## QA prompt engineering and LLM parameters to match a representative sample of citizens

In [20]:
import os
import sys
sys.path.append("../../../")

import pandas as pd

from src.llms_helpers.openai_api import query_open_ai
from src.eval_utils.experimenter import ExperimentRunner, Data, ModelConfig, SurveyConfig, EvalConfig
from src.data_helpers.mapper_wuw_meaning import profile_questions, energy_calculations, factors_to_be_considered_keys_flat, factors_to_be_considered_keys

import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1. Generate Prompts

In [21]:
df_sample = pd.read_csv("../../../data/ariadne/heating_buildings/df_c1t1_sample_50_processed_relevant_factors_citizen_type.csv", encoding='ISO-8859-1')
df_questions = pd.read_csv("../../../data/ariadne/heating_buildings/questions_selected_survey_mapper_encoded.csv", sep=';', encoding='utf-8', on_bad_lines='skip')

In [22]:
df_questions.head(2)

Unnamed: 0,question_id,Question,option_type,exclude_values,mapping_dict
0,ges,Geschlecht,categorical,,"{'1':'männlich', '2':'weiblich'}"
1,city_category,City size,ordinal,,"{'1':'weniger als 20.000 Einwohner','2':'zwisc..."


### 2. Run and evaluate experiment

In [23]:
standard_entries = ['key','citizen_type'] + profile_questions + energy_calculations + factors_to_be_considered_keys_flat
df_answers = df_sample[(df_sample['citizen_type']!='defier')][standard_entries]

In [24]:
df_answers.shape

(45, 38)

In [25]:
# parameters prompt engineering
rand_question_order=factors_to_be_considered_keys
survey_context="custom_subject"
final_question="custom_subject"
final_question_prompt="T1"
rand_order_options=False
prompts_directory="../../../data/ariadne/heating_buildings/experiment_1/prompts"

# parameters run estimations
n_trials = 5
model = "gpt-3.5-turbo-1106" #"gpt-4-1106-preview" # "gpt-3.5-turbo"  #"gpt-4"#"babbage-002"

# parameters for evaluation
mapper = {'A': 1, 'B': 2}
final_columns = ['ea801', 'ea802', 'ea803', 'ea804', 'ea805', 'ea806', 'ea807', 'ea808', 'ea809', 'ea810', 'ea811', 
               'ea812', 'ea813', 'ea814', 'ea815']

### 2. Run a recurrent experiment and analyse the results

**Observations**
* Temperature 0.5 differenciates between the different profiles way of thinking.
* Temperature 0.7 differenciates between the different profiles way of thinking and includes into the thoughts the numbers given in a more comprehensive way.
* The limitation of 400 tokens in the reflection breaks the sentences in the middle and therefore, the following request of the final options do not work well.
* Increase to 500 tokens limit is also not enough.
* Add a constrain in the prompt to limit thoughts to no more than 300 words fixed it.
* The estimations with temperature 0.7 are slightly better when there is a response in the right format.
* The cases where the response do not match the expected format is because the reasoning was not completed. Decreasing the size of reasoning to 250 words decreases the quality. 
* The increase to 700 tokens maximum of reasoning, adding the prompt instruction to be below 300 words and the additional instruction after the reasoning with the example of how options should be provided avoid the errors happening. However, the large space for reasoning seem to shift decisions towards option B.
* Decrease the size of reasoning again to 250 keeping the more detailed guidance and the threshold to 700 tokens just in case. However, the quality of the decisions decrease making irrational decisions such as choosing simple renovation at lower prices and extensive renovation when prices are higher.
* When increasing the size of the reasoning, there is a tendency to choose more option B, thus, simplify prompt closer to the original version. Still, that a different way of thinking is applied that moves tendency towards B. No case where all options are A is observed anymore. Most likely, the reasoning is limiting emotional decision-making.
* Removing the statement added of having 1500 euros available, I analized the reasoning of the individual that in real life choose simple renovation. One of the questions presented in the prompt highlighted the tendency to short term profit and in the reasoning it highlights that. However, the final decision is that one which is "purely rational". Thus, I am avoiding emotional-driven decision making.
* Giving instructions to the AI directly to mimic only the personality reflected in the previous prompt seems to work better despite of still being slightly more rational than the actual humans. However, the reproducibility of the responses is much higher. 
    * With GPT-3.5 I manage to get all options A for those that only A but, on the other hand, it keeps on making the mistake of choosing A first and then B in other cases as well as not managing to provide a structured shape of the responses.
    * With GPT-4 I do not manage to avoid getting option B in at least the 4 first decisions despite of reproducibility increases and it clearly distinguishes that the person thinks in a certain way.
* The direct target of AI to mimic the human behind the answers previously given works much better at 0.5, 0.6 and 0.7 in terms of reproducibility of results across 3 iterations and to distinguish the human that only wants to go for simple renovation at any price vs the other two. However, it does not perform that well with reproducing the results for the subject that always chooses comprehensive renovation. My perception is that the algorithm tries to reproduce the responses from an objective and quantitative perspective and does not caputre the emotional personal interests. It makes the decisions based on the financial aspects. The prompt given also emphasizes those aspects and therefore the following tests should be run:
    * Compare results when shaping the prompt more balanced between financial interest and personal beliefs --> decide prompt and temperature
    * Compare results when running the experiment with chosen factors based on literature and crucial requirements for shaping the profile of the people
    * Compare results when using different number of questions

In [8]:
temperature = [0.1]
top_p = 1.0
max_tokens = 100
n_iter = 1

for n_q in [0, 3, 5, 7, 9, 12, 15]:
    data = Data(df_questions, df_answers.head(20), [])
    model_config = ModelConfig(model, n_trials)
    survey_config = SurveyConfig(n_iter, n_q, rand_question_order, survey_context, final_question, final_question_prompt, rand_order_options)
    eval_config = EvalConfig(mapper, final_columns)

    additional_sentence = """Interviewer: Bitte entscheiden sich nun für die verschiedenen Optionen. Halten Sie Ihre Antworten in einer nummerierten Liste organisiert: '1. (Option Buchstabe) 2. (Option Buchstabe) 3. (Option Buchstabe) usw.\nIch:"""
    for temp in temperature:
        output_file = f"../../../data/ariadne/heating_buildings/experiment_1/length_{model}-t_{temp}_nq_{n_q}_qa.csv"
        exp_run = ExperimentRunner(data=data, model_config=model_config, survey_config=survey_config, eval_config=eval_config)
        exp_run.generate_prompts()
        exp_run.store_prompts(directory=prompts_directory)
        exp_run.run_recurrent_experiment(output_file, temp, top_p, max_tokens, additional_sentence)

Interviewer: Sie sind Eigentümer eines Hauses und haben die Kontrolle über Entscheidungen bezüglich der Heizung und Sie besitzen eine Zentralheizung und Ihre Immobilie wurde 1995 bis 2001 gebaut. Bitte erzählen Sie uns ein wenig über sich selbst.
\Person: Ich bin ein 49 Jahre alter Mann und ich wohne in einer Stadt mit zwischen 20.000 bis 100.000 Einwohnern in Baden-Württemberg, Deutschland. Zusätzlich, mein Einkommensniveau beträgt 5.700 Euro und mehr pro Monat und mein höchster Bildungsabschluss ist Abschluss einer Universität, wissenschaftlichen Hochschule, Kunsthochschule.
Interviewer: Im Folgenden erhalten Sie die Möglichkeit, sich zwischen zwei Methoden der Heizungsoptimierung zu entscheiden: einer „einfache Heizungsoptimierung“ und einer „umfassenden Heizungsoptimierung“.
Bei einer einfachen Heizungsoptimierung dämmt ein Fachunternehmen die Heizungsrohre in Ihrem Haus nach aktuellem Dämmstandard. Diese Heizungsoptimierung dauert ca. 1-2 Stunden.
Bei einer umfassenden Heizungsopt

#### Estimate with literature suggested variables

In [13]:
temperature = [0.1]
top_p = 1.0
max_tokens = 100
n_iter = 1

# literature: age of building, lives in the house, building type, region, income level, trust/belief on energy efficiency, altruism, political opinion (US) + control variables
custom_question_ids = ['a1','a4','bel2_1','altru1_2','so7','pk31','pk1_6','pk2_7']
data = Data(df_questions, df_answers.head(20), custom_question_ids)
model_config = ModelConfig(model, n_trials)
survey_config = SurveyConfig(n_iter, n_q, rand_question_order, survey_context, final_question, final_question_prompt, rand_order_options)
eval_config = EvalConfig(mapper, final_columns)

additional_sentence = """Interviewer: Bitte entscheiden sich nun für die verschiedenen Optionen. Halten Sie Ihre Antworten in einer nummerierten Liste organisiert: '1. (Option Buchstabe) 2. (Option Buchstabe) 3. (Option Buchstabe) usw.\nIch:"""
for temp in temperature:
    output_file = f"../../../data/ariadne/heating_buildings/experiment_1/length_{model}-t_{temp}_nq_{n_q}_qa.csv"
    exp_run = ExperimentRunner(data=data, model_config=model_config, survey_config=survey_config, eval_config=eval_config)
    exp_run.generate_prompts()
    exp_run.store_prompts(directory=prompts_directory)
    exp_run.run_recurrent_experiment(output_file, temp, top_p, max_tokens, additional_sentence)

Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Interviewer: Sie sind Eigentümer eines Hauses und haben die Kontrolle über Entscheidungen bezüglich der Heizung und Sie besitzen eine Zentralheizung und Ihre Immobilie wurde 1995 bis 2001 gebaut. Bitte erzählen Sie uns ein wenig über sich selbst.
\Person: Ich bin ein 49 Jahre alter Mann und ich wohne in einer Stadt mit zwischen 20.000 bis 100.000 Einwohnern in Baden-Würt

#### Estimate with correlation analysis suggested variables

In [26]:
temperature = [0.1]
top_p = 1.0
max_tokens = 100
n_iter = 1

# prof_status, education_level, city size, 
factors_log_reg_analysis = ['pk31','bel2_1','altru1_2','pk1_6','pk2_7','altq','a5']
data = Data(df_questions, df_answers.head(20), factors_log_reg_analysis)
model_config = ModelConfig(model, n_trials)
survey_config = SurveyConfig(n_iter, n_q, rand_question_order, survey_context, final_question, final_question_prompt, rand_order_options)
eval_config = EvalConfig(mapper, final_columns)

additional_sentence = """Interviewer: Bitte entscheiden sich nun für die verschiedenen Optionen. Halten Sie Ihre Antworten in einer nummerierten Liste organisiert: '1. (Option Buchstabe) 2. (Option Buchstabe) 3. (Option Buchstabe) usw.\nIch:"""
for temp in temperature:
    output_file = f"../../../data/ariadne/heating_buildings/experiment_1/length_{model}-t_{temp}_nq_{n_q}_qa.csv"
    exp_run = ExperimentRunner(data=data, model_config=model_config, survey_config=survey_config, eval_config=eval_config)
    exp_run.generate_prompts()
    exp_run.store_prompts(directory=prompts_directory)
    exp_run.run_recurrent_experiment(output_file, temp, top_p, max_tokens, additional_sentence)

Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Skipping qustion as subject did not give a valid answer.
Interviewer: Sie sind Eigentümer eines Hauses und haben die Kontrolle über Entscheidungen bezüglich der Heizung und Sie besitzen eine Zentralheizung und Ihre Immobilie wurde 1995 bis 2001 gebaut. Bitte erzählen Sie uns ein wenig über sich selbst.
\Person: Ich bin ein 49 Jahre alter Mann und ich wohne in einer Stadt mit zwischen 20.000 bis 100.000 Einwohnern in Baden-Würt

In [15]:
df_answers.columns

Index(['key', 'citizen_type', 'bundesland_name', 'ges', 'altq',
       'city_category', 'so5', 'so1', 'so2', 'ist6', 'ebj', 'ebe', 'ebu',
       'kdj', 'kde', 'kdu', 'kdf', 'ist5', 'a1', 'a4', 'a6', 'bel1', 'bel2_1',
       'pk1_1', 'pk1_3', 'pk1_6', 'pk2_3', 'pk2_7', 'altru1_1', 'altru1_2',
       'pk31', 'pk35', 'a7', 'so3_5', 'so6', 'so3_1', 'so7'],
      dtype='object')