# Prepare Allais data

This notebook generates 'data_allais.csv'. It processes the raw data and provides a standardised output:
- Cleans up the results of the emotion elicitation questionnaire
- Calculates a single locus score for each participant - See [original paper](https://psycnet.apa.org/record/2011-19211-001), page 11.
- Calculates a Big 5 score (openness to experience, conscientiousness, extraversion, agreeableness, and emotional stability) for each participant - See [original paper](https://www.sciencedirect.com/science/article/abs/pii/S0092656603000461), appendix A.

Notes:
- The "primary ID" for our dataset throughout the entire process stays the index of the entry in `raw_data_allais.csv`

Exact cleaning steps:
1. Remove duplicate column headers 
2. Remove exactly 4 participants with low progress (not enough data)
3. Rename generic column names into useable ones
4. Remove participant with multiple records (there should be no such participants)
5. Join against demographic data (drop 5 participants without demographic information)
6. Adjust game choice values & data types
7. Adjust certainty values & data types 
8. Adjust game emotions values & data types
9. Adjust TIPI answer values & data types
10. Calculate TIPI values
11. Calculate locus values
12. Add a column saying if the participant displays ambiguity aversion (DAA)
13. Write processed output CSV

In [None]:
import numpy as np
import pandas as pd

allais = 'data/raw_data_allais.csv'
demographic = f'data/demographic_allais.csv'
output = f'data/data_allais.csv'

df = pd.read_csv(allais)
demographic_df = pd.read_csv(demographic)

#### Step 1: Remove duplicate column headers

In [None]:
df = df.drop([0])

#### Step 2: Remove exactly 3 participants with low progress (not enough data)

In [None]:
df = df.astype({'Progress': 'int32'})
before = len(df)
df = df[df['Progress'] >= 99]
after = len(df)
print(f'Removed {before - after} participants with low progess. Records left: {after}')

#### Step 3: Rename generic column names into useable ones

In [None]:
mapping = {
    "Q89": "participant_id", 

    "Q5": "ChoiceGame1",
    "Q51": "CertaintyGame1",
    "Q6_1": "Hopeful1A",
    "Q55_1": "Curiosity1A",
    "Q56_1": "Enlightenment1A",
    "Q57_1": "Thrilled1A",
    "Q58_1": "Anticipatory1A",
    "Q59_1": "Satisfied1A",
    "Q60_1": "Hopeful1B",
    "Q61_1": "Curiosity1B",
    "Q62_1": "Enlightenment1B",
    "Q63_1": "Thrilled1B",
    "Q64_1": "Anticipatory1B",
    "Q65_1": "Satisfied1B",
    "Q82": "ChoiceGame2",
    "Q54": "CertaintyGame2",
    "Q66_1": "Hopeful2A",
    "Q67_1": "Curiosity2A",
    "Q68_1": "Enlightenment2A",
    "Q69_1": "Thrilled2A",
    "Q70_1": "Anticipatory2A",
    "Q71_1": "Satisfied2A",
    "Q72_1": "Hopeful2B",
    "Q73_1": "Curiosity2B",
    "Q74_1": "Enlightenment2B",
    "Q75_1": "Thrilled2B",
    "Q76_1": "Anticipatory2B",
    "Q77_1": "Satisfied2B",
    "Q16_1": "TipiEnthusiastic",
    "Q16_2": "TipiQuiet",
    "Q16_3": "TipiWarm",
    "Q16_4": "TipiQuarrelsome",
    "Q16_5": "TipiDisciplined",
    "Q16_6": "TipiCareless",
    "Q16_7": "TipiStable",
    "Q16_8": "TipiAnxious",
    "Q16_9": "TipiComplex",
    "Q16_10": "TipiUncreative",

    "Q18": "Locus1",
    "Q19": "Locus2",
    "Q21": "Locus3",
    "Q22": "Locus4",
    "Q23": "Locus5",
    "Q24": "Locus6",
    "Q25": "Locus7",
    "Q26": "Locus8",
    "Q27": "Locus9",
    "Q28": "Locus10",
    "Q29": "Locus11",
    "Q30": "Locus12",
    "Q31": "Locus13",
    "Q32": "Locus14",
    "Q33": "Locus15",
    "Q34": "Locus16",
    "Q35": "Locus17",
    "Q36": "Locus18",
    "Q37": "Locus19",
    "Q38": "Locus20",
    "Q39": "Locus21",
    "Q40": "Locus22",
    "Q41": "Locus23",
    "Q42": "Locus24",
    "Q43": "Locus25",
    "Q44": "Locus26",
    "Q45": "Locus27",
    "Q46": "Locus28",
    "Q47": "Locus29",
}

count = sum([1 for k in mapping if k in df])
print(f'{count}/{len(mapping)} of keys found')

df = df[mapping.keys()]
df = df.rename(columns=mapping)

#### Step 4: Remove participant with multiple records (there should be exactly 1 participant to remove)

In [None]:
# Find participants with multiple records
count = df['participant_id'].value_counts()

before = len(df)

# Remove all records from participants with more than one record
for id, n in count[count > 1].items():
    print(f'Removing {n} records for {id}')
    df = df[df['participant_id'] != id]

after = len(df)

print(f'Removed {before - after} records from same participants. Records left: {after}')

#### Step 5: Join against demographic data

In [None]:
demographic_cols = ['participant_id', 'Sex', 'Age']
demographic_df = demographic_df[demographic_cols]

# Add the index of 'df' as a new column
before = len(df)

df['index'] = df.index
df = pd.merge(df, demographic_df, on='participant_id')

# Restore index from original 'df'
df.index = df['index']
df = df.drop(columns=['index'])

after = len(df)
print(f'Removed {before - after} participants without demographic information. Records left: {after}')

#### Step 6: Adjust game choice values & data types

If the participant chose lottery A, set the value as (numeric) 1, otherwise, set the value as (numeric) 2

In [None]:
choice_cols = ['ChoiceGame1', 'ChoiceGame2']
for col in choice_cols:
    df[col] = np.where(df[col] == '1', 1, 2)

choice_dtype = {c: 'int32' for c in choice_cols}
df = df.astype(choice_dtype)

#### Step 7: Adjust game certainty values & data types

Make sure certainty values are numeric integrals, 1 (very uncertain) - 5 (very certain)

In [None]:
certainty_cols = ['CertaintyGame1', 'CertaintyGame2']
for col in certainty_cols:
    print(df[col].value_counts())

df = df.astype({c: 'int32' for c in certainty_cols})

#### Step 8: Adjust game emotions values & data types

For every emotion (hopeful, curiosity, enlightment, thrilled, anticipatory, satisfied), adjust scale (1, 6, 7, 8) to 1-4.

In [None]:
hopeful_columns = [ 'Hopeful1A', 'Hopeful1B', 'Hopeful2A', 'Hopeful2B' ]
curiosity_columns = [ 'Curiosity1A', 'Curiosity1B', 'Curiosity2A', 'Curiosity2B' ]
enlightenment_columns = [ 'Enlightenment1A', 'Enlightenment1B', 'Enlightenment2A', 'Enlightenment2B' ]
thrilled_columns = [ 'Thrilled1A', 'Thrilled1B', 'Thrilled2A', 'Thrilled2B' ]
anticipatory_columns = [ 'Anticipatory1A', 'Anticipatory1B', 'Anticipatory2A', 'Anticipatory2B' ]
satisfied_columns = [ 'Satisfied1A', 'Satisfied1B', 'Satisfied2A', 'Satisfied2B' ]
all_columns = hopeful_columns + curiosity_columns + enlightenment_columns + \
    thrilled_columns + anticipatory_columns + satisfied_columns

value_mapping = {
    '1': 1,
    '6': 2,
    '7': 3,
    '8': 4,
}

for col in all_columns:
    df[col] = df[col].replace(value_mapping)

df = df.astype({c: 'int32' for c in all_columns})

#### Step 9: Adjust TIPI answer values & data types

Remove textual values and cast to numeric types

In [None]:
tipi_cols = [
    "TipiEnthusiastic",
    "TipiQuiet",
    "TipiWarm",
    "TipiQuarrelsome",
    "TipiDisciplined",
    "TipiCareless",
    "TipiStable",
    "TipiAnxious",
    "TipiComplex",
    "TipiUncreative",
]

# Update tipi columns to be integers
df = df.astype({col: 'int32' for col in tipi_cols})

#### Step 10: Calculate TIPI values

In [None]:
# Map columns to the question numbers in the TIPI calculation document
tipi_mapping = {
    "Tipi1": "TipiEnthusiastic",
    "Tipi6": "TipiQuiet",
    "Tipi7": "TipiWarm",
    "Tipi2": "TipiQuarrelsome",
    "Tipi3": "TipiDisciplined",
    "Tipi8": "TipiCareless",
    "Tipi9": "TipiStable",
    "Tipi4": "TipiAnxious",
    "Tipi5": "TipiComplex",
    "Tipi10": "TipiUncreative",
}

# Calculate Big5
big5 = ['Openness', 'Conscientiousness', 'Extroversion', 'Agreeability', 'Stability']

def tipi_score(major, minor):
    major_col = tipi_mapping[major]
    minor_col = tipi_mapping[minor]
    return (df[major_col] + (8 - df[minor_col])) / 2

df['Openness'] = tipi_score('Tipi9', 'Tipi10')
df['Conscientiousness'] = tipi_score('Tipi5', 'Tipi6')
df['Extroversion'] = tipi_score('Tipi1', 'Tipi2')
df['Agreeability'] = tipi_score('Tipi3', 'Tipi4')
df['Stability'] = tipi_score('Tipi7', 'Tipi8')

print(df[big5].mean())

# Averages from Journal:
# Conscientiousness = 4.61
# Agreeability = 4.69
# Stability = 4.34
# Openness = 5.51
# Extroversion = 3.98

# Remove raw columns
df = df.drop(tipi_mapping.values(), axis=1)

#### Step 11: Calculate locus values

- Replaces categorical values (textual answers) with numerical values of 1 (first answer) and 2 (second answer)
- Adds a column per question that says if the participant should get a point for that answer or not
- Calculates a single column with the aggregated locus score per participant

Note: Questions 1, 8, 14, 19, 24 and 27 are not scored, see the Locus questionnaire for more information.

In [None]:
# Raw values correlate with the option they chose, i.e. a numeric 1 or 2

# This mapping helps convert participant answers to scoring, i.e. a numeric 1 if they are should get a point, otherwise 0
# Questions that are not scored are compared against '-1', which will never appear, so the score will always be 0
locus_score_mapping = {
    '1': '-1',
    '2': '1',
    '3': '2',
    '4': '2',
    '5': '2',
    '6': '1',
    '7': '1',
    '8': '-1',
    '9': '1',
    '10': '2',
    '11': '2',
    '12': '2',
    '13': '2',
    '14': '-1',
    '15': '2',
    '16': '1',
    '17': '1',
    '18': '1',
    '19': '-1',
    '20': '1',
    '21': '1',
    '22': '2',
    '23': '1',
    '24': '-1',
    '25': '1',
    '26': '2',
    '27': '-1',
    '28': '2',
    '29': '1',
}

# This conversion has to come first, as the next one overrides the original textual answers
for k, v in locus_score_mapping.items():
    locus_k = 'Locus' + k
    score_k = 'LocusScore' + k
    df[score_k] = np.where(df[locus_k] == v, 1, 0)

# Calculate a single locus score column
score_keys = ['LocusScore' + dk for dk in locus_score_mapping]
df['Locus'] = df[score_keys].sum(axis=1)

# Draw raw score columns
df = df.drop(score_keys, axis=1)

df['Locus'].value_counts()

#### Step 12: Add a column saying if the participant displays ambiguity aversion (DAA)

Allais paradox deals with the interesting observation that when presented with game 1 most people would choose gamble A,
but when presented with game 2 most people would choose gamble B.
This phenomenon is familiar as the "certainty effect" and it is paradoxical.

In [None]:
def isDAA(x):
    if x['ChoiceGame1'] == 1 and x['ChoiceGame2'] == 2:
        return 1
    return 0

df['DAA'] = df.apply(isDAA, axis=1)
df['DAA'].value_counts()

#### Step 13: Write processed output CSV

In [None]:
df.to_csv(output, index=True, index_label='ID')