c)
The questionnaire by Bruhin et al. comprises personality traits according to the Big Five. This question asks you to impute other (economic) preferences of the study participants. Examples include:

Social preferences (inequity aversion, reciprocity, guilt aversion...)
Time preferences (myopia, present bias...)
Risk preferences
...
Your taks is thus the following:

Find a dataset on individuals that contains the Big Five along with other preference measures. Think of datasets used in scientific publications.
Train models to predict the other preferences from the Big Five. Evaluate their performance.
Make an out-of sample prediction using the fitted models to impute the preference measures for the study participants of Bruhin et al.

First of all: We really searched through hundreds of papers in scientific databases for rudimentarily useful data sets, but always failed at the following points: 
- The papers might dealt with big-five scores, but the open data were some kind of raw data without concrete values for big-five scores or pretty advanced data sets with a lot of features and no useful big-five scores.
- Some papers seemed really useful, but the data was not open access or not even published at all.. 

At least, we now found some data that might be useful for this task. But then we came to the next problem: 

The big-five scores in our data from Bruhin et al. cannot be clearly explained. Even in the appendix data or in the additional supplementary-data of the paper,
we could not find any information on the questionnaire that clearly expresses what the answer options to the big-five questions are. So we do not know about the scoring system of the big-five scores.

Maybe we missed something, but the scores are varying across the most papers. So we decided to just create a mapping between our found data set and the big-five scores of Bruhin et al. 

The data set we found: https://github.com/automoto/big-five-data

The only useful variable here, that we can use for predict a information missing in our subjects, is the country of the subject. 

In [1]:
import pandas as pd

df_countries = pd.read_csv('data/big_five_scores.csv')

df_bruhin = pd.read_csv('data/subjects.csv')

print (df_countries.columns)

Index(['case_id', 'country', 'age', 'sex', 'agreeable_score',
       'extraversion_score', 'openness_score', 'conscientiousness_score',
       'neuroticism_score'],
      dtype='object')


In [2]:
# Let's first explore the big five scores in the data set
df_countries.head(1)

Unnamed: 0,case_id,country,age,sex,agreeable_score,extraversion_score,openness_score,conscientiousness_score,neuroticism_score
0,1,South Afri,24,1,0.753333,0.496667,0.803333,0.886667,0.426667


In [3]:
# As also written in the README of the data set, each of their big five personality traits has a value between 0 and 1. We will apply this to the data of bruhin by normalizing the values.

# List of Big Five personality traits
big_five_traits = ['bf_consciousness', 'bf_openness', 'bf_extraversion', 'bf_agreeableness', 'bf_neuroticism']

# Normalize each of the Big Five traits in df_bruhin
for trait in big_five_traits:
    min_val = df_bruhin[trait].min()
    max_val = df_bruhin[trait].max()
    df_bruhin[trait] = (df_bruhin[trait] - min_val) / (max_val - min_val)

# rename the columns to match df_countries format
df_bruhin.rename(columns={
    'bf_consciousness': 'conscientiousness_score',
    'bf_openness': 'openness_score',
    'bf_extraversion': 'extraversion_score',
    'bf_agreeableness': 'agreeable_score',
    'bf_neuroticism': 'neuroticism_score'
}, inplace=True)

df_bruhin.head(1)

Unnamed: 0,sid,conscientiousness_score,openness_score,extraversion_score,agreeable_score,neuroticism_score,cogabil,pe_D1_stud_natsci,pe_D1_stud_law,pe_D1_stud_socsci,pe_D1_stud_med,pe_monthinc,pe_age,pe_female
0,12010050501,0.75,0.421053,0.4375,0.866667,0.176471,3,1,0,0,0,400,21,1


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Check the number of columns in df_countries
num_columns = df_countries.shape[1]

# Ensure the column names match
required_columns = ['case_id', 'country', 'age', 'sex', 'agreeable_score', 'extraversion_score', 
                    'openness_score', 'conscientiousness_score', 'neuroticism_score']
additional_columns = [f'feature_{i}' for i in range(num_columns - len(required_columns))]
df_countries.columns = required_columns + additional_columns

# Randomly select 30k columns from the dataset (ensure we include the required columns)
if num_columns > 30000:
    selected_columns = np.random.choice(df_countries.columns[len(required_columns):], 30000 - len(required_columns), replace=False)
    selected_columns = required_columns + list(selected_columns)
else:
    selected_columns = df_countries.columns

df_countries_reduced = df_countries[selected_columns]

# Select relevant columns for model training
features = ['agreeable_score', 'extraversion_score', 'openness_score', 'conscientiousness_score', 'neuroticism_score']
X = df_countries_reduced[features]
y = df_countries_reduced['country']

# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Prepare the test data from df_bruhin
X_bruhin = df_bruhin[features]
X_bruhin_scaled = scaler.transform(X_bruhin)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2, random_state=42)

# Function to train and evaluate models
def train_and_evaluate(models, X_train, y_train, X_test, y_test):
    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        results[name] = accuracy
    return results

# Define the models to train with updated Logistic Regression parameters
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, solver='saga'),
    # 'Decision Tree': DecisionTreeClassifier()
}

# Train and evaluate the models
results = train_and_evaluate(models, X_train, y_train, X_test, y_test)
print(results)

# Select the best model based on the results
best_model_name = max(results, key=results.get)
best_model = models[best_model_name]

# Train the best model on the entire dataset
best_model.fit(X_scaled, y_encoded)

# Predict the countries for the subjects in df_bruhin
y_bruhin_pred = best_model.predict(X_bruhin_scaled)
y_bruhin_pred_labels = label_encoder.inverse_transform(y_bruhin_pred)

# Add predictions to df_bruhin
df_bruhin['predicted_country'] = y_bruhin_pred_labels

# Save the results
df_bruhin.to_csv('data/subjects_with_predictions.csv', index=False)

# Display the first few rows of the prediction results
print(df_bruhin.head())


{'Logistic Regression': 0.6915868083237069}
           sid  conscientiousness_score  openness_score  extraversion_score  \
0  12010050501                   0.7500        0.421053              0.4375   
1  12010050502                   0.5000        0.684211              0.4375   
2  12010050603                   0.4375        0.473684              0.4375   
3  12010050704                   0.6875        0.421053              0.2500   
4  12010050705                   0.8125        0.315789              0.0000   

   agreeable_score  neuroticism_score  cogabil  pe_D1_stud_natsci  \
0         0.866667           0.176471        3                  1   
1         0.600000           0.647059        7                  1   
2         0.733333           0.647059        3                  0   
3         0.800000           0.647059        9                  1   
4         0.933333           0.235294        4                  0   

   pe_D1_stud_law  pe_D1_stud_socsci  pe_D1_stud_med  pe_monthinc 

The next data set we found, included information about the preference of individuals regarding their longitudinal study attrition. The data set is from a scientific publication and can be found here: https://data.mendeley.com/datasets/g3jx8zt2t9/1

In [5]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the datasets
df_attrition = pd.read_csv('data/personality_survey_participation_MI.csv')
df_bruhin = pd.read_csv('data/subjects.csv')

# List of Big Five personality traits
big_five_traits_attrition = ['Openness', 'Conscientiousness', 'Extraversion', 'Agreeableness', 'Neuroticism']
big_five_traits_bruhin = ['bf_openness', 'bf_consciousness', 'bf_extraversion', 'bf_agreeableness', 'bf_neuroticism']

# Normalize each of the Big Five traits in df_attrition
scaler = MinMaxScaler()
df_attrition[big_five_traits_attrition] = scaler.fit_transform(df_attrition[big_five_traits_attrition])

# Normalize each of the Big Five traits in df_bruhin
df_bruhin[big_five_traits_bruhin] = scaler.fit_transform(df_bruhin[big_five_traits_bruhin])

# Rename columns in df_bruhin to match df_attrition
df_bruhin.rename(columns={
    'bf_openness': 'Openness',
    'bf_consciousness': 'Conscientiousness',
    'bf_extraversion': 'Extraversion',
    'bf_agreeableness': 'Agreeableness',
    'bf_neuroticism': 'Neuroticism'
}, inplace=True)

# Save the normalized df_bruhin
df_bruhin.to_csv('data/subjects_normalized.csv', index=False)

# Select relevant columns for model training
features = ['Openness', 'Conscientiousness', 'Extraversion', 'Agreeableness', 'Neuroticism']
X = df_attrition[features]
y = df_attrition['Attrition']

# Prepare the test data from df_bruhin
X_bruhin = df_bruhin[features]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to train and evaluate models
def train_and_evaluate(models, X_train, y_train, X_test, y_test):
    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        results[name] = accuracy
    return results

# Define the models to train
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, solver='saga'),
    'Decision Tree': DecisionTreeClassifier()
}

# Train and evaluate the models
results = train_and_evaluate(models, X_train, y_train, X_test, y_test)
print(results)

# Select the best model based on the results
best_model_name = max(results, key=results.get)
best_model = models[best_model_name]

# Train the best model on the entire dataset
best_model.fit(X, y)

# Predict the Attrition for the subjects in df_bruhin
y_bruhin_pred = best_model.predict(X_bruhin)

# Add predictions to df_bruhin
df_bruhin['predicted_attrition'] = y_bruhin_pred

# Save the results
df_bruhin.to_csv('data/subjects_with_attrition_predictions.csv', index=False)

# Display the first few rows of the prediction results
print(df_bruhin.head())


{'Logistic Regression': 0.8835758835758836, 'Decision Tree': 0.7775467775467776}
           sid  Conscientiousness  Openness  Extraversion  Agreeableness  \
0  12010050501             0.7500  0.421053        0.4375       0.866667   
1  12010050502             0.5000  0.684211        0.4375       0.600000   
2  12010050603             0.4375  0.473684        0.4375       0.733333   
3  12010050704             0.6875  0.421053        0.2500       0.800000   
4  12010050705             0.8125  0.315789        0.0000       0.933333   

   Neuroticism  cogabil  pe_D1_stud_natsci  pe_D1_stud_law  pe_D1_stud_socsci  \
0     0.176471        3                  1               0                  0   
1     0.647059        7                  1               0                  0   
2     0.647059        3                  0               0                  1   
3     0.647059        9                  1               0                  0   
4     0.235294        4                  0               