In [2]:
import coins

# Train ML models

The function underneath trains ML models for prediction. This is necessary in two situations:
- The program is used for the first time and no models exist
- New data from laufbahndiagnostik is available and should be analyzed

Data must be placed in the directory 'data/input/' as csv file. The following data is considered:
- Session data as 'sessions_v1.csv' (old data structure) and 'sessions_v2.csv' (new data structure)
- Image data as 'images_v1.csv' (old data structure) and 'images_v2.csv' (new data structure)
- IPIP data as 'ipip.csv'
- Mood data as 'mood.csv'
- MPZM data as 'mpzm.csv'
- Emotion data as 'emotions_v2.csv' (new data structure)
- Handbuild image labels as 'imageLabels.csv'

The function furthermore includes the translation of the user-written image descriptions and analyses it towards sentiment and emotions. The outcome is saved as a seperate csv file in the directory 'data/output/analyzedDataFrames/'. To activate this functionality, please set translate=True. Otherwise, set translate=False and the old translations and analysis will be loaded. Do only use it, if new data has to be translated and analyzed, as this charges your deposit at IBM Watson and DeepL. You need to provide your IBM Watson and DeepL API Credentials in 'data/input/credentials.yaml".

For building the ML models, this function also tries to find significant correlations in the data. As the function categorizes nearly all numerical values regarding the user for a simplification of the data, you can decide, if you want to use two classes (e.g. neurotic / not neurotic) or three classes (e.g. not neurotic / neutral / neurotic) by setting the parameter multiclass=False for two classes, or multiclass=True for three classes. In combination, you are able to define where the split regarding the categorization should be done:
- mean
- median
- hard (based on provided documentation)

Please keep in mind, that a hard split can lead to errors due to unbalanced data.

The drop percentage specifies, how much sample data points have to be existent. For example, the given percentage of 5 specifies, that samples in the data, which occur less than 5 percent of all values, will be dropped. 

Executing the function can take a lot of time, depending on your computational power.

In [4]:
response = coins.operations.trainModels(translate=False, multiclass=False, split='mean', dropPercent=5)
response

"Alle Model wurden erfolgreich erstellt. Du findest sie im Ordner 'output/modelResults'."

# Calculate Correlations

If you are more interested in the correlations obtained than in the ML models and automated predictions, you can use the function underneath to find and report on all significant correlations. You will find them as structured csv files in the directory 'output/correlations/'.

The parameters of the function are equal to the parameters of the function above.

In [5]:
response = coins.operations.calculateCorrelations(translate=False, multiclass=False, split='mean', dropPercent=5)
response

"Alle Correlationen wurden erfolgreich berechnet. Du findest sie im Ordner 'output/correlations'."

# Prediction

You can get predictions for new user data. This user data (mainly 'ipip.csv' and 'images_v2.csv') must be placed in the directory 'input/prediction/'. There are no null values allowed.

You are able to choose, which kind of data you want to get predicted. The possible options are:
- Personality (dfPersonality)
- Socio Demographics (dfSocioDemographics)
- Image Ratings (dfImageRatings)
- Image Descriptions (dfImageDescriptions)
- Image Contents (dfImageContents)

For the analysis of the Image Descriptions, again the API credentials for IBM Watson and DeepL are needed. This operation costs money.

In [6]:
response = coins.operations.predictNewData("dfImageDescriptions")
response

Unnamed: 0_level_0,human,animal,nature,mobility,child,food,0,1,2,3,...,utilization_translation_joyCategory,utilization_translation_fearCategory,utilization_translation_disgustCategory,utilization_translation_angerCategory,story_translation_sentimentCategory,story_translation_sadnessCategory,story_translation_joyCategory,story_translation_fearCategory,story_translation_disgustCategory,story_translation_angerCategory
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a,0,0,1,0,0,0,0,0,0,0,...,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0
b,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0
