# Preperation of the data
The survey results of the Stack Overflow survey of 2018 will be used. The results of the survey mainly consists of categorical answers on questions like: 'Do you write code as a hobby?' or 'How many times do you exercise each week?'. Before a Machine Learning classifier can be used, all the values with type 'object' (string) must be mapped to a numerical representation of that object. On top of that, the survey will be cleared of 'unimportant' columns that have no correlation with the target 'JobSatisfaction' and rows containing NaN values in the target column 'JobSatisfaction' will be dropped.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import MultiLabelBinarizer
from scipy import stats

### 1. Import the original Stack Overflow 2018 survey results

In [2]:
# Import original dataset
so_survey_original = pd.read_csv('./dataset/2018 Stack Overflow Survey Results.csv', low_memory=False)

print('Rows: %i, Columns: %i' % so_survey_original.shape)
so_survey_original.head(3)

Rows: 98855, Columns: 129


Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...","10,000 or more employees",Database administrator;DevOps specialist;Full-...,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy
2,4,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",20 to 99 employees,Engineering manager;Full-stack developer,...,,,,,,,,,,


The original data frame will be (deep) copied to a new data frame on which the preperation will take place. After each preperation step the original data frame can be used to make 'before' and 'after' comparisons with the new (prepared) data frame.

In [3]:
# Copy the original data frame to a new data frame
so_survey = so_survey_original.copy(deep=True)

### 2. Drop rows with missing job satisfaction
The target feature is 'JobSatisfaction'. It isn't desired to have missing values for a target feature, because the value NaN doesn't refer to numerical classification value. All target values have to be a numerical classification value before the Machine Learning model can be made.

In [4]:
# Drop all rows with value NaN in the column 'JobSatisfaction'
mask = pd.isnull(so_survey['JobSatisfaction']) == False
so_survey = so_survey[mask].reset_index(drop=True)

print('Rows (orig): %i, Rows (prep): %i' % (so_survey_original.shape[0], so_survey.shape[0]))

Rows (orig): 98855, Rows (prep): 69276


### 3. Drop unimportant columns
Some columns can be left out because they have no correlation with the target column 'JobSatisfaction', are redundant or have too many missing values.

#### 3.1. Drop columns 'Salary',  'Salary Type', 'Currency' and 'CurrencySybmo'
The column 'ConvertedSalary' is the combination of the columns 'Salary', 'SalaryType', 'Currency' and 'CurrencySymbol'. The columns 'Salary', 'SalaryType', 'Currency' and 'CurrencySymbol' are therefore redundant and can be dropped.

In [5]:
# Drop the columns 'Salary' and 'SalaryType'
so_survey.drop(so_survey[['Salary', 'SalaryType', 'Currency', 'CurrencySymbol']], axis=1, inplace=True)

print('Columns (orig): %i, Columns (prep): %i' % (so_survey_original.shape[1], so_survey.shape[1]))

Columns (orig): 129, Columns (prep): 125


#### 3.2. Drop column 'Respondent'
The column 'Respondent' is an unique identifier for each survey respondent. The values within this column aren't correlated with the target column 'JobSatisfaction'.

In [6]:
# Drop the columns 'Respondent'
so_survey.drop(so_survey[['Respondent']], axis=1, inplace=True)

print('Columns (orig): %i, Columns (prep): %i' % (so_survey_original.shape[1], so_survey.shape[1]))

Columns (orig): 129, Columns (prep): 124


#### 3.3. Drop columns containing 'StackOverflow' and 'Survey'
Questions containing 'StackOverflow' and 'Survey' are used to gather feedback for Stack Overflow. The results of these questions aren't correlated with the targed column 'JobSatisfaction'.

In [7]:
def find_columns(df, str_list):
    """ Find all columns which contain
    any of the strings in 'str_list'
    in their name.
    """
    columns = []
    for column in so_survey.columns:
        # Check for every string in 'str_list' 
        # if it's in name of the column.
        for s in str_list:
            if s in column:
                columns.append(column)
                # Break prevents duplicate column names in the list 'columns'
                break
    return columns

In [8]:
columns = find_columns(so_survey, ['StackOverflow', 'Survey'])
so_survey.drop(so_survey[columns], axis=1, inplace=True)

print('Columns (orig): %i, Columns (prep): %i' % (so_survey_original.shape[1], so_survey.shape[1]))

Columns (orig): 129, Columns (prep): 114


#### 3.4. Drop columns with a NaN value percentage of 35% or higher
Imputing statistical values (e.g. mean, mode, etc.) in columns that have more than 35% of missing datawon't result in accurate data.

In [9]:
df = so_survey.isnull().sum() / len(so_survey) * 100

# Drop the columns with NaN value percentage of 35% or higher
columns = df[df >= 35].index
so_survey.drop(so_survey[columns], axis=1, inplace=True)

print('Columns (orig): %i, Columns (prep): %i' % (so_survey_original.shape[1], so_survey.shape[1]))

Columns (orig): 129, Columns (prep): 95


### 4. Prepare values in list format
Some values in the dataset are denoted as $'Python;Java;C#'$. This format represents a list of different values. First, the ';' seperated list has to be converted to a real list data structure. Second, the list needs to be splitted into different columns to make it easy interpretable for Machine Learning algorithms. 

#### 4.1. Find columns containing list format
The columns containing ';' seperated lists will be searched and listed below to get an overview.

In [10]:
def find_cols_containing_char(df, char):
    """ Finds all columns with values
    containing the given character.
    """
    list_columns = []
    # Look for every column
    for col in df.columns:
        values = df[col].values
        # Numerical columns can be skipped, because columns 
        # with type 'number' doesn't contain(string-like) characters.
        if np.issubdtype(values.dtype, np.number) == False:
            # Look for every value in column
            for val in values:
                if char in str(val):
                    list_columns.append(col)
                    # Break prevents extra time spent while this 
                    # column already contains the character
                    break
    return list_columns

In [11]:
# Find all columns with values containing the character ';'
list_columns = find_cols_containing_char(so_survey, ';')
list_columns

['DevType',
 'CommunicationTools',
 'EducationTypes',
 'SelfTaughtTypes',
 'LanguageWorkedWith',
 'LanguageDesireNextYear',
 'DatabaseWorkedWith',
 'DatabaseDesireNextYear',
 'PlatformWorkedWith',
 'PlatformDesireNextYear',
 'FrameworkDesireNextYear',
 'IDE',
 'Methodology',
 'VersionControl',
 'AdsActions',
 'Gender',
 'SexualOrientation',
 'RaceEthnicity']

#### 4.2. Convert ';' seperated values to binarized representation
List values such as $'Python;Java;C#'$ can't be used as input for a Machine Learning algorithm. First, the value has to be numerical. Second, numerification of the ';' value as is will result a unique class for every unique list. It is instead needed to get a unique class for every language present in the list.<br><br>
To do this the the *MultiLabelBinarizer* algorithm from the *sklearn.preprocessing* library can be used. This algorithm converts values like $[['Python', 'Java', 'C#'], ['Python', 'C#']]$ to the following representation: $[[1, 1, 1], [1, 0, 1]]$ with the columns 'Python', 'Java' and 'C#'. This way the the values will be converted to a $0$ or a $1$ for each row, which is easier interpretable for a Machine Learning algorithm.

In [12]:
def binarize_list_columns(df, columns, sep=';'):
    """ Converts all ';' (or other seperator) lists to actual list data structures.
    The newly created list will be converted to a binarized matrix,
    in which each column is a unique value present in the list.
    The column tells if a certain value is present in the corresponding row or not
    by using boolean (binary) values.
    """
    df_copy = df.copy(deep=True)
    mlb = MultiLabelBinarizer()
    for col in columns:
        # Convert list formats like 'Python;Java;C#' to [Python, Java, C#]
        # If value is NaN, the value will be converted to []
        tranformed_vals = [x.split(';') if x is not np.nan else [] for x in df_copy[col].values]
        
        # Binerize the tranformed values
        binerized = mlb.fit_transform(tranformed_vals)

        # Add the binerized value to the existing data frame
        df_copy = df_copy.join(pd.DataFrame(binerized,
                                            columns=[col + '%' + str(x) for x in mlb.classes_]), lsuffix='_left', rsuffix='_right')
        
        # Delete original column because it doesn't add value anymore
        del df_copy[col]
    return df_copy

In [13]:
# The column 'Student' will colide with the value 'Student' in the binarization process.
# Therefore the column 'Student' will be renamed to 'is_student'
so_survey.rename(columns={'Student': 'is_student'}, inplace=True)

In [14]:
# Start binarization of list columns
so_survey_bin = binarize_list_columns(so_survey, list_columns, ';')

print('Columns (orig): %i, Columns (prep): %i' % (so_survey_original.shape[1], so_survey_bin.shape[1]))
so_survey_bin.head(3)

Columns (orig): 129, Columns (prep): 366


Unnamed: 0,Hobby,OpenSource,Country,is_student,Employment,FormalEducation,UndergradMajor,CompanySize,YearsCoding,YearsCodingProf,...,SexualOrientation%Bisexual or Queer,SexualOrientation%Gay or Lesbian,SexualOrientation%Straight or heterosexual,RaceEthnicity%Black or of African descent,RaceEthnicity%East Asian,RaceEthnicity%Hispanic or Latino/Latina,RaceEthnicity%Middle Eastern,"RaceEthnicity%Native American, Pacific Islander, or Indigenous Australian",RaceEthnicity%South Asian,RaceEthnicity%White or of European descent
0,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,3-5 years,3-5 years,...,0,0,1,1,0,0,0,0,0,0
1,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...","10,000 or more employees",30 or more years,18-20 years,...,0,0,1,0,0,0,0,0,0,1
2,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",20 to 99 employees,24-26 years,6-8 years,...,0,0,0,0,0,0,0,0,0,0


### 5. Encode ordinal data

In [15]:
mappings = {}

In [16]:
def find_object_cols(df):
    """ Finds all columns that 
    have the dtype of 'object'
    """
    return [x for x in df.columns if df[x].dtype == object]

#### Hand pick columns containing ordinal values and order them

In [17]:
print(find_object_cols(so_survey_bin))

['Hobby', 'OpenSource', 'Country', 'is_student', 'Employment', 'FormalEducation', 'UndergradMajor', 'CompanySize', 'YearsCoding', 'YearsCodingProf', 'JobSatisfaction', 'CareerSatisfaction', 'HopeFiveYears', 'JobSearchStatus', 'LastNewJob', 'UpdateCV', 'TimeFullyProductive', 'AgreeDisagree1', 'AgreeDisagree2', 'AgreeDisagree3', 'OperatingSystem', 'NumberMonitors', 'CheckInCode', 'AdBlocker', 'AdsAgreeDisagree1', 'AdsAgreeDisagree2', 'AdsAgreeDisagree3', 'AIDangerous', 'AIInteresting', 'AIResponsible', 'AIFuture', 'EthicsChoice', 'EthicsReport', 'EthicsResponsible', 'EthicalImplications', 'HypotheticalTools1', 'HypotheticalTools2', 'HypotheticalTools3', 'HypotheticalTools4', 'HypotheticalTools5', 'WakeTime', 'HoursComputer', 'HoursOutside', 'SkipMeals', 'Exercise', 'EducationParents', 'Age', 'Dependents']


In [18]:
ordinal_columns = ['CompanySize', 'YearsCoding', 'YearsCodingProf', 'JobSatisfaction', 'CareerSatisfaction',
                   'TimeFullyProductive', 'AgreeDisagree1', 'AgreeDisagree2', 'AgreeDisagree3', 'NumberMonitors',
                   'CheckInCode', 'AdsAgreeDisagree1', 'AdsAgreeDisagree2', 'AdsAgreeDisagree3', 'HypotheticalTools1',
                   'HypotheticalTools2', 'HypotheticalTools3', 'HypotheticalTools4', 'HypotheticalTools5',
                   'HoursComputer', 'HoursOutside', 'SkipMeals', 'Exercise', 'Age']
mappings['CompanySize'] = {
    '20 to 99 employees': 2,
    '10,000 or more employees': 7,
    '100 to 499 employees': 3,
    '10 to 19 employees': 1,
    '1,000 to 4,999 employees': 5,
    '5,000 to 9,999 employees': 6,
    '500 to 999 employees': 4,
    'Fewer than 10 employees': 0
}
mappings['YearsCoding'] = {
    '3-5 years': 1,
    '30 or more years': 9,
    '24-26 years': 7,
    '18-20 years': 5,
    '6-8 years': 2,
    '9-11 years': 3,
    '0-2 years': 0,
    '15-17 years': 5,
    '12-14 years': 4,
    '21-23 years': 6,
    '27-29 years': 8
}
mappings['YearsCodingProf'] = {
    '3-5 years': 1,
    '30 or more years': 9,
    '24-26 years': 7,
    '18-20 years': 5,
    '6-8 years': 2,
    '9-11 years': 3,
    '0-2 years': 0,
    '15-17 years': 5,
    '12-14 years': 4,
    '21-23 years': 6,
    '27-29 years': 8
}
mappings['JobSatisfaction'] = {
    'Extremely satisfied': 6,
    'Moderately dissatisfied': 1,
    'Moderately satisfied': 5,
    'Neither satisfied nor dissatisfied': 3,
    'Slightly satisfied': 4,
    'Slightly dissatisfied': 2,
    'Extremely dissatisfied': 0
}
mappings['CareerSatisfaction'] = {
    'Extremely satisfied': 6,
    'Moderately dissatisfied': 1,
    'Moderately satisfied': 5,
    'Neither satisfied nor dissatisfied': 3,
    'Slightly satisfied': 4,
    'Slightly dissatisfied': 2,
    'Extremely dissatisfied': 0
}
mappings['TimeFullyProductive'] = {
    'One to three months': 1,
    'Three to six months': 2,
    'Less than a month': 0,
    'Six to nine months': 3,
    'More than a year': 5,
    'Nine months to a year': 4
}
mappings['AgreeDisagree1'] = {
    'Strongly agree': 4,
    'Agree': 3,
    'Disagree': 1,
    'Neither Agree nor Disagree': 2,
    'Strongly disagree': 0
}
mappings['AgreeDisagree2'] = {
    'Strongly agree': 4,
    'Agree': 3,
    'Disagree': 1,
    'Neither Agree nor Disagree': 2,
    'Strongly disagree': 0
}
mappings['AgreeDisagree3'] = {
    'Strongly agree': 4,
    'Agree': 3,
    'Disagree': 1,
    'Neither Agree nor Disagree': 2,
    'Strongly disagree': 0
}
mappings['NumberMonitors'] = {
    '1': 0,
    '2': 1,
    'More than 4': 4,
    '3': 2,
    '4': 3
}
mappings['CheckInCode'] = {
    'Multiple times per day': 5,
    'A few times per week': 3,
    'Weekly or a few times per month': 2,
    'Never': 0,
    'Less than once per month': 1,
    'Once a day': 4
}
mappings['AdsAgreeDisagree1'] = {
    'Strongly agree': 4,
    'Agree': 3,
    'Disagree': 1,
    'Neither Agree nor Disagree': 2,
    'Strongly disagree': 0
}
mappings['AdsAgreeDisagree2'] = {
    'Strongly agree': 4,
    'Agree': 3,
    'Disagree': 1,
    'Neither Agree nor Disagree': 2,
    'Strongly disagree': 0
}
mappings['AdsAgreeDisagree3'] = {
    'Strongly agree': 4,
    'Agree': 3,
    'Disagree': 1,
    'Neither Agree nor Disagree': 2,
    'Strongly disagree': 0
}
mappings['HypotheticalTools1'] = {
    'Extremely interested': 4,
    'A little bit interested': 1,
    'Somewhat interested': 2,
    'Very interested': 3,
    'Not at all interested': 0
}
mappings['HypotheticalTools2'] = {
    'Extremely interested': 4,
    'A little bit interested': 1,
    'Somewhat interested': 2,
    'Very interested': 3,
    'Not at all interested': 0
}
mappings['HypotheticalTools3'] = {
    'Extremely interested': 4,
    'A little bit interested': 1,
    'Somewhat interested': 2,
    'Very interested': 3,
    'Not at all interested': 0
}
mappings['HypotheticalTools4'] = {
    'Extremely interested': 4,
    'A little bit interested': 1,
    'Somewhat interested': 2,
    'Very interested': 3,
    'Not at all interested': 0
}
mappings['HypotheticalTools5'] = {
    'Extremely interested': 4,
    'A little bit interested': 1,
    'Somewhat interested': 2,
    'Very interested': 3,
    'Not at all interested': 0
}
mappings['HoursComputer'] = {
    '9 - 12 hours': 3,
    '5 - 8 hours': 2,
    'Over 12 hours': 4,
    '1 - 4 hours': 1,
    'Less than 1 hour': 0
}
mappings['HoursOutside'] = {
    '1 - 2 hours': 2,
    '30 - 59 minutes': 1,
    'Less than 30 minutes': 0,
    '3 - 4 hours': 3,
    'Over 4 hours': 4
}
mappings['SkipMeals'] = {
    'Never': 0,
    '3 - 4 times per week': 2,
    '1 - 2 times per week': 1,
    'Daily or almost every day': 3
}
mappings['Exercise'] = {
    '3 - 4 times per week': 2,
    'Daily or almost every day': 3,
    "I don't typically exercise": 0,
    '1 - 2 times per week': 1
}
mappings['Age'] = {
    '25 - 34 years old': 2,
    '35 - 44 years old': 3,
    '18 - 24 years old': 1,
    '45 - 54 years old': 4,
    '55 - 64 years old': 5,
    'Under 18 years old': 0,
    '65 years or older': 6
}

#### Map ordinal values to mapped values

In [19]:
for col in ordinal_columns:
    so_survey_bin[col] = so_survey_bin[col].map(mappings[col])
    
    # Impute NaN (most_frequent)
    values = so_survey_bin[col].values
    stat_value = stats.mode(values[~np.isnan(values)])[0][0]
    so_survey_bin[col].fillna(stat_value, inplace=True)

In [20]:
so_survey_bin.head(3)

Unnamed: 0,Hobby,OpenSource,Country,is_student,Employment,FormalEducation,UndergradMajor,CompanySize,YearsCoding,YearsCodingProf,...,SexualOrientation%Bisexual or Queer,SexualOrientation%Gay or Lesbian,SexualOrientation%Straight or heterosexual,RaceEthnicity%Black or of African descent,RaceEthnicity%East Asian,RaceEthnicity%Hispanic or Latino/Latina,RaceEthnicity%Middle Eastern,"RaceEthnicity%Native American, Pacific Islander, or Indigenous Australian",RaceEthnicity%South Asian,RaceEthnicity%White or of European descent
0,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,2.0,1.0,1,...,0,0,1,1,0,0,0,0,0,0
1,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...",7.0,9.0,5,...,0,0,1,0,0,0,0,0,0,1
2,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",2.0,7.0,2,...,0,0,0,0,0,0,0,0,0,0


### 6. Encode nominal data

In [21]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [22]:
def one_hot_encode(df, columns):
    dataframes = [df]
    for col in columns:
        # Get unique values without NaN
        uniques = df[col].unique()
        uniques = uniques[~pd.isnull(uniques)]

        # Encode labels
        encoder = LabelEncoder()
        labels = encoder.fit_transform(uniques)
        df[col] = df[col].map({label: index for index, label in enumerate(encoder.classes_)})

        # Impute NaN
        values = df[col].values
        stat_value = stats.mode(values[~np.isnan(values)])[0][0]
        df[col].fillna(stat_value, inplace=True)

        # One Hot
        ohe = OneHotEncoder()
        feature_arr = ohe.fit_transform(df[[col]]).toarray()
        feature_labels = [col + '%' + str(x) for x in list(encoder.classes_)]
        dataframes.append(pd.DataFrame(feature_arr, columns=feature_labels))

        # Delete old column
        del df[col]
    return pd.concat(dataframes, axis=1)

In [23]:
so_survey_enc = one_hot_encode(so_survey_bin, find_object_cols(so_survey_bin))

In [24]:
so_survey_enc.head(3)

Unnamed: 0,CompanySize,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,AssessJob1,AssessJob2,AssessJob3,AssessJob4,AssessJob5,...,"EducationParents%Bachelor’s degree (BA, BS, B.Eng., etc.)","EducationParents%Master’s degree (MA, MS, M.Eng., MBA, etc.)","EducationParents%Other doctoral degree (Ph.D, Ed.D., etc.)",EducationParents%Primary/elementary school,"EducationParents%Professional degree (JD, MD, etc.)","EducationParents%Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)",EducationParents%Some college/university study without earning a degree,EducationParents%They never completed any formal education,Dependents%No,Dependents%Yes
0,2.0,1.0,1,6,6,10.0,7.0,8.0,1.0,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,7.0,9.0,5,1,3,1.0,7.0,10.0,8.0,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,2.0,7.0,2,5,5,,,,,,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### 6. Impute missing data
Some data is still missing, this missing data can be imputed with statistical values (e.g. mean, mode, etc.).

In [25]:
def impute_missing(df):
    """ Missing values of numerical columns will 
    be replaced by the mean of all values of that column.
    Missing values of categorical columns will
    be replaced by the mode of all values of that column.
    """
    # Look for every column
    for col in df.columns:
        values = df[col].values
        stat_value = 0
        if col is 'ConvertedSalary':
            # If column is numerical: get mean of all values
            stat_value = values[~np.isnan(values)].mean()
        else:
            # If column is numerical: get mode of all values
            stat_value = stats.mode(values[~np.isnan(values)])[0][0]
        # Fill missing values with statistical value
        df[col].fillna(stat_value, inplace=True)

In [26]:
# Impute missing (NaN) values
impute_missing(so_survey_enc)

# Check if dataset has still missing values
df = so_survey_enc.isnull().sum()
for v in df.values:
    if v > 0:
        print('Has missing (NaN) values')
        break

### 7. Export prepped data frame
The preprocessed data frame can be exported to a csv file, so future Machine Learning algorithms can use it.

In [27]:
# Export preprocessed imputed data frame to a csv file
so_survey_enc.to_csv('./dataset/so_survey_prepped.csv', index=False)

### 8. Export used mappings
The used mappings to numerical values are useful for decoding. A data frame will be created in which the numerical values are the row indeces and the text values are the column indeces. This will make future decoding of the values a lot easier.

In [28]:
def inverse_mappings(maps):
    """ Switches the keys and values for every key value pairs of the original mappings.
    E.g. 'Hobby': { 'yes': 1, 'no': 0 },
    will become 'Hobby': { 1: 'yes', 0: 'no' }
    """
    maps_inv = {}
    # Look for every key/column with mapped values
    for key in maps.keys():
        m = {}
        # Switch keys and values for all key value pairs.
        for k, v in maps[key].items():
            m[v] = k
        maps_inv[key] = m
    return maps_inv

In [29]:
# Inverse the current mappings and create a mappings matrix dataframe from the inversed mappings
maps_matrix = pd.DataFrame(inverse_mappings(mappings))

maps_matrix.head(3)

Unnamed: 0,CompanySize,YearsCoding,YearsCodingProf,JobSatisfaction,CareerSatisfaction,TimeFullyProductive,AgreeDisagree1,AgreeDisagree2,AgreeDisagree3,NumberMonitors,...,HypotheticalTools1,HypotheticalTools2,HypotheticalTools3,HypotheticalTools4,HypotheticalTools5,HoursComputer,HoursOutside,SkipMeals,Exercise,Age
0,Fewer than 10 employees,0-2 years,0-2 years,Extremely dissatisfied,Extremely dissatisfied,Less than a month,Strongly disagree,Strongly disagree,Strongly disagree,1,...,Not at all interested,Not at all interested,Not at all interested,Not at all interested,Not at all interested,Less than 1 hour,Less than 30 minutes,Never,I don't typically exercise,Under 18 years old
1,10 to 19 employees,3-5 years,3-5 years,Moderately dissatisfied,Moderately dissatisfied,One to three months,Disagree,Disagree,Disagree,2,...,A little bit interested,A little bit interested,A little bit interested,A little bit interested,A little bit interested,1 - 4 hours,30 - 59 minutes,1 - 2 times per week,1 - 2 times per week,18 - 24 years old
2,20 to 99 employees,6-8 years,6-8 years,Slightly dissatisfied,Slightly dissatisfied,Three to six months,Neither Agree nor Disagree,Neither Agree nor Disagree,Neither Agree nor Disagree,3,...,Somewhat interested,Somewhat interested,Somewhat interested,Somewhat interested,Somewhat interested,5 - 8 hours,1 - 2 hours,3 - 4 times per week,3 - 4 times per week,25 - 34 years old


In [30]:
maps_matrix.to_csv('./dataset/so_survey_mappings.csv', index=False)