# Preperation of the data
I'm using the survey results of the Stack Overflow survey of 2018. The results of the survey mainly consists of categorical answers on questions like: 'Do you write code as a hobby?' or 'How many times do you exercise each week?'. Before I can use Machine Learning, I have to change all the results to numerical values. I'll also get rid of most of the missing values, because the average Machine Learning algorithm doesn't like these.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import MultiLabelBinarizer

### Import the original Stack Overflow survey 2018 results

In [2]:
so_survey_original = pd.read_csv('./dataset/2018 Stack Overflow Survey Results.csv', low_memory=False)
so_survey_original.head(3)

Unnamed: 0,Respondent,Hobby,OpenSource,Country,Student,Employment,FormalEducation,UndergradMajor,CompanySize,DevType,...,Exercise,Gender,SexualOrientation,EducationParents,RaceEthnicity,Age,Dependents,MilitaryUS,SurveyTooLong,SurveyEasy
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,Full-stack developer,...,3 - 4 times per week,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Black or of African descent,25 - 34 years old,Yes,,The survey was an appropriate length,Very easy
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...","10,000 or more employees",Database administrator;DevOps specialist;Full-...,...,Daily or almost every day,Male,Straight or heterosexual,"Bachelor’s degree (BA, BS, B.Eng., etc.)",White or of European descent,35 - 44 years old,Yes,,The survey was an appropriate length,Somewhat easy
2,4,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",20 to 99 employees,Engineering manager;Full-stack developer,...,,,,,,,,,,


In [3]:
so_survey = so_survey_original.copy(deep=True)

### Prepare values in list format
Some values in the dataset are denoted as 'Python;Java;C#' this list format needs to be splitted and converted to numerical values.

#### Find columns containing list format

In [4]:
def find_cols_containing_char(df, char):
    list_columns = []
    for col in df.columns:
        values = df[col].values
        if np.issubdtype(values.dtype, np.number) == False:
            for val in df[col].values:
                if char in str(val):
                    list_columns.append(col)
                    break
    return list_columns
list_columns = find_cols_containing_char(so_survey, ';')
so_survey[list_columns].head(3)

Unnamed: 0,DevType,CommunicationTools,EducationTypes,SelfTaughtTypes,HackathonReasons,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,...,FrameworkDesireNextYear,IDE,Methodology,VersionControl,AdBlockerReasons,AdsActions,ErgonomicDevices,Gender,SexualOrientation,RaceEthnicity
0,Full-stack developer,Slack,"Taught yourself a new language, framework, or ...",The official documentation and/or standards fo...,To build my professional network,JavaScript;Python;HTML;CSS,JavaScript;Python;HTML;CSS,Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/A...,Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/A...,AWS;Azure;Linux;Firebase,...,Django;React,Komodo;Vim;Visual Studio Code,Agile;Scrum,Git,,Saw an online advertisement and then researche...,Standing desk,Male,Straight or heterosexual,Black or of African descent
1,Database administrator;DevOps specialist;Full-...,Confluence;Office / productivity suite (Micros...,"Taught yourself a new language, framework, or ...",The official documentation and/or standards fo...,,JavaScript;Python;Bash/Shell,Go;Python,Redis;PostgreSQL;Memcached,PostgreSQL,Linux,...,React,IPython / Jupyter;Sublime Text;Vim,,Git;Subversion,The website I was visiting asked me to disable it,,Ergonomic keyboard or mouse,Male,Straight or heterosexual,White or of European descent
2,Engineering manager;Full-stack developer,,,,,,,,,,...,,,,,,,,,,


#### Convert ';' seperated values to binarized representation
List values such as $'Python;Java;C#'$ can't be used as input for a Machine Learning algorithm. First, the value has to be numerical. Second, numerification of the ';' values as is will result a unique class for every unique list. It is instead needed to get a unique class for every language present in the list.<br><br>
To do this I use the *MultiLabelBinarizer* algorithm from the *sklearn.preprocessing* library. This algorithm converts values like $[['Python', 'Java', 'C#'], ['Python', 'C#']]$ to the following representation: $[[1, 1, 1], [1, 0, 1]]$ with the columns 'Python', 'Java' and 'C#'. This way the the values will be converted to a $0$ or a $1$ for each row, which is readable for a Machine Learning algorithm.

In [7]:
def binarize_list_columns(df, columns, sep=';'):
    df_copy = df.copy(deep=True)
    mlb = MultiLabelBinarizer()
    for col in columns:
        # Convert list formats like 'Python;Java;C#' to [Python, Java, C#]
        # If value is NaN, the value will be converted to []
        tranformed_vals = [x.split(';') if x is not np.nan else [] for x in df_copy[col].values]
        
        # Binerize the tranformed values
        binerized = mlb.fit_transform(tranformed_vals)
        
        df_copy = df_copy.join(pd.DataFrame(binerized, columns=mlb.classes_), lsuffix='_left', rsuffix='_right')
        
        # Delete original column because it doesn't add value anymore
        del df_copy[col]
    return df_copy

# The column 'Student' will colide with the value 'Student' in the binarization process.
# Therefore the column 'Student' will be renamed to 'is_student'
so_survey.rename(columns={'Student': 'is_student'}, inplace=True)

# Start binarization of list columns
so_survey_bin = binarize_list_columns(so_survey, list_columns, ';')
so_survey_bin.head(3)

Unnamed: 0,Respondent,Hobby,OpenSource,Country,is_student,Employment,FormalEducation,UndergradMajor,CompanySize,YearsCoding,...,Bisexual or Queer,Gay or Lesbian,Straight or heterosexual,Black or of African descent,East Asian,Hispanic or Latino/Latina,Middle Eastern,"Native American, Pacific Islander, or Indigenous Australian",South Asian,White or of European descent
0,1,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,3-5 years,...,0,0,1,1,0,0,0,0,0,0
1,3,Yes,Yes,United Kingdom,No,Employed full-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A natural science (ex. biology, chemistry, phy...","10,000 or more employees",30 or more years,...,0,0,1,0,0,0,0,0,0,1
2,4,Yes,Yes,United States,No,Employed full-time,Associate degree,"Computer science, computer engineering, or sof...",20 to 99 employees,24-26 years,...,0,0,0,0,0,0,0,0,0,0


### Convert string categories to numerical categories