# Predicting Job Satisfaction of Software Developers
In this research project a classification model will be created that tries to predict the Job Satisfaction of Software Developers using the survey results from the Stack Overflow survey of 2018.

## Describing the dataset
[Stack Overflow](https://stackoverflow.com/) is a website where you can ask and answer software related questions. It is a platform where millions of programmers, software developers, software engineers, etc. meet every day to learn form each other. Stack Overflow itself is aware of the enormous popularity of their platform and for this reason they keep a annual survey to get general insight about the average software engineer in relation to his/her field of work.

Every year the results of the survey will get published on their website (cleaned and in csv format). Multiple datasets (one for each year since 2015) are available for analysis. For this project the survey results of 2018 will be used exclusively. The main reason for this choice is recency and the completeness of the results. The 2018 survey results were filled in by around 100.000 software developers, most of which answered 129 different questions. These questions are about job satisfaction, salary, favourite programming languages, weekly exercise, company size, etc. This large variety of questions provides a source of interesting research opportunities.

**Source of the dataset**<br />
Stack Overflow Developer Survey 2018 (186 MB): https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey


**Data description:**<br />
The Stack Overflow survey results of 2018 has in total 98855 rows and 129 columns. Some of these columns consist of only numerical data, like: 'Salary' and 'ConvertedSalary'. All the other columns are categorical. The categorical columns are devidable over three types of columns:
- Values denoted in text. E.g. 'Yes', 'No', 'United States', 'Employed part-time', etc.
- Values denoted in ';' seperated lists. E.g. 'Python;Java;C#', 'Windows;Linux;, etc.
- Values denoted in numbers. E.g. 1, 2, 3, 4, etc. (for rankings)

Textual input can't be interpreted easily by the average Machine Learning algorithm. Therefore, preprocessing of the original dataset is needed, so it can be used for further analysis. A [notebook](./dataset_preprocessing.ipynb) is created that is dedicted to preprocessing the Stack Overflow survey dataset in the following ways:
- Drop rows with missing values in the column 'JobSatisfaction'. The column 'JobSatisfaction' is the target value that will be used for analysis. It is not desirable to have missing values for a target feature, because the value NaN doesn't refer to valid classification value.
- Drop unimportant columns. Some columns can be left out because they have no correlation with the target column 'JobSatisfaction', are redundant or have too many missing values (80%).
- Preprocess values dentoed in ';' seperated lists. List values such as 'Python;Java;C#' can't be used as input for a Machine Learning algorithm. First, the value has to be numerical. Second, numerification of the ';' value as is will result a unique class for every unique list. It is instead needed to get a unique class for every language present in the list.
- Encode text to numerical values. Text isn't easy to interpret for Machine Learning algorithms. To solve this problem all text-formatted values will be converted to numerical values.
- (optional) Impute missing values. A lot of data is missing, this missing data can be imputed with statistical values (e.g. mean, mode, etc.). Both the imputed and not-imputed dataset will be used in the analysis. (The one that will perform better will be chosen)

The above steps will result in a preprocessed data set with 69276 rows and 410 columns.<br />
The data will be tranformed as follows:<br />

| Student | Programming Language | Country        |
|:-------:|:--------------------:|:--------------:|
| Yes     | Python;Java;C#       | Kenya          |
| No      | Python;C#            | United Kingdom |
| Yes     | Java;C#              | United States  |

| Student | Python | Java | C# | Country |
|:-------:|:------:|:----:|:--:|:-------:|
| 1       | 1      | 1    | 1  | 0       |
| 0       | 1      | 0    | 1  | 1       |
| 1       | 0      | 1    | 1  | 2       |

## Research
Research will be conducted on the job satisfaction of software developers. The dataset, as described above, will be used to create a classifaction model that predicts the job statisfaction of software developers. The following research question will be answered:

**Can an accurate model be created, given the features denoted in the survey, to predict the job satisfaction of software developers?**

The model will be seen as accurate if at least 90% of the predictions are the same as the target values (column 'JobSatisfaction'). On top of the research question, a initial hypothesis can be made:

**Salary is the main influencor of job satisfaction, and is therefore responsible for a high accuracy classification model.**

This hypothesis will be either approved or rejected accoring to the reseach results.

## Exploration
Initial exploration of the data gives insight of the data itself and a better intuiting while conducting the research.

In [1]:
%matplotlib inline
import pandas as pd

#### Import the preprocessed dataset
Only the not-imputed dataset will be imported for exploration. This is because the not-imputed dataset will yield more accurate visualization in contrast to the imputed dataset. 

In [9]:
# Import not-imputed preprocessed dataset
so_survey = pd.read_csv('./dataset/so_survey_prepped.csv')

# Import the mappings for decoding purposes
so_mappings = pd.read_csv('./dataset/so_survey_mappings.csv')

In [10]:
# Show first three entries of so_survey data frame
so_survey.head(3)

Unnamed: 0,Hobby,OpenSource,Country,is_student,Employment,FormalEducation,UndergradMajor,CompanySize,YearsCoding,YearsCodingProf,...,Bisexual or Queer,Gay or Lesbian,Straight or heterosexual,Black or of African descent,East Asian,Hispanic or Latino/Latina,Middle Eastern,"Native American, Pacific Islander, or Indigenous Australian",South Asian,White or of European descent
0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0,...,0,0,1,1,0,0,0,0,0,0
1,0,1,1,0.0,1,0.0,1.0,1.0,1.0,1,...,0,0,1,0,0,0,0,0,0,1
2,0,1,2,0.0,1,1.0,2.0,0.0,2.0,2,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# Show first three entries of so_mappings data frame
so_mappings.head(3)

Unnamed: 0,Hobby,OpenSource,Country,is_student,Employment,FormalEducation,UndergradMajor,CompanySize,YearsCoding,YearsCodingProf,...,HypotheticalTools4,HypotheticalTools5,WakeTime,HoursComputer,HoursOutside,SkipMeals,Exercise,EducationParents,Age,Dependents
0,Yes,No,Kenya,No,Employed part-time,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Mathematics or statistics,20 to 99 employees,3-5 years,3-5 years,...,Extremely interested,Extremely interested,Between 5:00 - 6:00 AM,9 - 12 hours,1 - 2 hours,Never,3 - 4 times per week,"Bachelor’s degree (BA, BS, B.Eng., etc.)",25 - 34 years old,Yes
1,No,Yes,United Kingdom,"Yes, part-time",Employed full-time,Associate degree,"A natural science (ex. biology, chemistry, phy...","10,000 or more employees",30 or more years,18-20 years,...,A little bit interested,A little bit interested,Between 6:01 - 7:00 AM,5 - 8 hours,30 - 59 minutes,,Daily or almost every day,,35 - 44 years old,
2,,,United States,"Yes, full-time","Independent contractor, freelancer, or self-em...",Some college/university study without earning ...,"Computer science, computer engineering, or sof...",100 to 499 employees,24-26 years,6-8 years,...,,,,,,3 - 4 times per week,,Some college/university study without earning ...,,No


#### Create decode function
In advance, a generic decode function will be created

In [18]:
def decode_labels(encoded_labels, column_name, decoder=so_mappings):
    """ Decodes encoded (preprocessed) labes using a decoder.
    E.g. [0, 1, 0, 0] for column 'Hobby' => ['Yes', 'No', 'Yes', 'Yes']
    """
    decoded_labels = so_mappings[column_name].values
    return [decoded_labels[i] for i in encoded_labels]

['Bachelor’s degree (BA, BS, B.Eng., etc.)',
 'Associate degree',
 'Bachelor’s degree (BA, BS, B.Eng., etc.)',
 'Bachelor’s degree (BA, BS, B.Eng., etc.)',
 'Some college/university study without earning a degree',
 'Master’s degree (MA, MS, M.Eng., MBA, etc.)']