# Developer Survey Analysis

The goal is this analysis is to analyze the 2018 survey results and gain relevant insights pertaining to the profession. Some questions we are interested in right off the bat are:
- what are the determinants of job satisfaction?
- what are the most popular languages/technologies among developers?
- is it possible to extract different developer "archetypes" from this survey? Does it indicate bias in the respondents of the survey of in the profession as a whole?

We will leverage descriptive statistics as well as modeling to answer all these questions. We will analyze the results collected by Stack Overflow and made publicly available [here](https://insights.stackoverflow.com/survey).

In [1]:
# We start by loading the necessary libraries
import pandas as pd
import numpy as np
import zipfile
import os

## Loading the Data

In [2]:
# We first extract the file from its archive and then ingest it into Pandas

zip_ref = zipfile.ZipFile('developer_survey_2018.zip', 'r')
zip_ref.extractall(os.getcwd())
zip_ref.close()

data_2018 = pd.read_csv('survey_results_public.csv')
schema_2018 = pd.read_csv('survey_results_schema.csv')

  interactivity=interactivity, compiler=compiler, result=result)


## Initial Assessment

In [3]:
print('The 2018 survey results have {} respondents and {} variables'.format(str(data_2018.shape[0]),
                            str(data_2018.shape[1])))

The 2018 survey results have 98855 respondents and 129 variables


In [7]:
# The schema will serve as a useful guide
schema_2018.head()

Unnamed: 0,Column,QuestionText
0,Respondent,Randomized respondent ID number (not in order ...
1,Hobby,Do you code as a hobby?
2,OpenSource,Do you contribute to open source projects?
3,Country,In which country do you currently reside?
4,Student,"Are you currently enrolled in a formal, degree..."


In [12]:
# We can print out the data type of each column, since it will inform our analysis
data_2018.dtypes

Respondent                       int64
Hobby                           object
OpenSource                      object
Country                         object
Student                         object
Employment                      object
FormalEducation                 object
UndergradMajor                  object
CompanySize                     object
DevType                         object
YearsCoding                     object
YearsCodingProf                 object
JobSatisfaction                 object
CareerSatisfaction              object
HopeFiveYears                   object
JobSearchStatus                 object
LastNewJob                      object
AssessJob1                     float64
AssessJob2                     float64
AssessJob3                     float64
AssessJob4                     float64
AssessJob5                     float64
AssessJob6                     float64
AssessJob7                     float64
AssessJob8                     float64
AssessJob9               

The 2018 survey gathered responses from almost 100,000 people, on 129 questions. It will be really useful to answer the questions at hand. We have both numeric and categorical variables, which we will need to encode in order to be able to plug them in machine learning models.

## Data Preparation

### Missing Values

We assess the presence of missing values, and deal with them.

In [25]:
data_2013.isnull().mean().sort_values(ascending = False)

If you make a software product, how does your company make money? (You can choose more than one)                                                                                                                                                                                        0.922201
Which of the following languages or technologies have you used significantly in the past year?                                                                                                                                                                                          0.852509
Which technologies are you excited about?                                                                                                                                                                                                                                               0.763420
Which technology products do you own? (You can choose more than one)                                                                 