# Developer Survey Analysis

The goal is this analysis is to analyze the 2018 survey results and gain relevant insights pertaining to the profession. Some questions we are interested in right off the bat are:
- what are the determinants of job satisfaction?
- what are the most popular languages/technologies among developers?
- is it possible to extract different developer "archetypes" from this survey? Does it indicate bias in the respondents of the survey of in the profession as a whole?

We will leverage descriptive statistics as well as modeling to answer all these questions. We will analyze the results collected by Stack Overflow and made publicly available [here](https://insights.stackoverflow.com/survey).

In [1]:
# We start by loading the necessary libraries
import pandas as pd
import numpy as np
import zipfile
import os

## Loading the Data

In [2]:
# We first extract the file from its archive and then ingest it into Pandas

zip_ref = zipfile.ZipFile('developer_survey_2018.zip', 'r')
zip_ref.extractall(os.getcwd())
zip_ref.close()

data_2018 = pd.read_csv('survey_results_public.csv')
schema_2018 = pd.read_csv('survey_results_schema.csv')

  interactivity=interactivity, compiler=compiler, result=result)


## Initial Assessment

In [21]:
print('The 2013 survey results have {} respondents and {}'.format(str(data_2013.shape[0]),
                            str(data_2013.shape[1])),
      'variables, while in 2017 these numbers were {} and {} respectively'.format(str(data_2017.shape[0]), 
                            str(data_2017.shape[1])))

The 2013 survey results have 9743 respondents and 29 variables, while in 2017 these numbers were 51392 and 154 respectively


In [3]:
# We examine the first few rows of data_2013
# We notice some features are unnamed,

print(data_2013.columns)

Index(['What Country or Region do you live in?',
       'Which US State or Territory do you live in?', 'How old are you?',
       'How many years of IT/Programming experience do you have?',
       'How would you best describe the industry you currently work in?',
       'How many people work for your company?',
       'Which of the following best describes your occupation?',
       'Including yourself, how many developers are employed at your company?',
       'How large is the team that you work on?',
       'What other departments / roles do you interact with regularly?',
       ...
       'Unnamed: 118', 'Unnamed: 119', 'Unnamed: 120', 'Unnamed: 121',
       'What advertisers do you remember seeing on Stack Overflow?',
       'What is your current Stack Overflow reputation?',
       'How do you use Stack Overflow?', 'Unnamed: 125', 'Unnamed: 126',
       'Unnamed: 127'],
      dtype='object', length=128)


In [4]:
# We look at these columns further
# It appears they are simply additional information from a previous question
# We will simply drop them, since they have a lot of missing value
# And probably do not capture the main information, which was in the first, named, column
data_2013[[col for col in data_2013.columns if "Unnamed" in col]]

cols_to_drop = data_2013[[col for col in data_2013.columns if "Unnamed" in col]].columns.values.tolist()


data_2013 = data_2013.drop(labels = cols_to_drop, axis = 1)

In [10]:
# We print out the first few rows for the 2017 survey results, as well as the schema
print(data_2017.head())
schema_2017.head()

   Respondent                                       Professional  \
0           1                                            Student   
1           2                                            Student   
2           3                             Professional developer   
3           4  Professional non-developer who sometimes write...   
4           5                             Professional developer   

                ProgramHobby         Country      University  \
0                  Yes, both   United States              No   
1                  Yes, both  United Kingdom  Yes, full-time   
2                  Yes, both  United Kingdom              No   
3                  Yes, both   United States              No   
4  Yes, I program as a hobby     Switzerland              No   

                         EmploymentStatus  \
0  Not employed, and not looking for work   
1                      Employed part-time   
2                      Employed full-time   
3                      Emp

Unnamed: 0,Column,Question
0,Respondent,Respondent ID number
1,Professional,Which of the following best describes you?
2,ProgramHobby,Do you program as a hobby or contribute to ope...
3,Country,In which country do you currently live?
4,University,"Are you currently enrolled in a formal, degree..."


In [13]:
# Let's explore if any columns are in common
np.intersect1d(data_2013.columns.values, data_2017.columns.values)

array([], dtype=object)

As initially feared, the surveys do not have any column names in common, which will complicate the analysis. Two approaches are possible: we could try and map column names onto a common schema, dropping columns which have no equivalent, or we could replicate the same analysis for both datasets, and do the mapping in our interpretation of the results. We choose to do the latter for it will retain more information and will allow for more flexibility. We keep in mind for graphs we may have to rename elements to allow for comparison.

## Data Preparation

### Missing Values

We assess the presence of missing values, and deal with them.

In [25]:
data_2013.isnull().mean().sort_values(ascending = False)

If you make a software product, how does your company make money? (You can choose more than one)                                                                                                                                                                                        0.922201
Which of the following languages or technologies have you used significantly in the past year?                                                                                                                                                                                          0.852509
Which technologies are you excited about?                                                                                                                                                                                                                                               0.763420
Which technology products do you own? (You can choose more than one)                                                                 