# Stackoverflow Survey Data Analysis

In this file, I intend to practice data analysis techniques, by analyzing the data from the 2018 Stackoverflow Developer Survey.

The original data can be foun [here](https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey).

To begin, let's read the data.

In [13]:
import unicodecsv

def read_csv(filename):
    with open(filename, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)

In [40]:
responses = read_csv('survey_results_public.csv')
schema_raw = read_csv('survey_results_schema.csv')

schema = {}
for row in schema_:
    schema[row['Column']] = row['QuestionText']

After delving into the scheme and responses for a while, I came up with several questions I wanted answered, at least to begin with:
1. How do JobSearchStatus/HopeFiveYears correlate to job/career satisfaction? to country? to years coding?
2. How does job/career satisfaction correlate to company size?
3. How does employment correlate to years coding?
4. How do students compare to non-students in terms of job/career satisfaction?
5. How do open-source coders compare to non-os in terms of job/career satisfaction?
6. How does open-source correlate to years coding?
7. How do people with non-related degrees compare to regulars in pretty much everything?
8. How does DevType relate to job/career satisfaction?
9. Are certain values of DevType taken by people with a lot of years coding?
10. Are there any non-hobbyists? How do they compare to hobbyists in pretty much everything?
11. How does country relate to formal education?

Let's begin with question 3: How does employment status correlate to years coding?
To make the initial question clearer and the responses more readable, we'll rephrase it like so:

**How does one's experience in coding, both professionally and generally, relate to one's employment status?**

In [41]:
respondents_by_employment_status = {'Employed part-time': [],
                                    'Employed full-time': [],
                                    'Independent contractor, freelancer, or self-employed': [],
                                    'Not employed, and not looking for work': [],
                                    'Not employed, but looking for work': [],
                                    'Retired': [],
                                    'NA': []}

for response in responses:
    employment = response['Employment']
    respondents_by_employment_status[employment].append(response)
    # possible alternative: only keeping their id's
    # respondents_by_employment_status[employment].append(response['Respondent'])
    
# respondents_by_employment_status['Employed full-time'][0]

In [46]:
def get_years_coding_dict():
    ## deprecated
    return {'0-2 years': 0,
           '3-5 years': 0,
           '6-8 years': 0,
           '9-11 years': 0,
           '12-14 years': 0,
           '15-17 years': 0,
           '18-20 years': 0,
           '21-23 years': 0,
           '24-26 years': 0,
           '27-29 years': 0,
           '30 or more years': 0,
           'NA': 0}
    
def get_dict_keys_by_column(column: str) -> dict:
    column_values = {response[column] for response in responses}
    return dict.fromkeys(column_values, 0)
    
def get_years_coding_by_employment_dict():
    return {'Employed part-time': get_dict_keys_by_column('YearsCoding'),
              'Employed full-time': get_dict_keys_by_column('YearsCoding'),
              'Independent contractor, freelancer, or self-employed': get_dict_keys_by_column('YearsCoding'),
              'Not employed, and not looking for work': get_dict_keys_by_column('YearsCoding'),
              'Not employed, but looking for work': get_dict_keys_by_column('YearsCoding'),
              'Retired': get_dict_keys_by_column('YearsCoding'),
              'NA': get_dict_keys_by_column('YearsCoding')}
    # TODO make generic
    
def get_dict_of_dicts_by_columns(column: str, nested_column: str) -> dict:
    column_values = {response[column] for response in responses}
    return {value: get_dict_keys_by_column(nested_column) for value in column_values}

In [47]:
# years_coding_by_employment = get_years_coding_by_employment_dict()
# years_coding_prof_by_employment = get_years_coding_by_employment_dict()
years_coding_by_employment = get_dict_of_dicts_by_columns('Employment', 'YearsCoding')
years_coding_prof_by_employment = get_dict_of_dicts_by_columns('Employment', 'YearsCoding')
# TODO RIGHT NOW: second method works by jumbles the result columns (as seen in sheet 2 of temp.csv) - find out why!!!!
#                 if can't, reindex the columns


# TODO move to function:
for status in years_coding_by_employment.keys():
    for response in [resp for resp in responses if resp['Employment'] == status]: 
        resp_years_coding = response['YearsCoding']
        years_coding_by_employment[status][resp_years_coding] += 1
        
for status in years_coding_prof_by_employment.keys():
    for response in [resp for resp in responses if resp['Employment'] == status]: 
        resp_years_coding = response['YearsCoding']
        years_coding_prof_by_employment[status][resp_years_coding] += 1

In [48]:
import pandas as pd

print("Years coding by employment status of developer:")
row_labels = ['0-2 years', '3-5 years', '6-8 years', '9-11 years', '12-14 years',
              '15-17 years', '18-20 years', '21-23 years', '24-26 years', '27-29 years', '30 or more years', 'NA']

frame = pd.DataFrame(years_coding_by_employment).sort_index()
frame.reindex(index=row_labels)
# pd.DataFrame.from_dict(years_coding_by_employment, orient='index')

Years coding by employment status of developer:


Unnamed: 0,Employed full-time,"Independent contractor, freelancer, or self-employed","Not employed, and not looking for work","Not employed, but looking for work",Retired,Employed part-time,NA
0-2 years,5362,763,1283,2005,32,795,442
3-5 years,15419,1750,1596,2053,11,2008,476
6-8 years,15009,1481,732,750,11,1136,219
9-11 years,9976,1121,194,288,4,496,90
12-14 years,6666,884,69,135,4,242,30
15-17 years,5024,793,27,92,6,146,29
18-20 years,4095,679,39,104,13,113,29
21-23 years,2132,358,12,70,7,56,13
24-26 years,1477,296,7,35,5,35,7
27-29 years,820,173,12,21,3,21,10


In [None]:
### deprecated for now, having changed to dataframe (above) for more organized data display method ###

# %matplotlib inline
# import matplotlib.pyplot as plt
# import numpy as np
# import seaborn as sns

# sns.set()

# employment_statuses = list(respondents_by_employment_status.keys())
# indices = np.arange(len(employment_statuses))  # the x locations for the statuses
# # width = 0.35  # the width of the bars
# width = np.min(np.diff(indices))/3  # the width of the bars

# fig, ax = plt.subplots(figsize=(20,12))
# # ax.bar(employment_statuses, list(years_coding_by_employment.values()), width=width, label='Years coding')
# ax.bar(employment_statuses, list(years_coding_by_employment, width=width, label='Years coding')
# ax.set_yticklabels(list(years_coding.keys()))
# # ax.bar(indices - width, list(years_coding_prof_by_employment.values()), width=width, label='Years coding professionally')
# ax.legend()

# plt.setp(ax.get_xticklabels(), rotation=10, horizontalalignment='right')
# plt.show()

# # TODO right now: 
# #    * set y labels correctly ('0-2 years' is the first, not the 0 one)
# #    * create list containing the actual years coding by employment status and replace list(years_coding_by_employment.values()) with it
# # TODO later: Make y-axis exponential. Add axis labels. Add title.