# Stackoverflow Survey Data Analysis

In this file, I intend to practice data analysis techniques, by analyzing the data from the 2018 Stackoverflow Developer Survey.

The original data can be foun [here](https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey).

To begin, let's read the data.

In [2]:
import unicodecsv

def read_csv(filename):
    with open(filename, 'rb') as f:
        reader = unicodecsv.DictReader(f)
        return list(reader)

In [3]:
responses = read_csv('survey_results_public.csv')
schema_raw = read_csv('survey_results_schema.csv')

In [5]:
schema = {}
for row in schema_raw:
    schema[row['Column']] = row['QuestionText']
    
print("The schema of the gathered data:")
print()
for key in schema.keys():
    print("Question key: {}".format(key))
    print("\tText: {}".format(schema[key]))

The schema of the gathered data:

Question key: Respondent
	Text: Randomized respondent ID number (not in order of survey response time)
Question key: Hobby
	Text: Do you code as a hobby?
Question key: OpenSource
	Text: Do you contribute to open source projects?
Question key: Country
	Text: In which country do you currently reside?
Question key: Student
	Text: Are you currently enrolled in a formal, degree-granting college or university program?
Question key: Employment
	Text: Which of the following best describes your current employment status?
Question key: FormalEducation
	Text: Which of the following best describes the highest level of formal education that you’ve completed?
Question key: UndergradMajor
	Text: You previously indicated that you went to a college or university. Which of the following best describes your main field of study (aka 'major')
Question key: CompanySize
	Text: Approximately how many people are employed by the company or organization you work for?
Question ke

After delving into the scheme and responses for a while, I came up with several questions I wanted answered, at least to begin with:
1. How do JobSearchStatus/HopeFiveYears correlate to job/career satisfaction? to country? to years coding?
2. How does job/career satisfaction correlate to company size?
3. How does employment correlate to years coding?
4. How do students compare to non-students in terms of job/career satisfaction?
5. How do open-source coders compare to non-os in terms of job/career satisfaction?
6. How does open-source correlate to years coding?
7. How do people with non-related degrees compare to regulars in pretty much everything?
8. How does DevType relate to job/career satisfaction?
9. Are certain values of DevType taken by people with a lot of years coding?
10. Are there any non-hobbyists? How do they compare to hobbyists in pretty much everything?
11. How does country relate to formal education?
12. How do non-hobbyists compare in terms of job/career satisfaction?

Let's begin with question #3: "How does employment status correlate to years coding?"  
To make the initial question clearer and the responses more readable, we'll rephrase it like so:

**How does one's experience in coding, both professionally and generally, relate to one's employment status?**

In [6]:
## helper functions ##

def get_dict_keys_by_column(column: str) -> dict:
    column_values = {response[column] for response in responses}
    return dict.fromkeys(column_values, 0)
    
def get_dict_of_dicts_by_columns(column: str, nested_column: str) -> dict:
    column_values = {response[column] for response in responses}
    return {value: get_dict_keys_by_column(nested_column) for value in column_values}

In [7]:
years_coding_by_employment = get_dict_of_dicts_by_columns('Employment', 'YearsCoding')
years_coding_prof_by_employment = get_dict_of_dicts_by_columns('Employment', 'YearsCodingProf')


# TODO move to function:
for status in years_coding_by_employment.keys():
    for response in [resp for resp in responses if resp['Employment'] == status]: 
        resp_years_coding = response['YearsCoding']
        years_coding_by_employment[status][resp_years_coding] += 1
        
for status in years_coding_prof_by_employment.keys():
    for response in [resp for resp in responses if resp['Employment'] == status]: 
        resp_years_coding = response['YearsCodingProf']
        years_coding_prof_by_employment[status][resp_years_coding] += 1

In [32]:
import pandas as pd

print("Years coding by employment status of developer:")
# TODO: Find less hard-coded way to explicitly reindex dataframe
row_labels = ['0-2 years', '3-5 years', '6-8 years', '9-11 years', '12-14 years',
              '15-17 years', '18-20 years', '21-23 years', '24-26 years', '27-29 years', '30 or more years', 'NA']
column_labels = ['Employed part-time', 'Employed full-time', 'Independent contractor, freelancer, or self-employed',
                 'Not employed, but looking for work', 'Not employed, and not looking for work', 'Retired', 'NA']

# YBE = Years by Employment
frame_ybe = pd.DataFrame(years_coding_by_employment)
frame_ybe.reindex(index=row_labels, columns=column_labels)
# pd.DataFrame.from_dict(years_coding_by_employment, orient='index')

Years coding by employment status of developer:


Unnamed: 0,Employed part-time,Employed full-time,"Independent contractor, freelancer, or self-employed","Not employed, but looking for work","Not employed, and not looking for work",Retired,NA
0-2 years,795,5362,763,2005,1283,32,442
3-5 years,2008,15419,1750,2053,1596,11,476
6-8 years,1136,15009,1481,750,732,11,219
9-11 years,496,9976,1121,288,194,4,90
12-14 years,242,6666,884,135,69,4,30
15-17 years,146,5024,793,92,27,6,29
18-20 years,113,4095,679,104,39,13,29
21-23 years,56,2132,358,70,12,7,13
24-26 years,35,1477,296,35,7,5,7
27-29 years,21,820,173,21,12,3,10


In [10]:
print("Years coding professionally by employment status of developer:")

frame_ybe_prof = pd.DataFrame(years_coding_prof_by_employment).sort_index()
frame_ybe_prof.reindex(index=row_labels, columns=column_labels)

Years coding professionally by employment status of developer:


Unnamed: 0,Employed part-time,Employed full-time,"Independent contractor, freelancer, or self-employed","Not employed, but looking for work","Not employed, and not looking for work",Retired,NA
0-2 years,2278,14234,1824,2801,1762,15,507
3-5 years,970,17850,1614,541,229,7,151
6-8 years,267,9768,1080,165,47,5,53
9-11 years,108,6393,928,81,37,8,18
12-14 years,84,3549,570,48,14,6,16
15-17 years,50,2418,487,38,9,4,6
18-20 years,41,2275,457,38,6,7,6
21-23 years,21,1082,237,15,5,2,6
24-26 years,12,621,187,19,6,5,7
27-29 years,6,382,92,12,3,6,5


Some interesting points that pop out:
1. As could be expected, most of the respondents are employed full-time. 
2. The greatest concentration of full-time-working respondents is between 0-8 years of coding professionally and 3-11 years of coding in general.  
Relatively few full-time employees have 0-2 years of experience coding in general.
3. Comparing the "Not employed, but looking for work" columns, it's easy to see there's a drastic drop of respondents in the professional column who have more than 2 years of experience.  
Meanwhile, looking at the non-professional column, we see a similar drop-off exists for respondents who have over 9 years of coding experience!

Let's draw in on point #2:
* One possible conclusion is that to be accepted to a job where you can get get professional coding experience, you'd need to have at least some experience in coding non-professionally.
* It seems that the as time progresses, less and less respondents are full-time employees. That could mean a number of things:
   * Many could still be full-time employees in jobs that don't involving coding (or at least not full-time coding). Could be they've risen through the hierarchy of their workplaces to managerial or more business-related jobs.
   * Might be that having spent ~10 years coding wears one out, and causes them to seek another vocation, for any number of reason - tediousness, high-stress workplace attitude, etc.
   * Could be that as rumors foretell, at the ripe old age of not-that-very-old, the IT industry chooses to promote the younger and fresher programmers just out of university, than keep holding on to their aging employees.

Now let's shift out focus to point #3:
* It seems that having even a minimal 2 years of professional experience is enough to make finding other jobs (or at least, not having to be unemployed and search for one) a lot easier.
* On the other hand, the same drop-off in respondents who aren't unemployed-and-looking occurs at almost 10 years of general experience coding! We can't know from the data whether some of it *is* professional, but it seems that non-professional experience doesn't amount to much when looking for a job if you don't have a lot of it. 

In [11]:
### DEPRECATED for now, having changed to dataframe (above) for more organized data display method ###

# respondents_by_employment_status = {'Employed part-time': [],
#                                     'Employed full-time': [],
#                                     'Independent contractor, freelancer, or self-employed': [],
#                                     'Not employed, and not looking for work': [],
#                                     'Not employed, but looking for work': [],
#                                     'Retired': [],
#                                     'NA': []}

# for response in responses:
#     employment = response['Employment']
#     respondents_by_employment_status[employment].append(response)
#     # possible alternative: only keeping their id's
#     # respondents_by_employment_status[employment].append(response['Respondent'])



# %matplotlib inline
# import matplotlib.pyplot as plt
# import numpy as np
# import seaborn as sns

# sns.set()

# employment_statuses = list(respondents_by_employment_status.keys())
# indices = np.arange(len(employment_statuses))  # the x locations for the statuses
# # width = 0.35  # the width of the bars
# width = np.min(np.diff(indices))/3  # the width of the bars

# fig, ax = plt.subplots(figsize=(20,12))
# # ax.bar(employment_statuses, list(years_coding_by_employment.values()), width=width, label='Years coding')
# ax.bar(employment_statuses, list(years_coding_by_employment, width=width, label='Years coding')
# ax.set_yticklabels(list(years_coding.keys()))
# # ax.bar(indices - width, list(years_coding_prof_by_employment.values()), width=width, label='Years coding professionally')
# ax.legend()

# plt.setp(ax.get_xticklabels(), rotation=10, horizontalalignment='right')
# plt.show()

# # TODO right now: 
# #    * set y labels correctly ('0-2 years' is the first, not the 0 one)
# #    * create list containing the actual years coding by employment status and replace list(years_coding_by_employment.values()) with it
# # TODO later: Make y-axis exponential. Add axis labels. Add title.

Graphs are cool and look highly professional. Thus:

In [12]:
 #TODO display above data as graph

Let's move on to question #12: "How do non-hobbyists compare in terms of job/career satisfaction?", which we'll rephrase like so:

**How does career satisfaction correlate to coding as a hobby?**

In [33]:
career_satisfaction_by_hobby = get_dict_of_dicts_by_columns('Hobby', 'CareerSatisfaction')

for key in career_satisfaction_by_hobby.keys():
    for resp in [response for response in responses if response['Hobby'] == key]:
        career_satisfaction = resp['CareerSatisfaction']
        career_satisfaction_by_hobby[key][career_satisfaction] += 1
        
row_labels = ['Extremely satisfied', 'Moderately satisfied', 'Slightly satisfied', 'Neither satisfied nor dissatisfied',
              'Slightly dissatisfied', 'Moderately dissatisfied', 'Extremely dissatisfied', 'NA']

# CSH = Career Satisfaction by Hobby
frame_csh = pd.DataFrame(career_satisfaction_by_hobby)
frame_csh.reindex(index=row_labels)

Unnamed: 0,Yes,No
Extremely satisfied,12113,2203
Moderately satisfied,22965,4961
Slightly satisfied,10799,2685
Neither satisfied nor dissatisfied,5091,1225
Slightly dissatisfied,5111,1476
Moderately dissatisfied,4125,1137
Extremely dissatisfied,2075,538
,17618,4733


We can see a lot more respondents in the 'Yes' hobby column, which is to be expected - many programmers have an affinity to programming even before setting it as their career path (or foundation).

To make the comparison clearer, we'll standardize the results.

In [42]:
import numpy as np
from decimal import Decimal

hobbyists_total_respondents = sum(value for value in list(career_satisfaction_by_hobby['Yes'].values()))
non_hobbyists_total_respondents = sum(value for value in list(career_satisfaction_by_hobby['No'].values()))

for hobby_key in career_satisfaction_by_hobby.keys():
    for satisfaction_key in career_satisfaction_by_hobby[hobby_key].keys():
        respondents = career_satisfaction_by_hobby[hobby_key][satisfaction_key] 
        respondents_percentage = Decimal(respondents * 100 / \
                                    (hobbyists_total_respondents if hobby_key == 'Yes' else non_hobbyists_total_respondents))
        career_satisfaction_by_hobby[hobby_key][satisfaction_key] = round(respondents_percentage,2)

print("Results, each column standardized as percentage of total respondents in column:")
frame_csh_updated = pd.DataFrame(career_satisfaction_by_hobby)
frame_csh_updated.reindex(index=row_labels)

Results, each column standardized as percentage of total respondents in column:


Unnamed: 0,Yes,No
Extremely satisfied,15.16,11.62
Moderately satisfied,28.74,26.17
Slightly satisfied,13.52,14.16
Neither satisfied nor dissatisfied,6.37,6.46
Slightly dissatisfied,6.4,7.79
Moderately dissatisfied,5.16,6.0
Extremely dissatisfied,2.6,2.84
,22.05,24.97


In [None]:
# TODO RIGHT NOW: Add 'difference' column to frame, to easily analyze the results
# useful links: https://www.one-tab.com/page/fjiTSujlQmag0Ac950ONmQ