# Stack Overflow Developer Survey EDA

Stack Overflow is a popular website for programmers to ask questions about their code and recieve responses from other programmers. They do a yearly developer survey to better understand their community. What follows will be an exploration into the dataset in an attempt to uncover patterns among its community.

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)

## First Look

In [11]:
dtype_spec = {
    'NEWJobHunt': 'str',
    'NEWJobHuntResearch': 'str',
    'NEWLearn': 'str'
}

dev = pd.read_csv('developer_dataset.csv', dtype=dtype_spec)
dev = dev.rename(columns={
    'NEWJobHunt': 'JobHunt',
    'NEWJobHuntResearch': 'JobHuntResearch',
    'NEWLearn': 'Learn'
})
dev.head()

Unnamed: 0,RespondentID,Year,Country,Employment,UndergradMajor,DevType,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,Hobbyist,OrgSize,YearsCodePro,JobSeek,ConvertedComp,WorkWeekHrs,JobHunt,JobHuntResearch,Learn
0,1,2018,United States,Employed full-time,"Computer science, computer engineering, or sof...",Engineering manager;Full-stack developer,,,,,,,,,,,141000.0,,,,
1,1,2019,United States,Employed full-time,"Computer science, computer engineering, or sof...","Developer, full-stack",C;C++;C#;Python;SQL,C;C#;JavaScript;SQL,MySQL;SQLite,MySQL;SQLite,Linux;Windows,Linux;Windows,No,100 to 499 employees,1.0,I am not interested in new job opportunities,61000.0,80.0,,,
2,1,2020,United States,Employed full-time,"Computer science, computer engineering, or sof...",,HTML/CSS;Ruby;SQL,Java;Ruby;Scala,MySQL;PostgreSQL;Redis;SQLite,MySQL;PostgreSQL,AWS;Docker;Linux;MacOS;Windows,Docker;Google Cloud Platform;Heroku;Linux;Windows,Yes,,8.0,,,,,,Once a year
3,2,2018,United States,Employed full-time,"Computer science, computer engineering, or sof...",Full-stack developer,C#;JavaScript;SQL;TypeScript;HTML;CSS;Bash/Shell,C#;JavaScript;SQL;TypeScript;HTML;CSS;Bash/Shell,"SQL Server;Microsoft Azure (Tables, CosmosDB, ...","SQL Server;Microsoft Azure (Tables, CosmosDB, ...",Azure,Azure,,,4.0,,48000.0,,,,
4,2,2019,United States,Employed full-time,"Computer science, computer engineering, or sof...",Data or business analyst;Database administrato...,Bash/Shell/PowerShell;HTML/CSS;JavaScript;PHP;...,Bash/Shell/PowerShell;HTML/CSS;JavaScript;Rust...,Couchbase;DynamoDB;Firebase;MySQL,Firebase;MySQL;Redis,Android;AWS;Docker;IBM Cloud or Watson;iOS;Lin...,Android;AWS;Docker;IBM Cloud or Watson;Linux;S...,Yes,10 to 19 employees,8.0,I am not interested in new job opportunities,90000.0,40.0,,,


## Column Descriptions
- **RespondentID** - A unique ID given to every respondent. As this is a yearly survey the same respondent can have multiple answers
- **Year** - The year in which responses were given
- **Country** - Country which the developer resides
- **Employment** - Employment status
- **UndergradMajor** - Undergraduate degree
- **DevType** - Developer's job title
- **LangaugeWorkedWith** - Coding languages the developer uses
- **LanguageDesireNextYear** - Coding languages the developer would like to learn
- **DatabaseWorkedWith** - Databases the developer uses
- **DatabaseDesireNextYear** - Databases the developer would like to use next year
- **PlatformsWorkedWith** - What platforms does the developer use
- **PlatformDesireNextYear** - What platforms would the developer like to use next year
- **Hobbyist** - Is coding a hobby
- **OrgSize** - How many people do they work with
- **YearsCodePro** - How many years have they coded professionally
- **JobSeek** - Are they seeking a new role
- **CovertedComp** - Salary/compensations converted to USD
- **WorkWeekHrs** - How many hours a week do they work


## Missing Data
Most questions on the survey are optional so it is likely there will be much missing data

In [17]:
dev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111209 entries, 0 to 111208
Data columns (total 21 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   RespondentID            111209 non-null  int64  
 1   Year                    111209 non-null  int64  
 2   Country                 111209 non-null  object 
 3   Employment              109425 non-null  object 
 4   UndergradMajor          98453 non-null   object 
 5   DevType                 100433 non-null  object 
 6   LanguageWorkedWith      102018 non-null  object 
 7   LanguageDesireNextYear  96044 non-null   object 
 8   DatabaseWorkedWith      85859 non-null   object 
 9   DatabaseDesireNextYear  74234 non-null   object 
 10  PlatformWorkedWith      91609 non-null   object 
 11  PlatformDesireNextYear  85376 non-null   object 
 12  Hobbyist                68352 non-null   object 
 13  OrgSize                 54804 non-null   object 
 14  YearsCodePro        

In [18]:
missing_percentage = dev.isnull().mean().round(4) * 100
missing_percentage

RespondentID               0.00
Year                       0.00
Country                    0.00
Employment                 1.60
UndergradMajor            11.47
DevType                    9.69
LanguageWorkedWith         8.26
LanguageDesireNextYear    13.64
DatabaseWorkedWith        22.79
DatabaseDesireNextYear    33.25
PlatformWorkedWith        17.62
PlatformDesireNextYear    23.23
Hobbyist                  38.54
OrgSize                   50.72
YearsCodePro              14.76
JobSeek                   45.55
ConvertedComp             17.87
WorkWeekHrs               54.06
JobHunt                   82.80
JobHuntResearch           83.20
Learn                     78.22
dtype: float64

#### Notes:
- **RespondentID, Year, Country** are only columns w/o missing data. Probably required by survey
- Percentage of missing responses ranges from 1.6% - 83.2%
- **JobHunt, JobHuntResearch, Learn** have much more missing data because they were not asked in every year. Since the percentage is so high we will drop them before further analysis

In [21]:
dev.drop(['JobHunt', 'JobHuntResearch', 'Learn'], axis=1, inplace=True)


## Summary Statistics

In [22]:
dev.describe(include='all')

Unnamed: 0,RespondentID,Year,Country,Employment,UndergradMajor,DevType,LanguageWorkedWith,LanguageDesireNextYear,DatabaseWorkedWith,DatabaseDesireNextYear,PlatformWorkedWith,PlatformDesireNextYear,Hobbyist,OrgSize,YearsCodePro,JobSeek,ConvertedComp,WorkWeekHrs
count,111209.0,111209.0,111209,109425,98453,100433,102018,96044,85859,74234,91609,85376,68352,54804,94793.0,60556,91333.0,51089.0
unique,,,9,7,19,15627,33037,36490,7936,8558,13600,16024,2,10,,3,,
top,,,United States,Employed full-time,"Computer science, computer engineering, or sof...","Developer, full-stack",C#;HTML/CSS;JavaScript;SQL,Python,MySQL,PostgreSQL,Windows,Linux,Yes,20 to 99 employees,,"I’m not actively looking, but I am open to new...",,
freq,,,53727,84707,60852,6814,1071,1263,5850,4474,4990,4343,54733,10516,,33943,,
mean,19262.039709,2018.854832,,,,,,,,,,,,,9.547045,,125177.7,41.05167
std,11767.011322,0.777503,,,,,,,,,,,,,7.548931,,246121.8,13.833929
min,1.0,2018.0,,,,,,,,,,,,,0.0,,0.0,1.0
25%,9268.0,2018.0,,,,,,,,,,,,,4.0,,46000.0,40.0
50%,18535.0,2019.0,,,,,,,,,,,,,8.0,,79000.0,40.0
75%,28347.0,2019.0,,,,,,,,,,,,,14.0,,120000.0,42.0


#### Notes
- Some of the categorical columns have very few unique reponses while others have many, probably a result of multiple choice vs. open ended questions
- Max **WorkWeekHrs** is 475hrs, more than the number of hours in a week. This should be explored
- The middle 50% of developers have between 4 and 14 years of experience, showing the field is still growing
- Responses cam from 2018, 2019, 2020