### Performing exploratory data analysis on data from the StakeOverFlow developer survey from the years 2018 to 2021

##### Table of Contents

Phase 1 : Extracting the data
- Exploring the data
- Data Modelling
        
Phase 2 : Transforming the data
- Cleaning the data
- Merging dataframes
- Cleaning the new dataframe

Phase 3 : Loading the data
- Loading in the data into a database

#### Extracting the data


While writing this notebook, the data was stored locally in my computer therefore to follow along with this note book you will need to download the data from here : https://insights.stackoverflow.com/survey/



In [1]:
import pandas as pd
import numpy as np

#setting the maximum display for the notebook cells 
pd.set_option('display.max_rows', 48)
pd.set_option('display.max_columns', 48)

#removing annoying warnings 
pd.options.mode.chained_assignment = None  # default='warn'

In [2]:
# loading the data from csv files
df_2018 = pd.read_csv("..\\Data\\survey_results_public_2018.csv", low_memory=False)
df_2019 = pd.read_csv("..\\Data\\survey_results_public_2019.csv", low_memory=False)
df_2020 = pd.read_csv("..\\Data\\survey_results_public_2020.csv", low_memory=False)
df_2021 = pd.read_csv("..\\Data\\survey_results_public_2021.csv", low_memory=False)

##### Exploring the data

In [3]:
# The size of the different datasets
print(f"df_2018 has shape : {df_2018.shape}")
print(f"df_2019 has shape : {df_2019.shape}")
print(f"df_2020 has shape : {df_2020.shape}")
print(f"df_2021 has shape : {df_2021.shape}")

df_2018 has shape : (98855, 129)
df_2019 has shape : (88883, 85)
df_2020 has shape : (64461, 61)
df_2021 has shape : (83439, 48)


#### Observation - *The number of survey participants has been decreasing from 2018 to 2020 even though the survey length is becoming shorter. Except for the year 2021 which has seen an increase in survey participant and decreasing survey length*

##### Since the 2021 dataset has fewer columns, it is the primary dataset and a subsest of the other datasets will be taken in order to answer the following questions:

#### Question 1. *How much impact has the pandemic had on developer's choices of tech stack?*
This is question is broken down to three parts namely:

##### Question 1.1 Which programming languages have gained popularity from 2018 to 2021?
columns needed : ResponseId, LanguageHaveWorkedWith, LanguageWantToWorkWith

##### Question 1.2 Which database services have gained popularity from 2018 to 2021?
columns needed : ResponseId, DatabaseHaveWorkedWith, DatabaseWantToWorkWith

##### Question 1.3 Which platforms and frameworks have gained popularity from 2018 to 2021?
columns needed : ResponseId, WebframeHaveWorkedWith, WebframeWantToWorkWith, MiscTechHaveWorkedWith,    MiscTechWantToWorkWith, PlatformHaveWorkedWith, PlatformWantToWorkWith

#### Question 2. *How has the distribution of gender, age and ethnicity in the developer community changed from 2018 to 2021?*
columns needed : ResponseId, Age, Ethnicity, Gender

In [4]:
df_2021.columns

Index(['ResponseId', 'MainBranch', 'Employment', 'Country', 'US_State',
       'UK_Country', 'EdLevel', 'Age1stCode', 'LearnCode', 'YearsCode',
       'YearsCodePro', 'DevType', 'OrgSize', 'Currency', 'CompTotal',
       'CompFreq', 'LanguageHaveWorkedWith', 'LanguageWantToWorkWith',
       'DatabaseHaveWorkedWith', 'DatabaseWantToWorkWith',
       'PlatformHaveWorkedWith', 'PlatformWantToWorkWith',
       'WebframeHaveWorkedWith', 'WebframeWantToWorkWith',
       'MiscTechHaveWorkedWith', 'MiscTechWantToWorkWith',
       'ToolsTechHaveWorkedWith', 'ToolsTechWantToWorkWith',
       'NEWCollabToolsHaveWorkedWith', 'NEWCollabToolsWantToWorkWith', 'OpSys',
       'NEWStuck', 'NEWSOSites', 'SOVisitFreq', 'SOAccount', 'SOPartFreq',
       'SOComm', 'NEWOtherComms', 'Age', 'Gender', 'Trans', 'Sexuality',
       'Ethnicity', 'Accessibility', 'MentalHealth', 'SurveyLength',
       'SurveyEase', 'ConvertedCompYearly'],
      dtype='object')

In [5]:
# Dropping the columns that are not necessary in the 2021 dataset
columns_interested_in21 = ["ResponseId", "LanguageHaveWorkedWith", "LanguageWantToWorkWith", "DatabaseHaveWorkedWith", 
                          "DatabaseWantToWorkWith", "WebframeHaveWorkedWith", "WebframeWantToWorkWith", 
                          "PlatformHaveWorkedWith", "PlatformWantToWorkWith", "Age", "Ethnicity", "Gender"]

# Dropping the columns that are not necessary in the 2020 dataset
columns_interested_in20 = ['Respondent',  'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
                           'Ethnicity', 'Gender', 'LanguageDesireNextYear','LanguageWorkedWith',
                           'PlatformDesireNextYear','PlatformWorkedWith']

# Dropping the columns that are not necessary in the 2019 dataset
columns_interested_in19 = ['Respondent', 'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
                           'Ethnicity', 'Gender', 'LanguageDesireNextYear','LanguageWorkedWith',
                           'PlatformDesireNextYear','PlatformWorkedWith']

# Dropping the columns that are not necessary in the 2018 dataset
columns_interested_in18 = ['Respondent',  'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith','RaceEthnicity',
                           'Gender', 'LanguageDesireNextYear', 'LanguageWorkedWith','PlatformDesireNextYear',
                           'PlatformWorkedWith', 'FrameworkWorkedWith','FrameworkDesireNextYear']

df_2018 = df_2018[columns_interested_in18]
df_2019 = df_2019[columns_interested_in19]
df_2020 = df_2020[columns_interested_in20]
df_2021 = df_2021[columns_interested_in21]

# The new size of the different datasets
print(f"df_2018 has shape : {df_2018.shape}")
print(f"df_2019 has shape : {df_2019.shape}")
print(f"df_2020 has shape : {df_2020.shape}")
print(f"df_2021 has shape : {df_2021.shape}")

df_2018 has shape : (98855, 12)
df_2019 has shape : (88883, 10)
df_2020 has shape : (64461, 10)
df_2021 has shape : (83439, 12)


#### Data Modelling

#### Transforming the data

##### Before creating tables, all the datasets are joined to create one dataframe that can then be subdivided into tables

*The question of which web framework the respondents used or would like to use was not asked in 2018, so framework in general will be used.*

In [6]:
# Standardizing the dataframe column names 
print(df_2018.columns)
print(df_2019.columns)
print(df_2020.columns)
print(df_2021.columns)

Index(['Respondent', 'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'RaceEthnicity', 'Gender', 'LanguageDesireNextYear',
       'LanguageWorkedWith', 'PlatformDesireNextYear', 'PlatformWorkedWith',
       'FrameworkWorkedWith', 'FrameworkDesireNextYear'],
      dtype='object')
Index(['Respondent', 'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'Ethnicity', 'Gender', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'PlatformDesireNextYear', 'PlatformWorkedWith'],
      dtype='object')
Index(['Respondent', 'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'Ethnicity', 'Gender', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'PlatformDesireNextYear', 'PlatformWorkedWith'],
      dtype='object')
Index(['ResponseId', 'LanguageHaveWorkedWith', 'LanguageWantToWorkWith',
       'DatabaseHaveWorkedWith', 'DatabaseWantToWorkWith',
       'WebframeHaveWorkedWith', 'WebframeWantToWorkWith',
       'PlatformHaveWorkedWith', 'PlatformWantToWork

In [7]:
# Convert:
#       RaceEthnicity to Ethnicity in the 2018 dataset
#       FrameworkWorkedWith to WebframeHaveWorkedWith in the 2018 dataset
#       FrameworkDesireNextYear to WebframeWantToWorkWith in the 2018 dataset
columns2 = {"FrameworkWorkedWith": "WebframeHaveWorkedWith",
           "FrameworkDesireNextYear": "WebframeWantToWorkWith", "RaceEthnicity": "Ethnicity"}
df_2018.rename(columns=columns2, inplace=True)

#       DatabaseHaveWorkedWith to DatabaseWorkedWith in the 2021 dataset
#       DatabaseWantToWorkWith to DatabaseDesireNextYear in the 2021 dataset
#       LanguageHaveWorkedWith to LanguageWorkedWith in the 2021 dataset
#       LanguageWantToWorkWith to LanguageDesireNextYear in the 2021 dataset 
#       PlatformHaveWorkedWith to PlatformWorkedWith in the 2021 dataset
#       PlatformWantToWorkWith to PlatformDesireNextYear in the 2021 dataset
columns3 = {"ResponseId" : "Respondent", "DatabaseHaveWorkedWith" : "DatabaseWorkedWith", 
            "DatabaseWantToWorkWith" : "DatabaseDesireNextYear", "LanguageHaveWorkedWith" : "LanguageWorkedWith",
            "LanguageWantToWorkWith" : "LanguageDesireNextYear", "PlatformHaveWorkedWith" : "PlatformWorkedWith", 
            "PlatformWantToWorkWith" : "PlatformDesireNextYear"}
df_2021.rename(columns=columns3, inplace=True)

# Making sure that the 2021 dataset columns align with the rest of the datasets
new_column_order = ['ResponseId', 'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith', 
       'Ethnicity', 'Gender', 'LanguageDesireNextYear','LanguageWorkedWith',
       'PlatformDesireNextYear', 'PlatformWorkedWith',
       'WebframeHaveWorkedWith', 'WebframeWantToWorkWith']

df_2021 = df_2021[new_column_order]

print(df_2018.columns)
print(df_2019.columns)
print(df_2020.columns)
print(df_2021.columns)

Index(['ResponseId', 'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'Ethnicity', 'Gender', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'PlatformDesireNextYear', 'PlatformWorkedWith',
       'WebframeHaveWorkedWith', 'WebframeWantToWorkWith'],
      dtype='object')
Index(['ResponseId', 'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'Ethnicity', 'Gender', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'PlatformDesireNextYear', 'PlatformWorkedWith'],
      dtype='object')
Index(['ResponseId', 'Age', 'DatabaseDesireNextYear', 'DatabaseWorkedWith',
       'Ethnicity', 'Gender', 'LanguageDesireNextYear', 'LanguageWorkedWith',
       'PlatformDesireNextYear', 'PlatformWorkedWith'],
      dtype='object')
Index(['ResponseId', 'LanguageWorkedWith', 'LanguageDesireNextYear',
       'DatabaseWorkedWith', 'DatabaseDesireNextYear',
       'WebframeHaveWorkedWith', 'WebframeWantToWorkWith',
       'PlatformWorkedWith', 'PlatformDesireNextYear', 'Age', 

In [54]:
# Creating tables for question 1.1
# First the columns are flattened since each entry is a list of items
def to_list(df: pd.DataFrame, column_name: str):
    
    col_one_list = df[column_name].tolist()
    
    new_languages_list = []
    for list_item in col_one_list: 
        
        # for nan values
        if type(list_item) == float:
            new_languages_list.append(list_item)
            
        if type(list_item) == list:
            singleitem = next(iter(list_item.split(';')))
            new_languages_list.append(singleitem)
            
        if type(list_item) == str: 
            singleitem = next(iter(list_item.split(';')))
            new_languages_list.append(singleitem)
            
    return new_languages_list

languages_worked_with = to_list(df_2021, 'LanguageWorkedWith')
languages_want_to_work_with = to_list(df_2021, 'LanguageDesireNextYear')


# Columns needed : ResponseId, LanguageHaveWorkedWith, LanguageWantToWorkWith
from collections import Counter
if Counter(languages_worked_with) == Counter(languages_want_to_work_with):
    df = pd.DataFrame(list(zip(languages_worked_with, languages_want_to_work_with)), columns=['languages_worked_with','languages_want_to_work_with'])
    df.index.rename("respondent", inplace=True)
df

Unnamed: 0_level_0,languages_worked_with,languages_want_to_work_with
respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
0,C++,Swift
1,JavaScript,
2,Assembly,Julia
3,JavaScript,JavaScript
4,Bash/Shell,Bash/Shell
...,...,...
83434,Clojure,Clojure
83435,,
83436,Groovy,Java
83437,Bash/Shell,Go
