### Performing exploratory data analysis on data from the StakeOverFlow developer survey from the years 2018 to 2021

##### Table of Contents

Phase 1 : Extracting the data
- Exploring the data
- Data Modelling
        
Phase 2 : Transforming the data
- Cleaning the data
- Merging dataframes
- Cleaning the new dataframe

Phase 3 : Loading the data
- Loading in the data into a database

#### Extracting the data


While writing this notebook, the data was stored locally in my computer therefore to follow along with this note book you will need to download the data from here : https://insights.stackoverflow.com/survey/



In [53]:
import pandas as pd
import numpy as np

#setting the maximum display for the notebook cells 
pd.set_option('display.max_rows', 48)
pd.set_option('display.max_columns', 48)

#removing warnings 
pd.options.mode.chained_assignment = None  # default='warn'

##### Observation - *The number of survey participants has been decreasing from 2018 to 2020 even though the survey length is becoming shorter. Except for the year 2021 which has seen an increase in survey participant and decreasing survey length. Since the 2021 dataset has fewer columns, it is the primary dataset and a subsest of the other datasets will be taken in order to answer the following questions:*

#### Question 1. *How much impact has the pandemic had on developer's choices of tech stack?*
This is question is broken down to three parts namely:

##### Question 1.1 Which programming languages have gained popularity from 2018 to 2021?
columns needed : ResponseId, LanguageHaveWorkedWith, LanguageWantToWorkWith

##### Question 1.2 Which database services have gained popularity from 2018 to 2021?
columns needed : ResponseId, DatabaseHaveWorkedWith, DatabaseWantToWorkWith

##### Question 1.3 Which platforms and frameworks have gained popularity from 2018 to 2021?
columns needed : ResponseId, WebframeHaveWorkedWith, WebframeWantToWorkWith, MiscTechHaveWorkedWith,    MiscTechWantToWorkWith, PlatformHaveWorkedWith, PlatformWantToWorkWith

#### Question 2. *How has the distribution of gender and age in the developer community changed from 2018 to 2021?*
columns needed : ResponseId, Age, Gender

In [95]:
import json
import logging
from sqlalchemy import create_engine 

def get_credentials(filepath : str) -> dict:
    """Loads database credentials from file.
    Args: 
        filepath - path to the json file

    Returns :
        A dictionary containing database credentials
    """
    with open(filepath, "r") as file:
        data = json.loads(file.read())
   
    return data
credentials = get_credentials("..\sof_sa\conf\staging_db_credentials.json")

# TO DO: EXPORT TO UTILITY FUNCTIONS
def execute_sql(path_to_sql_file: str, credentials: dict) -> pd.DataFrame:
    """Executes an sql query 

    Args:
        path_to_sql_file (str): path to the sql file that contains the sql statement to execute.
        credentials (dict): credentials to the database where the query will be executed 

    Returns:
        pd.DataFrame: a pandas dataframe representing the results of the query
    """
    try:
        DATABASE_URL = f'postgresql+psycopg2://{credentials["user"]}:{credentials["password"]}@{credentials["host"]}:{credentials["port"]}/{credentials["database"]}'
        engine = create_engine(DATABASE_URL, pool_pre_ping=True)

        with open(path_to_sql_file, 'r') as file, engine.connect() as connection:
            df = pd.read_sql_query(file.read(), connection)
            return df
    except Exception as e:
        logging.error(e)
    
df2018 = execute_sql("..\sof_sa\SQL\select_2018_data.sql", credentials)
df2019 = execute_sql("..\sof_sa\SQL\select_2019_data.sql", credentials)
df2020 = execute_sql("..\sof_sa\SQL\select_2020_data.sql", credentials)
df2021 = execute_sql("..\sof_sa\SQL\select_2021_data.sql", credentials)

##### Exploring the data

In [55]:
# The size of the different datasets
print(f"df2018 has shape : {df2018.shape}")
print(f"df2019 has shape : {df2019.shape}")
print(f"df2020 has shape : {df2020.shape}")
print(f"df2021 has shape : {df2021.shape}")

df2018 has shape : (98855, 12)
df2019 has shape : (88883, 12)
df2020 has shape : (64461, 12)
df2021 has shape : (83439, 12)


#### Data Modelling

![Data Model](../img/model.jpg)

#### Transforming the data

*Before creating tables, all the datasets are joined to create one dataframe that can then be subdivided into tables. The question of which web framework the respondents used or would like to use was not asked in 2018, so framework in general will be used.*

*Inspecting each column and cleaning it if necessary*

In [96]:
# for values that are in range form: 
#   - take the average of the two values and use it as age
# for values with one value and text
#   - take the value and discard text
# for entries with value 'prefer not to say'
#   - replace with nan
# for values of type float 
#   - round off and convert to int

def age_to_range(number: int) -> str:

    if number < 18:
        return 'Under 18 years old'
    elif number >= 18 and number <= 24:
        return '18 - 24 years old'
    elif number >= 25 and number <= 30:
        return '25 - 30 years old'
    elif number >= 31 and number <= 36:
        return '31 - 36 years old'
    elif number >= 37 and number <= 42:
        return '37 - 42 years old'
    elif number >= 43 and number <= 48:
        return '43 - 48 years old'
    elif number >= 49 and number <= 54:
        return '49 - 54 years old'
    elif number >= 55 and number <= 60:
        return '55 - 60 years old'
    elif number > 60:
        return 'Over 60 years old'

def clean_age_column(age):
    
    if isinstance(age, str):
        n = age.replace(" ", "")
        if 'or' in n:
            return age_to_range(int(n[0:2]))
                
        if 'Under' in n:
            return age_to_range(int(n[5:7]))
            
        if '-' in n:
            return age_to_range((int(n[0:2]) + int(n[3:5]))//2)
                
        if 'Prefer' in n:
            return 'Prefer not to say'
        
        if n is None:
            print(n)
            return 'Prefer not to say'
            
    if isinstance(age, float):
        return age_to_range(round(age))
         

# Export to utility functions
def replace_na_with_mean(df: pd.DataFrame, column_name: str) -> None:
    
    age_list = df[column_name].to_list()
    new_age = []

    for age in age_list:
        if isinstance(age, str):
            n = age.replace(" ", "")
            if 'or' in n:
                new_age.append(int(n[0:2]))
                
            if 'Under' in n:
                new_age.append(int(n[5:7]))
            
            if '-' in n:
                new_age.append((int(n[0:2]) + int(n[3:5]))//2)
                
            if 'Prefer' in n or 'None' in n:
                new_age.append(np.nan)
        
        if isinstance(age, float):
            if np.isnan(age):
                new_age.append(age)
            else:
                new_age.append(round(age))
                
    sum_of_numbers = 0
    length_of_number = 0
    for x in new_age:
        if isinstance(x, int):
            sum_of_numbers += x
            length_of_number += 1 
    mean = round(sum_of_numbers/length_of_number)

    df[column_name].fillna(mean, inplace=True)
 
# replace nan values with mean   
replace_na_with_mean(df2018, 'age')
replace_na_with_mean(df2019, 'age')
replace_na_with_mean(df2020, 'age')
replace_na_with_mean(df2021, 'age')

# clean the age column
df2018['age'] = df2018['age'].apply(clean_age_column, )
df2019['age'] = df2019['age'].apply(clean_age_column)
df2020['age'] = df2020['age'].apply(clean_age_column)
df2021['age'] = df2021['age'].apply(clean_age_column)

# age column is clean
for df in [df2018, df2019, df2020, df2021]:
    print(df['age'].value_counts())

25 - 30 years old    31759
18 - 24 years old    16887
37 - 42 years old    11477
49 - 54 years old     3313
55 - 60 years old      959
Over 60 years old      179
Name: age, dtype: int64
25 - 30 years old     35593
18 - 24 years old     19428
31 - 36 years old     15697
37 - 42 years old      8353
43 - 48 years old      3795
Under 18 years old     2196
49 - 54 years old      2019
55 - 60 years old      1072
Over 60 years old       730
Name: age, dtype: int64
31 - 36 years old     28032
25 - 30 years old     14463
18 - 24 years old     10672
37 - 42 years old      5141
43 - 48 years old      2376
49 - 54 years old      1348
Under 18 years old     1200
55 - 60 years old       711
Over 60 years old       518
Name: age, dtype: int64
25 - 30 years old    32568
18 - 24 years old    26369
37 - 42 years old    15183
49 - 54 years old     5472
55 - 60 years old     1819
Prefer not to say      575
Over 60 years old      421
Name: age, dtype: int64


In [50]:
df2019['age'] = df2019['age'].apply(clean_age_column)
df2019.head()
    

Unnamed: 0,respondent,age,gender,database_desire_next_year,database_worked_with,language_desire_next_year,language_worked_with,platform_desire_next_year,platform_worked_with,web_framework_have_worked_with,web_framework_want_to_work_with,year
0,1059,25 - 30 years old,,,,C++;Rust;TypeScript,Bash/Shell/PowerShell;C;C++;Java;JavaScript;Py...,Linux;Windows,Linux;Raspberry Pi;Windows,,,2019
1,1080,18 - 24 years old,Man,MySQL,MySQL;SQLite,C++;Java,C++;C#;Java;JavaScript;SQL,Android;Windows,Android;Linux;Windows,jQuery;Spring,React.js;Spring,2019
2,1875,18 - 24 years old,Man,MySQL,MySQL,HTML/CSS;Java;JavaScript,Java,,,,,2019
3,2287,25 - 30 years old,Man,,Oracle,Python,C,Android,,Angular/Angular.js,,2019
4,2721,18 - 24 years old,Man,,,C;C++;Java;JavaScript;Python,Assembly;Python,Linux,,,,2019


In [51]:
# question 1.1
from collections import Counter

def count_unique_items_in_column(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
    
    column_as_list = df[column_name].tolist()
    
    new_list = []
    for list_item in column_as_list: 
        
        # for nan values
        if type(list_item) == float:
            new_list.append(list_item)
            
        if type(list_item) == list:
            singleitem = next(iter(list_item[0].split(';')))
            new_list.append(singleitem)
            
        if type(list_item) == str: 
            singleitem = next(iter(list_item.split(';')))
            new_list.append(singleitem)
            
    # find the number of occurances of a item in a list
    occ = Counter(new_list)
    language = []
    count = []
    for x in occ:
        key = x
        value = occ[key]
        language.append(key)
        count.append(value)

    df_temp = pd.DataFrame(list(zip(language, count)), columns = [column_name, 'count'])
    df_temp.set_index(column_name, inplace=True)
    df_temp.sort_values(by='count', ascending=True, inplace=True)
    return df_temp


In [52]:
# Columns needed : ResponseId, LanguageHaveWorkedWith, LanguageWantToWorkWith
df18_languages_workedwith = count_unique_items_in_column(df_2018, 'language_worked_with')
df18_languages_workedwith

NameError: name 'df_2018' is not defined

In [None]:
# Creating tables for question 1.2
# columns needed : ResponseId, DatabaseHaveWorkedWith, DatabaseWantToWorkWith



In [None]:
# Creating tables for question 1.3
# columns needed : ResponseId, WebframeHaveWorkedWith, WebframeWantToWorkWith, MiscTechHaveWorkedWith,    MiscTechWantToWorkWith, PlatformHaveWorkedWith, PlatformWantToWorkWith



In [None]:
# Creating tables for question 2
# columns needed : ResponseId, Age, Ethnicity, Gender