### Performing exploratory data analysis on data from the StakeOverFlow developer survey from the years 2018 to 2021

##### Table of Contents

Phase 1 : Extracting the data
- Exploring the data
- Data Modelling
        
Phase 2 : Transforming the data
- Cleaning the data
- Merging dataframes
- Cleaning the new dataframe

Phase 3 : Loading the data
- Loading in the data into a database

#### Extracting the data


While writing this notebook, the data was stored locally in my computer therefore to follow along with this note book you will need to download the data from here : https://insights.stackoverflow.com/survey/



In [7]:
import pandas as pd
import numpy as np

#setting the maximum display for the notebook cells 
pd.set_option('display.max_rows', 48)
pd.set_option('display.max_columns', 48)

#removing warnings 
pd.options.mode.chained_assignment = None  # default='warn'

##### Observation - *The number of survey participants has been decreasing from 2018 to 2020 even though the survey length is becoming shorter. Except for the year 2021 which has seen an increase in survey participant and decreasing survey length. Since the 2021 dataset has fewer columns, it is the primary dataset and a subsest of the other datasets will be taken in order to answer the following questions:*

#### Question 1. *How much impact has the pandemic had on developer's choices of tech stack?*
This is question is broken down to three parts namely:

##### Question 1.1 Which programming languages have gained popularity from 2018 to 2021?
columns needed : ResponseId, LanguageHaveWorkedWith, LanguageWantToWorkWith

##### Question 1.2 Which database services have gained popularity from 2018 to 2021?
columns needed : ResponseId, DatabaseHaveWorkedWith, DatabaseWantToWorkWith

##### Question 1.3 Which platforms and frameworks have gained popularity from 2018 to 2021?
columns needed : ResponseId, WebframeHaveWorkedWith, WebframeWantToWorkWith, MiscTechHaveWorkedWith,    MiscTechWantToWorkWith, PlatformHaveWorkedWith, PlatformWantToWorkWith

#### Question 2. *How has the distribution of gender, age and ethnicity in the developer community changed from 2018 to 2021?*
columns needed : ResponseId, Age, Ethnicity, Gender

In [8]:
import json
import logging
from sqlalchemy import create_engine 

def get_credentials(filepath : str) -> dict:
    """Loads database credentials from file.
    Args: 
        filepath - path to the json file

    Returns :
        A dictionary containing database credentials
    """
    with open(filepath, "r") as file:
        data = json.loads(file.read())
   
    return data
credentials = get_credentials("..\sof_sa\conf\staging_db_credentials.json")

# TO DO: EXPORT TO UTILITY FUNCTIONS
def execute_sql(path_to_sql_file: str, credentials: dict) -> pd.DataFrame:
    """Executes an sql query 

    Args:
        path_to_sql_file (str): path to the sql file that contains the sql statement to execute.
        credentials (dict): credentials to the database where the query will be executed 

    Returns:
        pd.DataFrame: a pandas dataframe representing the results of the query
    """
    try:
        DATABASE_URL = f'postgresql+psycopg2://{credentials["user"]}:{credentials["password"]}@{credentials["host"]}:{credentials["port"]}/{credentials["database"]}'
        engine = create_engine(DATABASE_URL, pool_pre_ping=True)

        with open(path_to_sql_file, 'r') as file, engine.connect() as connection:
            df = pd.read_sql_query(file.read(), connection)
            return df
    except Exception as e:
        logging.error(e)
    
df2018 = execute_sql("..\sof_sa\SQL\select_2018_data.sql", credentials)
df2019 = execute_sql("..\sof_sa\SQL\select_2019_data.sql", credentials)
df2020 = execute_sql("..\sof_sa\SQL\select_2020_data.sql", credentials)
df2021 = execute_sql("..\sof_sa\SQL\select_2021_data.sql", credentials)

##### Exploring the data

In [9]:
# The size of the different datasets
print(f"df2018 has shape : {df2018.shape}")
print(f"df2019 has shape : {df2019.shape}")
print(f"df2020 has shape : {df2020.shape}")
print(f"df2021 has shape : {df2021.shape}")

df2018 has shape : (98855, 13)
df2019 has shape : (88883, 13)
df2020 has shape : (64461, 13)
df2021 has shape : (83439, 13)


#### Data Modelling

![Data Model](../img/model.jpg)

#### Transforming the data

*Before creating tables, all the datasets are joined to create one dataframe that can then be subdivided into tables. The question of which web framework the respondents used or would like to use was not asked in 2018, so framework in general will be used.*

In [10]:
df_list = [df2018, df2019, df2020, df2021]
df_all = pd.concat(df_list)
df_all.shape

(335638, 13)

*Inspecting each column and cleaning it if necessary*

In [11]:
df_all.columns

Index(['respondent', 'age', 'ethnicity', 'gender', 'database_desire_next_year',
       'database_worked_with', 'language_desire_next_year',
       'language_worked_with', 'platform_desire_next_year',
       'platform_worked_with', 'web_framework_have_worked_with',
       'web_framework_want_to_work_with', 'year'],
      dtype='object')

In [26]:
# respondents is not important as we are not going to be checking for individual preferences.
# age column
df_all['age'].unique()


array([None, '18 - 24 years old', '55 - 64 years old',
       '35 - 44 years old', '25 - 34 years old', '45 - 54 years old',
       'Under 18 years old', '65 years or older', nan, 18.0, 23.0, 19.0,
       49.0, 25.0, 20.0, 31.0, 14.0, 13.0, 71.0, 26.0, 69.0, 66.0, 15.0,
       36.0, 44.0, 33.0, 43.0, 83.0, 32.0, 29.0, 16.0, 30.0, 13.5, 17.0,
       58.0, 65.0, 99.0, 12.0, 50.0, 22.0, 24.0, 21.0, 37.0, 76.0, 60.0,
       81.0, 45.0, 54.0, 59.0, 61.0, 88.0, 42.0, 35.0, 73.0, 67.0, 38.0,
       27.0, 9.0, 28.0, 63.0, 64.0, 39.0, 52.0, 77.0, 47.0, 34.0, 62.0,
       41.0, 40.0, 56.0, 46.0, 51.0, 48.0, 57.0, 23.9, 55.0, 68.0, 1.0,
       53.0, 17.5, 70.0, 16.5, 46.5, 11.0, 3.0, 97.0, 29.5, 78.0, 74.0,
       26.5, 26.3, 24.5, 72.0, 10.0, 75.0, 79.0, 36.8, 14.1, 19.5, 98.0,
       43.5, 22.5, 31.5, 21.5, 28.5, 33.6, 2.0, 38.5, 30.8, 24.8, 90.0,
       61.3, 4.0, 17.3, 19.9, 80.0, 85.0, 23.5, 16.9, 20.9, 91.0, 98.9,
       57.9, 94.0, 95.0, 37.5, 14.5, 5.0, 82.0, 84.0, 37.3, 33.5, 53.8,
     

In [34]:
# for values that are in range form: 
#   - take the average of the two values and use it as age
# for values with one value and text
#   - take the value and discard text
# for entries with value 'prefer not to say'
#   - replace with nan
# for values of type float 
#   - round off and convert to int

from math import isnan 

age_list = list(df_all['age'])
print(f"length b4 : {len(age_list)}")
new_age = []

for age in age_list:
    if isinstance(age, str):
        n = age.replace(" ", "")
        if 'or' in n:
            new_age.append(int(n[0:2]))
            
        if 'Under' in n:
            new_age.append(int(n[5:7]))
        
        if '-' in n:
            new_age.append((int(n[0:2]) + int(n[3:5]))//2)
            
        if 'Prefer' in n or 'None' in n:
            new_age.append(np.nan)
        
    if isinstance(age, float):
        if isnan(age):
            new_age.append(age)
        else:
            new_age.append(round(age))
            
    if isinstance(age, type(None)):
        new_age.append(age)
            
# replacing the column Age with the new ages
df_all['age'] = new_age

# replacing nan values with the mean of the column for a specific year
def calculate_mean_of_column(df: pd.DataFrame, column: str, filter_column: str, filter_value: int):
    """Calculates the mean of a column in a dataframe based on filter column and filter value

    Args:
        df (pd.DataFrame): dataframe to be filtered.
        column (str): column in df whose mean is to be calculated.
        filter_column (str): name of column in df to be used as a filter.
        filter_value (str): value to filter the filter column on.

    Returns:
        int representing the mean.
    """
    mean = df[df[filter_column] == filter_value][column].mean(skipna=True)
    return round(mean)

from statistics import mean
mean2018 = calculate_mean_of_column(df_all, 'age', 'year', 2018)
mean2019 = calculate_mean_of_column(df_all, 'age', 'year', 2019)
mean2020 = calculate_mean_of_column(df_all, 'age', 'year', 2020)
mean2021 = calculate_mean_of_column(df_all, 'age', 'year', 2021)

list_of_means = [mean2018, mean2019, mean2020, mean2021]
avg_mean = round(mean(list_of_means))

# for some reason inplace is not working
df_all['age'].fillna(avg_mean, inplace=True)

# age column is clean

length b4 : 335638


False    335638
Name: age, dtype: int64

In [38]:
# ethnicity column
df_all['ethnicity'].value_counts()

White or of European descent                                                                                                                                                                   163459
South Asian                                                                                                                                                                                     27381
East Asian                                                                                                                                                                                      10488
Middle Eastern                                                                                                                                                                                   8961
Black or of African descent                                                                                                                                                                      7143
          

In [11]:
# question 1.1
def count_unique_items_incolumn(df, column):
    ''' Separates colon(;) separated items in a column into individual items in a list and calculates the number of occurances of eaach item.
        input : a pandas dataframe, a column in df that contains colon separated values(a;b)
        output : a sorted dataframe on count with each item and its count
    '''
    new_list = []
    for lang in list(df[column]):
        new_items = lang.split(";")
        new_items = map(lambda x: x.strip(), new_items)
        new_list.extend(new_items)

    

def df_column_to_list(df: pd.DataFrame, column_name: str) -> list:
    """Converts a given dataframe into a list

    Args:
        df (pd.DataFrame): dataframe to be modified
        column_name (str): column name to be converted in dataframe

    Returns:
        list: column items in list form
    """
    col_list = df[column_name].tolist()
    
    new_list = []
    for list_item in col_list: 
        
        # for nan values
        if type(list_item) == float:
            new_list.append(list_item)
            
        if type(list_item) == list:
            singleitem = next(iter(list_item.split(';')))
            new_list.append(singleitem)
            
        if type(list_item) == str: 
            singleitem = next(iter(list_item.split(';')))
            new_list.append(singleitem)
            
    # find the number of occurances of a item in a list
    occ = Counter(new_list)
    language = []
    count = []
    for x in occ:
        key = x
        value = occ[key]
        language.append(key)
        count.append(value)

    df_temp = pd.DataFrame(list(zip(language, count)), columns = [column, 'count'])
    df_temp.set_index(column, inplace=True)
    df_temp.sort_values(by='count', ascending=True, inplace=True)
    return df_temp


def create_df(columns: list, data: list = None) -> pd.DataFrame:
    """Creates a dataframe using the data provided and names the columns according the columns lists

    Args:
        columns (list): name of columns to be used in the dataframe.
        data (list, optional): a list of lists that is the data to be used in the dataframe. Defaults to None.

    Raises:
        ValueError: if no data is provided.
        ValueError: if the length of the data provided is not equal to three.

    Returns:
        pd.DataFrame: A dataframe from the data.
    """
    if data is None:
        raise ValueError("No data provided.")
    
    if len(data) != 3:
        raise ValueError("Length of data must be 3.")
    
    difference = len(data[0]) - len(data[1])
    
    # get the number of elements missing
    no_missing_elements = abs(difference)
    
    if difference < 0: 
        # first list is smaller than second list so fix first list
        # then add that to small list
        new_values = ["None"]*no_missing_elements
        data[0].extend(new_values)
    
    if difference > 0:
        # second list is smaller so fix second list 
        # then add that to small list
        new_values = ["None"]*no_missing_elements
        data[1].extend(new_values)
        
    df = pd.DataFrame(list(zip(data[0], data[1], data[3])), columns=columns)
    df.index.rename("respondent", inplace=True)
    
    return df


In [12]:
# Columns needed : ResponseId, LanguageHaveWorkedWith, LanguageWantToWorkWith

languages_worked_with = df_column_to_list(df_2021, 'LanguageWorkedWith')
languages_want_to_work_with = df_column_to_list(df_2021, 'LanguageDesireNextYear')

languages_worked_with = df_column_to_list(df_2021, 'LanguageWorkedWith')
languages_want_to_work_with = df_column_to_list(df_2021, 'LanguageDesireNextYear')

languages_worked_with = df_column_to_list(df_2021, 'LanguageWorkedWith')
languages_want_to_work_with = df_column_to_list(df_2021, 'LanguageDesireNextYear')

languages_worked_with = df_column_to_list(df_2021, 'LanguageWorkedWith')
languages_want_to_work_with = df_column_to_list(df_2021, 'LanguageDesireNextYear')

Unnamed: 0_level_0,languages_worked_with,languages_want_to_work_with
respondent,Unnamed: 1_level_1,Unnamed: 2_level_1
0,C#,C#
1,JavaScript,Python
2,Objective-C,Objective-C
3,,
4,HTML/CSS,Java
...,...,...
64456,,
64457,Assembly,Assembly
64458,,
64459,HTML/CSS,HTML/CSS


In [None]:
from collections import Counter

languages_worked_with = df_column_to_list(df_2021, 'LanguageWorkedWith')
languages_want_to_work_with = df_column_to_list(df_2021, 'LanguageDesireNextYear')
print(dict(Counter(languages_worked_with)))

In [None]:
# Creating tables for question 1.2
# columns needed : ResponseId, DatabaseHaveWorkedWith, DatabaseWantToWorkWith



In [None]:
# Creating tables for question 1.3
# columns needed : ResponseId, WebframeHaveWorkedWith, WebframeWantToWorkWith, MiscTechHaveWorkedWith,    MiscTechWantToWorkWith, PlatformHaveWorkedWith, PlatformWantToWorkWith



In [None]:
# Creating tables for question 2
# columns needed : ResponseId, Age, Ethnicity, Gender