### Performing exploratory data analysis on data from the StakeOverFlow developer survey from the years 2018 to 2021

##### The purpose of this notebook is to analyze StackOverflow Developer Survey data in order to answer the following questions:\
***Developer's choice of tech stack?***\
This is question is broken down to four parts namely:
- Question 1 Which programming languages have gained popularity from 2018 to 2021?
- Question 2 Which database services have gained popularity from 2018 to 2021?
- Question 3 Which cloud platforms have gained popularity from 2018 to 2021?
- Question 4 Which web frameworks have gained popularity from 2018 to 2021?


#### Table of Contents

Extracting the data\
Data Model\
Transforming the data\
Loading in the data into a database

## Extracting the data

While writing this notebook, the data was stored locally in my computer therefore to follow along with this note book you will need to download the data from here : https://insights.stackoverflow.com/survey/

In [2]:
# Imports
import pandas as pd

In [1]:
df2018 = catalog.load('public_2018_data')

## Transforming the data

In [3]:
df2018.shape

In [5]:
list(df2018.columns)

In [54]:
# question 1
from collections import Counter

def count_unique_items_in_column(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
    """Counts unique elements in dataframe column. Column must have semicolon separated values or nan values in column

    Args:
        df (pd.DataFrame): dataframe to be modified
        column_name (str): column name in dataframe

    Returns:
        pd.DataFrame: new dataframe contain value and count of value in df
    
    Raises:
        ValueError: if the column passed does not exist in dataframe
    """
    if not column_name in df.columns:
        raise ValueError(f"No column named {column_name} in dataframe.")
    
    column_as_list = df[column_name].tolist()
    
    new_list = []
    for list_item in column_as_list: 
        
        # for nan values
        if isinstance(list_item, type(None)):
            new_list.append(list_item)
            
        if isinstance(list_item, str): 
            new_list.extend(list_item.split(";"))
            
    # find the number of occurances of a item in a list
    occ = Counter(new_list)
    language = []
    count = []
    for x in occ:
        key = x
        value = occ[key]
        language.append(key)
        count.append(value)

    df_temp = pd.DataFrame(list(zip(language, count)), columns = [column_name, 'count'])
    df_temp.set_index(column_name, inplace=True)
    df_temp.sort_values(by='count', ascending=False, inplace=True)
    return df_temp


def _merge(dataframe_list: list) -> pd.DataFrame:
    """Merges dataframes on index

    Args:
        dataframe_list (list): a list of dataframes to merge

    Returns:
        pd.DataFrame: merged dataframe
        
    Raises:
        ValueError: if the list of dataframes passed is not equal to four
    """
    if len(dataframe_list) != 4:
        raise ValueError("List of dataframes must be equal to four(4)")
        
    df18_19 = pd.merge(dataframe_list[0], dataframe_list[1], left_index=True, right_index=True)
    df20_21 = pd.merge(dataframe_list[2], dataframe_list[3], left_index=True, right_index=True)
    dfs_merged = pd.merge(df18_19, df20_21, left_index=True, right_index=True)
    dfs_merged.columns = ['2018', '2019', '2020', '2021']
    
    return dfs_merged

In [23]:
df_list = [df2018, df2019, df2020, df2021]
def display_index_values(df_list: list, column_name: str):
    for i in df_list:
        df_temp = count_unique_items_in_column(i, column_name)
        yield df_temp.index.values
        
j = display_index_values(df_list, 'web_framework_have_worked_with')

for i in j:
    print(i)

[None 'Node.js' 'Angular' 'React' '.NET Core' 'Spring' 'Django' 'Cordova'
 'TensorFlow' 'Xamarin' 'Spark' 'Hadoop' 'Torch/PyTorch']
['jQuery' None 'React.js' 'Angular/Angular.js' 'ASP.NET' 'Express'
 'Spring' 'Vue.js' 'Django' 'Flask' 'Laravel' 'Other(s):' 'Ruby on Rails'
 'Drupal']
[None 'jQuery' 'React.js' 'Angular' 'ASP.NET' 'Express' 'ASP.NET Core'
 'Vue.js' 'Spring' 'Angular.js' 'Django' 'Flask' 'Laravel' 'Ruby on Rails'
 'Symfony' 'Gatsby' 'Drupal']
['React.js' None 'jQuery' 'Express' 'Angular' 'Vue.js' 'ASP.NET Core '
 'Flask' 'ASP.NET' 'Django' 'Spring' 'Angular.js' 'Laravel'
 'Ruby on Rails' 'Gatsby' 'FastAPI' 'Symfony' 'Svelte' 'Drupal']


In [20]:
# Web Frameworks
# rename React in 2018 dataset to React.js
# rename 'Angular' in 2018 dataset to Angular.js
# rename 'Angular/Angular.js' to Angular.js in 2019 dataset
# add ASP.NET to 2018 dataset and set it to zero
# add jQuery to 2018 dataset and set it to zero
# add Vue.js to 2018 dataset and set it to zero
# add Flask to 2018 dataset and set it to zero
# add Laravel to 2018 dataset and set it to zero
# add Express to 2018 dataset and set it to zero
# add Ruby on Rails to 2018 dataset and set it to zero
# add Drupal on Rails to 2018 dataset and set it to zero
df18 = count_unique_items_in_column(df2018, 'web_framework_have_worked_with').rename(index={'React': 'React.js', 'Angular': 'Angular.js', 'Angular/Angular.js': 'Angular.js'})
for i in ['ASP.NET', 'jQuery', 'Vue.js', 'Flask', 'Laravel',  'Express', 'Ruby on Rails', 'Drupal']:
    if i not in df18.index.values:
        df18.loc[i] = 0
df19 = count_unique_items_in_column(df2019, 'web_framework_have_worked_with')
df20 = count_unique_items_in_column(df2020, 'web_framework_have_worked_with')
df21 = count_unique_items_in_column(df2021, 'web_framework_have_worked_with')

l = [df18, df19, df20, df21]
web_frameworks = _merge(l)


df18 = count_unique_items_in_column(df2018, 'web_framework_want_to_work_with').rename(index={'React': 'React.js', 'Angular': 'Angular.js', 'Angular/Angular.js': 'Angular.js'})
for i in ['ASP.NET', 'jQuery', 'Vue.js', 'Flask', 'Laravel',  'Express', 'Ruby on Rails', 'Drupal']:
    if i not in df18.index.values:
        df18.loc[i] = 0
df19 = count_unique_items_in_column(df2019, 'web_framework_want_to_work_with')
df20 = count_unique_items_in_column(df2020, 'web_framework_want_to_work_with')
df21 = count_unique_items_in_column(df2021, 'web_framework_want_to_work_with')

l = [df18, df19, df20, df21]
future_web_frameworks = _merge(l)

Unnamed: 0_level_0,2018,2019,2020,2021
web_framework_want_to_work_with,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,43637,25939,24437,31344
React.js,23736,29531,20071,25718
Django,8774,11358,8237,9112
Spring,8063,9846,6241,7229
ASP.NET,0,13495,4818,4488
jQuery,0,16918,8382,8487
Vue.js,0,19784,13142,15784
Flask,0,8163,6097,7594
Laravel,0,6536,4260,4734
Express,0,12092,8128,11885


In [24]:
j = display_index_values(df_list, 'platform_worked_with')

for i in j:
    print(i)

[None 'Linux' 'Windows Desktop or Server' 'Android' 'AWS' 'Mac OS'
 'Raspberry Pi' 'WordPress' 'iOS' 'Firebase' 'Azure' 'Arduino' 'Heroku'
 'Google Cloud Platform/App Engine' 'Serverless' 'Drupal' 'Amazon Echo'
 'Windows Phone' 'SharePoint' 'ESP8266' 'Salesforce'
 'Apple Watch or Apple TV' 'IBM Cloud or Watson' 'Google Home'
 'Gaming console' 'Mainframe' 'Predix']
['Linux' 'Windows' 'Docker' 'Android' 'AWS' 'MacOS' 'Slack' 'Raspberry Pi'
 'WordPress' 'iOS' 'Google Cloud Platform' 'Microsoft Azure' 'Arduino'
 'Heroku' None 'Kubernetes' 'Other(s):' 'IBM Cloud or Watson']
['Linux' 'Windows' 'Docker' 'AWS' 'Android' 'MacOS' None 'Raspberry Pi'
 'Microsoft Azure' 'WordPress' 'Google Cloud Platform' 'iOS' 'Kubernetes'
 'Heroku' 'Arduino' 'Slack Apps and Integrations' 'IBM Cloud or Watson']
[None 'AWS' 'Google Cloud Platform' 'Microsoft Azure' 'DigitalOcean'
 'Heroku' 'IBM Cloud or Watson' 'Oracle Cloud Infrastructure']


In [17]:
# Platforms
# Only dealing with cloud platforms : AWS Microsoft Azure Heroku  Google Cloud Platform	 IBM Cloud or Watson
# change 'Google Cloud Platform/App Engine' into 'Google Cloud Platform'
# change 'Azure' in 2018 dataset to 'Microsoft Azure'
df18 = count_unique_items_in_column(df2018, 'platform_worked_with').rename(index={'Google Cloud Platform/App Engine': 'Google Cloud Platform', 'Azure': 'Microsoft Azure'})
df19 = count_unique_items_in_column(df2019, 'platform_worked_with')
df20 = count_unique_items_in_column(df2020, 'platform_worked_with')
df21 = count_unique_items_in_column(df2021, 'platform_worked_with')

l = [df18, df19, df20, df21]
platforms = _merge(l)

df18 = count_unique_items_in_column(df2018, 'platform_desire_next_year').rename(index={'Google Cloud Platform/App Engine': 'Google Cloud Platform', 'Azure': 'Microsoft Azure'})
df19 = count_unique_items_in_column(df2019, 'platform_desire_next_year')
df20 = count_unique_items_in_column(df2020, 'platform_desire_next_year')
df21 = count_unique_items_in_column(df2021, 'platform_desire_next_year')

l = [df18, df19, df20, df21]
future_platforms = _merge(l)

Unnamed: 0_level_0,2018,2019,2020,2021
platform_worked_with,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,32856,8169,10618,41820
AWS,15927,21304,14389,26295
Microsoft Azure,7267,9528,7830,15096
Heroku,6913,8527,5974,8182
Google Cloud Platform,5302,9928,7569,16228
IBM Cloud or Watson,950,1514,876,1768


In [28]:
j = display_index_values(df_list, 'language_worked_with')

for i in j:
    print(i)

['JavaScript' 'HTML' 'CSS' 'SQL' 'Java' 'Bash/Shell' 'Python' 'C#' 'PHP'
 None 'C++' 'C' 'TypeScript' 'Ruby' 'Swift' 'Assembly' 'Go' 'Objective-C'
 'VB.NET' 'R' 'Matlab' 'VBA' 'Kotlin' 'Scala' 'Groovy' 'Perl'
 'Visual Basic 6' 'Lua' 'CoffeeScript' 'Delphi/Object Pascal' 'Haskell'
 'Rust' 'F#' 'Clojure' 'Erlang' 'Cobol' 'Ocaml' 'Julia' 'Hack']
['JavaScript' 'HTML/CSS' 'SQL' 'Python' 'Java' 'Bash/Shell/PowerShell'
 'C#' 'PHP' 'C++' 'TypeScript' 'C' 'Other(s):' 'Ruby' 'Go' 'Assembly'
 'Swift' 'Kotlin' 'R' 'VBA' 'Objective-C' 'Scala' 'Rust' 'Dart' None
 'Elixir' 'Clojure' 'WebAssembly' 'F#' 'Erlang']
['JavaScript' 'HTML/CSS' 'SQL' 'Python' 'Java' 'Bash/Shell/PowerShell'
 'C#' 'PHP' 'TypeScript' 'C++' 'C' None 'Go' 'Kotlin' 'Ruby' 'Assembly'
 'VBA' 'Swift' 'R' 'Rust' 'Objective-C' 'Dart' 'Scala' 'Perl' 'Haskell'
 'Julia']
['JavaScript' 'HTML/CSS' 'Python' 'SQL' 'Java' 'Node.js' 'TypeScript' 'C#'
 'Bash/Shell' 'C++' 'PHP' 'C' 'PowerShell' 'Go' 'Kotlin' 'Rust' 'Ruby'
 'Dart' 'Assembly' 'Swift

In [33]:
# Languages
df18 = count_unique_items_in_column(df2018, 'language_worked_with')
df19 = count_unique_items_in_column(df2019, 'language_worked_with')
df20 = count_unique_items_in_column(df2020, 'language_worked_with')
df21 = count_unique_items_in_column(df2021, 'language_worked_with')
l = [df18, df19, df20, df21]

languages = _merge(l)

# future_languages
df18 = count_unique_items_in_column(df2018, 'language_desire_next_year')
df19 = count_unique_items_in_column(df2019, 'language_desire_next_year')
df20 = count_unique_items_in_column(df2020, 'language_desire_next_year')
df21 = count_unique_items_in_column(df2021, 'language_desire_next_year')
l = [df18, df19, df20, df21]

future_languages = _merge(l)

In [34]:
future_languages

Unnamed: 0_level_0,2018,2019,2020,2021
language_desire_next_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
JavaScript,38465,44739,26188,37008
Python,32795,40006,26682,34929
SQL,28011,33566,19970,26631
,25611,4795,10348,6618
Java,22556,23508,13264,17222
C#,20419,22449,13674,17999
TypeScript,16896,23720,17150,26905
Go,15529,17060,12605,15788
C++,15289,16856,9756,15249
PHP,12244,12837,7106,8852


In [42]:
j = display_index_values(df_list, 'database_desire_next_year')

for i in j:
    print(i)

[None 'MySQL' 'MongoDB' 'PostgreSQL' 'SQL Server' 'Redis' 'Elasticsearch'
 'SQLite' 'Microsoft Azure (Tables, CosmosDB, SQL, etc)'
 'Google Cloud Storage' 'MariaDB' 'Amazon DynamoDB' 'Cassandra'
 'Google BigQuery' 'Amazon RDS/Aurora' 'Oracle' 'Neo4j' 'Memcached'
 'Amazon Redshift' 'Apache Hive' 'Apache HBase' 'IBM Db2']
['PostgreSQL' 'MySQL' 'MongoDB' None 'Redis' 'SQLite'
 'Microsoft SQL Server' 'Elasticsearch' 'Firebase' 'MariaDB' 'Oracle'
 'DynamoDB' 'Cassandra' 'Other(s):' 'Couchbase']
[None 'PostgreSQL' 'MongoDB' 'MySQL' 'Redis' 'SQLite' 'Elasticsearch'
 'Microsoft SQL Server' 'Firebase' 'MariaDB' 'Oracle' 'DynamoDB'
 'Cassandra' 'Couchbase' 'IBM DB2']
['PostgreSQL' None 'MySQL' 'MongoDB' 'Redis' 'SQLite' 'Elasticsearch'
 'Microsoft SQL Server' 'Firebase' 'MariaDB' 'DynamoDB' 'Oracle'
 'Cassandra' 'Couchbase' 'IBM DB2']


In [47]:
# Databases
df18 = count_unique_items_in_column(df2018, 'database_worked_with').rename(index={'SQL Server': 'Microsoft SQL Server', 'IBM Db2': 'IBM DB2', 'Amazon DynamoDB': 'DynamoDB'})
for i in ['Cassandra', 'Couchbase', 'Firebase']:
    if i not in df18.index.values:
        df18.loc[i] = 0
df19 = count_unique_items_in_column(df2019, 'database_worked_with')
df20 = count_unique_items_in_column(df2020, 'database_worked_with')
df21 = count_unique_items_in_column(df2021, 'database_worked_with')
l = [df18, df19, df20, df21]

databases = _merge(l)

# future_Databases
df18 = count_unique_items_in_column(df2018, 'database_desire_next_year').rename(index={'SQL Server': 'Microsoft SQL Server', 'IBM Db2': 'IBM DB2', 'Amazon DynamoDB': 'DynamoDB'})
for i in ['Couchbase', 'Firebase']:
    if i not in df18.index.values:
        df18.loc[i] = 0
df19 = count_unique_items_in_column(df2019, 'database_desire_next_year')
df20 = count_unique_items_in_column(df2020, 'database_desire_next_year')
df21 = count_unique_items_in_column(df2021, 'database_desire_next_year')
l = [df18, df19, df20, df21]

future_databases = _merge(l)

In [49]:
databases

Unnamed: 0_level_0,2018,2019,2020,2021
database_worked_with,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MySQL,38909,40537,27559,35289
,32585,12857,14924,13893
Microsoft SQL Server,27293,24590,16336,18896
PostgreSQL,21776,25758,17892,28424
MongoDB,17183,19100,13086,19479
SQLite,13036,23713,15434,22634
Redis,11944,13971,9056,14552
Elasticsearch,9312,10720,6817,9331
MariaDB,8853,12401,8312,12088
Oracle,7376,12353,8155,8868


In [52]:

def merge_dfs(df_list: list, column_name: str, rename: bool = False, rename_data: dict = {}, fill_missing: bool = False, missing_values: dict = {}):
    """Merge dataframes into one dataframe. 

    Args:
        df_list (list): list of dataframes to merge
        column_name (str): column name to merge dataframes on
        rename (bool, optional): option to rename any values or not. Defaults to False.
        rename_data (dict, optional): data for renaming. Defaults to {}.
        fill_missing (bool, optional): fill in missing data or not. Defaults to False.
        missing_values (dict, optional): missing data values. Defaults to {}.

    Raises:
        ValueError: if rename is set to true but no data provided
        ValueError: if fill_missing is set to true but no data provided

    Returns:
        pd.DataFrame : merged dataframe
    """
    if rename:
        if len(rename_data) == 0:
            raise ValueError('Rename specified as True but no data provided')
        
        df18 = count_unique_items_in_column(df_list[0], column_name).rename(index=rename_data)
    else: 
        df18 = count_unique_items_in_column(df_list[0], column_name)
        
    if fill_missing:
        if len(missing_values) == 0:
            raise ValueError('Fill missing specified as True but no data provided')
        
        if not rename:
            df18 = count_unique_items_in_column(df_list[0], column_name)
            
        for i in missing_values:
            if i not in df18.index.values:
                df18.loc[i] = 0
                
    if not rename and not fill_missing:
        df18 = count_unique_items_in_column(df_list[0], column_name) 
        
    df19 = count_unique_items_in_column(df_list[1], column_name)
    df20 = count_unique_items_in_column(df_list[2], column_name)
    df21 = count_unique_items_in_column(df_list[3], column_name)
    
    l = [df18, df19, df20, df21]

    dfs_merged = _merge(l)
    return dfs_merged

In [57]:
df_list = [df2018, df2019, df2020, df2021]
df = merge_dfs(df_list, 'database_worked_with', rename=True, rename_data={'SQL Server': 'Microsoft SQL Server', 'IBM Db2': 'IBM DB2', 'Amazon DynamoDB': 'DynamoDB'}, fill_missing=True, missing_values=['Cassandra', 'Couchbase', 'Firebase'])

df

Unnamed: 0_level_0,2018,2019,2020,2021
database_worked_with,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MySQL,38909,40537,27559,35289
,32585,12857,14924,13893
Microsoft SQL Server,27293,24590,16336,18896
PostgreSQL,21776,25758,17892,28424
MongoDB,17183,19100,13086,19479
SQLite,13036,23713,15434,22634
Redis,11944,13971,9056,14552
Elasticsearch,9312,10720,6817,9331
MariaDB,8853,12401,8312,12088
Oracle,7376,12353,8155,8868


It works!!! Transformations for question 1 done.

In [64]:
# question 2

def age_to_range(number: int) -> str:
    """Checks if a certain value falls within a certain range then retruns the appropriate string

    Args:
        number (int): number to be checked

    Returns:
        str: a string based on the number passed
    """

    if number < 18:
        return 'Under 18 years old'
    elif number >= 18 and number <= 24:
        return '18 - 24 years old'
    elif number >= 25 and number <= 30:
        return '25 - 30 years old'
    elif number >= 31 and number <= 36:
        return '31 - 36 years old'
    elif number >= 37 and number <= 42:
        return '37 - 42 years old'
    elif number >= 43 and number <= 48:
        return '43 - 48 years old'
    elif number >= 49 and number <= 54:
        return '49 - 54 years old'
    elif number >= 55 and number <= 60:
        return '55 - 60 years old'
    elif number > 60:
        return 'Over 60 years old'

def clean_age_column(age) -> str:
    """Cleans the age column of a dataframe

    Args:
        age (Any): An int, str or float representing age

    Returns:
        str: a string based on the age passed
    """
    if isinstance(age, str):
        n = age.replace(" ", "")
        if 'or' in n:
            return age_to_range(int(n[0:2]))
                
        if 'Under' in n:
            return age_to_range(int(n[5:7]))
            
        if '-' in n:
            return age_to_range((int(n[0:2]) + int(n[3:5]))//2)
                
        if 'Prefer' in n:
            return 'Prefer not to say'
        
        if n is None:
            return 'Prefer not to say'
            
    if isinstance(age, float) or isinstance(age, int):
        return age_to_range(round(age))

def replace_na_with_mean(df: pd.DataFrame, column_name: str) -> None:
    """Replaces na values in column of a dataframe with mean

    Args:
        df (pd.DataFrame): dataframe to be modified
        column_name (str): column in dataframe
    
    Raises:
        ValueError: if the column passed does not exist in dataframe
    """
    
    if not column_name in df.columns:
        raise ValueError(f"No column named {column_name} in dataframe.")
    
    age_list = df[column_name].to_list()
    new_age = []

    for age in age_list:
        if isinstance(age, str):
            n = age.replace(" ", "")
            if 'or' in n:
                new_age.append(int(n[0:2]))
                
            if 'Under' in n:
                new_age.append(int(n[5:7]))
            
            if '-' in n:
                new_age.append((int(n[0:2]) + int(n[3:5]))//2)
                
            if 'Prefer' in n or 'None' in n:
                new_age.append(np.nan)
        
        if isinstance(age, float):
            if np.isnan(age):
                new_age.append(age)
            else:
                new_age.append(round(age))
                
    sum_of_numbers = 0
    length_of_number = 0
    for x in new_age:
        if isinstance(x, int):
            sum_of_numbers += x
            length_of_number += 1 
    mean = round(sum_of_numbers/length_of_number)

    df[column_name].fillna(mean, inplace=True)
  
replace_na_with_mean(df2018, 'age')
replace_na_with_mean(df2019, 'age')
replace_na_with_mean(df2020, 'age')
replace_na_with_mean(df2021, 'age')

df2018['age'] = df2018['age'].apply(clean_age_column)
df2019['age'] = df2019['age'].apply(clean_age_column)
df2020['age'] = df2020['age'].apply(clean_age_column)
df2021['age'] = df2021['age'].apply(clean_age_column)

In [69]:
# age = pd.DataFrame(df2018['age'], columns=['age'])
df_list = [df2018, df2019, df2020, df2021]
age = merge_dfs(df_list, 'age')
age

Unnamed: 0_level_0,2018,2019,2020,2021
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
25 - 30 years old,66040,35593,14463,33600
18 - 24 years old,16887,19428,10672,26369
37 - 42 years old,11477,8353,5141,15183
49 - 54 years old,3313,2019,1348,5472
55 - 60 years old,959,1072,711,1819
Over 60 years old,179,730,518,421


In [74]:
j = display_index_values(df_list, 'gender')

for i in j:
    print(i)

['Male' None 'Female' 'Non-binary, genderqueer, or gender non-conforming'
 'Transgender']
['Man' 'Woman' None 'Transgender'
 'Non-binary, genderqueer, or gender non-conforming']
['Man' None 'Woman' 'Transgender'
 'Non-binary, genderqueer, or gender non-conforming']
['Man' 'Woman' None 'Transgender'
 'Non-binary, genderqueer, or gender non-conforming' 'Prefer not to say'
 'Or, in your own words:']


In [None]:

# In the 2018 dataset the choices were male and female but in other datasets its man and woman so changing the 2018 dataset index
list_of_choices = []
for item in df2018['gender'].to_list():
    if isinstance(item, type(None)):
        list_of_choices.append(item)
        
    if isinstance(item, str):
        if 'Male' in item:
            list_of_choices.append(item.replace('Male', 'Man'))
        elif 'Female' in item:
            list_of_choices.append(item.replace('Female', 'Woman'))
        else:
            list_of_choices.append(item)
  
df2018['gender'] = list_of_choices
df2018['gender'].unique()

In [71]:

def add_trans_option(df: pd.DataFrame) -> list:
    e = []
    for gender, choice in zip(df['gender'].to_list(), df['transgender'].to_list()):
        
        if isinstance(choice, str) and isinstance(gender, str): 
            if 'Yes' in choice:
                e.append(gender +  ';Transgender')
                
            if 'No' in choice:
                e.append(gender)
            
            if 'Prefer not to say' in choice or 'Or, in your own words:' in choice:
                e.append(None)
                
        if isinstance(choice, type(None)) or isinstance(gender, type(None)):
            e.append(gender)
        
    return e
    

In [73]:
df2019['gender'] = add_trans_option(df2019)
df2020['gender'] = add_trans_option(df2020)
df2021['gender'] = add_trans_option(df2021)

df_list = [df2018, df2019, df2020, df2021]
gender = merge_dfs(df_list, 'gender', rename=True, rename_data={'Male': 'Man', 'Female': 'Woman'})
gender

Unnamed: 0_level_0,2018,2019,2020,2021
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Man,59785,78302,46236,74320
,34386,3477,13904,3503
Woman,4409,6709,4038,4261
"Non-binary, genderqueer, or gender non-conforming",595,1011,624,955
Transgender,423,1990,900,2060


## Loading in the data into a database

In [118]:
# load that data to production db
production_credentials = get_credentials("..\sof_sa\conf\prod_db_credentials.json")
load_data_into_db("languages", production_credentials, df=languages)
load_data_into_db("future_languages", production_credentials, df=future_languages)
load_data_into_db("databases", production_credentials, df=databases)
load_data_into_db("future_databases", production_credentials, df=future_databases)
load_data_into_db("platforms", production_credentials, df=platforms)
load_data_into_db("future_platforms", production_credentials, df=future_platforms)
load_data_into_db("web_frameworks", production_credentials, df=web_frameworks)
load_data_into_db("future_web_frameworks", production_credentials, df=future_web_frameworks)
load_data_into_db("age", production_credentials, df=age)
load_data_into_db("gender", production_credentials, df=gender)

dataframe created from dataframe
dataframe created from dataframe
dataframe created from dataframe
dataframe created from dataframe
dataframe created from dataframe
dataframe created from dataframe
dataframe created from dataframe
dataframe created from dataframe
dataframe created from dataframe


NameError: name 'gender' is not defined