# Exploratory Analysis

**CS 5306 - Project 1**

*Team Members: Wenren Zhou (wz366), Alicia Chen (ac2596), Julie Phan (jp2254)*

**Question:** How do user demographics impact contribution and participation on Stack Overflow?

**Data Source:**
Stack Overflow Annual Developer Survey (https://insights.stackoverflow.com/survey)
We will use the annual developer survey from Stack Overflow, which is the largest and most comprehensive survey of software developers and it is made available under the Open Database License.

**Details on Survey (overview of the survey topics):**
There are six sections in this survey, which are: 1. Basic Information; 2. Education, Work, and Career; 3. Technology and Tech Culture; 4. Stack Overflow Usage + Community; 5. Demographic Information; 6. Final Qs. Most questions are optional and the results are anonymized to protect personal information of participants. With nearly 80,000 responses from over 180 countries and territories, this survey collects all-round information of the participants from identity to education background, and their developer experience from programming skills to participation in open source communities.


Some questions that we will explore include: Which demographic groups (age, gender, ethnicity, sexual orientation, disability status, and mental health) have the highest proportion of engagement? The lowest? Which groups are (proportionally) most likely and least likely to be contributors and not just users of the platform? We will also investigate intersectional behavior and trends, such as how non-straight people of color interact with the platform, and how that might differ from heterosexual individuals of color or non-straight and white individuals. Lastly, we will look at how the demographics of Stack Overflow users compare to demographics of other similar spaces (i.e. Quora).

In [61]:
import pandas
import matplotlib.pyplot as plt
from pandas.api.types import CategoricalDtype
import pprint

In [5]:
# loading the dataset
survey = pandas.read_csv("/Users/juliexphan/Documents/1 - Fall 2021/4 - CS 5306/cs5306-p1/survey_results_public.csv")
survey.head()

Unnamed: 0,ResponseId,MainBranch,Employment,Country,US_State,UK_Country,EdLevel,Age1stCode,LearnCode,YearsCode,...,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Slovakia,,,"Secondary school (e.g. American high school, G...",18 - 24 years,Coding Bootcamp;Other online resources (ex: vi...,,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,62268.0
1,2,I am a student who is learning to code,"Student, full-time",Netherlands,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",7.0,...,18-24 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,
2,3,"I am not primarily a developer, but I write co...","Student, full-time",Russian Federation,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",,...,18-24 years old,Man,No,Prefer not to say,Prefer not to say,None of the above,None of the above,Appropriate in length,Easy,
3,4,I am a developer by profession,Employed full-time,Austria,,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",11 - 17 years,,,...,35-44 years old,Man,No,Straight / Heterosexual,White or of European descent,I am deaf / hard of hearing,,Appropriate in length,Neither easy nor difficult,
4,5,I am a developer by profession,"Independent contractor, freelancer, or self-em...",United Kingdom of Great Britain and Northern I...,,England,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",5 - 10 years,Friend or family member,17.0,...,25-34 years old,Man,No,,White or of European descent,None of the above,,Appropriate in length,Easy,


In [6]:
# Survey Size
survey.size

4005072

In [14]:
total_respondents = len(survey)
print("Total Respondents:", total_respondents)

Total Respondents: 83439


In [7]:
# Parse Data Function - counting for multiple data selections in survey
def simplify_data(demographic_df):
    df_dict = {}
    for index, value in demographic_df.value_counts().items():
        keys = index.split(";")
        if len(keys)
        for k in keys:
            if k in df_dict:
                df_dict[k] += value
            else:
                df_dict[k] = value
    return df_dict

## Gender

In [8]:
# Gender - raw values of all responses ("check all that apply")
# Survey results do not actually include the "Or, in your own words" responses 
# NaN dropped from dataset
gender = survey['Gender'].value_counts()
gender_df = pandas.DataFrame(gender)
gender_df

Unnamed: 0,Gender
Man,74817
Woman,4120
Prefer not to say,1442
"Non-binary, genderqueer, or gender non-conforming",690
"Or, in your own words:",413
"Man;Or, in your own words:",268
"Man;Non-binary, genderqueer, or gender non-conforming",252
"Woman;Non-binary, genderqueer, or gender non-conforming",147
Man;Woman,41
"Non-binary, genderqueer, or gender non-conforming;Or, in your own words:",21


In [78]:
# Simplified data for gender 
gender_data = survey['Gender'].copy()
gender_simp = simplify_data(gender_data)
gender_simp_df = pandas.DataFrame(list(gender_simp.items()),columns=["Gender","# People"])
gender_simp_df.set_index("Gender")

Unnamed: 0_level_0,# People
Gender,Unnamed: 1_level_1
Man,75428
Woman,4372
Prefer not to say,1442
"Non-binary, genderqueer, or gender non-conforming",1168
"Or, in your own words:",756


Unnamed: 0_level_0,# People
Gender,Unnamed: 1_level_1
Man,75428
Woman,4372
Prefer not to say,1442
"Non-binary, genderqueer, or gender non-conforming",1168
"Or, in your own words:",756


## Age

In [8]:
# Age - sorted in the way that it appears in the survey
# NaN values are dropped 
age = survey['Age'].value_counts()
age_df = pandas.DataFrame(age, index = ['Under 18 years old', '18-24 years old', '25-34 years old', 
                                        '35-44 years old','45-54 years old', '55-64 years old', 
                                        '65 years or older', 'Prefer not to say'])
age_df

Unnamed: 0,Age
Under 18 years old,5376
18-24 years old,20993
25-34 years old,32568
35-44 years old,15183
45-54 years old,5472
55-64 years old,1819
65 years or older,421
Prefer not to say,575


## Ethnicity

In [80]:
# Ethnicity - sorted the way it appears in the survey
'''
Black or of African descent East Asian
Hispanic or Latino/a/x Middle Eastern
White or of European descent
Biracial
Indigenous (such as Native American, Pacific Islander, or Indigenous Australian) South Asian
Multiracial
Southeast Asian
I don't know
Prefer not to say
''' 
# Removed "Or, in your own words" -> value is not specified in dataset

ethnicity_df = pandas.DataFrame(survey['Ethnicity'].value_counts())

In [81]:
# Simplified data for gender 
ethnicity_data = survey['Ethnicity'].copy()
ethnicity_simp = simplify_data(ethnicity_data)
ethnicity_simp_df = pandas.DataFrame(list(ethnicity_simp.items()),columns=["Ethnicity","# People"])
ethnicity_simp_df.set_index("Ethnicity")

Unnamed: 0_level_0,# People
Ethnicity,Unnamed: 1_level_1
White or of European descent,46434
South Asian,9214
Hispanic or Latino/a/x,5570
Southeast Asian,4083
Prefer not to say,3062
Middle Eastern,4222
East Asian,3735
I don't know,2684
Black or of African descent,2686
"Or, in your own words:",2916


## Sexual Orientation

## Years of Experience

## Gender x Ethnicity 

In [40]:
def simplify_data_2d(combined_df):
    
    df_dict = {}
    for index, value in combined_df.value_counts().items():
        full = index.split("|")
        d1_keys = full[0].split(";")
        d2_keys = full[1].split(";")

        for d1 in d1_keys:
            for d2 in d2_keys:
                combined_key = d1 + "|" + d2
                if combined_key in df_dict:
                    df_dict[combined_key] += value
                else:
                    df_dict[combined_key] = value
                
    return df_dict

In [52]:
def simplify_data_2d_nested_dict(combined_df):
    
    df_dict = {}
    for index, value in combined_df.value_counts().items():
        full = index.split("|")
        d1_keys = full[0].split(";")
        d2_keys = full[1].split(";")

        for d1 in d1_keys:
            if d1 not in df_dict:
                df_dict[d1] = {}
            for d2 in d2_keys:
                if d2 in df_dict[d1]:
                    df_dict[d1][d2] += value
                else:
                    df_dict[d1][d2] = value
                
    return df_dict

In [41]:
combined_df = survey['Gender'] +"|" + survey['Ethnicity']

print(combined_df)

0        Man|White or of European descent
1        Man|White or of European descent
2                   Man|Prefer not to say
3        Man|White or of European descent
4        Man|White or of European descent
                       ...               
83434    Man|White or of European descent
83435     Man|Black or of African descent
83436    Man|White or of European descent
83437    Man|White or of European descent
83438          Man|Hispanic or Latino/a/x
Length: 83439, dtype: object


In [53]:
gender_ethnicity_dict = simplify_data_2d(combined_df)
gender_ethnicity_dict2 = simplify_data_2d_nested_dict(combined_df)

In [63]:
pandas.DataFrame.from_dict(gender_ethnicity_dict2)

Unnamed: 0,Man,Woman,Prefer not to say,"Non-binary, genderqueer, or gender non-conforming","Or, in your own words:"
White or of European descent,43253,2279,246,835,363
South Asian,8570,506,72,57,54
Hispanic or Latino/a/x,5156,306,31,113,61
Southeast Asian,3677,296,59,60,57
Middle Eastern,3904,237,27,74,56
East Asian,3289,328,44,102,49
Prefer not to say,2048,151,796,30,36
Black or of African descent,2418,198,29,56,51
I don't know,2366,170,78,46,75
"Or, in your own words:",2559,167,26,59,235


In [60]:
pprint.pprint(gender_ethnicity_dict2)

{'Man': {'Biracial': 790,
         'Black or of African descent': 2418,
         'East Asian': 3289,
         'Hispanic or Latino/a/x': 5156,
         "I don't know": 2366,
         'Indigenous (such as Native American, Pacific Islander, or Indigenous Australian)': 481,
         'Middle Eastern': 3904,
         'Multiracial': 1147,
         'Or, in your own words:': 2559,
         'Prefer not to say': 2048,
         'South Asian': 8570,
         'Southeast Asian': 3677,
         'White or of European descent': 43253},
 'Non-binary, genderqueer, or gender non-conforming': {'Biracial': 65,
                                                       'Black or of African descent': 56,
                                                       'East Asian': 102,
                                                       'Hispanic or Latino/a/x': 113,
                                                       "I don't know": 46,
                                                       'Indigenous (such as Nati

In [58]:
for key, value in sorted(gender_ethnicity_dict.items(), key=lambda x: x[0]): 
    print("{} : {}".format(key, value))

Man|Biracial : 790
Man|Black or of African descent : 2418
Man|East Asian : 3289
Man|Hispanic or Latino/a/x : 5156
Man|I don't know : 2366
Man|Indigenous (such as Native American, Pacific Islander, or Indigenous Australian) : 481
Man|Middle Eastern : 3904
Man|Multiracial : 1147
Man|Or, in your own words: : 2559
Man|Prefer not to say : 2048
Man|South Asian : 8570
Man|Southeast Asian : 3677
Man|White or of European descent : 43253
Non-binary, genderqueer, or gender non-conforming|Biracial : 65
Non-binary, genderqueer, or gender non-conforming|Black or of African descent : 56
Non-binary, genderqueer, or gender non-conforming|East Asian : 102
Non-binary, genderqueer, or gender non-conforming|Hispanic or Latino/a/x : 113
Non-binary, genderqueer, or gender non-conforming|I don't know : 46
Non-binary, genderqueer, or gender non-conforming|Indigenous (such as Native American, Pacific Islander, or Indigenous Australian) : 39
Non-binary, genderqueer, or gender non-conforming|Middle Eastern : 74
N

In [33]:
categories = ["Gender", "Ethnicity"]
def get_demos(full_df):
    result = full_df[categories[0]]
    return result
    for i in range(1, len(categories)):
        c = categories[0]
        result = result + "|"+full_df[c]
    return result

In [34]:
test_df.agg(get_demos, axis=1)

0        Man
1        Man
2        Man
3        Man
4        Man
        ... 
83434    Man
83435    Man
83436    Man
83437    Man
83438    Man
Length: 83439, dtype: object