# Exploratory Analysis

**CS 5306 - Project 1**

*Team Members: Wenren Zhou (wz366), Alicia Chen (ac2596), Julie Phan (jp2254)*

**Question:** How do user demographics impact contribution and participation on Stack Overflow?

**Data Source:**
Stack Overflow Annual Developer Survey (https://insights.stackoverflow.com/survey)
We will use the annual developer survey from Stack Overflow, which is the largest and most comprehensive survey of software developers and it is made available under the Open Database License.

**Details on Survey (overview of the survey topics):**
There are six sections in this survey, which are: 1. Basic Information; 2. Education, Work, and Career; 3. Technology and Tech Culture; 4. Stack Overflow Usage + Community; 5. Demographic Information; 6. Final Qs. Most questions are optional and the results are anonymized to protect personal information of participants. With nearly 80,000 responses from over 180 countries and territories, this survey collects all-round information of the participants from identity to education background, and their developer experience from programming skills to participation in open source communities.


Some questions that we will explore include: Which demographic groups (age, gender, ethnicity, sexual orientation, disability status, and mental health) have the highest proportion of engagement? The lowest? Which groups are (proportionally) most likely and least likely to be contributors and not just users of the platform? We will also investigate intersectional behavior and trends, such as how non-straight people of color interact with the platform, and how that might differ from heterosexual individuals of color or non-straight and white individuals. Lastly, we will look at how the demographics of Stack Overflow users compare to demographics of other similar spaces (i.e. Quora).

In [1]:
import pandas
import matplotlib.pyplot as plt
from pandas.api.types import CategoricalDtype

In [2]:
# loading the dataset
survey = pandas.read_csv("/Users/alicia/FA21/cs5306-p1/survey_results_public.csv")
survey.head()

Unnamed: 0,ResponseId,MainBranch,Employment,Country,US_State,UK_Country,EdLevel,Age1stCode,LearnCode,YearsCode,...,Age,Gender,Trans,Sexuality,Ethnicity,Accessibility,MentalHealth,SurveyLength,SurveyEase,ConvertedCompYearly
0,1,I am a developer by profession,"Independent contractor, freelancer, or self-em...",Slovakia,,,"Secondary school (e.g. American high school, G...",18 - 24 years,Coding Bootcamp;Other online resources (ex: vi...,,...,25-34 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,62268.0
1,2,I am a student who is learning to code,"Student, full-time",Netherlands,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",7.0,...,18-24 years old,Man,No,Straight / Heterosexual,White or of European descent,None of the above,None of the above,Appropriate in length,Easy,
2,3,"I am not primarily a developer, but I write co...","Student, full-time",Russian Federation,,,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",11 - 17 years,"Other online resources (ex: videos, blogs, etc...",,...,18-24 years old,Man,No,Prefer not to say,Prefer not to say,None of the above,None of the above,Appropriate in length,Easy,
3,4,I am a developer by profession,Employed full-time,Austria,,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",11 - 17 years,,,...,35-44 years old,Man,No,Straight / Heterosexual,White or of European descent,I am deaf / hard of hearing,,Appropriate in length,Neither easy nor difficult,
4,5,I am a developer by profession,"Independent contractor, freelancer, or self-em...",United Kingdom of Great Britain and Northern I...,,England,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",5 - 10 years,Friend or family member,17.0,...,25-34 years old,Man,No,,White or of European descent,None of the above,,Appropriate in length,Easy,


In [3]:
# Survey Size
survey.size

4005072

In [24]:
# Parse Data Function - counting for multiple data selections in survey
def simplify_data(demographic_df):
    df_dict = {}
    for index, value in demographic_df.value_counts().items():
        keys = index.split(";")
        for k in keys:
            if k in df_dict:
                df_dict[k] += value
            else:
                df_dict[k] = value
    return df_dict

## Gender

In [12]:
# Gender - raw values of all responses ("check all that apply")
# Survey results do not actually include the "Or, in your own words" responses 
# NaN dropped from dataset
gender = survey['Gender'].value_counts()
gender_df = pandas.DataFrame(gender)
gender_df

Unnamed: 0,Gender
Man,74817
Woman,4120
Prefer not to say,1442
"Non-binary, genderqueer, or gender non-conforming",690
"Or, in your own words:",413
"Man;Or, in your own words:",268
"Man;Non-binary, genderqueer, or gender non-conforming",252
"Woman;Non-binary, genderqueer, or gender non-conforming",147
Man;Woman,41
"Non-binary, genderqueer, or gender non-conforming;Or, in your own words:",21


In [30]:
# Simplified data for gender 
gender_data = survey['Gender'].copy()
gender_simp = simplify_data(gender_data)
gender_simp

{'Man': 75428,
 'Woman': 4372,
 'Prefer not to say': 1442,
 'Non-binary, genderqueer, or gender non-conforming': 1168,
 'Or, in your own words:': 756}

## Age

In [8]:
# Age - sorted in the way that it appears in the survey
# NaN values are dropped 
age = survey['Age'].value_counts()
age_df = pandas.DataFrame(age, index = ['Under 18 years old', '18-24 years old', '25-34 years old', 
                                        '35-44 years old','45-54 years old', '55-64 years old', 
                                        '65 years or older', 'Prefer not to say'])
age_df

Unnamed: 0,Age
Under 18 years old,5376
18-24 years old,20993
25-34 years old,32568
35-44 years old,15183
45-54 years old,5472
55-64 years old,1819
65 years or older,421
Prefer not to say,575


## Ethnicity

In [9]:
# Ethnicity - sorted the way it appears in the survey
'''
Black or of African descent East Asian
Hispanic or Latino/a/x Middle Eastern
White or of European descent
Biracial
Indigenous (such as Native American, Pacific Islander, or Indigenous Australian) South Asian
Multiracial
Southeast Asian
I don't know
Prefer not to say
''' 
# Removed "Or, in your own words" -> value is not specified in dataset

ethnicity_df = pandas.DataFrame(survey['Ethnicity'].value_counts())
ethnicity_df

Unnamed: 0,Ethnicity
White or of European descent,42671
South Asian,8328
Hispanic or Latino/a/x,3585
Southeast Asian,3224
Prefer not to say,3062
...,...
"Hispanic or Latino/a/x;Middle Eastern;Indigenous (such as Native American, Pacific Islander, or Indigenous Australian)",1
I don't know;Multiracial;Middle Eastern;Biracial,1
"White or of European descent;I don't know;Biracial;Or, in your own words:",1
I don't know;Hispanic or Latino/a/x;Black or of African descent,1


In [10]:
ethnicity_df.to_csv("/Users/alicia/FA21/cs5306-p1/.csv")

## Sexual Orientation

## Years of Experience