As aspiring data scientists, we wield a lot of power. Every decision made today, whether it be deciding prices on a McDonalds menu or predicting the winner of an election, is increasingly relying on drawing conclusions from data. However, with this great power comes great responsibility as well. We have to ensure that the decisions we make are fair and unbiased. 

If we start our experiment with a goal of achieving a certain result, we shouldn't keep tweaking our methodology as we're running the experiment until we find an appropriate finding. A responsible approach would be to initially agree on a methodology, and run the experiment without "peeking" at the raw data. Only when the analysis is finished should we examine our result. This approach is called "blind analysis," and we will investigate how it works and why it's so important!

In [57]:
# Just run this cell - it contains a bunch of helpful libraries, some of which you may be familar with!
from datascience import * # This is the library used in Data 8!
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import ipywidgets as widgets
import numpy as np
from IPython.display import display

Let's take the example of Simpson's paradox. This is a statistical phenomenon where we see a trend in many groups of data, but this trend seemingly reverses when the groups are aggregated. A famous case of this actually involves our school! In 1973, researchers wanted to investigate possible gender bias in graduate school admissions at UC Berkeley. Take a look at the data they found below: 

In [58]:
table = Table().with_columns("", ["Men", "Women"], "Applicants", [8442, 4321], "Accepted", ["44%", "35%"])
table

Unnamed: 0,Applicants,Accepted
Men,8442,44%
Women,4321,35%


Just based on this data, what would you conclude? Does it look like there's a possible bias here?

Researchers certainly thought so! That is, until they delved deeper into the data. If we look at a breakdown examining enrollment by department, for the six largest departments, the data paints a different picture. 

In [59]:
table = Table().with_columns("", ["A", "B", "C", "D", "E", "F"], "Male Applicants", [825, 560, 325, 417, 191, 373], 
                            "Male Acceptance", ["62%", "63%", "37%", "33%", "28%", "6%"], 
                            "Female Applicants", [108, 25, 593, 375, 393, 341], 
                            "Female Acceptance", ["82%", "68%", "34%", "35%", "24%", "7%"])
table

Unnamed: 0,Male Applicants,Male Acceptance,Female Applicants,Female Acceptance
A,825,62%,108,82%
B,560,63%,25,68%
C,325,37%,593,34%
D,417,33%,375,35%
E,191,28%,393,24%
F,373,6%,341,7%


Now, we get some evidence contrary to our initial claims! 4 of the top 6 departments seem to be biased against men.


This phenomenon is really unintuitive to understand at first, so don't worry if it takes a while to sink in. Because it's so hard to grasp, it can easily be exploited by people that misapply statistics. If we wanted to demonstrate that the admissions process is biased towards men, we could point to the first table as proof. On the other hand, if we wanted to prove that women are the beneficaries of this bias, we could just refer to the above statistic! 

So who's right? It all depends on what methodology you follow. If a researcher were to "peek" at the raw findings here, and find a result that clashes with their goal, they could easily switch to the other methodology and obtain a satisfactory finding. This is why it's important to carefully decide your methodology at the start, and not proceed until you're absolutely sure that it is the fairest and most complete method. Only then should we finalize it, and run our experiment without tweaking the method. 

In [89]:
expenditures = Table.from_df(pd.read_csv("californiaDDSDataV2.csv"))

In [91]:
expenditures

Id,Age Cohort,Age,Gender,Expenditures,Ethnicity
10210,13 to 17,17,Female,2113,White not Hispanic
10409,22 to 50,37,Male,41924,White not Hispanic
10486,0 to 5,3,Male,1454,Hispanic
10538,18 to 21,19,Female,6400,Hispanic
10568,13 to 17,13,Male,4412,White not Hispanic
10690,13 to 17,15,Female,4566,Hispanic
10711,13 to 17,13,Female,3915,White not Hispanic
10778,13 to 17,17,Male,3873,Black
10820,13 to 17,14,Female,5021,White not Hispanic
10823,13 to 17,13,Male,2887,Hispanic


CLAIM: White Non-Hispanics were receiving more funding than Hispanics

In [98]:
ethnicity_grouped = expenditures.group("Ethnicity", np.mean).select("Ethnicity", "Expenditures mean")
ethnicity_grouped

Ethnicity,Expenditures mean
American Indian,36438.2
Asian,18392.4
Black,20884.6
Hispanic,11065.6
Multi Race,4456.73
Native Hawaiian,42782.3
Other,3316.5
White not Hispanic,24697.5


In [154]:
w = widgets.Dropdown(
    options=["Age Cohort", "Age", "Gender"],
    value="Age Cohort",
    description='Choose column to correlate with average expenditure:',
)
display(w)

A Jupyter Widget

In [174]:
grouped = expenditures.group(w.value, np.mean)
if w.value == "Age Cohort":
    rows = [grouped.row(i) for i in range(1, 5)]
    grouped.remove([1, 2, 3, 4])
    for i in range(1, 4):
        grouped.append(rows[i])
grouped.select(w.value, "Expenditures mean")

Age Cohort,Expenditures mean
0 to 5,1415.28
6 to 12,2226.86
18 to 21,9888.54
22 to 50,40209.3
51+,53521.9


In [195]:
white_grouped = expenditures.group([w.value, "Ethnicity"], np.mean).where("Ethnicity", "White not Hispanic").select(0, "Expenditures mean").relabel(1, "Average White (Non-Hispanic) Expenditure")
hispanic_column = expenditures.group([w.value, "Ethnicity"], np.mean).where("Ethnicity", "Hispanic").column("Expenditures mean")
white_grouped = white_grouped.with_columns("Average Hispanic Expenditure", hispanic_column)
white_grouped = white_grouped.with_columns("Difference between Hispanic and White Expenditure", hispanic_column - white_grouped.column(1))
white_grouped

Age Cohort,Average White (Non-Hispanic) Expenditure,Average Hispanic Expenditure,Difference between Hispanic and White Expenditure
0 to 5,1366.9,1393.2,26.3045
13 to 17,3904.36,3955.28,50.9233
18 to 21,10133.1,9959.85,-173.212
22 to 50,40187.6,40924.1,736.492
51+,52670.4,55585.0,2914.58
6 to 12,2052.26,2312.19,259.926


Based on this data, for a typical person in a given age cohort, the expenditures seem to be really similar by race. In fact, for all but one age cohort, the average Hispanic expenditure is actually greater than the corresponding average white expenditure! This is another example of Simpson's paradox. The overall data tells us that the average White expenditure is much higher than the average Hispanic expenditure. However, when we break the data down more, we see a different trend that directly counters our initial assumptions. 

In [193]:
expenditures.group(["Ethnicity", "Age Cohort"]).show(40)

Ethnicity,Age Cohort,count
American Indian,13 to 17,1
American Indian,22 to 50,1
American Indian,51+,2
Asian,0 to 5,8
Asian,13 to 17,20
Asian,18 to 21,41
Asian,22 to 50,29
Asian,51+,13
Asian,6 to 12,18
Black,0 to 5,3
