# Evidence of Discrimination?

The Department of Developmental Services (DDS) in California is responsible for allocating funds to support over 250,000 developmentally-disabled residents. The data set `ca_dds_expenditures.csv` contains data about 1,000 of these residents. The data comes from a discrimination lawsuit which alleged that California's Department of Developmental Services (DDS) privileged white (non-Hispanic) residents over Hispanic residents in allocating funds. We will focus on comparing the allocation of funds (i.e., expenditures) for these two ethnicities only, although there are other ethnicities in this data set.

There are 6 variables in this data set:

- Id:  5-digit, unique identification code for each consumer (similar to a social security number and used for identification purposes)  
- Age Cohort:  Binned age variable represented as six age cohorts (0-5, 6-12, 13-17, 18-21, 22-50, and 51+)
- Age:  Unbinned age variable
- Gender:  Male or Female
- Expenditures:  Dollar amount of annual expenditures spent on each consumer
- Ethnicity:  Eight ethnic groups (American Indian, Asian, Black, Hispanic, Multi-race, Native Hawaiian, Other, and White non-Hispanic)

# Question 1

Read in the data set. Make a graphic that compares the _average_ expenditures by the DDS on Hispanic residents and white (non-Hispanic) residents. Comment on what you see.

In [None]:
from google.colab import files
uploaded = files.upload()
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

import io
df = pd.read_csv(io.BytesIO(uploaded['ca_dds_expenditures.csv']))


hispanic_exp = df[df['Ethnicity']=="Hispanic"]['Expenditures'].sum()
nonhisp_exp = df[df['Ethnicity']=="White not Hispanic"]['Expenditures'].sum()

bar_plot = plt.figure()
ax = bar_plot.add_axes([0, 0, 2, 1])
x_axis = ['Hispanic', 'White not Hispanic']
y_axis = [hispanic_exp, nonhisp_exp]
ax.bar(x_axis, y_axis)


**Based on this graph, we can conclude that more money was spent on white residents, rather than Hispanic residents. Expenditures sum to a little over 4,000,000 USD for Hispanic residents and nearly 10,000,000 USD for white residents.**

# Question 2

Now, calculate the average expenditures by ethnicity and age cohort. Make a graphic that compares the average expenditure on Hispanic residents and white (non-Hispanic) residents, _within each age cohort_. 

Comment on what you see. How do these results appear to contradict the results you obtained in Question 1?

In [None]:
cohorts = df['Age Cohort'].unique()
cohorts.sort()
ethnic_groups = df['Ethnicity'].unique()

for cohort in cohorts:
    group_data = []
    for group in ethnic_groups:
        exp = df.loc[(df['Ethnicity']==group)&(df['Age Cohort']==cohort), 'Expenditures'].sum()
        group_data.append(exp)
       
    bar_plot = plt.figure()
    ax = bar_plot.add_axes([0, 0, 2, 1])
    ax.set_title(cohort)
    x_axis = ethnic_groups
    y_axis = group_data
    ax.bar(x_axis, y_axis)

**After breaking up the data into cohorts, you can see that in several cases more was spent on Hispanic residents than non-Hispanic residents. From ages 0-21, more is spent on Hispanic residents. From 22-51+, more is spent on white residents. In all cases, the data was positively skewed towards white and Hispanic residents.**

# Question 3

Can you explain the discrepancy between the two analyses you conducted above (i.e., Questions 1 and 2)? Try to tell a complete story that interweaves tables, graphics, and explanation.

_Hint:_ You might want to consider looking at:

- the distributions of ages of Hispanics and whites
- the average expenditure as a function of age

In [None]:
hisp_ages = df.loc[(df['Ethnicity']=="Hispanic")]['Age'].value_counts().index.tolist()
hisp_ages_count = df.loc[(df['Ethnicity']=="Hispanic")]['Age'].value_counts().tolist()
hisp_dist = plt.figure()
ax_hisp = hisp_dist.add_axes([0, 0, 2, 1])
x_axis_hisp = hisp_ages
y_axis_hisp = hisp_ages_count
ax_hisp.set_title("Distribution of Age in Hispanic Residents")
ax_hisp.bar(x_axis_hisp, y_axis_hisp)

white_ages = df.loc[(df['Ethnicity']=="White not Hispanic")]['Age'].value_counts().index.tolist()
white_ages_count = df.loc[(df['Ethnicity']=="White not Hispanic")]['Age'].value_counts().tolist()
white_dist = plt.figure()
ax_white = white_dist.add_axes([0, 0, 2, 1])
x_axis = white_ages
y_axis = white_ages_count
ax_white.set_title("Distribution of Age in White Residents")
ax_white.bar(x_axis, y_axis)

**From these two graphs, we can see that there are more white residents aged 40-80 than there are Hispanic residents of the same age. We can see this because while both graphs are positively skewed, the second figure has more outliers, which implies that there are more people in that cohort.**

**The most likely scenario is that older Hispanic residents saw that more money was being spent on their white counterparts, but neglected the fact that younger individuals make up a more significant chunk of expenditures for their ethnicity. They gathered the average expenditures data, and once they saw that more money was being spent on non-Hispanic residents looked no further.**

## Submission Instructions

Once you are finished, follow these steps:

1. Restart the kernel and re-run this notebook from beginning to end by going to `Kernel > Restart Kernel and Run All Cells`.

2. If this process stops halfway through, that means there was an error. Correct the error and repeat Step 1 until the notebook runs from beginning to end.

3. Double check that there is a number next to each code cell and that these numbers are in order.

Then, submit your lab as follows: Upload notebook (ipynb) iLearn.