<div class="alert alert-block alert-danger">

# Chi Square Goodness of Fit (COMPLETE)

    
</div>

Made in collaboration with [Skew the Script](https://skewthescript.org/) and [CourseKata](https://coursekata.org/).

<img src="https://i.postimg.cc/ty1GkxB8/Skew-the-Script-Logo.png" title="Skew the script logo" width=200 align = left>

<img src="https://i.postimg.cc/tXcF0nzD/Course-Kata-logo.png" title="CourseKata logo" width=200 align = right>

In [None]:
# Load the CourseKata library
suppressPackageStartupMessages({
    library(coursekata)
})

### 1.0 - Did Harvard Discriminate?

In the fall of 2014, Harvard was sued by a group of Asian-American applicants who were rejected in admissions. They claimed racial discrimination.

<img src="https://i.postimg.cc/7qvYC29z/Chi-Sq-Fit-Harvard.png" title="Harvard University" width = 500/>

In the summer of 2018, courts released reports on Harvard admissions data, giving an unprecedented window into private University admissions data. The data we’ll look at today is reconstructed from parts of the plaintiff report in which the defense and plaintiffs generally agreed on the findings (see page 41 of plaintiff statistical [report](https://samv91khoyt2i553a2t1s05i-wpengine.netdna-ssl.com/wp-content/uploads/2018/06/Doc-415-1-Arcidiacono-Expert-Report.pdf)).

<img src="https://i.postimg.cc/ysh0ncdm/Chi-Sq-Fit-Legal-Papers.png" title="Harvard Legal Papers" width = 500/>


We'll use this data to help us explore today’s key analysis:

**Is there convincing evidence that Harvard discriminated against Asian applicants?**

To help answer this, the courts looked at Academic Index Ratings.

**Academic Index Ratings:**

- Internal measures of academic qualification produced by the Harvard admissions office.

- Calculated based on standardized test scores and high school grades/performance.


The code below loads a data frame with the following information (obtained from Plaintiff's report, table 8.5.7, and [Harvard Gazette](https://news.harvard.edu/gazette/story/2015/05/strong-enrollment-for-class-of-2019)):

- `Group` The racial/ethnic groups of the applicant pool
- `Top10_AI` The proportion of the class of 2019 applicants that were in the top 10% of the Academic Index for each `Group`
- `Admitted` The proportion of the class of 2019 students that were admitted for each `Group`

*Note: Other groups (Native American, Mixed Race, etc.) and international students were not included in the court’s main analysis.*

**1.1:** Take a look at at the distributions of those in the top 10% and those who were admitted. What do you notice? 

In [None]:
# Create a data frame with the 3 variables
Group <- c("Asian-American", "Hispanic", "African-American", "White")
Top10_AI <- c(0.575, 0.031, 0.008, 0.387)
Admitted <- c(0.214, 0.122, 0.112, 0.553)
Harvard <- data.frame(Group, Top10_AI, Admitted)

# Print out the data frame
Harvard

<div class="alert alert-block alert-warning">

**Sample Responses**

- The proportion of Asian-Americans that are admitted is much lower than the proportion of Asian-Americans with the top AI ratings.

- The proportion of White Americans that are admitted is much higher than the proportion of White Americans with the top AI ratings.

- Hispanic and African-Americans have a slightly higher proportion of acceptance compared to their proportions for AI ratings, but they are also still the most underrepresented groups overall.
    
</div>

**1.2:** Take a look at the actual differences between the two variables by creating a new variable called `Prop_Difference` (for the difference in proportions). Which groups are overrepresented? Which are underrepresented?

In [None]:
# COMPLETE VERSION
# Create a new variable of group differences
Harvard$Prop_Difference <- (Harvard$Admitted - Harvard$Top10_AI)
Harvard

We can see obvious differences among the groups, but what we really want to know is: Is each difference a *statistically significant* difference?

Imagine Harvard claims: “We only accept the top academic applicants and we treat those applicants
equally. Our admitted class is as good as a random sample from the pool of top applicants.” 

You’d like to test if there’s convincing evidence against this claim. 

Instead of comparing each row one-by-one (which will increase the chance of Type I Error), how do we compare all the groups in just one test?

We can use a **Chi-Square ($\chi^2$) test for Goodness of Fit**.

Note, the pronunciation of Chi ($\chi$) sounds like "Kai" as in *kite* -- not like *chai* latte or tai *chi*.


<img src="https://i.postimg.cc/GLzGvRgY/Chi-Sq-Fit-Chai-and-Chi.png" title="Chai Latte and Tai Chi" width = 300/>

### 2.0 - The difference between what is expected and what is observed

**2.1:** If the data generating process is random (meaning the distribution of selected applicants should resemble the distribution of a random sample of applicants), what should the approximate difference between `Admitted` and `Top10_AI` be?

<div class="alert alert-block alert-warning">

**Sample Responses**

It should be zero, or pretty close to zero.
    
</div>

**2.2:** How might we write this as a word equation? Also, how would we write the alternative model?

<div class="alert alert-block alert-warning">

**Sample Responses**

**Empty Model:**

*Admitted = Other Stuff*

**Alternative Model:**

*Admitted = Top10_AI + Other Stuff*
    
</div>

#### Assumptions of a chi-square test

Before we begin our chi-square test, we need to convert our proportions to raw values to figure out how many students we should expect to be admitted for each group and how many were actually admitted. If we do our analysis using the proportions, our analysis will yield an error. This is because it will violate the rule of the chi-square test that assumes our expected values will be counts, not proportions, and that the expected counts for each group are greater than or equal to 5.

We can easily convert these values by multiplying our proportions by the number of admitted applicants.

In total, Harvard admitted n = 2023 applicants from these racial groups to the [Class of 2019](https://features.thecrimson.com/2015/freshman-survey/makeup-narrative/).

**2.3:** So, how many of each group would we expect to admit if it was a random sample? Save the counts into a new variable called `Expected`.

In [None]:
# COMPLETE VERSION
Harvard$Expected <- (Harvard$Top10_AI * 2023)
Harvard

**2.4:** Now that we know what we should expect, let's create a new variable of what we actually observed. Save it as a variable called `Observed`.

In [None]:
# COMPLETE VERSION
Harvard$Observed <- (Harvard$Admitted * 2023)
Harvard

**2.5:** Now, we can get the actual count difference between the number of students we would *expect* to be admitted for each racial group, and how many we actually *observe* being admitted.

Save this difference as a new variable called `Count_Difference`.

In [None]:
Harvard$Count_Difference <- (Harvard$Observed - Harvard$Expected)
Harvard

**2.6:** Looking at the data frame, identify the DATA, the MODEL, and the ERROR.

<div class="alert alert-block alert-warning">

**Sample Response**

DATA --> Observed

MODEL --> Expected

ERROR --> Count_Difference
    
</div>

### 3.0 - Summarizing the difference between expected and observed

**3.1:** Now that we have all the differences (or, residuals) for each group, what do you think will happen if we sum them up? What can we do to resolve this?

In [None]:
# COMPLETE VERSION
# Save your squared differences as a new variable called `Diff_Sq`.
Harvard$Diff_Sq <- Harvard$Count_Difference^2
Harvard

#### Scaling the Differences

Now that we have all these squared differences, it can be hard to interpret them, so we will want to scale them.

To help this make sense, let's consider the following scenario:

**Situation 1**: You go to a donut shop and order 3 donuts, but when you get home you notice that 2 donuts are missing.

**Situation 2**: You go to a donut shop and order 600 donuts, but when you get home you notice that 2 donuts are missing.



<img src="https://i.postimg.cc/kJVkFQKq/Chi-Sq-Fit-Donuts.png" title="Donuts" width = 300/>

In these situations, the raw difference is the same:

<img src="https://i.postimg.cc/jxFnBbDy/Chi-Sq-Fit-2-Situations.png" title="Raw Difference" width = 300/>


So if we divide by the expected count we can get a sense of the differences in scale:

<img src="https://i.postimg.cc/z8S2ZtkV/Chi-Sq-Fit-Percent-Lost.png" title="Percent Lost" width = 500/>


We can see there is a huge difference in scale! Now, I may infer that in one case they messed up my order, and it's more likely to have happened on purpose, whereas, in the other case, it was likely a chance mistake.

<img src="https://i.postimg.cc/6WyFBrpF/Chi-Sq-Fit-Chance-or-Not-Chance.png" title="Chance or Not Chance" width = 500/>

**3.2:** So next, to scale our values, we can simply divide our squared differences by our expected counts (saved as a new variable called `Scaled_Diff`) and THEN we can sum them up to get our **$\chi^2$ statistic**.

In [None]:
# COMPLETE VERSION
Harvard$Scaled_Diff <- Harvard$Diff_Sq / Harvard$Expected
Harvard
sum(Harvard$Scaled_Diff)

In [None]:
# Or do the calculations 
((433-1163)^2 / 1163) + ((247-63)^2 / 63) + ((227-16)^2 / 16) + ((1119-783)^2 / 783)

**3.3:** A shortcut is to use the `chisq.test()` function, which will do all these steps for us (yay for R!). You may get a slightly adjusted value compared to the other methods (due to rounding), but it should be pretty close. 

You also get the p-value when you use the function. Try running the chi-square analysis below, and interpret the p-value that you get.

In [None]:
# The chisq.test() function

# chisq.test(x, p) 
# where:
# x: A numerical vector of observed frequencies.
# p: A numerical vector of expected proportions.

Observed <- c(433, 247, 227, 1119)

chisq.test(Observed, p = c(0.57, 0.03, 0.01, 0.39)) #must add up to 1

# Chi-Square Score to P Value Calculator
# https://www.statology.org/chi-square-p-value-calculator/

<div class="alert alert-block alert-warning">

**Sample Response**

The p-value is very small (less than .001). This means that it is very unlikely that we are observing these group differences due to random chance.

</div>

### 4.0 - Interpreting $\chi^2$ and p-value

**4.1:** Here is the equation summarizing what we have done:

<img src="https://i.postimg.cc/9VPsCQmm/Chi-Sq-Fit-Chi-Sq-Formula.png" title="Chi Square Equation" width = 500/>

Try putting this equation into your own words.

<div class="alert alert-block alert-warning">

**Sample Response**

To find the chi-square value, you will take the difference between what we expect and what we observe, then scale the differences by dividing by our expected values. Finally, we square all those scaled differences and add them up.

</div>

**4.2:** We have calculated a $\chi^2$ value of 3878.05. How do we interpret this?

When chi-square is big, that means your observed data is very different from what was expected under the random model. Therefore, if the random model is true, your observations are unlikely (low p-value). You reject the random model.

When chi-square is small, that means your observed data is very similar to what was expected under the random model. Therefore, if the random model is true, your observations are likely (high p-value). You fail to reject the random model.

What is the relationship you notice between $\chi^2$ and p-value?

<div class="alert alert-block alert-warning">

**Sample Response**

As $\chi^2$ gets bigger, p-value gets smaller.

</div>

How do we know if our $\chi^2$ value is big or small? We can compare it to a sampling distribution! Below are various probability distributions of random chi-square models that slightly change depending on the degrees of freedom. 

We can calculate degrees of freedom as the number of groups we have minus one ($k - 1$). We have four groups so our degrees of freedom is 4 - 1 = 3. 

Looking at the distributions below, we can see how likely we are to observe each possible $\chi^2$ value under the empty model, for each potential degrees of freedom. 

**4.3:** What do you notice about the shape of the sampling distribution of $\chi^2$? How does $k$ seem to affect it?

<img src="https://i.postimg.cc/z8kCVSxB/Chi-Sq-Fit-Chi-Sq-Dist.png" title="Chi Square Distribution" width = 500/>

<div class="alert alert-block alert-warning">

**Sample Response**

They appear to be skewed right, and all start at 0. The higher the $k$, the more normal it appears to get.

It's very similar to the F distribution!

</div>

**4.4:** Look at the distribution for $k = 3$. What is the approximate range for the most common $\chi^2$ values? What does this tell us about our sample $\chi^2$ value? 

<div class="alert alert-block alert-warning">

**Sample Response**

The distribution ranges from 0-8, with the curve starting to really fall after about $\chi^2 = 4$, so a $\chi^2$ above 8 is highly uncommon, thus, our sample $\chi^2$ is very high and the differences among the groups are quite large, and likely not due to chance.

</div>

**4.5:** What does this finding suggest about the plaintiff's case?

<div class="alert alert-block alert-warning">

**Sample Response**

It suggests that there was more than just "other stuff" involved, and that perhaps Harvard was not just randomly selecting applicants out of the top applicants pool.

</div>

### 5.0 - Discussion

**Discussion Question:** Imagine the plaintiffs argue that the above evidence proves racial discrimination. If you were Harvard’s lawyer, how would you defend the school’s admissions policy? Explain.

<div class="alert alert-block alert-warning">

**Sample Response**

*Responses will vary.*

</div>