In the last file, we looked at the gender frequencies of people included in a sample data set on US income. We had an inkling that the amount of data we had on males and females was different from an even split, so we learned how to perform the chi-squared test to put this inkling to the test. The sample dataset consisted of 32561 rows, and here are the first few:

![image.png](attachment:image.png)

Each row represents a single person who was counted in the `1990` US Census and contains information about their income and demograpics. Here are some of the relevant columns:

* `age`: how old the person is
* `workclass`: the type of sector the person is employed in.
* `race`: the race of the person.
* `sex`: the gender of the person, either `Male` or `Female`.
* `high_income`: if the person makes more the `50k` or not.

In the last file, we calculated a chi-squared statistic for a single categorical column, such as sex. In this file, we'll learn how to apply this same technique to two-way contingency tables that show how two categorical columns interact. For instance, here's a table showing the relationship between `sex` and `high_income`:

![image.png](attachment:image.png)

On looking at this diagram, we might see a pattern between `sex` and `high_income`. If there were no interaction between `sex` and `high_income`, we would have their values to be independent of each other. But it's hard to immediately quantify that pattern and tell if it's significant. We can apply the chi-squared test (also known as the [chi-squared test of association](https://en.wikipedia.org/wiki/Chi-squared_test) to figure out if there's a statistically significant correlation between two categorical columns.

The first step we took in calculating the chi-squared statistic for a single categorical variable was calculating the expected value. This calculation gets a slightly more complicated when considering multiple categories at once, but we'll walk through an example here. 

Above, we said that if there was no interaction between `sex` and `high_income`, then their counts should be independent from each other. This independence is the same independence that we learned about in Probability;

If two events are independent, then the probability of both of them happening at the same time is just the product of their individual probabilities, as seen below:

**P(A ∩ B)= P(A) x P(B)**


We can use this relationship to calculate our expected values. In a multiple category chi-squared test, we calculate expected values across our whole dataset. We can illustrate this by converting our chart from above into proportions:

![image.png](attachment:image.png)

Each cell represents the proportion of people in the data set that fall into the specified categories. For example:

* `20.5%` of `Males` in the whole data set earn `>50k` in income.
* `33%` of the whole dataset is `Female`
* `75.9%` of the whole dataset earns `<=50k`.

The bottom row represents the probability of being a particular `gender`, while the last column represents the probability of being `high income`. The other cells neatly represent the intersection of the two events. 

Using this information, we can start calculating out the `expected values` using the formula above. For example, `24.1%` of all people in income earn `>50k`, and `33%` of all people in income are `Female`, so we'd expect the proportion of people who are `female` and earn `>50k` to be `0.241 * 0.33`, which is `0.07953`. We have this expectation based on the proportions of `Females` and `>50k` earners across the whole dataset. Instead, we see that the observed proportion is `0.036`, which indicates that there may be some correlation between the `sex` and `high_income` columns.

The expected values are calculated under the assumption that `sex` and `high_income` are independent of each other, so we can use this as our null hypothesis:

* $H_0$: gender and earning a high income are independent of each other
* $H_1$: gender and earning a high income are not independent of each other

Saying that the two variables are not independent is another way of saying that the two influence each other. 

We can convert our expected proportion to an expected count value by multiplying by `32561`, the total number of rows in the data set, which gives us `32561 * 0.07953`, or `2589.6`. With this in mind, calculate all of the expected values for all the different category combinations.

**Task**

Using the expected proportions in the table above, calculate the expected values for each of the `4` cells in the table.

* Calculate the expected value for `Males` who earn `>50k`.
* Calculate the expected value for `Males` who earn `<=50k`.
* Calculate the expected value for `Females` who earn `>50k`.
* Calculate the expected value for `Females` who earn `<=50k`.

**Answer**

`males_over50k <- .67 * .241 * 32561
males_under50k <- .67 * .759 * 32561
females_over50k <- .33 * .241 * 32561
females_under50k <- .33 * .759 * 32561`

We should have ended up with values like this:

![image.png](attachment:image.png)

Now that we have our expected values, we can calculate the chi-squared value by using the same principles from the last file.


![image.png](attachment:image.png)

Here's the table of our observed values for reference:

![image.png](attachment:image.png)

**Task**

* Compute the chi-squared value for the observed values above and the expected values above.

**Answer**

`chisq_gender_income <- 0
observed <- c(6662, 1179, 15128, 9592)
expected <- c(5257.6, 2589.6, 16558.2, 8155.6)`

`for (i in 1:length(observed)) {
    O <- observed[i]
    E <- expected[i]
    chisq_gender_income <- chisq_gender_income + (O - E)^2 / E
}`

Now that we've found our chi-squared value, `1520.0`, we can use the same technique with the chi-squared sampling distribution from the file to find a `p-valu`e associated with the chi-squared value. 

Before we can calculate `p-value`, we need to figure out how many degrees of freedom it has. The degrees of freedom was described as the number of values that contribute to the statistic, minus 1. 

This formula changes somewhat when we consider the interaction between two categorical variables. Instead of considering just the number of categories, we only need to count how many different values each category can take.

For a two-way contingency table, we can calculate the degrees of freedom by taking the number of categories for the first variable `—sex—` and subtract one from it. Then, we take this value and multiply it by the number of categories for the second variable, `high_income`, minus 1. 

The only difference between this calculation and the calculation we learned in the file is that we must start multiplying. We essentially take the degrees of freedom for each individual categorical variable and multiply them together. The final result is the degrees of freedom we will use for the null hypothesis.

The degrees of freedom calculation can be summarized below:

![image.png](attachment:image.png)

Here, 

* `r` represents the number of categories for the variable we are using along the **rows**, which is `high_income` in this example. 
* `c` represents the number of categories for the variable along the **columns**, which is `sex`.

**Task**

calculate the degrees of freedom for this particular contingency table between `sex` and `high_income`.

**Answer**

`r <- 2
c <- 2
df <- (r - 1) * (c - 1)`

Now we know the degrees of freedom, we now know what the distribution of the statistic is under the null hypothesis. It turns out that the degrees of freedom for two categories with only two options each is still 1. We have all the ingredients we need to calculate the `p-value`.

We can use the `pchisq()` function to calculate the cumulative probability of seeing this test statistic under the null.

**Task**

* Calculate the `p-value` for observing a test statistic of `1517` under the null hypothesis.
* Using the `p-value`, decide whether or not to reject or fail to reject the null hypothesis. If we believe we should reject the null hypothesis, assign the value `TRUE`. Otherwise, assign `FALSE`. Use a signifcance level of `0.05`.

**Answer**

`pvalue <- 1 - pchisq(1517, 1)
reject_null <- TRUE`

We learned how to perform the chi-squared test by hand. The chi-squared test is such a common test that R actually has a dedicated function for the test, called `chisq.test()`. 

The input to `chisq.test()` is a data matrix; in this case, it's the contingency table that we use to calculate the test statistic by hand. The function takes this matrix and automatically calculates the `test statistic`, `degrees of freedom`, and `p-value`. The test uses the null hypothesis that the two categorical variables used are independent of each other.

For example, we would use the following code to set up the data for our test for `sex` and `high_income`:

![image.png](attachment:image.png)

We use the `table()` function to take the two categorical variables and convert them into a contingency table. Once we have constructed this table, we can give it to the `chisq.test()` function. 

The test function outputs all of the important characteristics of the test that we've described in a user-friendly format. The output mentions a "continuity correction", but we don't need to worry about this. Just know that the output of `chisq.test()` correctly describes the results of the test, and allows us to make the judgment about the null hypothesis.

The reason we don't discuss about this function first is because it's important to become acquainted with the entire process of hypothesis testing. With a convenient function like chisq.test(), it's easy to overlook important technical aspects about the test. Without knowing what the null and alternative hypotheses, we have no way of knowing how to interpreting the resulting `p-value`. But now that we're equipped with the full process, we can use the function in an informed, responsible manner.

The above code allows us to quickly test hypotheses concerning categorical variables in our data.

**Task**

We suspect that there is an association between `race` and `education level`. Using the income data, test the null hypothesis that `race` and `education level` are independent of each other. 

`Race` is contained in the race column, while `education level` is contained in the education column.

* Take these two columns and construct a two-way contingency table from them. 
* Using the `chisq.test()` function, perform the chi-squared test and make a decision about the null hypothesis. If we think we should reject the null hypothesis that `race` and `education` are independent of each other, assign `TRUE`. Otherwise, assign `FALSE`. Use a signifcance level of `0.05`.

**Answer**

`library(readr)
income <- read.csv("income.csv")`

`race_education_table <- table(income$race, income$education)
chisq.test(race_education_table)
reject_null <- TRUE`

Now that we've learned the chi-squared test, we now have the ability to develop a hypothesis about relationshipd between categorical variables and test these hypotheses. There are a few caveats to using the chi-squared test that are important to cover, though:

* Finding an insignificant result doesn't mean we can conclude that there is no association between the columns. 

    - For instance, if we found that the chi-squared test between the `sex` and `race` columns returned a `p-value` of `.1`, it wouldn't mean that there is no relationship between `sex` and `race`. It might be the case that the association between the two variables is too small to detect with the data on hand.
    

* Finding a statistically significant result doesn't imply anything about the strength of the relationship between the two variables. 

    - For instance, finding that a chi-squared test between `sex` and `race` results in a `p-value` of `.01` doesn't mean that the dataset contains too many `Females` who are `White` (or too few). A statistically significant finding means that there is some evidence that the two variables are not independent of each other. That is to say, having a particular gender can increase or decrease our probability of being a certain race, according to the data set.



* Chi-squared tests work the best when the numbers in each cell of the cross table are large. There is no hard rule, but general rule-of-thumb is that the test is valid if each cell is greater than 5.

In this file, we covered chi-squared tests for multiple categories, and learned how to quickly perform chi-squared tests. 

We learned when to apply and when not to apply chi-squared tests. Chi-squared tests can be a powerful tool to discover interrelationships and figure out when anomalies in our data should be investigated further.