# Multi category chi-squared tests

In the last mission, we looked at the gender frequencies of people included in a data set on US income. The dataset consists of 32561 rows, and here are the first few:

```
age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
```

Each row represents a single person who was counted in the 1990 US Census, and contains information about their income and demograpics. Here are some of the relevant columns:

* `age` -- how old the person is
* `workclass` -- the type of sector the person is employed in.
* `race` -- the race of the person.
* `sex` -- the gender of the person, either Male or Female.
* `high_income` -- if the person makes more the 50k or not.

In the last mission, we calculated a chi-squared value indicating how the observed frequencies in a single categorical column, such as `sex`, varied from the US population as a whole.<br>

In this mission, we'll look how to make this same technique applicable to cross tables, that show how two categorical columns interact. For instance, here's a table showing the relationship between sex and `high_income`:<br>

 |Male|Female|Totals
---|---|---|---
>50k income|6662|1179|7841
<=50k income|15128|9592|24720
Totals|21790|10771|32561

On looking at this diagram, you might see a pattern between `sex` and `high_income`. But it's hard to immediately quantify that pattern, and tell if it's significant. We can apply the chi-squared test (also known as the [chi-squared test of association](https://en.wikipedia.org/wiki/Chi-squared_test)) to figure out if there's a statistically significant correlation between two categorical columns.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
income = pd.read_csv('data/income.csv')
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Multi-category chi-squared tests

In the single category chi-squared test, we find expected values from other data sets, and then compare with our own observed values. In a multiple category chi-squared test, we calculate expected values across our whole dataset. We'll illustrate this by converting our chart from last screen into proportions:

 |Male|Female|Totals
---|---|---|---
>50k income|.205|.036|.241
<=50k income|.465|.294|.759
Totals|.669|.331|1

Each cell represents the proportion of people in the data set that fall into the specified categories.

* `20.5%` of `Males` in the whole data set earn `>50k` in income.
* `33.1%` of the whole dataset is `Female`
* `75.9%` of the whole dataset earns `<=50k`.

We can use our total proportions to calculate expected values. `24.1%` of all people in `income` earn `>50k`, and `33.1%` of all people in `income` are `Female`, so we'd expect the proportion of people who are `female` and earn `>50k` to be `.241` * `.331`, which is `.0799771`. We have this expectation based on the proportions of Females and `>50k` earners across the whole dataset. Instead, we see that the observed proportion is `.036`, which indicates that there may be some correlation between the sex and high_income columns.<br>

We can convert our expected proportion to an expected value by multiplying by `32561`, the total number of rows in the data set, which gives us `32561` * `.0799771`, or `2597.4`.


Using the expected proportions in the table above, calculate the expected values for each of the 4 cells in the table.
* Calculate the expected value for Males who earn >50k, and assign to `males_over50k`.
* Calculate the expected value for Males who earn <=50k, and assign to `males_under50k`.
* Calculate the expected value for Females who earn >50k, and assign to `females_over50k`.
* Calculate the expected value for Females who earn <=50k, and assign to `females_under50k`.

In [6]:
total = 32561
males = total * .669
females = total * .331
over50k_prob = .241
under50k_prob = .759

males_over50k = males * over50k_prob
males_under50k = males * under50k_prob
females_over50k = females * over50k_prob
females_under50k = females * under50k_prob

print('males_over50k: {}'.format(males_over50k))
print('males_under50k: {}'.format(males_under50k))
print('females_over50k: {}'.format(females_over50k))
print('females_under50k: {}'.format(females_under50k))


males_over50k: 5249.777469
males_under50k: 16533.531531
females_over50k: 2597.423531
females_under50k: 8180.267469


## Calculating chi-squared

In the last screen, you should have ended up with a table like this:

 |Male|Female
---|---|---
>50k income|5249.8|2597.4
<=50k income|16533.5|8180.3

Now that we have our expected values, we can calculate the chi-squared value by using the same principle from the previous mission. Here are the steps:

* Subtract the expected value from the observed value.
* Square the difference.
* Divide the squared difference by the expected value.
* Repeat for all the observed and expected values and add up the values.

Here's the formula:

$$\sum(\frac{(observed−expected)^2}{expected})$$


Here's the table of our **observed values** for reference:

 |Male|Female
---|---|---
>50k income|6662|1179
<=50k income|15128|9592

Compute the chi-squared value for the observed values above and the expected values above.
* Assign the result to chisq_gender_income.

In [8]:
observed = [6662, 1179, 15128, 9592]
expected = [5249.8, 2597.4, 16533.5, 8180.3]
values = []

for i, obs in enumerate(observed):
    exp = expected[i]
    value = (obs - exp) ** 2 / exp
    values.append(value)

chisq_gender_income = sum(values)

In [9]:
chisq_gender_income

1517.5510981525103

## Finding statistical significance

Now that we've found our chi-squared value, `1517.6`, we can use the same technique with the chi-squared sampling distribution from the last mission to find a p-value associated with the chi-squared value. **The p-value will tell us whether the difference between the observed and expected values is statistically significant or not.**<br>

Rather than construct a sampling distribution again manually, we'll use the [scipy.stats.chisquare](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.chisquare.html) function that we covered in the last mission.<br>

We could find the chi-squared value and the p-value using the scipy.stats.chisquare function like this:

```python
import numpy as np
from scipy.stats import chisquare

observed = np.array([10, 10, 5, 5])
expected = np.array([5, 5, 10, 10])
chisquare_value, pvalue = chisquare(observed, expected)
```

Use the scipy.stats.chisquare function to find the chi-squared value and p-value for the above observed and expected counts.
* Assign the p-value to pvalue_gender_income.

In [11]:
from scipy.stats import chisquare

chisq_gender_income, pvalue_gender_income = chisquare(observed, expected)
print('pvalue_gender_income: {}'.format(pvalue_gender_income))

pvalue_gender_income: 0.0


## Cross tables

We can also scale up the chi-squared test to cases **where each category contains more than two possibilities**. We'll illustrate this with an example where we look at `sex` vs `race`. <br>

Before we can find the chi-squared value, we need to find the observed frequency counts. We can do this using the [pandas.crosstab](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html) function. **The crosstab function will print a table that shows frequency counts for two or more columns**. Here's how you could use the [pandas.crosstab](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html) function:

```python
import pandas

table = pandas.crosstab(income["sex"], [income["high_income"]])
print(table)
```
The above code will print a table showing how many people from `income` fall into each category of `sex` and `high_income`.<br>

The second parameter passed into [pandas.crosstab](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html) is actually a list -- this parameter can contain more than one item.

Use the [pandas.crosstab](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html) function to print out a table comparing the `sex` column of `income` to the `race` column of `income`.

In [12]:
print(pd.crosstab(income['sex'], [income['race']]))

race      Amer-Indian-Eskimo   Asian-Pac-Islander   Black   Other   White
sex                                                                      
 Female                  119                  346    1555     109    8642
 Male                    192                  693    1569     162   19174


## Finding expected values

Now that we have the observed frequency counts, **we can generate the expected values**. We can use the [scipy.stats.chi2_contingency](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html) function to do this. The function:

* takes in 
  * a cross table of observed counts (pd.crosstab),
* returns
  * the chi-squared value, 
  * the p-value, the degrees of freedom,
  * the expected frequencies. 
  
Let's say we have the following observed counts:

```
[5, 5, 10, 10]
```

Here's how we could use the [scipy.stats.chi2_contingency](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html) function:

```python
import numpy as np
from scipy.stats import chi2_contingency
observed = np.array([[5, 5], [10, 10]])

chisq_value, pvalue, df, expected = chi2_contingency(observed)
```

You can also directly pass the result of the [pandas.crosstab](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html) function into the [scipy.stats.chi2_contingency](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html) function, which makes it extremely easy to do perform a chi-squared test.

Use the scipy.stats.chi2_contingency function to calculate the pvalue for the `sex` and `race` columns of `income`.
* Assign the result to `pvalue_gender_race`.

In [16]:
from scipy.stats import chi2_contingency

sex_race_crosstab = pd.crosstab(income['sex'], [income['race']])
_, pvalue_gender_race, _, _ = chi2_contingency(sex_race_crosstab)

pvalue_gender_race

5.1920613027604561e-97

## Caveats

Now that we've learned the chi-squared test, you should be able to figure out **if the association between two columns of categorical data is statistically significant or not**. There are a few caveats to using the chi-squared test that are important to cover, though:

### Finding that a result isn't significant doesn't mean that no association between the columns exists. 
For instance, if we found that the chi-squared test between the sex and race columns returned a p-value of .1, it wouldn't mean that there is no relationship between sex and race. *It just means that there isn't a statistically significant relationship*.

### Finding a statistically significant result doesn't imply anything about what the correlation is.
For instance, finding that a chi-squared test between sex and race results in a p-value of .01 doesn't mean that the dataset contains too many Females who are White (or too few). A statistically significant finding means that **some evidence of a relationship** between the variables exists, **but needs to be investigated further**.

### Chi-squared tests can only be applied in the case where each possibility within a category is independent. 
For instance, the Census counts individuals as either Male or Female, not both.

### Chi-squared tests are more valid when the numbers in each cell of the cross table are larger.
So if each number is 100, great -- if each number is 1, you may need to gather more data.