## Significance Testing for Chi-Squared

Based on: https://github.com/sengkchu/codingdisciple.content/blob/master/Learning%20data%20science/Learning/Studying%20Statistics/Chi-Squared%20Test%20for%20Independence/Chi-Squared%20Test%20for%20Independence.ipynb

We will use a Chi-Squared test to determine the statistical significance of two independent categorical groups of data.

### Chi-Squared Test Assumptions

We'll be looking at data from the census in 1994. Specifically, we are interested in the relationship between 'sex' and 'hours-per-week' worked. Click [here](https://archive.ics.uci.edu/ml/datasets/Census+Income) for the documentation and citation of the data. First let's get the assumptions out of the way:

+ There must be different participants in each group with no participant being in more than one group. In our case, each individual can only have one 'sex' and can not be in multiple workhour categories.
+ For the 1994 census, sex could only be recorded as Male or Female.
+ Random samples from the population. In our case, the census is assumed to be a representation of the population.

### Data Exploration

For the sake of this example, we'll convert the numerical column 'hours-per-week' into a categorical column using pandas, where the categories are bins from 0-9 hours, 10-19 hours, etc. Then we'll assign 'sex' and 'hours_per_week_categories' to a new dataframe.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from scipy import stats

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
cols = ['age', 'workclass', 'fnlwg', 'education', 'education-num', 
        'marital-status','occupation','relationship', 'race','sex',
        'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
data = pd.read_csv('data/census.csv', names=cols)

#Create a column for work hour categories.
def process_hours(df):
    cut_points = [0,9,19,29,39,49,1000]
    label_names = ["0-9","10-19","20-29","30-39","40-49","50+"]
    df["hours_per_week_categories"] = pd.cut(df["hours-per-week"],
                                             cut_points,labels=label_names)
    return df

data = process_hours(data)
workhour_by_sex = data[['sex', 'hours_per_week_categories']]
workhour_by_sex.head()

In [None]:
workhour_by_sex['sex'].value_counts()

In [None]:
workhour_by_sex['hours_per_week_categories'].value_counts()

### The Null and Alternate Hypotheses

Recall that we are interested in knowing if there is a relationship between 'sex' and 'hours_per_week_categories'. In order to do so, we would have to use the Chi-squared test. But first, let's state our null hypothesis and the alternative hypothesis.

$ H_0 :  \text{There is no statistically significant relationship between sex and the # of hours per week worked.} $

$ H_a :  \text{There is a statistically significant relationship between sex and the # of hours per week worked.} $


### Constructing the Contingency Table

The next step is to format the data into a frequency count table. This is called a <b>Contingency Table</b>, we can accomplish this by using the pd.crosstab() function in pandas.

In [None]:
contingency_table = pd.crosstab(
    workhour_by_sex['sex'],
    workhour_by_sex['hours_per_week_categories'],
    margins = True
)
contingency_table

Each cell in this table represents a frequency count. For example, the intersection of the 'Male' row and the '10-19' column of the table would represent the number of males who works 10-19 hours per week from our sample data set. The intersection of the 'All' row and the '50+' column would represent the total number of people who works 50+ hours a week.

### Visualizing the Contingency Table with a Stacked Bar Chart

In [None]:
#Assigns the frequency values
malecount = contingency_table.iloc[0][0:6].values
femalecount = contingency_table.iloc[1][0:6].values

#Plots the bar chart
fig = plt.figure(figsize=(10, 5))
sns.set(font_scale=1.8)
categories = ["0-9","10-19","20-29","30-39","40-49","50+"]
p1 = plt.bar(categories, malecount, 0.55, color='#d62728')
p2 = plt.bar(categories, femalecount, 0.55, bottom=malecount)
plt.legend((p2[0], p1[0]), ('Male', 'Female'))
plt.xlabel('Hours per Week Worked')
plt.ylabel('Count')
plt.show()

The chart above visualizes our sample data from the census. If there is truly no relationship between sex and the number of hours per week worked. Then the data would show an even ratio split between 'Male' and 'Female' for each time category. For example, if 5% of the females worked 50+ hours, we would expect the same percentage for males who worked 50+ hours.

### The Chi-Squared Test for Independence - Using Scipy

Now that we've gone through all the calculations, it is time to look for shortcuts. Scipy has a function that plugs in all the values for us. Click [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) for the documentation.

All we need to do is format the observed values into a two-dimensional array and plug it into the function.

In [None]:
f_obs = np.array([contingency_table.iloc[0][0:6].values,
                  contingency_table.iloc[1][0:6].values])
print(f_obs)

In [None]:
chi2_results = stats.chi2_contingency(f_obs)
print(chi2_results)

We need to go to the documentation to see what the values we are getting mean.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
Using the documentation, complete the following code block to print out our desired statistics.

In [None]:
# TODO: Store the results from the chi-squared test in the following variables:
p_value =
df = 
chi2_test_statistic = 
print(f"The chi-squared value we calculated was {chi2_test_statistic:.3f}, ")
print(f"and with {df} degrees of freedom, the p-value this results in is {p_value:.3f}.")

### Conclusion

Write a conclusion of the null and alternative hypotheses above, based on the p-value we calculated.