# Lambda School Data Science Module 122
## Hypothesis Testing - Chi-Square Tests

#Objectives

* Expain the purpose of a chi-square test and identify appliations
* Use a chi-square test for independence to test for a statistically significant association between two categorical variables
* Use a chi-square test p-value to draw the correct conclusion about the null and alternative hypothesis


## Prepare 
In the last lecture, we learned about the t-test, which allows you to weigh evidence for or against the claim that mean of a population is equal to a reference value (the null hypothesis).

T-tests are often the appropriate statistical test when you are working with a a quantitative, continuous variable.

However, there are lots of other kinds of data and many other methods of data analysis.  For example, we might like to examine the relationship between two categorical variables.  In that case, we'd like to use a chi-square test.  "Chi-square" refers to a particular statistical distribution in the way that the t-test is called the t-test because it depends on the t-distribution.

The chi-square test works - in general - by comparing the counts that actually appear in a two-way table to the counts we would expect to see if the two variables were not related to each other at all.

[More about the Chi-square test](https://https://en.wikipedia.org/wiki/Chi-squared_test).


## Titanic Example

In the early hours of April 15, 1912, the unsinkable ship RMS Titanic sank when it struck an iceberg, killing more than half of the passengers and crew aboard. 

The Titanic.csv dataset contains demographic information for 889 of those passengers as well as a record of whether or not each passenger survived. 

Our goal is to determine if there is a relationship between ticket class and passneger survival on the Titanic.



A chi-square test *always* tests the null hypothesis that there is *no* relationship between two variables vs. the alternative hypothesis that there *is* some relationship between the two variables.


Therefore, in this exmaple

**Ho:** There ____________________ relationship between passenger ticket class and survival on the Titanic.

**Ha:** There _____________________ relationship between passenger ticket class and survival on the Titanic.

In [None]:
import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Titanic.csv'

titanic = pd.read_csv(data_url, skipinitialspace=True, header=0)

print(df.shape)
titanic .head()

Survived = 0 means the passenger did not survive and 
Surivived = 1 means the passenger did survive.

Pclass = 1, 2, 3 indicates the passenger had a 1st, 2nd or 3rd class ticket, respectively.

To start, let's look at the freqeuncy and relative frequency of surival on the Titanic.

In [None]:
#Frequency of survival



In [None]:
#Relative frequency of survival.  Multiply by 100 to convert from
#proportions to percents




Survival results here:   


Now let's look at the frequency and relative frequency of ticket class.


In [None]:
#Frequency and relative frequency of ticket class.

Passenger class results here: 

Now let's look at the joint distribution of survival by passenger class.  That means we want to see how many people fall into each combination of the two categories.

In [None]:
#Joint distribution

So... is there a relationship between ticket class and survival?  

Let's begin by including the marginal distribution of each variable.  We actually calculated those before, but we can add them to the "margins" of the two-way table (hence the name marginal distribution) so we can remember how many people survived overall and how many people were in each ticket class overall.

In [None]:
# Joint distribution with margins

But what we really want to know is "Of people in each ticket class, what proportion survived?"  We can compare those proportions and see if they are the same or different.

We use "normalize = index" to tell Python that we want to compute the proportion of individuals who did and did not survive by the row variable (what Python calls the index variable).  

In statistical terminology, we call this the conditional distribution.  We are computing the distribution of survival *conditional* on what passenger class they were in.

In [None]:
#Conditional distribution of survival by passenger class

We observe:

But because we haven't actually computed a statistical test, we don't know for sure if there is strong evidence that there is a relationship between ticket class and survival.  That's where the Chi-Square test comes in.

As a refresher:

**Ho:** 

**Ha:** 

Just like in the t-test examples, if the p-value is less than the significance level, we will reject the null hypothesis.  If the p-value is greater than the significance level, we will fail to reject.

We import the chi-square function (chi2_contingency) from scipy.stats.  

**Take a very close look at the entry in the contingency table function**. It is the *table* we created above, not just the two variables of interest.

The chi2_contingency function has a lot of output, but we are most interested in the p-value, which we are calling p below.  

In [None]:
from scipy.stats import chi2_contingency

#Chi-square test


P-value = 

So... we definitely think that passenger ticket class is ???

Let's make a nice visualization - a side by side bar plot - to illustrate this relationship.



First, let's take a look at our conditional distribution of survival by passenger class again.

In [None]:
#Copy code for conditional distribution of survival by passenger class here:

We'd like to create a bar plot where we compare the percent of surivors in each passenger class.

We're going to start by creating two vectors: one for the percent of individuals who survived and one for the percent of individuals who didn't survive and plotting those with the help of some graphing parameters that are going to make everything line up nicely.

[More info about barplots](https://https://matplotlib.org/examples/api/barchart_demo.html).

In [None]:
import matplotlib.pyplot as plt

# Need this for graphing purposes - it's the number of passenger classes
N = ## Fill in here 


Died = (## Fill in here##) # Percent that died in each ticket class
Survived = (## Fill in here##) #Percent that survived in each ticket class

#This is more graphical stuff
ind = np.arange(N)  # the x locations for the groups
width = 0.35       # the width of the bars

#Create the plot
fig, ax = plt.subplots()
rects1 = ax.bar(ind, Died, width, color='g') #bars for died
rects2 = ax.bar(ind + width, Survived, width, color='b') #bars for survived

# add some text for labels, title and axes ticks
ax.set_ylabel('## Fill in here##')
ax.set_title('## Fill in here##')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(('## Fill in here##))

ax.legend((rects1[0], rects2[0]), ('##Fill in here##'))



We can see by our lovely graph and chi-square test that:



---



Now let's look at passenger sex and survival?  Were women and children really the first ones in the lifeboats?

Note that here both child and adult males are considered male and child and adult females are considered female in the data so we really can't conclude anything about children from this analysis.


First, is a chi-square test appropriate for these data?

Answer: 

What is the distribution of passenger sex on the Titanic?

In [None]:
#Relative frequency of gender



#Relative frequency of gender



Passengers on the Titanic were...

Refresh your memory by calculating the distribution of passenger survival.

In [None]:
#Frequency of survival



#Relative frequency of survival



Survival results: 

Calculate the joint distribution of passenger sex and survival.  Add on the margins.  Can you draw any initial conclusions about the relationship between passenger sex and survival?

In [None]:
#Joint distribution and joint distribution adding margins.

Results: 

Calculate the distribution of survival conditional on passenger sex.  What does this tell you?

In [None]:
#Conditional distribution of survival by passenger sex

Results: 

Now we need to conduct the chi-square test.  What are our hypotheses?

**Ho:**

**Ha:** 

In [None]:
#chi-square test

What is the p-value?  What do we conclude (at the 0.05 significance level) about the relationship between passenger sex and survival?

Results: 

Create a side-by-side bar plot illustrating the relationship of passenger sex and survival.  

In [None]:
#Need this for graphing purposes - it's the number of sexes (male and female)

N = ## Fill in here ##


Died = ## Fill in here ## # Percent that died in each ticket class
Survived = ## Fill in here ## #Percent that survived in each ticket class

#This is more graphical stuff
ind = np.arange(N)  # the x locations for the groups
width = 0.35       # the width of the bars

#Create the plot
fig, ax = plt.subplots()
rects1 = ax.bar(ind, Died, width, color='g') #bars for died
rects2 = ax.bar(ind + width, Survived, width, color='b') #bars for survived

# add some text for labels, title and axes ticks
ax.set_ylabel('##Fill in here##')
ax.set_title('##Fill in here##')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(('##Fill in here##'))

ax.legend((rects1[0], rects2[0]), ('##Fill in here##))

Explain your results to someone who is interested in Titanic history but knows little about statistics.

#Determine family size on the Titanic

As a proxy, how many people have the same last name?

In [None]:
#Print the first 5 observations
df.head()