# Lab 5.4: Chi-Square Tests

## Outline

- Chi-square test for goodness of fit
- Chi-square test for independence

In [1]:
import pandas as pd
import yaml

from scipy import stats
from sqlalchemy import create_engine

pg_creds = yaml.load(open('../../pg_creds.yaml'))['student']

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{dbname}'.format(**pg_creds))

**Questions 1**  

Seven percent of mutual fund investors rate corporate stocks “very safe,” 58% rate them “somewhat safe,” 24% rate them “not very safe,” 4% rate them “not at all safe,” and 7% are “not sure.” A BusinessWeek/Harris poll asked 529 mutual fund investors how they would rate corporate bonds on safety. The responses are as follows.  

<img src="images/q1.png" width="300">  

Do mutual fund investors’ attitudes toward corporate bonds differ from their attitudes toward
corporate stocks? Clearly state the null and alternative hypotheses.

$H_0$ : the corporate stocks safety rating for all mutual funds investor types are independent.

$H_a$: the corporate stocks safety rating for all mutual funds investor types are dependent.

In [2]:
corp_stocks_safety = [48,323,79,16,63]
expected_freq = [sum(corp_stocks_safety)/5]*5

stats.chisquare(corp_stocks_safety, expected_freq)

Power_divergenceResult(statistic=577.79584120982986, pvalue=9.8962885157241283e-124)

Inference: Since p-value<0.05, there is enough evidence to reject the null hypothesis.  The corporate stocks safety rating for all mutual funds investor types are not independent.

**Question 2**  

A news article reports that "Americans have differing views on two potentially inconvenient and invasive practices that airports could implement to uncover potential terrorist attacks." This news piece was based on a survey conducted among a random sample of 1,137 adults nationwide, interviewed by telephone November 7-10, 2010, where one of the questions on the survey was "Some airports are now using 'full-body' digital x-ray machines to electronically screen passengers in airport security lines. Do you think these new x-ray machines should or should not be used at airports?" Below is a summary of responses based on party affiliation.  

<img src="images/q4.png" width="550">  

The differences in each political group may be due to chance. Complete the following computations under the null
hypothesis of independence between an individual's party affiliation and his/her support of full-body scans. It may be useful to first add on an extra column for row totals before proceeding with the computations.  

1) How many Republicans would you expect to not support the use of full-body scans?  

2) How many Democrats would you expect to support the use of full-body scans?  

3) How many Independents would you expect to not know or not answer?  

4) Test if an individual's party affiliation affects his/her support of full-body scans. Clearly state the null and alternative hypotheses in your test.

In [3]:
# 1
total_should = 264+299+351
total_should_not = 38+55+77
total_no_answer = 16+15+22
total_republican = 318
total_democrat = 369
total_independent = 450
total_count = 318+369+450

In [4]:
(total_should_not/total_count) * (total_republican/total_count) * total_count

47.54617414248021

In [5]:
# 2
(total_should/total_count) * (total_democrat/total_count) * total_count

296.6279683377309

In [6]:
# 3
(total_no_answer/total_count) * (total_independent/total_count) * total_count

20.976253298153033

### 4
$H_0$ : the opinion of the use of full body scan x-ray machies from are independent from political party affiliation.

$H_a$: the opinion of the use of full body scan x-ray machies from are dependent from political party affiliation.

In [7]:
political_df = pd.DataFrame(index=['Should', 'Should Not', 'No Answer'], columns=['Republican','Democrat','Independent'])

In [8]:
political_df

Unnamed: 0,Republican,Democrat,Independent
Should,,,
Should Not,,,
No Answer,,,


In [9]:
political_df['Republican'] = [264,38,16]
political_df['Democrat'] = [299,55,15]
political_df['Independent'] = [351,77,22]

In [10]:
political_df

Unnamed: 0,Republican,Democrat,Independent
Should,264,299,351
Should Not,38,55,77
No Answer,16,15,22


In [11]:
chi2, p, dof, expected = stats.chi2_contingency(political_df)

In [12]:
p

0.35977211142065163

The p-value is more than 0.05, therefore we fail to reject the null hypothesis and conclude that the opinion of the use of full body scan x-ray machies from are independent from political party affiliation.

**Question 3**  

A clothes retailer believes that there is no difference in sales across Monday, Tuesday and Wednesday. You are given the data in a `cloth_sales` table (in RDS) to test the claim. The table contains two columns: `dt` for the date, and `sales`, containing the count of sales for that day. Start by drawing up the table for the observed and expected frequencies for the chi-square test. 

**Hint:** 
- It will probably be easiest to extract the week of the year (`week`) and the day of the week (`dow`) using the [`date_trunc`](https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC) and [`date_part`](https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) functions in PostgreSQL respectively. You can then [`pivot`](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-pivoting-dataframe-objects) the table with pandas. The `head` of the resulting data frame should look something like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>date_part</th>
      <th>1.0</th>
      <th>2.0</th>
      <th>3.0</th>
    </tr>
    <tr>
      <th>date_trunc</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2014-10-27</th>
      <td>1016</td>
      <td>978</td>
      <td>1010</td>
    </tr>
    <tr>
      <th>2014-11-03</th>
      <td>987</td>
      <td>991</td>
      <td>997</td>
    </tr>
    <tr>
      <th>2014-11-10</th>
      <td>1014</td>
      <td>983</td>
      <td>1002</td>
    </tr>
    <tr>
      <th>2014-11-17</th>
      <td>991</td>
      <td>945</td>
      <td>992</td>
    </tr>
    <tr>
      <th>2014-11-24</th>
      <td>1001</td>
      <td>1058</td>
      <td>1002</td>
    </tr>
  </tbody>
</table>  

(N.B. `date_part('dow', dt)` will return the number of days after Sunday, so Monday = 1.0, Tuesday = 2.0, and so on.)
   
- Use `scipy.stats.chisquare()` to carry out a goodness of fit test

$H_0$ : the sales of clothing are independent from the day of the week (Monday, Tuesday or Wednesday).

$H_a$: the sales of clothing are dependent from the day of the week (Monday, Tuesday or Wednesday).

In [13]:
cloth_data = pd.read_sql("SELECT date_trunc('week',dt), date_part('dow',dt), sales FROM cloth_sales", engine)

In [14]:
cloth_data.head()

Unnamed: 0,date_trunc,date_part,sales
0,2014-10-27,1.0,1016
1,2014-10-27,2.0,978
2,2014-10-27,3.0,1010
3,2014-11-03,1.0,987
4,2014-11-03,2.0,991


In [15]:
cloth_data = cloth_data.pivot(index='date_trunc', columns='date_part', values='sales')

In [16]:
cloth_data.head()

date_part,1.0,2.0,3.0
date_trunc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-10-27,1016,978,1010
2014-11-03,987,991,997
2014-11-10,1014,983,1002
2014-11-17,991,945,992
2014-11-24,1001,1058,1002


In [17]:
cloth_data_total = pd.DataFrame(index=['Total'], columns=['Monday','Tuesday','Wednesday'])

In [18]:
cloth_data_total

Unnamed: 0,Monday,Tuesday,Wednesday
Total,,,


In [19]:
cloth_data_total['Monday'] = cloth_data[1.0].sum()
cloth_data_total['Tuesday'] = cloth_data[2.0].sum()
cloth_data_total['Wednesday'] = cloth_data[3.0].sum()

In [20]:
cloth_data_total

Unnamed: 0,Monday,Tuesday,Wednesday
Total,99839,98941,99892


In [21]:
# In order to pass the dataframe data into stats.chisquare, IT NEEDS TO BE CONVERTED FROM PANDAS
# DATAFRAME INTO NUMPY ARRAY OR LIST.
import numpy as np
cloth_data_total_array = np.asarray(cloth_data_total.ix[0,:])
cloth_data_total_array

array([99839, 98941, 99892])

In [22]:
expected_cloth_sales = [cloth_data_total.sum().sum()/3]*3

In [23]:
expected_cloth_sales

[99557.33333333333, 99557.33333333333, 99557.33333333333]

In [24]:
stats.chisquare(cloth_data_total_array, expected_cloth_sales)

Power_divergenceResult(statistic=5.7374444206353461, pvalue=0.056771422190437341)

The p-value is greater than 0.05, therefore we fail to reject the null hypothesis that clothing sales is independent from the day of the week (Monday, Tuesday, Wednesday).

**Question 4**  

1) A law suit has been filed against a university for a charge of sexual discrimination against female applicants during the admissions process. Use the data below to test whether sex affects admission at this university.
      
   **Hint:**
   - Construct your null and alternative hypotheses
   - Use `scipy.stats.chi2_contingency()` to carry out a test for independence
   - The function takes the contingency table as an `numpy` array as the first argument


|        | Admitted | Not Admitted |
|--------|----------|--------------|
| Male   | 3715     | 4727         |
| Female | 1513     | 2808         |

2) You are also given the breakdown of the female and male admissions by department (A to F).

<img src="images/paradox_1.png" width="300px">

Test if sex and department are independent in terms of number of applicants.

3) (Extra Credit) Based on all the data you are given, is it fair to say that there is sexual discrimination in the admission process? Explain your answer. (Hint: Simpson's paradox)

### 1

$H_0$: the admission process to a university is independent from sex.

$H_a$: the admission process to a university is dependent from sex.

In [25]:
university_df = pd.DataFrame({"Male": [3715,4727], "Female": [1513,2808]}, index=['Admitted','Not Admitted']).T

In [26]:
university_df

Unnamed: 0,Admitted,Not Admitted
Female,1513,2808
Male,3715,4727


In [27]:
chi2, p, dof, expected = stats.chi2_contingency(university_df)

In [28]:
p

1.7473568514632148e-22

The p-value is less than 0.05, therefore we reject the null hypothesis and conclude that admissions to university are dependent on sex.

### 2

$H_0$: sex and department are independent in terms of number of applicants.

$H_a$: sex and department are dependent in terms of number of applicants.

In [29]:
university_department_df = pd.DataFrame({"Male Applicants": [825,560,325,417,191,373], "Female Applicants": [108,25,593,375,393,341]}, index=['A','B','C','D','E','F'])

In [30]:
university_department_df

Unnamed: 0,Female Applicants,Male Applicants
A,108,825
B,25,560
C,593,325
D,375,417
E,393,191
F,341,373


In [31]:
chi2, p, dof, expected = stats.chi2_contingency(university_department_df)

In [32]:
p

9.4440769769112824e-229

The p-value is less than 0.05, therefore we reject the null hypothesis and conclude that sex and department are dependent in terms of number of applicants.

### 3
We need to create dataframes per department based on the table and run tests per department.

In [33]:
university_department_a_df = pd.DataFrame({"Male": [512,314], "Female": [89,19]}, index=['Admitted','Not Admitted']).T

In [34]:
university_department_b_df = pd.DataFrame({"Male": [353,207], "Female": [17,8]}, index=['Admitted','Not Admitted']).T

In [35]:
university_department_c_df = pd.DataFrame({"Male": [120,205], "Female": [202,391]}, index=['Admitted','Not Admitted']).T

In [36]:
university_department_d_df = pd.DataFrame({"Male": [138,279], "Female": [131,244]}, index=['Admitted','Not Admitted']).T

In [37]:
university_department_e_df = pd.DataFrame({"Male": [53,138], "Female": [94,299]}, index=['Admitted','Not Admitted']).T

In [38]:
university_department_f_df = pd.DataFrame({"Male": [22,351], "Female": [24,317]}, index=['Admitted','Not Admitted']).T

In [39]:
university_department_total_df = pd.DataFrame({"Male": [1198,1493], "Female": [557,1278]}, index=['Admitted','Not Admitted']).T

In [40]:
stats.chi2_contingency(university_department_a_df)[1]

4.9055086094758964e-05

We conclude that department A discriminates based on sex.

In [41]:
stats.chi2_contingency(university_department_b_df)[1]

0.77050405320557358

We conclude that department B does not discriminate based on sex.

In [42]:
stats.chi2_contingency(university_department_c_df)[1]

0.42617526141992279

We conclude that department C does not discriminate based on sex.

In [43]:
stats.chi2_contingency(university_department_d_df)[1]

0.63782826912679236

We conclude that department D does not discriminate based on sex.

In [44]:
stats.chi2_contingency(university_department_e_df)[1]

0.36869809459730318

We conclude that department E does not discriminate based on sex.

In [45]:
stats.chi2_contingency(university_department_f_df)[1]

0.64038166517852968

We conclude that department F does not discriminate based on sex.

In [46]:
stats.chi2_contingency(university_department_total_df)[1]

1.0557968087828395e-21

Overall, there's discrimination based on sex.

The university council should check with department A to address sex discrimination.

#### Hint

Some functions that may be useful to you:

- From the `sqlalchemy` package:
    - `create_engine`
- From the `pandas` package:
    - `read_sql`
    - `pivot`
- From the `scipy.stats` package:
    - `chisquare`
    - `chi2_contingency`