# Lab 5.4: Chi-Square Tests

## Outline

- Chi-square test for goodness of fit
- Chi-square test for independence

In [1]:
import pandas as pd
import yaml

from scipy import stats
from sqlalchemy import create_engine

pg_creds = yaml.load(open('../../pg_creds.yaml'))['student']

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{dbname}'.format(**pg_creds))

**Questions 1**  

Seven percent of mutual fund investors rate corporate stocks “very safe,” 58% rate them “somewhat safe,” 24% rate them “not very safe,” 4% rate them “not at all safe,” and 7% are “not sure.” A BusinessWeek/Harris poll asked 529 mutual fund investors how they would rate corporate bonds on safety. The responses are as follows.  

<img src="images/q1.png" width="300">  

Do mutual fund investors’ attitudes toward corporate bonds differ from their attitudes toward
corporate stocks? Clearly state the null and alternative hypotheses.

**Question 2**  

A news article reports that "Americans have differing views on two potentially inconvenient and invasive practices that airports could implement to uncover potential terrorist attacks." This news piece was based on a survey conducted among a random sample of 1,137 adults nationwide, interviewed by telephone November 7-10, 2010, where one of the questions on the survey was "Some airports are now using 'full-body' digital x-ray machines to electronically screen passengers in airport security lines. Do you think these new x-ray machines should or should not be used at airports?" Below is a summary of responses based on party affiliation.  

<img src="images/q4.png" width="550">  

The differences in each political group may be due to chance. Complete the following computations under the null
hypothesis of independence between an individual's party affiliation and his/her support of full-body scans. It may be useful to first add on an extra column for row totals before proceeding with the computations.  

1) How many Republicans would you expect to not support the use of full-body scans?  

2) How many Democrats would you expect to support the use of full-body scans?  

3) How many Independents would you expect to not know or not answer?  

4) Test if an individual's party affiliation affects his/her support of full-body scans. Clearly state the null and alternative hypotheses in your test.

**Question 3**  

A clothes retailer believes that there is no difference in sales across Monday, Tuesday and Wednesday. You are given the data in a `cloth_sales` table (in RDS) to test the claim. The table contains two columns: `dt` for the date, and `sales`, containing the count of sales for that day. Start by drawing up the table for the observed and expected frequencies for the chi-square test. 

**Hint:** 
- It will probably be easiest to extract the week of the year (`week`) and the day of the week (`dow`) using the [`date_trunc`](https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC) and [`date_part`](https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) functions in PostgreSQL respectively. You can then [`pivot`](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-pivoting-dataframe-objects) the table with pandas. The `head` of the resulting data frame should look something like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>date_part</th>
      <th>1.0</th>
      <th>2.0</th>
      <th>3.0</th>
    </tr>
    <tr>
      <th>date_trunc</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2014-10-27</th>
      <td>1016</td>
      <td>978</td>
      <td>1010</td>
    </tr>
    <tr>
      <th>2014-11-03</th>
      <td>987</td>
      <td>991</td>
      <td>997</td>
    </tr>
    <tr>
      <th>2014-11-10</th>
      <td>1014</td>
      <td>983</td>
      <td>1002</td>
    </tr>
    <tr>
      <th>2014-11-17</th>
      <td>991</td>
      <td>945</td>
      <td>992</td>
    </tr>
    <tr>
      <th>2014-11-24</th>
      <td>1001</td>
      <td>1058</td>
      <td>1002</td>
    </tr>
  </tbody>
</table>  

(N.B. `date_part('dow', dt)` will return the number of days after Sunday, so Monday = 1.0, Tuesday = 2.0, and so on.)
   
- Use `scipy.stats.chisquare()` to carry out a goodness of fit test

In [8]:
cloth_data = pd.read_sql("SELECT date_trunc('week',dt), date_part('dow',dt), sales FROM cloth_sales", engine)

In [9]:
cloth_data.head()

Unnamed: 0,date_trunc,date_part,sales
0,2014-10-27,1.0,1016
1,2014-10-27,2.0,978
2,2014-10-27,3.0,1010
3,2014-11-03,1.0,987
4,2014-11-03,2.0,991


In [11]:
cloth_data = cloth_data.pivot(index='date_trunc', columns='date_part', values='sales')

In [12]:
cloth_data.head()

date_part,1.0,2.0,3.0
date_trunc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-10-27,1016,978,1010
2014-11-03,987,991,997
2014-11-10,1014,983,1002
2014-11-17,991,945,992
2014-11-24,1001,1058,1002


In [13]:
stats.chisquare(cloth_data)

Power_divergenceResult(statistic=array([  40.51101273,  127.43775583,   33.04504865]), pvalue=array([ 0.99999997,  0.02865544,  1.        ]))

**Question 4**  

1) A law suit has been filed against a university for a charge of sexual discrimination against female applicants during the admissions process. Use the data below to test whether sex affects admission at this university.
      
   **Hint:**
   - Construct your null and alternative hypotheses
   - Use `scipy.stats.chi2_contingency()` to carry out a test for independence
   - The function takes the contingency table as an `numpy` array as the first argument


|        | Admitted | Not Admitted |
|--------|----------|--------------|
| Male   | 3715     | 4727         |
| Female | 1513     | 2808         |

2) You are also given the breakdown of the female and male admissions by department (A to F).

<img src="images/paradox_1.png" width="300px">

Test if sex and department are independent in terms of number of applicants.

3) (Extra Credit) Based on all the data you are given, is it fair to say that there is sexual discirmination in the admission process? Explain your answer. (Hint: Simpson's paradox)

#### Hint

Some functions that may be useful to you:

- From the `sqlalchemy` package:
    - `create_engine`
- From the `pandas` package:
    - `read_sql`
    - `pivot`
- From the `scipy.stats` package:
    - `chisquare`
    - `chi2_contingency`