# Lab 5.4: Chi-Square Tests

## Outline

- Chi-square test for goodness of fit
- Chi-square test for independence

**Questions 1**  

Seven percent of mutual fund investors rate corporate stocks “very safe,” 58% rate them “somewhat safe,” 24% rate them “not very safe,” 4% rate them “not at all safe,” and 7% are “not sure.” A BusinessWeek/Harris poll asked 529 mutual fund investors how they would rate corporate bonds on safety. The responses are as follows.  

<img src="images/q1.png" width="300">  

Do mutual fund investors’ attitudes toward corporate bonds differ from their attitudes toward
corporate stocks? Clearly state the null and alternative hypotheses.

In [38]:
%pylab inline

import pandas as pd
import statsmodels.api as sm

from IPython.display import Latex, YouTubeVideo

from scipy import stats
from statsmodels.graphics.gofplots import qqplot
from statsmodels.distributions.empirical_distribution import ECDF

Populating the interactive namespace from numpy and matplotlib


h0 : attitudes toward bonds same as stocks with the following probabilities:

P(very safe) = .07
P(somewhat safe) = .58
P(not very safe) = .24
P(not at all safe) = .04
P(not sure) = .07

ha : data provides evidence that any of the model attitudes towards stocks does not match the data collected on bonds attitudes., i.e. at least one P_i differs

conduct a chi square goodness of fit test.

In [26]:
import numpy as np

attitudes = ["very safe", "somewhat safe", "not very safe", "not at all safe", "not sure"]
counts = [48, 323, 79, 16, 63]
n = sum(counts)
expected = np.array([.07, .58, .24, .04, .07]) * n

print(pd.DataFrame({"Counts": counts, "Expected": expected}, index=attitudes).T)

stats.chisquare(counts, expected)

          very safe  somewhat safe  not very safe  not at all safe  not sure
Counts        48.00         323.00          79.00            16.00     63.00
Expected      37.03         306.82         126.96            21.16     37.03


Power_divergenceResult(statistic=41.691944400470568, pvalue=1.9323272579221198e-08)

Since p < .05, we have evidence to reject the null hypothesis.  Hence, investors' attitudes towards bonds differs from that of stocks.

**Question 2**  

A news article reports that "Americans have differing views on two potentially inconvenient and invasive practices that airports could implement to uncover potential terrorist attacks." This news piece was based on a survey conducted among a random sample of 1,137 adults nationwide, interviewed by telephone November 7-10, 2010, where one of the questions on the survey was "Some airports are now using 'full-body' digital x-ray machines to electronically screen passengers in airport security lines. Do you think these new x-ray machines should or should not be used at airports?" Below is a summary of responses based on party affiliation.  

<img src="images/q4.png" width="550">  

The differences in each political group may be due to chance. Complete the following computations under the null
hypothesis of independence between an individual's party affiliation and his/her support of full-body scans. It may be useful to first add on an extra column for row totals before proceeding with the computations.  

1) How many Republicans would you expect to not support the use of full-body scans?  

2) How many Democrats would you expect to support the use of full-body scans?  

3) How many Independents would you expect to not know or not answer?  

4) Test if an individual's party affiliation affects his/her support of full-body scans. Clearly state the null and alternative hypotheses in your test.


1) how many republicans would you expect to not support the use of full-body scans?

38 out of 1137 americans.

2) How many Democrats would you expect to support the use of full-body scans? 

299 out of 1137 americans.

3) How many Independents would you expect to not know or not answer? 

22 out of 1137 americans.

4) Test if an individual's party affiliation affects his/her support of full-body scans. Clearly state the null and alternative hypotheses in your test.

conduct a chi square test for independence

ho : data is independent of party affiliation

ha : data is dependnet on party affiliation

n=1137

In [37]:
answers = ["should", "should not", "don't know/no answer"]

data = pd.DataFrame({"Republican": [264, 38, 16], "Democrat": [299, 55, 15], "Independent": [351, 77, 22]}, index=answers)
data

Unnamed: 0,Democrat,Independent,Republican
should,299,351,264
should not,55,77,38
don't know/no answer,15,22,16


In [40]:
chi2, p, dof, expected = stats.chi2_contingency(data)

Latex(r"$\chi^2 = {:.4}; p = {:.2}$".format(chi2, p))

<IPython.core.display.Latex object>

since p is large, we fail to reject the null hypothesis.  Thus, there is not sufficient evidence to conclude that answers data is dependent on party affiliation.

**Question 3**  

A clothes retailer believes that there is no difference in sales across Monday, Tuesday and Wednesday. You are given the data in a `cloth_sales` table (in RDS) to test the claim. The table contains two columns: `dt` for the date, and `sales`, containing the count of sales for that day. Start by drawing up the table for the observed and expected frequencies for the chi-square test. 

**Hint:** 
- It will probably be easiest to extract the week of the year (`week`) and the day of the week (`dow`) using the [`date_trunc`](https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC) and [`date_part`](https://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) functions in PostgreSQL respectively. You can then [`pivot`](http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-pivoting-dataframe-objects) the table with pandas. The `head` of the resulting data frame should look something like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>date_part</th>
      <th>1.0</th>
      <th>2.0</th>
      <th>3.0</th>
    </tr>
    <tr>
      <th>date_trunc</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2014-10-27</th>
      <td>1016</td>
      <td>978</td>
      <td>1010</td>
    </tr>
    <tr>
      <th>2014-11-03</th>
      <td>987</td>
      <td>991</td>
      <td>997</td>
    </tr>
    <tr>
      <th>2014-11-10</th>
      <td>1014</td>
      <td>983</td>
      <td>1002</td>
    </tr>
    <tr>
      <th>2014-11-17</th>
      <td>991</td>
      <td>945</td>
      <td>992</td>
    </tr>
    <tr>
      <th>2014-11-24</th>
      <td>1001</td>
      <td>1058</td>
      <td>1002</td>
    </tr>
  </tbody>
</table>  

(N.B. `date_part('dow', dt)` will return the number of days after Sunday, so Monday = 1.0, Tuesday = 2.0, and so on.)
   
- Use `scipy.stats.chisquare()` to carry out a goodness of fit test

In [70]:
from sqlalchemy import create_engine
import yaml
pg_creds = yaml.load(open('../../pg_creds.yaml'))['student']

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{dbname}'.format(**pg_creds))
conn = engine.connect()

# r = pd.read_sql('select * from cloth_sales', engine)
r = pd.read_sql("""select date_trunc('week', dt) from cloth_sales""", engine)
r
# SELECT date_trunc('hour', TIMESTAMP '2001-02-16 20:38:40')

Exception during reset or similar
Traceback (most recent call last):
  File "/Users/justin/anaconda/lib/python3.5/site-packages/sqlalchemy/pool.py", line 636, in _finalize_fairy
    fairy._reset(pool)
  File "/Users/justin/anaconda/lib/python3.5/site-packages/sqlalchemy/pool.py", line 776, in _reset
    pool._dialect.do_rollback(self)
  File "/Users/justin/anaconda/lib/python3.5/site-packages/sqlalchemy/engine/default.py", line 420, in do_rollback
    dbapi_connection.rollback()
psycopg2.DatabaseError: SSL SYSCALL error: Operation timed out



Unnamed: 0,date_trunc
0,2014-10-27
1,2014-10-27
2,2014-10-27
3,2014-11-03
4,2014-11-03
5,2014-11-03
6,2014-11-10
7,2014-11-10
8,2014-11-10
9,2014-11-17


**Question 4**  

1) A law suit has been filed against a university for a charge of sexual discrimination against female applicants during the admissions process. Use the data below to test whether sex affects admission at this university.
      
   **Hint:**
   - Construct your null and alternative hypotheses
   - Use `scipy.stats.chi2_contingency()` to carry out a test for independence
   - The function takes the contingency table as an `numpy` array as the first argument


|        | Admitted | Not Admitted |
|--------|----------|--------------|
| Male   | 3715     | 4727         |
| Female | 1513     | 2808         |

2) You are also given the breakdown of the female and male admissions by department (A to F).

<img src="images/paradox_1.png" width="300px">

Test if sex and department are independent in terms of number of applicants.

3) (Extra Credit) Based on all the data you are given, is it fair to say that there is sexual discirmination in the admission process? Explain your answer. (Hint: Simpson's paradox)

#### Hint

Some functions that may be useful to you:

- From the `sqlalchemy` package:
    - `create_engine`
- From the `pandas` package:
    - `read_sql`
    - `pivot`
- From the `scipy.stats` package:
    - `chisquare`
    - `chi2_contingency`