# Hypothesis Test: Proportions

## Smoking Status vs. Hypertension

In [1]:
from data.create_data import *
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import math
import matplotlib.pyplot as plt
from statsmodels.distributions.empirical_distribution import ECDF

%matplotlib inline

In [2]:
data = read_frmgham()
data_hyp = data[data['prevhyp']!=1] # drop prev history of hypertension
data_hyp = data_hyp[['cursmoke', 'hyperten']]

### Research Questions
1. Is there a statistically significant association between smoking status and the presence of hypertension?

#### Categorical Variables
##### Smoking Status (`cursmoke`)
  * `0`: Non-current smoker
  * `1`: Current smoker
  
##### Hypertension (`hyperten`)
Hypertension is defined as the first exam being treated for high blood pressure (Systolic blood pressure ≥ 140 mmHg or Diastolic blood pressure ≥ 90 mmHg).  
  * `0`: No hypertension
  * `1`: Hypertension
  
## Chi<sup>2</sup> Tests
Chi<sup>2</sup> test investigates the *differences* between **categorical variables**, as opposed to numerical values of `t-tests` (hypothesis testing of differences between numerical values).

More specifically, they provide a way to investigate:
  * the differences in the distributions of categorical variables with the same categories (**Goodness of Fit**)
  * the dependence between categorical variables (**Independence Test**)

### Proportions
By testing for **proportions**, Chi<sup>2</sup> is testing for the difference between the *expected* and *observed* outcomes.

### Chi<sup>2</sup> Test: *Goodness-of-Fit* 
The **goodness-of-fit** test tests whether the distribution of a sample *categorical* data matches an expected distribution.
  * It's an analog of the *One-way t-test*, but for categorical variables.
  
**Case Use Examples**:
  * Check whether the race demographics of members of your school match that of the entire U.S. population.


#### Hypertension

1) Hypotheses
   * **H<sub>0</sub>**: There is no difference between expected and observed outcome of hypertension.
   * **H<sub>A</sub>**: There is a difference between expected and observed outcome of hypertension.
   
2) Compute **t-statistic** (Chi<sup>2</sup>) & **p-value**.

In [3]:
hyperten_table=pd.crosstab(data_hyp.hyperten, columns="count")
hyperten_table

col_0,count
hyperten,Unnamed: 1_level_1
0,2985
1,3298


In [4]:
gof_chi2, gof_pval = stats.chisquare(f_obs = hyperten_table)
print "Goodness-of-Fit Chi2 = %.2f" % gof_chi2
print "Goodness-of-Fit Chi2 = %.2f" % gof_pval

Goodness-of-Fit Chi2 = 15.59
Goodness-of-Fit Chi2 = 0.00


3) Results
  * test-statistic: **Chi<sup>2</sup>** = 15.59
  * **p-value** = 0

The **p-value** is very small (p-value < 0.05), providing substantial evidence against the null hypothesis (H<sub>0</sub>). Thus, H<sub>0</sub> is rejected in favor of the alternative hypothesis (H<sub>A</sub>).

##### Conclusion
There is a statistically significant difference between the expected and observed values of hypertension. Thus, the set of categorical data did not come from a discrete distribution (follows expected population mean).

In other words, the proportion of hypertensive patients in the sample is not indicative of that of the population (*expectation*).

### Chi<sup>2</sup> Test: *Independence Test*
The **test of independence** determines if the output (*hypertension*) is dependent on a predictor variable (*smoking status*). In other words, the test determines whether 2 categorical variables are associated with one another in the population.

1) Hypotheses
   * **H<sub>0</sub>**: The data point to a population where there is no relationship between current smoking and hypertension.
   * **H<sub>A</sub>**: The data point to a population where there is a relationship between current smoking and hypertension.
   
2) Compute **t-statistic** (Chi<sup>2</sup>) & **p-value**.

In [5]:
# Contingency Table
contin_table = data_hyp.pivot_table(index=['cursmoke', 'hyperten'], aggfunc=len)
# contin_table = data_hyp.groupby(['cursmoke', 'hyperten']).size()
df_contin_table = contin_table.unstack('hyperten')
df_contin_table

hyperten,0,1
cursmoke,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1403,1740
1,1582,1558


In [6]:
chi2_val, chi2_pval, chi2_df, chi2_expect_val = stats.chi2_contingency(df_contin_table)
print "Test-statistic: Chi2 = %.2f" % chi2_val
print "p-value = %f" % chi2_pval

Test-statistic: Chi2 = 20.55
p-value = 0.000006


##### Summary Statistics
Additional statistics about the sample.

In [7]:
contin_val = df_contin_table.values
print "Total Non-Smokers = %d" % (contin_val[0,0] + contin_val[0,1])
print "Total Smokers = %d" % (contin_val[1,0] + contin_val[1,1])
print "Total without Hypertension = %d" % (contin_val[0,0] + contin_val[1,0])
print "Total with Hypertension = %d" % (contin_val[0,1] + contin_val[1,1])
print
print "Current Non-Smoker without Hypertension = %d" % (contin_val[0,0])
print "Current Non-Smoker with Hypertension = %d" % (contin_val[0,1])
print "Current Smoker without Hypertension = %d" % (contin_val[1,0])
print "Current Smoker with Hypertension - %d" % (contin_val[1,1])

Total Non-Smokers = 3143
Total Smokers = 3140
Total without Hypertension = 2985
Total with Hypertension = 3298

Current Non-Smoker without Hypertension = 1403
Current Non-Smoker with Hypertension = 1740
Current Smoker without Hypertension = 1582
Current Smoker with Hypertension - 1558


3) Results
  * test-statistic: **Chi<sup>2</sup>** = 20.55
  * **p-value** = 0.000006
  
The **p-value** is small (p-value < 0.05), providing substantial evidence against the null hypothesis (H<sub>0</sub>). Thus, rejecting H<sub>0</sub> in favor of the alternative hypothesis (H<sub>A</sub>)

If H<sub>0</sub> *were to be true*, the test-statistic (Chi<sup>2</sup>) value should be close to 0, indicating that the difference between the observed and expected is small.\n",
 -    "\n",

##### Conclusion
Based on the sample, there is a statistically significant difference between being a current smoker and having hypertension. The Chi<sup>2</sup> value is large and not approximately close to zero, indicating a difference between the observed and expected outcomes

There is a relationship between smoking status and hypertension. Thus, the two variables are not independent of one another.


## Considerations: Limitations
**Chi<sup>2</sup> testing** indicates that there is a difference, but does not provide anything specific about what the difference is. 