# OpenIntro Stats Exercises

The following is a series of exercises, taken from OpenIntro Stats 4ed, for the purposes of developing my own understanding of the statistical tests alongside increasing familiarity with common statistical packages.

In [44]:
from scipy import stats
import numpy as np
import pandas as pd
import seaborn as sns

## Chpt 6

### 6.2 Hypothesis Tests for the Difference of Two Sample Means

#### 6.13 Guided Practice

In [2]:
# Create the dataframe given in the question

df = pd.DataFrame({'heart_attack' : [145, 200], 'no_event' : [12788, 12738]} , index = ['fish_oil', 'placebo'])
df['total'] = df['heart_attack'] + df['no_event']
df


Unnamed: 0,heart_attack,no_event,total
fish_oil,145,12788,12933
placebo,200,12738,12938


First verify that a normal approximation is valid. Each subject was randomly assigned to one of the two groups so we may assume independence holds. Now we verify the success-failure condition:

In [3]:
# Fish oil group:

p1_hat , p2_hat = df['heart_attack'] / df['total'] 
n1, n2 = df['total']

all([n1*p1_hat>10, n1*(1-p1_hat)>10, n2*p2_hat>10, n2*(1-p2_hat)])

True

All good here. We are safe to use a normal distribution to approximate the sampling distribution. We now construct the 95% confidence interval.

In [7]:
point_estimate = p1_hat - p2_hat
std_error = np.sqrt( p1_hat*(1-p1_hat)/n1 + p2_hat*(1-p2_hat)/n2 )

# Compute corresponding z-value using an inverse CDF of a standard normal distribution:

z = stats.norm.ppf(0.975)

# Convert to float dtype for more aesthetic printing

print([float(point_estimate-z*std_error), float(point_estimate+z*std_error)])

[-0.007041644824162001, -0.0014517763930541822]


Observing the confidence interval, we have evidence to reject the hypothesis that the fish oil treatment does not affect heart attack outcomes.

#### 6.15 Guided Practice

Not exactly what Ex 6.15 asks for but a good exercise nonetheless.

In [8]:
# Create the data given:

df = pd.DataFrame({'Yes':[500,505], 'No':[44425,44405]} )
df.index = ['Mammogram','Control']
df

Unnamed: 0,Yes,No
Mammogram,500,44425
Control,505,44405


Let $p_1\, \, p_2$ be the true proportion of those who die of cancer in the mammogram and control groups respectively. We wish to test if there is a statistically significant difference between the two proportions. We state our null and alternate hypotheses as:

$
H_0 : p_1 - p_2 = 0 
$

$
H_1 : p_1 - p_2 \neq 0
$

We will conduct a two-tailed test at the 5% significance level. We first verify that a normal distribution is a valid approxiation of the sample distribution in this case:

In [9]:
n1, n2 = df['Yes'] + df['No']
p1, p2 = df['Yes'] / [n1, n2]

all([n1*p1>10, n1*(1-p1)>10, n2*p2>10, n2*(1-p2)>10])

True

The sample parameters seem reasonable to use a normal distribution as an approximation. Since the women were randomly assigned to one of the two groups, we may assume independence holds both within each group and across each group.

We now compute the p-value for our observed statistic:

In [126]:
mu = 0

# Estimate the standard deviation of the null distribution using the sample parameters
sigma = np.sqrt( p1*(1-p1)/n1 + p2*(1-p2)/n2 )

# p1-p2<0 so we compute the p-value using symmetries of the normal distribution

p = 2*stats.norm(mu, sigma).cdf(p1-p2)
float(p)

0.8697840130334852

This is clearly well above the 5% threshold so we have no evidence to reject the null hypothesis.

#### 6.27 Exercise

In [11]:
# Create the dataframe given in the question

df = pd.DataFrame({'Control':[35, 193, 64], 'Pilots': [19, 132, 51], 'Truck Drivers': [35, 117, 51], 'Train Operators':[29, 119, 32], 'Bus/Taxi/Limo Drivers':[21, 131, 58]})
df.index = {'Less than 6 hours of sleep':0, '6 to 8 hours of sleep':1, 'More than 8 hours':2}
df.loc['Total'] = df.sum()
df

Unnamed: 0,Control,Pilots,Truck Drivers,Train Operators,Bus/Taxi/Limo Drivers
Less than 6 hours of sleep,35,19,35,29,21
6 to 8 hours of sleep,193,132,117,119,131
More than 8 hours,64,51,51,32,58
Total,292,202,203,180,210


We wish to conduct a hypothesis test as to whether or not there is a difference between the proportion of truck drivers who get less than 6 hours of sleep per day and the corresponding proportion for non-transportation workers (i.e. the control group).

We begin by setting up our hypotheses. Let $\hat{p_t}, \, \hat{p_c}$ represent the sample proportions for the truck drivers and control groups respectively and $p_t, \, p_c$ be the corresponding population parameters. Then, our hypotheses are:

$H_0: p_t-p_c=0$

$H_1: p_t \neq p_c$

We will now verify that the success-failure condition holds to ensure that a normal distribution is a valid approximation of the sample distribution by using the pooled proportion.

In [77]:
n_c, n_t = df.loc['Total', ['Control', 'Truck Drivers']]
n = n_c + n_t
c, t = df.loc['Less than 6 hours of sleep', ['Control', 'Truck Drivers']] 
p_pooled = (c+t)/n
all( [ n*p_pooled > 10, n*(1-p)>10 ] )

True

We appear to have a sufficient sample size to make a normal distribution valid here. Since our null hypothesis is that there is no difference in the means between the two groups, we will compute the pooled standard error and use this as the standard deviation of our null distribution.

In [88]:
p_c, p_t = c/n_c, t/n_t
SE_pool = np.sqrt( p_pooled*(1-p_pooled)/n_t + p_pooled*(1-p_pooled)/n_c )
SE_pool

np.float64(0.031842080861455284)

We now compute the test statistic. We will use a 5% significance level.

In [92]:
# Double to get the p-value since the CDF function only gives the probability in one-tail of the normal distribution
p = 2*(1-stats.norm(0, SE_pool).cdf(p_t-p_c))
p

np.float64(0.09887007932782055)

This is clearly outside of our 5% significance level, so we do not have sufficient evidence to reject the null hypothesis.

### 6.3 Testing for Goodness of Fit Using Chi-Square Tests

#### 6.34 Exercise

We begin by summarising the data given to us in the question. We employ a NaN value for the sake of emulating a real dataset.

In [148]:
df = pd.DataFrame({'Woods': [4, 4.8], 'Cultivated grassplot': [16, 14.7] , 'Deciduous forests': [61, 39.6], 'Other': [345, np.nan] })
df.index = ['Number of sites', 'Expected percentage coverage']
df.loc['Expected percentage coverage', 'Other']
df = df.fillna(0)
df.loc['Expected percentage coverage', 'Other'] = 100 - sum(df.loc['Expected percentage coverage'])
df

Unnamed: 0,Woods,Cultivated grassplot,Deciduous forests,Other
Number of sites,4.0,16.0,61.0,345.0
Expected percentage coverage,4.8,14.7,39.6,40.9


We wish to determine if the particular region examined is representative of the whole of Hainan Island, China. We will first check that the conditions for using a Chi-Squared test apply.