## Module 5 - Inference 2

Below we'll use a few different techniques to demonstrate inference. Keep in mind, whatever the techique we're applying, we always fallow the same steps:

Step 1: Formulate the null hypothesis and the alternative hypothesis.<br>
Step 2: Specify the level of significance to be used.<br>
Step 3: Select the test statistic.<br>
Step 4: Establish the value of the test statistic.<br>
Step 5: Make a decision.<br>

In [1]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import numpy as np
from scipy import stats
import random
import math

## Understanding the z-score
What if we don't have a sample distribution, but instead only a single value? The z-score allows us to identify the probability associated with the observation of a single value, in the context of an underlying normal distribution (remember, according to the central limit theorem, point estimates of samples taken from non-normal distributions will themselves form a normal distribution). The z-score is calculated as:<br><br>
$$ z = \frac{x - \mu}{\sigma}$$<br>

<b>z-score example</b><br>
Sophia who took the GRE scored 160 on the Verbal Reasoning section and 157 on the Quantitative Reasoning section. The mean score for Verbal Reasoning section for all test takers was 151 with a standard deviation of 7, and the mean score for the Quantitative Reasoning was 153 with a standard deviation of 7.67. Suppose that both distributions are nearly normal. How did Sophia rank for the verbal reasoning and quantitiative reasoning sections?

In [2]:
# Simulate the test data (population)
test_data = np.random.normal(loc=151, scale=7, size=10000)
test_data = np.round(test_data)

In [3]:
# What is Sophia's z-score on the verbal reasoning section?
z_score_v = (160 - 151) / 7
print("Z-score on verbal reasoning is %.4f" % z_score_v)

# What is Sophia's z-score on the quantitative reasoning section?
z_score_q = (157 - 153) / 7.67
print("Z-score on quantitative reasoning is %.4f" % z_score_q)

Z-score on verbal reasoning is 1.2857
Z-score on quantitative reasoning is 0.5215


In [4]:
data = [go.Histogram(x=test_data, histnorm='probability')]

layout = go.Layout(
    shapes = [
        {
            'type': 'line',
            'x0': 160,
            'y0': 0,
            'x1': 160,
            'y1': 0.06,
            'line': {
                'color': 'rgb(155, 0, 0)',
                'width': 3,
            },
        }
    ],
    xaxis=dict(title="Score"),
    yaxis=dict(title="Probability")
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

In [5]:
# What is the corresponding percentile for these scores?
vr_percentile = stats.norm.cdf(z_score_v)*100
qr_percentile = stats.norm.cdf(z_score_q)*100
print("Sophia was in the %.0fth percentile for verbal reasoning" % vr_percentile)
print("Sophia was in the %.0fth percentile for quantitative reasoning" % qr_percentile)

Sophia was in the 90th percentile for verbal reasoning
Sophia was in the 70th percentile for quantitative reasoning


What is the probability of sampling a group of 30 students that have an average verbal reasoning score of 155? To find the answer to this question, we can use the standard error, as opposed to the standard deviation of the distribution.
$$ z = \frac{\overline{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$<br>

In [6]:
z_score_sample = (155 - 151) / (7 / np.sqrt(30))
print("The z-value is %.4f" % z_score_sample)
stat_percentile = (1 - stats.norm.cdf(z_score_sample))
print("The corresponding p-value is %.4f" % stat_percentile)

The z-value is 3.1298
The corresponding p-value is 0.0009


## The t-distribution
Just as for the z-statistic, we can calculate t as:

$$ t = \frac{\overline{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$<br>

with n-1 degrees of freedom.

<b>t-distribution Example</b><br>
Let's say the class only had 10 students. Now, we will not have a great estimation of the population standard deviation. Instead we use the t-distribution in place of 'z'.

In [7]:
student_marks_small = [170, 152, 155, 156, 142, 158, 148, 144, 153, 155]
n = student_marks_small.__len__()
degrees_of_freedom = n - 1

sample_mean = np.mean(student_marks_small)
sample_sd = np.std(student_marks_small, ddof=1)
sample_se = sample_sd / (np.sqrt(n))

small_ci = stats.t.ppf(1-0.05, degrees_of_freedom) * sample_se

print("Our value with confidence interval is %.4f +/- %.4f" % (sample_mean, small_ci))

Our value with confidence interval is 153.3000 +/- 4.5648


In this case, we do not have sufficient evidence to reject the null hypothesis. Another way to approach this problem would be to calcualte the t-statistic ourself and look up the associated p-value. Note, since we are testing the null hypothesis that the students are different from the overall population, this is a two-sided t-test.

In [8]:
t_statistic = (151 - sample_mean) / (sample_sd / np.sqrt(n))
print("t-statistic is %.4f" % t_statistic)
p_val = stats.t.sf(np.abs(t_statistic), n-1) * 2 # Two-sided test so we multiply the p-value from the SF by 2
print("p-value is %.4f" % p_val)

t-statistic is -0.9236
p-value is 0.3798


### The t-test statistic with two means (paired)
The t-statistic is also useful when comparing two means. Before calculating the t-statistic however, we must determine if the data are paired or unpaired. In the case of paired data, the t-statistic is calculated as we've seen previously, but it is based on the difference between the two groups:

$$t = \frac{\overline{x}_{diff} - x}{SE_{diff}}$$

Where again:

$$SE_{diff} = \frac{\sigma_{diff}}{\sqrt{n}}$$

<b>Paired t-test example</b><br>
Let's say I allowed that set of students to re-take the test following a second round of my training course. So now we have two sets of values for the same group. Did the second course improve their marks?

In [9]:
original_marks = [170, 152, 155, 156, 142, 158, 148, 144, 153, 155]
revised_marks = [174, 155, 179, 155, 145, 160, 155, 150, 155, 165]

In [10]:
mark_diff = []
n = original_marks.__len__()
for i in range(n):
    mark_diff.append(revised_marks[i] - original_marks[i])
diff_mean = np.mean(mark_diff)
diff_sd = np.std(mark_diff, ddof=1)

So, in this case we can again apply our t-statistic test. In this case we're looking at distance from 0

In [11]:
t_statistic = (diff_mean - 0) / (diff_sd / np.sqrt(n))
print("t-statistic is %.4f" % t_statistic)
p_val = stats.t.sf(np.abs(t_statistic), n-1)
print("p-value is %.5f" % p_val)

t-statistic is 2.7014
p-value is 0.01217


Or, we could use the scipy function for paired t-tests

In [12]:
stats.ttest_rel(original_marks, revised_marks) # Note, this assumes a 2-tailed test

Ttest_relResult(statistic=-2.7013510133444893, pvalue=0.024339785246278736)

### The t-test statistic with two means (independent)
The above example worked because we had groups that were paired (same individuals, different measurements). What if the data were unpaired? In this case we can not calculate the t-statistic the same way. Instead we substitute standard error, for a standard error calculation used for the difference of two means: 

$$SE_{\overline{x}_{1} - \overline{x}_{2}} = \sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}$$

This gives us the following formaula for the t-statistic:

$$t = \frac{\overline{x}_{1} - \overline{x}_{2}}{\sqrt{\frac{\sigma_{1}^2}{n_{1}} + \frac{\sigma_{2}^2}{n_{2}}}}$$

<b>Unpaired t-test example</b><br>
In a final attempt to prove the value of his training course, Gabe compares the marks of his students versus those trained from his competitor (Abe's GRE training). Is there any difference between the two courses?

In [13]:
group_1_marks = [170, 152, 155, 156, 142, 158, 148, 144, 153, 155]
group_2_marks = [171, 159, 164, 175, 155, 181, 154, 158, 160, 170]

n1 = group_1_marks.__len__()
n2 = group_2_marks.__len__()

group_1_mean = np.mean(group_1_marks)
group_2_mean = np.mean(group_2_marks)

group_1_sd = np.std(group_1_marks, ddof=1)
group_2_sd = np.std(group_2_marks, ddof=1)

se = np.sqrt((group_1_sd**2)/(n1) + (group_2_sd**2)/(n2))

In [14]:
t_statistic = (group_1_mean - group_2_mean) / se
print("t-statistic is %.4f" % t_statistic)
p_val = stats.t.sf(np.abs(t_statistic), n1-1) * 2
print("p-value is %.4f" % p_val)

t-statistic is -2.9924
p-value is 0.0151


Again, there is a python function for this test. We can compare the result:

In [15]:
stats.ttest_ind(group_1_marks, group_2_marks)

Ttest_indResult(statistic=-2.9924111641121973, pvalue=0.007813262523768034)

The p-value with the python fucntion is lower, this is because of our fast estimate of the degrees of freedom in the last step. Our approximation $(n_{1}-1)$ is fast, but conservative.

## Inference Using Catagorical Data
When working with catagorical data, the sample proportion (p) becomes a sum of the number of success and failures, divided by the number of trials. If there is a sufficient number of successes and failures (generally taken as 10 of each), we can calculate a Standard Error based on p as:

$$SE = \sqrt{\frac{p(1-p)}{n}}$$

Note that, for confidence intervals we would typically use the sample proportion to calculate the standard error, but for hypothesis tests we would use the proportion claimed in the null hypothesis.

<b>Catagorical Inference Example</b><br>
A simple random sample of 1,028 US adults in March 2013 found that 56% support nuclear arms reduction. Does this provide convincing evidence that a majority of Americans supported nuclear arms reduction at the 5% significance level?

In [16]:
n = 1028
p_sample = 0.56
p_null = 0.5 # Here we use the null proportion

std_error = np.sqrt((p_null*(1-p_null))/n)

z_score = (p_sample - p_null) / std_error
p_value = 1 - stats.norm.cdf(z_score)

print("The z-score is %.4f, which has an associated p-value of %.5f" % (z_score, p_value))

The z-score is 3.8475, which has an associated p-value of 0.00006


How many people would be necessary to prove a difference with margin of error less than 0.05?

In [17]:
# Sample size estimator
p = 0.5
z = 1.96
cutoff = 0.04
n = ((z / cutoff)**2) * (p * (1-p))
n = math.ceil(n)
print("Estimated sample size is at least %.0f" % n)

Estimated sample size is at least 601


## Comparing Two Proportions
When comparing the proportions of two groups, often your null hypothesis is that the two groups are equivalent (i.e. $p_{1} - p_{2}$ = 0). In this case, just as we did above, we use the pooled proportion:

$$\hat{p} = \frac{num. successes}{num. cases} = \frac{p_{1}n_{1} + p_{2}n_{2}}{n_{1} + n_{2}}$$

This pooled proportion is used to calculate the standard error:

$$SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n_{1}} + \frac{\hat{p}(1-\hat{p})}{n_{2}}}$$

Which gives us the following equation for the z-statistic:

$$z = \frac{(p_{A} - p_{B}) - 0}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n_{1}} + \frac{\hat{p}(1-\hat{p})}{n_{2}}}}$$

<b>Comparing Proportions Example</b><br>
Time magazine reported the result of a telephone poll of 800 adult Americans. The question posed of the Americans who were surveyed was: "Should the federal tax on cigarettes be raised to pay for health care reform?" The results of the survey were: 

|             | Non-Smokers    | Smokers  |
|-------------|----------------|----------|
|Yes          | 351            | 141      |
|No           | 254            | 154      |

Is there sufficient evidence at the α = 0.05 level, say, to conclude that the two populations — smokers and non-smokers — differ significantly with respect to their opinions?

In [18]:
n1 = 351 + 254
n2 = 141 + 154

p_ns = 351 / n1   
p_s = 141 / n2  # Proportion dying of breast cancer who did not have a mammogram
pp = (351 + 141) / (n1 + n2)

# Here our point estimate is the difference in probabilities between the two groups
point_estimate = p_ns - p_s

# Our denominator will be the standard error, which we calculate using the pooled proportion
std_error = np.sqrt(((pp*(1-pp))/n1) + ((pp*(1-pp))/n2))

z_score = (point_estimate - 0) / std_error
p_value = (1 - stats.norm.cdf(z_score))* 2
print("The p-value is %.4f. We reject the null hypothesis, there is some difference between these two groups" % p_value)

The p-value is 0.0038. We reject the null hypothesis, there is some difference between these two groups


## Chi-Square Distribution
What do we do when we have multiple groups? The Chi-Square test generalizes the z-test to cases where there are more than two proportions. The $\chi^2$ statistic is calculated as:

$$\chi^2 = \sum \frac{(O - E)^2}{E}$$

Where O represents the observed count. E is the expected count under the null hypothesis and computed by:

$$E = \frac{row\ total \times column\ total}{sample\ size}   $$

<b>Chi-Squre Example</b><br>
A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:

|             | High School | Bachelors | Masters | PhD  | Total |
|-------------|-------------|-----------|---------|------|-------|
|Female       | 60          | 54        | 46      | 41   | 201   |
|Male         | 40          | 44        | 53      | 57   | 194   |
|Total        | 100         | 98        | 99      | 98   | 395   |


Are gender and education level dependent at 5% level of significance?  In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?

In [19]:
num_rows = 2
num_cols = 4

# First comparison (high school)
obs_1 = 60
row_1_total = 201
col_1_total = 100
sample_total = 395
exp_1 = (col_1_total * row_1_total) / sample_total
chi_1 = (obs_1 - exp_1)**2 / (exp_1)

# Second comparison
obs_2 = 54
row_1_total = 201
col_2_total = 98
exp_2 = (col_2_total * row_1_total) / sample_total
chi_2 = (obs_2 - exp_2)**2 / (exp_2)

# Third comparison
obs_3 = 46
row_1_total = 201
col_3_total = 99
exp_3 = (col_3_total * row_1_total) / sample_total
chi_3 = (obs_3 - exp_3)**2 / (exp_3)

# Fourth comparison
obs_4 = 41
row_1_total = 201
col_4_total = 98
exp_4 = (col_4_total * row_1_total) / sample_total
chi_4 = (obs_4 - exp_4)**2 / (exp_4)

# Fifth comparison (high school)
obs_5 = 40
row_2_total = 194
col_1_total = 100
sample_total = 395
exp_5 = (col_1_total * row_2_total) / sample_total
chi_5 = (obs_5 - exp_5)**2 / (exp_5)

# Sixth comparison
obs_6 = 44
row_2_total = 194
col_2_total = 98
exp_6 = (col_2_total * row_2_total) / sample_total
chi_6 = (obs_6 - exp_6)**2 / (exp_6)

# Seventh comparison
obs_7 = 53
row_2_total = 194
col_3_total = 99
exp_7 = (col_3_total * row_2_total) / sample_total
chi_7 = (obs_7 - exp_7)**2 / (exp_7)

# Eigth comparison
obs_8 = 57
row_2_total = 194
col_4_total = 98
exp_8 = (col_4_total * row_2_total) / sample_total
chi_8 = (obs_8 - exp_8)**2 / (exp_8)

chi = chi_1 + chi_2 + chi_3 + chi_4 + chi_5 + chi_6 + chi_7 + chi_8
dof = (num_rows - 1) * (num_cols - 1)
print("The Chi-Square value is %.4f" % chi)
p_value = 1 - stats.distributions.chi2.cdf(8.0061, dof)
print("The associated p-value is %.4f" % p_value)

The Chi-Square value is 8.0061
The associated p-value is 0.0459


Again, we can just use a function from the Python Stats library to check these values

In [20]:
obs_array = np.array([[60, 54, 46, 41], [40, 44, 53, 57]])
stats.chi2_contingency(obs_array)

(8.006066246262538,
 0.045886500891747214,
 3,
 array([[50.88607595, 49.86835443, 50.37721519, 49.86835443],
        [49.11392405, 48.13164557, 48.62278481, 48.13164557]]))