# P-values

In [None]:
from datascience import *
from cs104 import *
import numpy as np
%matplotlib inline

## 1. Midterm Scores

Was Lab Section 3 graded differently than that other lab sections, or could their low average score be attributed to the random chance of the students assigned to that section? 

In this dataset, each **row** (observation) is a single student. There are 4 lab sections. 

In [None]:
scores = Table().read_table("data/scores_by_section.csv")
scores.sample(5)

How many students in each section? 

In [None]:
scores.group("Section")

What was the mean Midterm score across each section? 

In [None]:
scores.group("Section", np.mean)

Lab section 3's observed average score on the midterm.

In [None]:
observed_scores = scores.where("Section", 3).column("Midterm")
observed_scores

In [None]:
observed_average = np.mean(observed_scores)
observed_average

The observed sample size is the number of students in Lab Section 3.

In [None]:
observed_sample_size = scores.where("Section", 3).num_rows
observed_sample_size

**Null hypothesis model**:  mean of the population 

**Null hypothesis restated**: The mean in *any* sample should be close to the mean in the population (the whole class)

In [None]:
null_hypothesis_model_parameter = np.mean(scores.column("Midterm"))
null_hypothesis_model_parameter

Difference between Lab Section 3's average midterm score and the average midterm score across *all* students.

In [None]:
observed_midterm_statistic = abs(observed_average - null_hypothesis_model_parameter)
observed_midterm_statistic

Abstraction: Let's create a function that calculates a statistic as the absolute difference between the mean of a sample and the mean of a population. 

In [None]:
def statistic_abs_diff_means(sample):
    """Return the absolute difference in means between a sample and a population"""
    return np.abs(np.mean(sample) - null_hypothesis_model_parameter)

In [None]:
observed_midterm_statistic = statistic_abs_diff_means(observed_scores)
observed_midterm_statistic

To simulate a sample from teh null hypothesis, we'll sample without replacement to represent a different grouping of students into sections.  

For example, these "could have been" Section 3 (if it was sample size 5). 

In [None]:
sample = scores.sample(5, with_replacement=False)
sample

In [None]:
sample.column("Midterm")

Let's put this into a function we can later put into `simulate_sample_statistic`. 

In [None]:
def sample_scores(sample_size):
    """
    Sampling scores with replacement 
    
    Note: we're using with_replacement=False here because we don't 
    want to sample the same student's score twice.
    """
    return scores.sample(sample_size, with_replacement=False).column("Midterm")

In [None]:
sample_scores(observed_sample_size)

In [None]:
simulated_midterm_statistics = simulate_sample_statistic(sample_scores, 
                                         observed_sample_size, 
                                         statistic_abs_diff_means, 
                                         5000)

In [None]:
results = Table().with_columns("Statistic: abs(sample mean - population mean)", 
                               simulated_midterm_statistics)
plot = results.hist(title="Null hypothesis empirical distribution")    
plot.dot(observed_midterm_statistic)

## 2. Calculating p-values

In [None]:
def plot_simulated_and_observed_statistics(simulated_statistics, observed_statistic):
    """
    Plots the empirical distribution of the simulated statistics, along with
    the observed data's statistic, highlighting the tail in yellow.
    """
    results = Table().with_columns("Statistic: abs(sample mean - population mean)", 
                                   simulated_statistics)
    plot = results.hist(left_end=observed_statistic, title="Null hypothesis empirical distribution")
    plot.dot(observed_statistic)

In [None]:
plot_simulated_and_observed_statistics(simulated_midterm_statistics, observed_midterm_statistic)

Supposed we observed other observed statistics: 0.3, 1.5, and 2.7:

In [None]:
with Figure(1,3):
    plot_simulated_and_observed_statistics(simulated_midterm_statistics, 0.3)
    plot_simulated_and_observed_statistics(simulated_midterm_statistics, 1.5)
    plot_simulated_and_observed_statistics(simulated_midterm_statistics, 2.7)

**Calculating p-values:**
Let's compute the proportion of the histogram that is colored yellow. This captures the proportion of the simulated samples under the null hypothesis that are more unlikely than the observed data. 

We'll do it first for our midterm example, and then generalize to a function we can use to compute p-values for any test.

In [None]:
simulated_midterm_statistics

In [None]:
np.count_nonzero(simulated_midterm_statistics >= observed_midterm_statistic) / len(simulated_midterm_statistics)

Now as a function:

In [None]:
def empirical_pvalue(null_statistics, observed_statistic): 
    """
    Return the proportion of the null statistics that are greater than 
    or equal to the observed statistic.
    """
    return np.count_nonzero(null_statistics >= observed_statistic) / len(null_statistics)

In [None]:
empirical_pvalue(simulated_midterm_statistics, observed_midterm_statistic)

## 3. Impact of Sample Size on p-value

Recall that as our sample size increases, we observe less variability in our samples.

Here's a visualizaion showing the **Null hypothesis empirical distribution** for different sample sizes.  Note how the p-value changes as the sample size changes due to that effect.

In [None]:

def visualize_p_value_and_sample_size(sample_size):
    
    simulated_midterm_statistics = simulate_sample_statistic(sample_scores, 
                                         sample_size, 
                                         statistic_abs_diff_means, 
                                         1000)
    
    results = Table().with_columns("Statistic: abs(sample mean - class mean)", 
                                   simulated_midterm_statistics)
    plot = results.hist(left_end=observed_midterm_statistic,bins=np.arange(0,6,0.2)) 

    plot.dot(observed_midterm_statistic)
    plot.set_ylim(0,2)
    
    pvalue = empirical_pvalue(simulated_midterm_statistics, observed_midterm_statistic)
    plot.set_title('sample-size = ' + str(sample_size) + '\np-value = '+str(pvalue))
    

interact(visualize_p_value_and_sample_size, 
         sample_size=Slider(5,80))