<h1> Lecture 18 

Data Science 8, Spring 2021 </h1>

<h3>
<b>
<ul>
<li>Hypothesis Testing and $p$-values   </li><br>
            
<li>Making Decisions with Incomplete Information  </li><br>

<li>Error Probabilities  </li><br>
</ul>
</b>
</h3>

In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

#The following allows porting images into a Markdown window
from IPython.display import Image

## The GSI's Defense ##

In [None]:
scores = Table.read_table('scores_by_section.csv')
scores

<h3> Headcounts of each of the twelve sections</h3>

In [None]:
# Recall that if we don't pass to the group method anything other than the 
# column label, it will simply return headcounts for each category.
# Here the categories are the section numbers. 
section_headcounts = scores.group('Section')
section_headcounts.show()

<h3>Compute the average score for each section</h3>

In [None]:
# Pass on the function np.average as a second argument, and 
# the group method will return the average score for each section.
section_averages = scores.group('Section', np.average)
section_averages.show()

<h4>Section 3's Midterm Average</h4>

This is our observed test statistic.

In [None]:
observed_average = section_averages.column('Midterm average').item(2) 
observed_average

<h4>Section 3's Population (Head Count)</h4>

This is our sample size. 

In [None]:
sample_size = section_headcounts.column('count').item(2)
sample_size

In [None]:
# In our random selection, we do NOT want to select the same student 
# more than once.  So, we must sample withOUT replacement.
random_sample = scores.sample(sample_size, with_replacement=False)
random_sample

<h4>What we've done in the cell above constitutes running ONE trial. <br>
    
We must run a large number of trials so we can construct an empirical distribution. </h4>

<h4>Random Sample's Average Score: <br>
    
This is our Simulated Test Statistic</h3>

In [None]:
random_sample_average_score = np.average(random_sample.column('Midterm'))
random_sample_average_score
# If you wish to round, uncomment the line below
#np.round(random_sample_average_score,2)

<h4>Compare with the observed statistic&mdash;average of scores in Sec. 3</h4>

In [None]:
observed_average
# If you wish to round, uncomment the line below
#np.round(observed_average,2)

<h4>Doesn't look very far, but we still don't know what "far" means here.</h4>

<h3>Define a function that does the above:</h3><br>

<h4>Creates a random sample of size <tt>sample_size</tt> from the roster,  <br>

and computes a section average score.<br>

Each run of this function constitutes one "trial."</h4>

In [None]:
# Simulate one value of the test statistic 
# under the hypothesis that the section is like a random sample from the class

def random_sample_midterm_avg(sample_size):
    random_sample = scores.sample(sample_size, with_replacement = False)
    return np.average(random_sample.column('Midterm'))

<h4>Let's run the cell a few times to check if it's generating varying numbers.</h4>

In [None]:
random_sample_midterm_avg(sample_size)

<h3>The next cell runs the "trial" many, many times&mdash;in fact, <tt>num_simulations</tt> times.</h3>

In [None]:
# Simulate 50,000 copies of the test statistic

num_simulations= 50000

# Create an empty array that will ultimately contain 
# the sample average for each of the trials.
sample_averages = make_array()

for i in np.arange(num_simulations):
    new_sample_average=random_sample_midterm_avg(sample_size)
    sample_averages = np.append(sample_averages, new_sample_average)    

<h3> Our Decision: </h3>

<h3>Compare the simulated distribution of the statistic and the actual, observed statistic </h3>

<h4>Create a table containing the sample averages.</h4>

In [None]:
averages_table = Table().with_column('Random Sample Average', sample_averages)
#averages_table

<h4>Plot a histogram of the sample averages<br>
    
Superimpose on the histogram the value of the actual, observed statistic (red dot)
    
</h4>

In [None]:
averages_table.hist(bins = 20)
# Plot a red dot of size 120, at vertical coordinate -0.01, 
# which is just under the horizontal axis
plots.scatter(observed_average, -0.01, color='red', s=120);

<h3>Question: Does Sec. 3's average (red dot) <br>
seem very different from the rest of the sections?</h3>

<h4>This is an example where each answer&mdash;'yes' or 'no'&mdash;seems plausible. <br>
    
This is why we have to define terms such as "different," "close," or "far" more concretely.</h4>

SLIDE: Statistical Significance

<h3>Conventions About Inconsistency</h3>

<h4>Approach I: Determine fraction of sample values $\leq$ observed value.<br><br>
    
If the fraction is less than 5%, the observed value is a statistical outlier.<br><br>
    
    
Note: The observed average is the average score of Sec. 3.</h4

<h4> How many of our sample averages (simulated) <br><br>
are less than, or equal to, the observed average? </h4>

In [None]:
sample_averages <= observed_average

<h4>To determine the count of sample averages $\leq$ observed average, use <tt>sum</tt>.</h4>

In [None]:
tail_head_count = sum(sample_averages <= observed_average)
tail_head_count

<h4>The tail probability is the ratio of the tail head count<br><br> 
over <tt>num_simulations</tt>, the total number of samples (trials).<br><br> 

The tail probability is also called the $p$-value.<br><br> 
    It's the probability of obtaining results <u>at least as extreme</u> as the observed value.</h4>

In [None]:
# (1) Calculate the p-value: simulation area beyond observed value
tail_probability=tail_head_count/num_simulations
tail_probability

<h3>This is NOT less than 5%.  So, the GSI's assertion is supported by the data.</h3>

<h4>Alternative Python code for calculating the tail probability:</h4>

In [None]:
# (1) Calculate the p-value: simulation area beyond observed value
np.count_nonzero(sample_averages <= observed_average) / num_simulations
# (2) See if this is less than 5%

<h3>Where is the 5% cutoff?</h3>

In [None]:
# Recall the Averages Table
averages_table

<h4>Sort the table from low to high values:</h4>

In [None]:
sorted_averages_table = averages_table.sort(0)
sorted_averages_table

<h4>Grab the boundary value <u>beyond</u> which the tail probability $<$ 0.05 (5%):</h4>

In [None]:
# (1) Find simulated value corresponding to 5% of 50,000 = 2500
five_percent_point = sorted_averages_table.column(0).item(2500)
five_percent_point

<h3>If Sec. 3's average score (i.e., observed value) <br><br>
had been lower than the 5%-point above, <br><br>
it would have contradicted the GSI's claim.</h3>

In [None]:
# (2) See if this value is greater than observed value
observed_average

<h2>It was close, but the GSI Wins by scraping by!<br><br>
I still would NOT want to be in that GSI's section!</h2>

### Visual Representation

In [None]:
averages_table.hist(bins = 20)
# Plot a gold-colored vertical line, of thickness 2, 
# at horizontal coordinate equal to the 
# five_percent_point (which is 13.63), 
# from vertical coordinate 0 to 35 (%/unit) (0 to 0.35)
plots.plot([five_percent_point, five_percent_point], [0, 0.35], color='gold', lw=2)

# Give the plot a title
plots.title('Area to the left of the gold line is 5%');

# Plot a red dot of size 120, at vertical coordinate -0.01, 
# which is just under the horizontal axis
plots.scatter(observed_average, -0.01, color='red', s=120);

<h3>QUESTION: Did Sec. 3's average score fall in the 5% tail?</h3>