# <center>Intro to Hypothesis Testing</center>

## Part 1: our first hypothesis test

You find a brain in the human dissection lab, but you suspect it may belong to a new unknown species.

You decide that mass of brain is the only important feature. So you will measure its mass, then compare it to an online database of all human brain masses.

So you measure the mass of the brain, and record it:

In [78]:
sample_brain_mass = 1564.2 # grams

Then you download your dataset and load it into Python.

In [79]:
import numpy as np
brain_masses = np.load('human_brain_masses.npy')
print(brain_masses)

[1031.03862494 1132.34383661 1188.58290064 ... 1325.03535947 1342.84315573
 1578.74968743]


### __Does our sample brain come from a different *population* than the brains in this list of human brain masses?__  
*(it's worth noting, though okay to ignore for now, that we are treating these brain measurements as a __population__ and not a sample from a larger population–we'll deal with this later)*  

How can we answer that question?

<i>Intuitive approach: 

Is our brain smaller than the largest human brain?  
Is it larger than the smallest human brain?  
Is it larger or smaller than the average human brain?  
How many human brains are larger than our sample?  
How many are smaller? </i>

__Task__:  
Answer these questions using Python.

In [80]:
# answer here
is_heaver = brain_masses > sample_brain_mass

number_of_heavier_brains = np.sum(is_heaver)

print(f'{number_of_heavier_brains} are heavier than our sample brain.')

251206 are heavier than our sample brain.


In [81]:
%timeit np.sum(brain_masses>sample_brain_mass)

18.3 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [82]:
# this is convenient but possibly confusing for new programmers
for i in brain_masses:
    brain_masses[i]
    
# it is simpler to *always* use this:
for i in range(len(brain_masses)):
    this_brain = brain_masses[i]

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

We could ask hundreds of such questions. Visualization can help speed this up.

__Task__:  
(1) plot a histogram of the human brain mass dataset  
(2) label the mean human brain and our sample brain on that histogram

In [88]:
# answer here

fig,ax = pl.subplots(figsize=(12,7))
fontsize = 'x-large'


# show a histogram of all brain masses
hvals,hbins,hps = ax.hist(brain_masses, bins=500, histtype='step', linewidth=2, color='grey')
ax.set_xlabel('Brain mass (g)', fontsize=fontsize)
ax.set_ylabel('Number of brains', fontsize=fontsize)
ax.set_title('Human population brain masses', fontsize=fontsize, pad=60, color='grey')

mean_human_brain_mass = np.mean(brain_masses)
ax.axvline(mean_human_brain_mass, color='darkblue')
ax.text(mean_human_brain_mass+20, 10000, 'Average\nbrain', color='darkblue', fontsize=fontsize)

# display our sample brain mass on the histogram
ax.axvline(sample_brain_mass, color='darkmagenta')
ax.text(1580, 30000, 'Sample\nbrain', color='darkmagenta', fontsize=fontsize)

delta = sample_brain_mass - mean_human_brain_mass 
ax.hlines(65000, mean_human_brain_mass, sample_brain_mass, color='darkgray')
ax.text((mean_human_brain_mass+sample_brain_mass)/2, 66000, f'{delta:0.2f} g', color='darkgray', fontsize=fontsize, ha='center')

hbins = (hbins[1:]+hbins[:-1])/2 # bin centers
fill_xs = hbins[hbins>=sample_brain_mass]
fill_ys = hvals[hbins>=sample_brain_mass]
ax.fill_between(fill_xs, fill_ys, color='grey')

# two-tailed
ax.hlines(65000, mean_human_brain_mass-delta, mean_human_brain_mass, color='darkgray')
ax.text(mean_human_brain_mass-delta/2, 66000, f'{delta:0.2f} g', color='darkgray', fontsize=fontsize, ha='center')
fill_xs = hbins[hbins<=mean_human_brain_mass-delta]
fill_ys = hvals[hbins<=mean_human_brain_mass-delta]
ax.fill_between(fill_xs, fill_ys, color='grey')

<matplotlib.collections.PolyCollection at 0x1c28fc5358>

While it's not possible to be *certain* about the answer to our question, we can judge *how likely* it is that our brain came from this population distribution.

One measure of that is the fraction of brains that are __further from the mean__ than our sample is.

__Task:__  
(1) Compute the *fraction* of human brains with a mass larger than our sample brain.  
(2) Compute the *fraction* of human brains with a mass *further from the mean* than our sample brain.

In [104]:
# answer here

is_heaver = brain_masses > sample_brain_mass

number_of_heavier_brains = np.sum(is_heaver)

fraction_of_heavier_brains = number_of_heavier_brains / len(brain_masses)

print(fraction_of_heavier_brains)

fraction_of_more_extreme_brains = fraction_of_heavier_brains*2
#print(fraction_of_more_extreme_brains)


0.03588657655522522


We could rephrase the latter fraction as "the probability of finding a brain with a more outlying mass than our sample."  
This is often referred to as a "p-value."

The first value is a "one-tailed" value, and the second is a "two-tailed" value. Unless you have strong reason to believe your sample could not have been extreme in other direction (in our case, lighter), then it makes most sense to use a two-tailed test.

<hr/>

### Generalizing the approach

If we can *assume* that the population underlying our data are *normally distributed*, then we can make that analysis very general.

(Recall the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">central limit theorem</a> for some basic intuition on the generality of normal distributions.)

To see why, let's visualize the basic properties of a normal distribution.

In [86]:
from scipy.stats import norm

xs = np.linspace(-4,4,1000)
ys = norm.pdf(xs, 0, 1)

fig,ax = pl.subplots()

ax.plot(xs, ys, color='k', lw=2)

for i,col in zip([3, 2.5, 2, 1], [.9, .8, .7, .6]):
    use = (xs>0) & (xs<i)
    ax.fill_between(xs[use], ys[use], color=pl.cm.Greys(col), lw=0)
    ax.annotate(f'{i} std', [np.max(xs[use])+.15,np.min(ys[use])+.01], color=pl.cm.Greys(col), ha='center')

To help illustrate the value of understanding this point, consider what happens when we choose to show our data in an arbitrary new unit of measurement.

In [90]:
fig,axs = pl.subplots(2,1,figsize=(5,7), sharex=True)

for ax,scale,unit in zip(axs, [1,0.035274], ['g','oz']):
    ax.hist(brain_masses*scale, color='k', bins=500, histtype='step')
    ax.set_xlabel(f'Brain mass ({unit})')
    ax.set_yticks([])
    ax.axvline(sample_brain_mass*scale, color='darkmagenta')

In [94]:
mean = np.mean(brain_masses)
std = np.std(brain_masses)

print(mean)
print(std)

relative_to_mean_in_grams = (brain_masses - mean)
relative_to_mean_in_std_units = relative_to_mean_in_grams / std

1292.0229673451206
151.17544116993284


In [102]:
fig,axs = pl.subplots(1,3)

axs[0].hist(brain_masses, bins=100)
axs[0].set_title('Original')
axs[1].hist(relative_to_mean_in_grams, bins=100)
axs[1].set_title('Mean subtracted')
axs[2].hist(relative_to_mean_in_std_units, bins=100)
axs[2].set_title('Mean subtracted and divded by std')


Text(0.5,1,'Mean subtracted and divded by std')

The data are identical, but if we didn't know that, it would not be immediately obvious how much of an "outlier" each one is, because the units are different.  

To get around this, we normalize our measurements by subtracting the mean and dividing by standard deviation to produce a __z-score__.  
<br>
<center>$z = \frac{\text{sample value - population mean}}{\text{population standard deviation}}$</center>

<br>
<center>$z = \frac{x-\mu}{\sigma}$</center>


where $x$ is our sample value, $\mu$ is the population mean, and $\sigma$ is the population standard deviation.

In [111]:
brain_masses

mean = brain_masses.mean()
std = brain_masses.std()

zscores = (brain_masses-mean)/(std)

zscores

array([-1.72636733, -1.05625047, -0.68423856, ...,  0.2183714 ,
        0.33616696,  1.89664881])

__Task__: compute the z-score for every brain in our dataset, and for the sample brain, and display them.

In [101]:
zscore_of_sample_brain = (sample_brain_mass - mean) / std

print(zscore_of_sample_brain)

1.8004050826544735


The distribution of values once z-scored is called the <u>*standard* normal distribution</u>, and it has a mean of 0 and a standard deviation of 1. 

We can use this system to express in general terms (a) how much of an outlier a data point is (ex. "1.2 standard deviations from the mean"), and (b) how likely that data point was to have arisen from the population (this is related to the p-value we saw above).

We can use existing packages in Python to determine the probability of observing any given data point in a normal distribution. Here is an example:

In [108]:
from scipy.stats import norm # norm is the normal distribution

z_score = (sample_brain_mass-mean_human_brain_mass) / np.std(brain_masses, ddof=1)

p = norm.sf(z_score) # survival function: returns the fraction of the distribution that is greater than the input value 
# (and it uses the standard normal distribution by default)

print(z_score)
print(p*2)

1.8004049540540876
0.07179671915293835


<hr/>

### Putting it all together: conducting a hypothesis test

__Null hypothesis__: our sample brain comes from the population of human brains  
__Alternative hypothesis__: our sample brain comes from a separate population of brains  

__Compute z-score__  
__Compute p-value__  
__Compute "effect size"__ 

*What reasonable conclusions can make from this analysis? 

*What does the p-value tell us? What does it not tell us?*

*One caveat*: This is also quite an odd scenario in reality, because we rarely have information about full *populations* and single data points to compare them to.

---

## Part 2: dealing with samples of multiple observations

We often want to know about a *group of sampled points* as opposed to a single point.  
A common example would be to ask "does this sample come from a population with a mean of X?" 

This has a similar flavor to our first example question. It's a bit more complex, but the same logic applies.

Let's begin by loading in a similar dataset, but this time consisting of a set of sampled observations.  
Suppose each sample represents the mean firing rate of a single neuron.  

In [112]:
sample = np.load('sample.npy')

print(sample)

[ 19.08765187  68.65806486  33.56354929  32.49577301  30.48491205
   9.27890057  27.96165612  -3.17118488  19.24172174  35.57012565
   3.46882322  33.69680934  29.00681338  26.24718219  34.13053343
  82.60800516  72.86523956  63.62233261  53.87813344  -2.98364416
   2.923537    52.54128835   9.43490808  61.32799195   4.09816448
  88.87137507  36.40587655  46.21497085 -19.43758789  20.50959796
 -44.17382065  22.16441264  -1.10324205 -13.17386638  42.25494227
  27.17121254  94.43270511 -41.27862449  35.76603664  42.72439828
 -17.77928443 -58.33710308  42.46037535  51.20356776  65.45432975
  37.14975646   0.503567    54.79503172  34.73825892 -24.52529967
  13.70100255  14.9085       4.30102928 -26.74032759  63.57866916
 100.64613485  53.43042129  59.48199558 -22.85693732  13.48268966
 -27.91227646  -0.54621774  56.90457851  24.52251321  58.36478292
  49.79139193  38.61330352  36.20647124  -0.23270032  55.51730976
   4.61814591  74.60904414   6.19978153  24.11885351 -28.61337162
  -1.32663

In [116]:
fig,ax = pl.subplots()

ax.hist(sample)
ax.set_xlabel('Mean relative firing rate (Hz)')
ax.set_ylabel('Number of neurons')

Text(0,0.5,'Number of neurons')

We suspect that these neurons come from a neuronal population that has a mean firing rate of 30 Hz.

Null hypothesis: the sample comes from a population with a mean of 30.  
Alternative hypothesis: the sample comes from a population with a mean not equal to 30.

__Task__: inspect these data and speculate on how you might answer this question.

In [118]:
# inspect here
print(len(sample))

print(np.mean(sample))

120
27.030809250655714


*A naïve idea:* compute the sample mean, and if it differs from 30, reject the null hypothesis.

This and many other simple approaches don't work because we don't know __what to expect by chance when sampling a set of points from a population__. In other words, we haven't specified our *null distribution* (i.e. the one we conveniently had available in the previous example).

In [121]:
# visual demonstration

fig,ax = pl.subplots()

xs = np.arange(15,45,.01)
ys = norm.pdf(xs, 30, 5)
ax.plot(xs, ys, color='black', label='Population')
ax.axvline(30, lw=4, color='k')
ax.legend(loc='upper left')
pl.waitforbuttonpress()
ax2 = ax.twinx()

for i,col in enumerate(['maroon','darkorange','forestgreen','steelblue','violet','grey']):
    sample = np.random.normal(30, 5, size=20)
    ax2.hist(sample, bins=8, density=True, histtype='step', color=col, alpha=.5, label=f'Sample {i}')
    ax2.axvline(sample.mean(), lw=2, color=col)
    ax2.legend(loc='upper right')
    pl.waitforbuttonpress()

To figure this out, we could look into probability or statistics theory, but we can also build some intuition ourselves first.

Let's create a "known" population and sample from it many times, then analyze what we observe.

This will allow us to build some intuition about what happens when we sample, and what to expect about a population based on a single sample that came from it.

In [122]:
mean,std = 30,10

fig,ax = pl.subplots()

population = np.random.normal(mean,std,size=100000)
# show this histogram, and say: if we can somehow infer this exact population properties using just a single sample,
# then we can do our original test -- that's what a one-sample t-test ends up being
ax.hist(population, color='k', histtype='step', bins=200, density=True)
xs = np.arange(20,40,.01)
ys = norm.pdf(xs,mean,std)
ax.plot(xs,ys,color='k')
ax.set_title(f'Our "population": mean={mean}, std={std}')

Text(0.5,1,'Our "population": mean=30, std=10')

We want to learn what we can expect when sampling from this population.

todo: note "with/without replacement" - could be a chance to define replacement

__Task:__ from our "population", collect 10000 random samples each containing 50 observations.

In [123]:
N = 10000 # number of times to sample the population
n = 50 # number of observations to take in each sample
samples = np.array([np.random.choice(population, size=n) for i in range(N)])

Lesson #1 of sampling: how do the means of our samples relate to mean of the population?

In [129]:
fig,axs = pl.subplots(2,1,sharex=True,gridspec_kw=dict(hspace=.4))
cols = ['red','orange','gold','green','blue','violet','grey','steelblue','pink']

pl.waitforbuttonpress()

for si,(sample,col) in enumerate(zip(samples,cols)):
    axs[0].hist(sample, histtype='step', color=col)
    axs[0].axvline(sample.mean(), color=col)
    axs[1].axvline(sample.mean(), color=col, alpha=.5)
    axs[0].set_xlabel('Observed value')
    axs[0].set_ylabel('Frequency')
    axs[0].set_title('Each histogram is a sample of observations\nWe get one of these when we do an experiment in real life')
    axs[1].set_xlabel('Mean of observations in a given sample')
    pl.waitforbuttonpress()
axs[1].hist(samples.mean(axis=1), bins=50, histtype='step', color='k', lw=3)
axs[1].set_ylabel('Frequency')

TclError: can't invoke "update" command:  application has been destroyed

So now we're beginning to get a sense of how much we can expect our sample mean to reflect the true mean.

This distribution is our "null distribution."

If we could characterize the center and shape of our null distribution (i.e. its mean and standard deviation), we could compute the probability of observing a single sample from it, as we did in the simpler case before.

Lesson #2 of sampling: how does `n` of our samples relate to our estimate of the population mean?

In [None]:
#task

for n in range(10,200,50):
    samples = np.array([np.random.choice(population, size=n) for i in range(N)])
    axs[1].hist(samples.mean(axis=1), bins=50, histtype='step', label=f'n={n}', lw=3, color=pl.cm.Greys(n/200+.1))
axs[1].legend(title='n: # of observations in a sample', fontsize='x-small')

# normalize those
fig,ax = pl.subplots()
for n in range(10,200,50):
    samples = np.array([np.random.choice(population, size=n) for i in range(N)])
    z = (samples-samples.mean(axis=1)[:,None])/(samples.std(axis=1, ddof=1)[:,None])
    ax.hist(z.mean(axis=1), density=True, bins=50, histtype='step', label=f'n={n}', lw=3, color=pl.cm.Greys(n/200+.3))
pl.legend()

Lesson #3 of sampling: How can we use one sample to infer our null distribution?

It turns out that our null distribution of sample means can be inferred from:  
* (1) our sample mean
* (2) our sample standard deviation ($s$)
* (3) our sample size ($n$)

Specifically, our null distribution is estimated as a normal distribution with the same mean as our sample, and a standard deviation equal to $\frac{s}{\sqrt n}$.  
This latter fraction is called the __standard error of the mean__, or SEM.

Why do you suppose it's called the standard error of the mean?
<hr/>

Now we can ask our original question, rephrasing it as: __How much of an outlier is the *difference between our sample mean and the hypothesized mean of 30*?__  

And we are now able to precisely specify how we expect this *difference* to be distributed:  

In [134]:
from scipy.stats import t

fig,ax = pl.subplots()

for n,col in zip([5,10,1000],['r','g','b']):
    samples = np.array([np.random.choice(population, size=n) for i in range(N)])
    normed = (samples.mean(axis=1)-mean) / (samples.std(axis=1,ddof=1)/np.sqrt(n))
    ax.hist(normed, bins=75, histtype='step', label=f'n={n}', density=True, lw=1, color=col)
    
    # with density true above:
    xs = np.arange(-4,4,.01)
    ys = t.pdf(xs, n-1)
    ax.plot(xs,ys, color=col)
ax.legend(fontsize='x-small')
ax.set_xlim([-5,5])
ax.set_title('The t-distribution')

Text(0.5,1,'The t-distribution')

The single parameter of the t-distribution is the "degrees of freedom," corresponding to sample size - 1.

(Overlay t-distribution).

<hr/>

__Task__: load in the sample from above and compute a manual t-test to determine whether or not it comes from a population with a mean of 30.

*Hint: `scipy.stats` has a module called `t` for t distribution. Consider using the cdf() or sf() functions in that module.*

**Bonus:** Complete this without using the `scipy.stats` functions but instead by manually creating the relevant t-distribution.

In [138]:
# 1. State our hypotheses

sample = np.load('sample.npy')

# 2. Calculate our test statistic (in this case: t-statistic)
n = len(sample)
sem = np.std(sample) / n**0.5
t_statistic = (30 - sample.mean()) / (sem)

# 3. Find a p-value
from scipy.stats import t
p = t.sf(t_statistic, n-1)

print(p) # one-tailed p-value

two_tailed_p = p*2
print(two_tailed_p)

0.15390134999644162
0.30780269999288323


In [154]:
# 1. State our hypotheses

sample = np.load('sample.npy')

# 2. Calculate our test statistic (in this case: t-statistic)
# "critical value: the value of the test statistic past which the test would be significant"
n = len(sample)
sem = np.std(sample, ddof=1) / n**0.5
t_statistic = np.abs(sample.mean() - 30) / (sem)

# 3. Find a p-value
from scipy.stats import t
p = t.sf(t_statistic, n-1)

#print(p) # one-tailed p-value

two_tailed_p = p*2
print(f'Our manual t statistic: {t_statistic:0.3f}')
print(f'Our manual p value: {two_tailed_p:0.7f}')

Our manual t statistic: 1.020
Our manual p value: 0.3098181


In [152]:
from scipy.stats import ttest_1samp

tstat,pvalue = ttest_1samp(a=sample, popmean=30)

print(f'scipy t statistic: {tstat:0.3f}')
print(f'scipy p value: {pvalue:0.7f}')

scipy t statistic: -1.020
scipy p value: 0.3098181


In [155]:
ttest_1samp?
