# Parameter estimation: t-tests and p-values

One of the most basic data analysis problems is to estimate parameters of a **population**, such as the mean, based on our dataset, which is usually a small **sample** of the population. 

The characteristics of our sample, such as the mean and standard deviation, are (usually) easy to calculate. But how does our sample relate to the broader population that we want to know about? 

### A one-sample t-test

Imagine you work at a factory where the mean size of a product needs to be 10 cm. For quality control, a sample of 1000 items were measured.

In [79]:
#generate a pretend sample 
import numpy
numpy.random.seed(2345)

sample = numpy.random.normal(9.9, 1.0, 1000)
print sample[0:10]

[  8.94870125  11.6687718    8.75817307  10.61075487  10.41095064
  11.04902943   9.36153977   9.16335491   9.82400352  10.74881782]


In [80]:
#what's the mean
print numpy.mean(sample)

9.93712322189


So the mean size of items in the sample is somewhat less than 10 cm. How do we know if this means that the true mean size of the product is less than 10 cm?

The classic statistical approach to this problem is a "one-sample t-test." This test basically takes a measure of the distance between the sample mean and the hypothesized mean of the population, and uses statistical distributions as the basis for figuring out how likely it is that our sample came from a population with that mean.

To do this, we identify two hypotheses:

**$H_0$, the "null" hypothesis:** The mean of the sample is NOT DIFFERENT from the expected value.

**$H_1$, the "alternative" hypothesis:** The mean of the sample is DIFFERENT from the expected value.

*Note:* We could ask whether the mean is smaller or larger instead of just different (that's the difference between a one- and two-sided t-test). For now we'll just focus on the two-sided test.

In classical statistical testing, we are usually trying to see if there is sufficient evidence *against* the null hypothesis to *reject* it in favour of the alternative. We'll take up the question of *how much* different the samples need to be to say that they are different in a few moments...

### Calculating the test statistic

The basic equation for a one-sample t-test is:

$t=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}$

where:

- $\bar{x}$ is the sample mean

- $\mu$ is the mean value for the population (in our case the expected population mean)

- $\sigma$ is the standard deviation of the sample

- $n$ is the size of the sample

So in our example data, the value of $t$ is: 

In [81]:
tstat = (numpy.mean(sample)-10)/(numpy.std(sample)/numpy.sqrt(1000))
print "t-statistic = %.3f" %tstat

t-statistic = -2.035


A couple important notes before we move on to interpreting what this test statistic means. 

First, the $t$-test has a few key assumptions:

- the observations in the sample are independent (often taken to mean that they should have been randomly selected from the population)
- the population from which the sample was drawn should be approximately normally distributed, or rather, the distribution of sample means should be normal

Further, $t$ differs from a $Z$-score because we are using the sample standard deviation, adjusted by the size of the sample, to scale the data instead of the population standard deviation (if you knew the population standard deviation, you should do a $Z$-test!). Because of the greater uncertainty caused by using the sample standard deviation as an approximation, we use a sampling distribution broader than a standard normal distribution to interpret the $t$ test statistic. However, as the sample size $n$ increases, the $t$-distribution approaches normal (remember the central limit theorem). What this means is that the sample size enters in the calculation of the significance of the test (through the degrees of freedom), which we'll get to next.

### Interpreting the test statistic: calculating $p$

The $t$ statistic measured earlier is a measure of how far the observed sample mean is from our hypothetical population mean. To decide whether or not this distance is large enough that we should reject the null hypothesis, we'll use the concept of a "p-value."

The **p-value** is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true.

For our test case:

In [85]:
from scipy.stats import t

tstat
df = 1000-1

pval = t.sf(numpy.abs(tstat), df)*2  # two-sided pvalue = Prob(abs(t)>tt)
print 'p-value = %.4f' %(pval)

p-value = 0.0421


In [None]:
#should put a plot illustrating the result here...

Thus, if the null hypothesis is true, and the true mean of the population is 10 cm, then the probability of observing a sample with the mean in our sample is about 4%. So, either the true mean of the population that generated this sample is NOT 10 cm, or we simply selected a pretty unusual sample (hence the importance of independence in our sample!).

### Interpreting the test statistic: determining significance

Now, finally, we need to decide whether this outcome is sufficiently unlikely that we should reject the null. To do this, we need a couple more concepts.

First, think about the four possible outcomes of a test:

|   |   | Prediction | |
|---|---|---|---|
|       |    | True  | False |
|**Reality**    | True  | True positive (TP) | False negative (FN)|
|           | False |  False positive (FP)| True negative (TN) |

In statistical hypothesis testing, we refer to the "bad outcomes" (a test that leads to be a FN or FP outcome) as two different types of error:

**Type I error:** FP, or rejecting the null when it is true

**Type II error:** FN, or accepting the null when it is false

We refer to the probability of a Type I error that we're willing to accept as $\alpha$ (the "significance level").  By convention, this is often set at 0.05 or 0.01. This is the probability of making the wrong decision *when the null hypothesis is true.*

The null hypothesis we specified earlier was that the population mean was 10 cm. We have found that the probability of observing a sample as extreme as ours is about 0.04. 

So, in our example case, if we take $\alpha$ = 0.05, since our $p$-value (0.04) is less than $\alpha$, we could reject the null. It seems unlikely that our sample came from a population with a true mean of 10 cm. But you should choose your significance level BEFORE conducting your test!

*Important:* $p$-values are related to the concept of Type I error but are not the actual Type I error rate of a model! Further, they SAY NOTHING about Type II error (power, or "sensitivity," analysis is required to examine the probability of Type II error). Beware!

### Conclusions

Some key things to remember about $p$-values:

- a low $p$-value simply implies that either (a) the null is true and you've got a very unlikely sample, or (b) that the null is false

- they say nothing about the relative probabilities of the hypotheses considered, in fact the test says nothing about the probability of the alternative being true.

- statistical significance is not the same as practical significance!

This hypothesis testing approach is the basis of most classic statistical tests. This approach is often referred to as "frequentist" statistics, having to do with the idea of how "frequent" the observed outcome would be if you repeated the "experiment" many times. If this approach seems unsatisfying to you---perhaps because the choice of significance level seems arbitrary or because you think we should actually figure out how likely the alternative hypothesis is---you are not alone. We'll discuss an alternative approach in the next notebook (Parameter_estimation_bayesian).

### Understanding model performance

Type I and Type II error are closely related to measures used to assess the performance of classification models:

**Precision:** Proportion of predicted positives that are truly positive. $\frac{TP}{TP+FP}$

**Recall:** Proportion of true positives that are correctly predicted. $\frac{TP}{TP+FN}$

Thus, Type I error (the probability of a test being a false positive) is related to precision, while Type II error (the probability of a test being a false negative) is related to recall.

*NB:* Sensitivity = recall