# 2. Hypothesis Tests

So far we have only dealt with descriptive statistics, which is very useful to have a very good sense of the type of data we are managing. However, if we have a sample or operational dataset chosen correctly and large enough, we can do some inferential statistics.

Essentially, we infer onto the population from which the sample was drawn the characteristics we have seen in the sample.

### But first, some vocabulary:

##### POPULATION - aka REFERENCE GROUP, TARGET GROUP or STUDY GROUP is the entire group of interest.  

If I am interested in obesity in children ages 5 - 12 in NYS,  then the population is all 5 - 12 year olds living in NYS. 

##### RANDOM SAMPLE - a selection of members of a population such that each member is chosen independently and has a non-zero probability of being chosen.  

Random digit dialing is not a random sample of the whole population, only of the population with a land line.  Those without a land line have a 0 chance of being chosen.  

##### SIMPLE RANDOM SAMPLE - a random sample in which each member has an equally likely probability of being selected

If a family has more than one land line, then those particular families have a greater chance of being chosen then those with only one land line.  So it may be a random sample of land line owners, but not a simple random sample.

There are other types of samples, e.g. cluster, stratified, block etc. 

#### BLINDING - A devise used in clinical trials to minimize bias.  Double blind implies neither the administrating individual nor the receiving unit is aware of which treatment is being rendered.  Single blind is when only the receiving unit is unaware of treatment.

PARAMETER - numerical descriptive measure of a population.  Because it is based on the observations in the population, its value is almost always unknown.  Querying the population is usually not possible as it is too big. 
 

ESTIMATOR - a rule or function (statistic) of the sample is called an estimator and is used to estimate a population parameter:

	POINT ESTIMATOR - calculates a single number - mean, median, variance
	INTERVAL ESTIMATOR - calculates an interval - confidence interval

![title](parameter.jpg)

We want to choose estimators of the population that are unbiased and have minimum variance.  That is, the expected value of the estimator is the value of the parameter and all such values vary tightly around the parameter.

# SAMPLING DISTRIBUTION 
of a sample statistic calculated from a sample of n measurements is the probability distribution (PDF) of the statistic.

The above definition is a critical one in understanding inferential statistics.  


For the time being, we will be considering only those variables which are quantitative, continuous.  Thus the estimators of interest are  $\bar{x}$ for μ and $s^2$ for $\sigma^2$.   First we will proceed to learn about point and interval estimators for the mean, then later the variance.

Say I have a population of 50 people and I take a simple random sample (SRS) of 10 people.  There are $50 \choose 10$  = 10,272,270,000 such samples I can get: all with different people in them. 

If I were to take the mean age ($\bar{x}$) of my sample of ten, I will get a number.  

Now choose another one of the $1.02 x 10^{10}$ samples and find the mean age($\bar{x}$).  It will be a different number.  In fact, in all the possible $1.02 x 10^{10}$ samples, the $\bar{x}$'s will be slightly different.  Thus, the $\bar{x}$'s of all the possible samples (of size 10 in this example) have a distribution (i.e. I can create a histogram of all the $\bar{x}$'s) and will happen with a specific probability.  This also means that, for all these $\bar{x}$'s, we can calculate a mean and a variance.  


#### Knowledge of this distribution will allow us to make inferences from the sample onto the population.

The first important piece of information about the distribution of $\bar{x}$ is:

IF THE UNDERLYING DISTRIBUTION OF X ~ N($\mu$ ,$\sigma^2$), then  $\bar{x}$ ~ N($\mu$ ,$\sigma{^2_x})$.

where  $sigma{^2_x}$ =  $\sigma^2$ is called the standard error of the mean (SEM)

Note that as the sample size n gets larger, the SEM gets smaller!  

So, if X ~ N($\mu$ ,$\sigma^2$) I can find the probability of getting the $\bar{x}$  I obtained from my one sample:

#### Example:

Suppose we have a quantitative, continuous variable X ~ N(100,100).  

I take an SRS of size n = 25 and want to know the probability that the sample mean, not an individual value of X, falls between 99 and 102.    

The above statement tells me that all the  of sample size 25 taken from this population will have a distribution ~ N(100,100/25) or ~ N(100, 4) {note how much small the variance of the means are}:

$P(99 < \bar{x} < 102) = P(\frac {99-100}{\sqrt4} < \frac {\bar{x}-\mu}{\sigma_x} < \frac {102-100}{\sqrt4}) = P(-0.5 < \bar{x} < 1.00)$

= 0.8413 - 0.3085

= 0.5328

## To recoup, $\bar{x}$  is an estimator of $\mu$.  In fact it is a point estimator.  Now we will learn how to create and interpret an interval estimator.

First, another definition:

### DEGREES OF FREEDOM (DF) of a set of observations is the number of values that could be assigned arbitrarily within the specifications of the system.  In other words, the number of values that are free to vary in the calculation of a statistic, in this case the mean.  

This value varies dependeing on the type of distribution we are sampling.

Normally, it is infered from the number of groups that are free to vary. 



## What to do if $\sigma^2$ is unknown?  For that we need a sampling distribution, a distribution of the sample means where $\sigma^2$ is unknown.

## The t Distribution (or Student's t Distribution)

the t distribution is a family of distributions whose shape depends on the parameters degree of freedom (DF). The t distribution is similar to the normal distribution - symmetric and mound shaped - but it is thicker at the tails to allow for more uncertainty.

![title](t_dist.png)

As df increases, t --> N.

THE 100(U)th PERCENTILE of a t distribution with d degrees of freedom is denoted $t_{d,u}$, thus P($t_d$ < $t_{d,u}$) = u.  

The u is the cumulative probability and d the degrees of freedom.

So, now we need to use the t distribution table to find the probabilities (area under the curve).  

As we noted above, the t is a family of distributions, each of which are defined by the df.  Thus, using the table for the t distribution, $t_{29,0.80} = 0.854$, $t_{15, 0.975}$ = 2.131, etc.  



In [2]:
qt(0.8,29)
qt(0.975,15)

Like the standard normal Z distribution, the t distribution is also symmetrical, centered around 0 - we need only the upper half of the t values.  For example: $t_{15, 0.975}$ = 2.131 so $t_{15, 0.025}$ = -2.131.

Putting these two together we can estimate μ, not using one number, but an interval within which we think the true μ will lie.

### INTERVAL ESTIMATION - UNKNOWN VARIANCE, since in reality, we do not know the variance.

If x1, x2, … xn ~ N($\mu,\sigma^2$) and are independent, then ($\bar{x}-\mu$)/(s/$\sqrt{n}$) is distributed as a t distribution with (n-1) degrees of freedom (df).
 
$$t_{n-1} = \frac {\bar{x} - \mu}{s/\sqrt{n}}$$

A 100%(1-$\alpha$ ) CONFIDENCE INTERVAL (CI) for $\mu$  is defined by the interval:

 
$(\bar{x} - t_{n-1,1_{-\alpha/2}}s/\sqrt{n},\bar{x} + t_{n-1,1_{-\alpha/2}}s/\sqrt{n}) $

Interval estimates of  
Twenty-five samples, 95% CI
![title](CI.png)

That is, over all possible 95% confidence intervals that can be constructed about the  $\bar{x}$'s, calculated from all the possible random samples of size n that can be chosen, 95% will contain the parameter   $\mu$  Our confidence is in the process of how we constructed the interval, not the values themselves.

The interval is exact when the underlying population distribution is normal; otherwise, when n is large, the interval is an approximation.  When n is small we cannot invoke the CLT and therefore cannot construct the confidence interval.

Example: A 95% CI for μ where $\bar{x}$ = 15, n = 25 and $s^2$ = 49:
	
95% CI → 100(1-α) CI → α = 0.05; α/2 = 0.025 therefore want ±$t_{24, 0.975}$ = 2.064

$(\bar{x} - t_{n-1,1_{-\alpha/2}}s/\sqrt{n},\bar{x} + t_{n-1,1_{-\alpha/2}}s/\sqrt{n}) $

(15 - 2.064(7/5), 15 + 2.064(7/5))

(15 - 2.8896, 15 + 2.8896)

(12.11, 17.89)

95% of all confidence intervals constructed from all samples of sample size 25 taken from this population will contain μ, I can only hope that I have chosen a sample in which the constructed CI does, in fact, contain μ.  I will never really know!!

#### WHAT AFFECTS THE LENGTH OF A CONFIDENCE INTERVAL?


n:  As the sample size increases, the length of the CI decreases.

$\alpha$   As the confidence wanted increases, the length of the CI  increases.


I can construct an interval in which 100% of the intervals will capture μ: (-∞, ∞).  But that really does not give me any information - so it is always a tug between the amount of information the interval gives you and the number of intervals that will contain the μ.

CAUTIONS: To use  $\mp t_{n-1,1-\alpha}\sigma/\sqrt{n}$ 

1.    Data must be an SRS from the population or can be thought of as independent observations,
2.   If data is the result of more complex sampling, CI must be calculated from other formulae.
3.   Nothing can be done to rescue badly produced data.
4.	Outliners can have a large effect on the above CI, so beware and check data.
5.	If n is small and the population not close to normally distributed, the above choice is incorrect.
must use non-parametric methods!


### Lets do an example: We draw a random sample n = 100 where the average blood pressure is 130.1 mm Hg with a standard deviation of 21.21.mm Hg. We know from a prior research that the population mean is 120 mm Hg.

Is the group significantly different (with respect to systolic blood pressure!) from the regular population?

In [56]:
xbar = 130.1            # sample mean 
mu0 = 120           # hypothesized value 
s = 21.21                # sample standard deviation 
n = 100                 # sample size 

##Lets calculate the t critical value
t = (xbar-mu0)/(s/sqrt(n))
t

In [57]:
alpha = .05
t.half.alpha = qt(1-alpha/2, n-1)
c(-t.half.alpha, t.half.alpha)

The Confidence intervals for the critical values confirm that the t statistic falls outside the 95% probability.

In [58]:
pval = 2*pt(t, n-1, lower=F)
pval

### Let's calculate the 95% confidence interval from this data
$(\bar{x} - t_{n-1,1_{-\alpha/2}}s/\sqrt{n},\bar{x} + t_{n-1,1_{-\alpha/2}}s/\sqrt{n}) $

In [59]:
(130.1+qt(1-(alpha/2),n-1)*(s/sqrt(n)))
(130.1-qt(1-(alpha/2),n-1)*(s/sqrt(n)))

## Normally the variance is often unknown.

Point Estimation Variance

The sampling distribution that describes the distribution of all the $s^2$'s is the Chi-square

This $X^2$ test can be used to test if the variance of a population is equal to a specified value

### Chi-Square ($\chi^2$) Distribution:

If x1, x2, …xn ~ N(0,1) and are independent, then  $\Sigma{xi^2}$ is said to follow a chi-square distribution with n degrees of freedom, written $\chi^2_n$

![title](chi.png)

The sampling distribution of $\sigma^2$, if we assume X ~ N($\mu, \sigma^2$) is:

$$\chi^2_n-1 ~ \frac{(n-1)s^2}{\sigma^2}$$

A 100%(1-$\alpha$ ) CONFIDENCE INTERVAL for $\sigma$ 2 is given by

$$[(n-1)S^2/\chi^2_{n-1,1-\alpha/2},(n-1)S^2/\chi^2_{n-1,\alpha/2}]$$



![title](t_table.png)

![title](x2_table.png)