# Sampling Distributions
## The Sampling Distribution of a Sample Mean
**statistic** is a number that describes some characteristic of a **sample**. The value of a statistic can be computed directly from the sample data. We often use a statistic to estimate an unknown parameter. 

**parameter** is a number that describes some characteristic of the **population**. In statistical practice, the value of a parameter is not known because we cannot examine the entire population. 

$\dagger$  
A statistic is the estimator, as a random variable, of the unknown parameter.

Let $\mu$ and $\sigma$ be the *mean* and *standard Deviation* of the *population* respectively. Then we have the relationship between the *sample mean* $\bar{X}$ and them as following:
1. $E\big(\bar{X}\big) = \mu$, thus an unbiased estimator of $\mu$.
2. $\sigma_{\bar{X}}^2 = \sigma^2 \big( \bar{X} \big) = \displaystyle \frac{\sigma^2} {n}$, smaller than the parameter $\sigma^2$, biased.
3. Standard Deviation or Standard Error (SE): $\sigma_{\bar{X}} = \sigma \big( \bar{X} \big) = \displaystyle \frac{\sigma} {\sqrt{n}}$.

**Central Limit Theorem** (CLT): For ***sufficiently*** large sample size $n$, $\bar{X} \sim \DeclareMathOperator*{\N}{N} \N\Big(\mu, \displaystyle\frac{\sigma} {\sqrt{n}}\Big)$.

## Sampling Distributions for "Counts" and "Proportions"
### Bernoulli Random Variable
$X$ is a **Bernoulli random variable** with parameter $p$.
1. **State Space**: $\{0,1\}$
2. **Distribution func**: $P(X = 1) = p, P(X=0) = 1-p$
3. **Mean**: $E(X)=0\cdot P(X=0) + 1 \cdot P(X = 1) = p$;
4. **Variance**: $\DeclareMathOperator*{\Var}{Var} \Var(X) = (0-p)^2\cdot P(X=0) + (1-p)^2 \cdot P(X = 1) = p(1-p$)

And the number of event $\{X = 1\}$ (the "success" trial) is to be counted, during $n$ trials.

### Binomial Random Variable
The random variable $X$ that **COUNTS** how many defectives out of the $n$ items is called the BINOMIAL $r.v.$ with parameters $n$ and $p$, $X\sim Binomial(n,p)$.
1. **State Space**: $\{0,1,2,\dots,n\}$
2. **Distribution func**: $P(X = x) = \displaystyle \frac{n!} {x!(n-x)!} p^x (1-p)^{n-x}$
3. **Mean**: $\sum\limits_{x = 0}^{n} xP(X = x) = \sum\limits_{X = 0}^{n} \displaystyle \frac{n\cdot (n-1)!} {(x-1)!(n-x)!} p\cdot p^{x-1} (1-p)^{n-x} = np$
4. **Variance**: $\sum\limits_{x = 0}^{n} x^2P(X = x) = np(1-p)$

***
**Sampling Distribution of a Count**: Choose an SRS of size $n$ from a population with proportion $p$ of successes. When the population is **much larger** than the sample, the count $X$ of successes in the sample has approximately the binomial distribution with parameters $n$ and $p$.

>**e.g.** 10% of light bulbs are unqualified. Select $10$ from totally $10000$. How many would be unqualified?
>
>| Actual Probability | Using the binomial distribution |
>| :----------------: |----------------------------------------------------------|
>| $$P(X=0) =10!\times \frac{9990} {10000} \times \frac{9989} {9999} \times \cdots \times \frac{9980} {9990} = 0.3485$$ | $$P(X=0) = \binom{10}{0} \sideset{}{^{0}}{(0.10)} \sideset{}{^{10}}{(0.90)} = 0.3487$$ |

### Proportions
The proportion of successes: $\newcommand{\ffrac}{\displaystyle\frac} \hat{p} = \ffrac{X} {n}$
1. **Mean**: $p$
2. **Variance**: $\sigma_{\hat{p}} = \sqrt{\ffrac{p(1-p)} {n}}$
3. Remember that $\hat{p}$ is NOT BINOMIAL.

>**e.g.** Suppose an issue on $2500$ students and $60\%$ would agree. What's the probability that sample proportion who agree is at least $58\%$?
>
>$\hat{p} = \ffrac{X} {2500}$, $X\sim \N(2500,0.6)$. Then $P(\hat{p} \geq 0.58) = P(X \geq 2500 \times 0.58) = P(X \geq 1450)$. The following work are for JMP

***
And when $n$ gets larger, through their figures we can find that the distribution of $X$ is approximately *Normal*
with **mean** and **standard deviation** $\mu_X = np$ and $\sigma_X = \sqrt{np(1-p)}$ respectively. And the distribution of $\hat{p}$ is also approximately *Normal* with **mean** and **standard deviation** $\mu_{\hat{p}} = p$ and $\sigma_{\hat{p}} = \sqrt{\ffrac{p(1-p)} {n}}$ respectively. As a rule of thumb, we will use the **Normal approximation** when n is large, to be specific: $np \geq 10$ and $n(1 – p) \geq 10$.

## Summary
For binomial distribution, when $np \geq 10$ and $n(1 – p) \geq 10$, large enough, the count $X \sim \N \left(np, \sqrt{np(1-p)}\right)$

Ane the porpotion $\hat{p} \sim \N \left( p, \sqrt{\ffrac{p(1-p)} {n}} \right)$

# Introduction to Inference
## Estimating with Confidence
### Statistical Estimation
Collect data from sample, get statistics, then make an inference about the parameter of the population.

**Setting in advance**: Simple Conditions for Inference About a Mean
1. We have an SRS from the population of interest. There is no nonresponse or other practical difficulty.
2. The variable we measure has an exactly Normal distribution $\N(\mu,\sigma)$ in the population.
3. We *don’t know* the population mean $\mu$, but we *do know* the population standard deviation $\sigma$.

And since we don't know $\mu$, we can only construct an interval which possibly contains the unknown $\mu$.
 
First thing is **Confidence level**: $C = 1 - \alpha$. The probability of that containing happens required. It's a decimal between $0$ and $1$. We can also say it is the overall capture rate of real parameter $\mu$ of all times of sampling.

Then is the **Critical Value**: $z^*$, which satisfies $\N(z*,0,1) - \N(-z*,0,1) = C$ (can be found on table). 

With that we can have **Margin of Error**: $m = \sigma_{\bar{X}} \cdot z^* = z^* \ffrac{\sigma} {\sqrt{n}}$, here the $\sigma_{\bar{X}}$ is calculated from the parameter, not statistics, and $n$ is the number of the sample.

And finally the **confidence interval**: $\bar{x} \pm m$, where $\bar{x}$ is calculated from the dataset.
***
Logic behind
1. Following the z-table we can know that with probability $C$ sample mean $\bar{x}$ will be lied in a interval of Two Margin of Error long, with its center the real mean $\mu$: $\bar{X} \sim \N(\mu, \sigma)$
2. To say that $\bar{x}$ lies within a magin of error long away from the parameter $\mu$ is EQUIVALENT to say that the parameter $\mu$ lies with those of some $\bar{x}$
3. So among all times of sampling, about probability $C$ will produce the interval that contains parameter $\mu$
***
Their relation:

Lower $C$, lower $z^*$, lower $m$; lower $\sigma$, lower $m$; higher $n$, lower $m$.

And since $n$ is the easiest to control, we can use our desired $m$ to derive a estimated $n$, then round up to an integer.

The conclusion after all of this can be made looks like this: I am $C$ confident that the population mean falls between $\bar{x} - m$ and $\bar{x} + m$.

### About this method
1. Only applied to this distribution
2. Data are collected using SRS
3. Badly produced data ruins everything
4. The confident interval is not resistant to outliers
5. $n$ can't be too small
6. $\sigma$ or the population, required
7. $m$ covers only random sampling error

## Tests of Significance (Hypothesis Testing)
**Tests of Significance**: Comparing obervation with a claim (hypothesis) on certain parameters whose truth is to be questioned. And the result is expressed in terms of probability, called the **p-value**.

The original claim is call the **null hypothesis**, $H_0$; and what we are going to prove is the **alternative hypothesis**, which is *two-sided* if it states that the parameter is different from the null value, and *one-sided* if it states that the parameter is bigger or smaller than the null value.

***
**Test Statistic**: based on a statistic that estimate the parameter in debate. *When $H_0$ is true, the estimate should be near the parameter value.*

When normally distributed, the test statistic can be 

$$z = \frac{\text{estimate  -  hypothesized value}} {\text{standard deviation of the estimate}}$$

**p-value**: assuming $H_0$ is true, the probability of the statistic would take the value as extreme or more extreme than the actual one by observation.

Then, is the **significant level**: $\alpha$, a given fixed value.
1. *p-value* less than $\alpha$ → reject $H_0$ → conclude $H_a$ (then we call this result is statistically significant at level $\alpha$)
2. *p-value* greater than or equal to $\alpha$ → can't reject $H_0$ → can't conclude $H_a$
3. for smaller $\alpha$, it means that we insist on stronger evidence against $H_0$

>**e.g.**
>
>A score. Given normal distribution with unknown $\mu$, and $\sigma = 60$. $H_0: \mu = 0$, $H_a: \mu \neq 0$; $\alpha = 0.05$; Obeservation from $18$ respondents with sample mean $17$.
>
>First we can derive that $\sigma_{\bar{X}} = \sigma \, /\sqrt{n} = 14.1421$, theorically.
>Then assuming $H_0$ to be true, calculate the test statistic and p-value:
>
>$$z = \frac{\bar{x} - \mu_0} {\sigma \, / \sqrt{n}} = 1.2$$
>
>and for this two-sided alternative hypothesis
>
>$$\text{p-value} = P(Z > z \text{ OR } Z < z) = 2P(Z > 1.2) = 0.2302 > 0.05$$
>
>Conclusion: Can't reject $H_0$. Such high p-value makes no evidence to support $H_a$.

Common steps for this method:
1. State the null hypothesis $H_0$ and alternative one $H_a$. Remember, the test is designed to assess the strength of the evidence against $H_0$, and only when it's strong enough can we conclude that $H_0$ is rejected and accept $H_a$
2. Calculate the value of *test statistic*, which normally measures how far the data are from $H_0$
3. Assuming that $H_0$ is true, find *p-value* for the observed data. This is a probability that the test statistic will weigh against $H_0$ at least as strongly as it does for these observation
4. Decide the significance level $\alpha$, how much evidence against $H_0$ you regard as decisive. blah blah

### Connection between the Confident Interval and Hypothesis Test
>Following the earlier example
>
>as $\alpha = 0.05$, since it's two-sided, then $C = 1-\alpha = 0.95$, which means the *critical value* $z^*$ is $1.96$, so that the confident interval is $17 \pm 1.96 \cdot 14.14 = [-10.7,+34.7]$, which contains the hypothetic value $0$. You can't reject that!
>
>Or first construct a standard confident interval: $0 \pm z^* = [-1.96,1.96]$. Then from the given sample mean $\bar{x} = 17$, we find the corresponding $z=1.2$, which is contained in the standard confident interval. You can't reject that!

### $z$ test for a population mean
So assume the normal distribution, unknown $\mu$ and known $\sigma$. $H_0:\mu = \mu_0$. We can calculate the **one-sample $z$ statistic**

$$z = \frac{\bar{x} - \mu_0} {\sigma \, / \sqrt{n}}$$

Then for each alternative hypothesis, we calculate the *p-value* following the table below.

$$\begin{array}{cc}
\hline
H_a & \text{p-value} \\ \hline
\mu > \mu_0 & P(Z \geq z) \\
\mu < \mu_0 & P(Z \leq z) \\
\mu \neq \mu_0 & 2\cdot P(Z \geq \left|z\right|) \\ \hline
\end{array}$$

>**comprehensive example**
>
> A total of $n = 324$ athelets participated. Among the $n = 201$ woman, their average caloric intake is $2403.7$ while the recommand is $2811.5$. Is there evidence that female athletes are deficient in caloric intake?
>
>$H_0: \mu = 2811.5$, versus $H_a : \mu < 2811.5$
>
>Now given the standard deviation of $880$ from a carry out test. Now we have a $\sigma_{\bar{X}} = \ffrac{\sigma} {\sqrt{n}} = \ffrac{880} {\sqrt{201}} = 62.0704$. And the test statistic $z = \ffrac{\bar{x} - \mu_0} {\sigma_{\bar{X}}} = \ffrac{2403.7 - 2811.5} {62.0704} = -6.57$. So that the *p-value* is $P(Z < -6.57) \approx 0$
>
>So small, then reject $H_0$. And conclude that: there is "strong evidence" of below recommended caloric consumption among female athletes.
>
>Another conclusion: $\bar{x} = 2403.7$ obtained from sample is ***statistical significantly smaller*** than $\mu = \mu_0 = 2811.5$.

### Type I and Type II ERROR
**Tyle I error**: reject $H_0$ when $H_0$ is true.

**Tyle II error**: fail to reject $H_0$ when $H_0$ is false

## Use and Abuse of Tests
### When choosing the significant level $\alpha$
1. Is the result serious when errors occur?
    1. Type II: Conclude that global warming is real, but actually not
    2. Type I: An innocent person was convicted a crime
2. Is there a preliminary study? If so you may require a larger $\alpha$
### About the test result
1. Statistical significance only says whether the effect observed is likely to be due to chance alone because of random sampling.
2. Statistical significance may not be practically important. It warns you the effect but not the magnitude. But still it may be meaningful.
3. For too large sample size, significance can be reached even for the tiniest effect.
4. Don't use data that suggest hypothesis before sampling.
5. Exploratory data analysis is not always bad. And if you find something interest, change to another data set and repeat.

# Inference for Distributions
## Inference for the Mean of a Population ($\text{t-Test}$)
Draw an SRS of size $n$ from a large population that has a Normal distribution with mean $\mu$ and standard deviation $\sigma$. The **one-sample $\mathrm{t}$ statistic** has the t distribution with degrees of freedom $df = n – 1$.

$$t = \frac{\bar{x} - \mu} {s_{\bar{x}}} \sim t_{n-1}$$

where $s_{\bar{x}} = \ffrac{s} {\sqrt{n}}$, called the **standard error**, and $s$ is the sample standard deviation, and this variable $t$ has the $\mathrm{t}$ distribution with degrees of freedom $df = n – 1$.

$\odot$And when the degree of freedom increase, the plot get closer to the standard normal curve.$\Join$
***
How to use the table?

Given sample size $n$, we can calculate the degree of freedom $n-1$, and using the upper-tail probability to find the corresponding probability $\alpha$ or $\alpha/2$, with these two new figures we can find the critical value $t$.

And then use the $t$ we can find the Confident interval

$$\bar{x} \pm t \times \frac{s} {\sqrt{n}}$$

where $m = t \times \frac{s} {\sqrt{n}}$ is the **margin of error**.
***
Test of Hypothesis

$$\begin{array}{cc}
\hline
H_a & \text{p-value} \\ \hline
\mu > \mu_0 & P(T \geq t) \\
\mu < \mu_0 & P(T \leq t) \\
\mu \neq \mu_0 & 2\cdot P(T \geq \left|t\right|) \\ \hline
\end{array}$$

>**e.g.** given $\alpha = 0.05$, $n=8$, $\bar{x} = 7.06$, $s = 3.56$; $H_0: \mu = 5.4, H_{a}: \mu \neq 5.4$

>First we calculate the Standard Error $\ffrac{s} {\sqrt{n}} = \ffrac{3.56} {\sqrt{8}} = 1.26$. So that the $\text{t}$ statistics is

>$$t = \frac{\bar{x} - \mu_0} {s_{\bar{x}}} = 1.40$$

>Then look up the table for line of $df=7$ and we can find that

>$$\begin{array}{c:cccc}
df & .15 & .10 & .05 & .025 \\
7 & 1.119 & 1.415 & 1.895 & 2.365
\end{array}$$

>So when $T \geq 1.40$ we might say that the upper-tail probability is approximately $0.11$, somewhere from $0.15$ to $0.10$.

>Since is a two-tail test, so that we have the $p\text{-value} = 2 \times 0.11 = 0.22 > 0.05$. Can't reject the null hypothesis at $0.05$ level. There is NO SIGNIFICANT DIFFERENCE ( $\alpha= 5\%$) BETWEEN ....

$\dagger$
- Actually only when the population is normally distributed can we get a $t$ distributed sample, but when the sample size get larger, in practise, that will be OK.
- So if the sample size is less than about $15$, use this procedure only if the data are close to normal.
- Also, in same condition, CI from $t$ procedure is wider than $z$ procedure and that's the price of not knowing the population standard deviation and with so few samples.
- Often a "significant result" for a two-sided test can be used to justify a one-sided test for another sample from the same population. Remember: p-value for one-sided test is half of the p-value for 2-sided (tailed) test.
- Sometimes, the hypothesis affect the conclusion. So always always use a two-sided test.
***

Matched Pairs $\text{t}$ Test Procedures

To compare the responses to the two treatments in a matched-pairs design, find the difference between the responses within each pair. Then apply the one-sample t procedures to the differences:

We first make two observation on same individual, or one observation on each of two similar individuals, result in **paired data**. If the conditions for inference are met, we can use one-sample $t$ procedures to perform inference about the mean difference $\mu_d$.

>**e.g.** Matched Pairs t Test
>
>$H_0:\mu=\mu_0 = 0$,$H_a = \mu \neq 0$. Then from sample:
1. Degree of freedom: $df = n-1=14$
2. Mean of the sample: $\bar{x} = 2.433$
3. Standard Deviation of the sample: $s= 1.460$
4. Standard Error (sample): $SE_{\bar{x}} = \ffrac{s} {\sqrt{n}} = 0.377$
5. $t$ statistic: $t = \ffrac{\bar{x} - \mu_0} {SE_{\bar{x}}} = 6.45$
>
>Then use the data above to find the p-value in the table, remenber its two-sided hypothesis:
>
>$p\text{-value} = 2 \times \Pr(T \geq t) \approx 0.0002 < \alpha$. Reject Null Hypothesis!

## Test
Chapter 6 & 7
Lab 6 7 8 9 10
### Chapter 5
Population and sample, and what's the random variable.

Bernoulli, counting



