# Unit 5


In [33]:
import numpy as np

import matplotlib.pyplot as plt
from scipy.stats import (
    geom,
    percentileofscore,
    poisson,
    norm)
from itertools import (
    combinations
)

from math import (
    sqrt
)

from fractions import Fraction
from sklearn.linear_model import LinearRegression

## 5.1.2

Population symbols

- proportion = $p$
- mean = $\mu$
  -standard deviation = $\sigma$

Sample

- Proportion = $\hat p$
- mean = $\bar x$
- standard deviation = $ s $


## 5.1.3 Sampling with or without replacement

For independence, a large population is going to be at least 10 times
larger than the sample.

$$
\text{Population} \ge 10 \cdot n \\
\text{where n is sample size}
$$

If that's the case, then you're going to say that the probabilities don't
shift very much when you sample "n" items from the population. Therefore,
you can treat the sampling as being almost independent.

Sampling without Replacement:

A sampling plan where each
observation that is sampled is kept out of subsequent selections,
resulting in a sample where each observation can be selected no more than
one time.


## 5.1.4 Sampling error

sample error decreases as sample size increases


## 5.1.5 Distribution of sample means

mean of a sampling distribution of sample means

$ \mu*{\bar x} = \mu*{\text{population}} $

standard deviation of a sampling distribution of sample means

$ \sigma*{\bar x} = \frac{\sigma*{population}}{\sqrt(n)}$

Central Limit Theorem: If the sample size is large enough (generally $ n
\ge 30$), and the std dev fo the population is finite, then the
distribution of the sample means will be approximately normal


## Distribution of sample proportions

mean of a distribution of sample proportions

$ \mu\_{\hat p} = p $

standard deviation of a distribution of sample proportions $
\sigma\_{\hat p} = sqrt(\frac{pq}{n}) $

p = probability of success q = probability of failure (1-p)

for inference these conditions must be true:

$ n \cdot p \ge 10 $ and $ n \cdot q \ge 10 $

shape is going to look like a binomial distribution: skewed left when p
is high and sample size is low, skewed right when p is low and sample
size is low, approximately normal when sample size is high.


## 5.2.3 Errors

Type I error:
an error when a true null hypothesis is rejected

Type II error:
when a false null hypothesis is not rejected


## 5.2.4

Power indicates sensitivity or the ability to detect a difference that is
present.

alpha is equal to the probability of making a type 1 error (rejecting
when we shouldn't)


## 5.2.5

right tailed: looking for value higher than null

left tailed: looking for value lower than null

two-tailed: looking for value different than null (higher or lower)


## 5.2.7

Chi-squared test for association/independence:

ex: does gender affect whether or not someone likes apple, orange or
banana?


In [34]:
norm(loc=122, scale=35).cdf(67)

0.05804156686932752

## 5.3.2z-test for population means

$$

z = \frac{\bar x - \mu}
         {{\frac{\sigma}{\sqrt{n}}}}


$$


In [35]:
mu = 7.5
sigma = 0.7
n = 60
x_bar = 7.2

num = x_bar - mu
denom = sigma / sqrt(n)

num / denom

-3.319700011034927

## 5.3.3

z test population proportions

$$
z = \frac{\hat p - p}
         {\sqrt{\frac{pq}{n}}}
$$


In [36]:
n = 200
p_hat = 140 / n
p = 0.68
q = 1 - p

num = p_hat - p
denom = sqrt(p*q / n)
num/denom

0.6063390625908296

## 5.3.4

Find critical z value

just use standard normal distribution and ppf


In [37]:
# critical value for left-tailed with alpha of 0.05
# less than this value
norm.ppf(0.05)

-1.6448536269514729

In [38]:
# critical for right-tailed with 0.09 alpha
norm().ppf(1 - .09)

1.3407550336902165

In [39]:
# two tailed with a 0.05 alpha
# remember to split the tail
norm.ppf(1 - 0.05/2)  # more extreme than plus or minus this value

1.959963984540054

## 5.3.5 find a p-value from a z-test statistic

use norm.cdf or norm.sf

- [find a p-value from z-score](https://www.tutorialspoint.com/how-to-find-a-p-value-from-a-z-score-in-python#:~:text=The%20p%2Dvalue%20for%20a%20z,a%20standard%20normal%20random%20variable.)


In [40]:
# left tailed test
norm().cdf(-2.74)

0.003071959218650488

## 5.3.6 confidence intervals

use `norm().ppf


In [41]:
norm().ppf(1 - .01/2)

2.5758293035489004

## 5.3.7 Calculating the confidence interval for population proportion

- [5.3.7](https://app.sophia.org/spcc/introduction-to-statistics-2-challenge-5-3/7/7595/confidence-interval-for-population-proportion-2)

$$

CI = \hat p \pm z * \sqrt{\frac{\hat p \hat q}{n}}
$$


In [42]:
p_hat = 120 / 500
q_hat = 1 - p_hat
n = 500
# for 90% confidence interval
z = norm().ppf(1 - .10/2)


def ci(p_hat, q_hat, n, z):

    spread = z * sqrt((p_hat * q_hat) / n)

    return (p_hat - spread, p_hat + spread)


ci(p_hat, q_hat, n, z)

(0.20858372631813238, 0.2714162736818676)

## Standard error of a sample proportion

sample proportion when population std dev is unknown

$$
\sqrt{\frac{\hat p \hat q}{n}}
$$

- $\hat p = $ sample proportion of success
- $\hat q = $ compliment of $\hat p$
- n = sample size

Remember:
sample means: $ \frac{s}{\sqrt n}$

Sample proportion when population std dev is known

$$
\sqrt{\frac{p q}{n}}
$$


In [43]:
n = 186
p = 57 / n
q = 1 - p

sqrt(p*q / n)

0.03380359319930459

## 5.4.1 T-tests

when population standard deviation is not known, use t-statistic. same as
z but replaces population standard deviation with sample standard
deviation.

$$

z = \frac{\bar x - \mu}
         {{\frac{s}{\sqrt{n}}}}


$$

remember: standard error = $\frac{s}{\sqrt{n}}$


t distribution is usually a little flatter and with a little more spread.

degrees of freedom: sample sze - 1

independence: population at least 10 times as large as the sample


## 5.4.2 Calculate T-Test statistic


In [44]:
n = 101
x_bar = 6.9
s = 2.5
mu = 8

(x_bar - mu) / (s/sqrt(n))

-4.42194527329319

## 5.4.3 critical t value


In [48]:
from scipy.stats import t

# two tailed
t(df=28).ppf(1 - 0.10/2)

1.701130934265931

## 5.4.4

How to find a P-Value from a T-Test statistic


In [59]:
# these are going to be a little different.
# sophia answers are rounded
t(df=29).cdf(1.311) * 2

1.7998554107550482

In [61]:
# left tailed...
t(df=15).cdf(-2.624)

0.009580243498838651

In [64]:
# right tailed
1 - t(df=19).cdf(1.729)

# or
t(df=19).sf(1.729)

0.050012127821915116