# Unit 5


In [1]:
import numpy as np

import matplotlib.pyplot as plt
from scipy.stats import (
    binom,
    chisquare,
    geom,
    percentileofscore,
    poisson,
    norm,
    t)
from itertools import (
    combinations
)

from math import (
    sqrt
)

from fractions import Fraction
from sklearn.linear_model import LinearRegression

## 5.1.2

Population symbols

- proportion = $p$
- mean = $\mu$
  -standard deviation = $\sigma$

Sample

- Proportion = $\hat p$
- mean = $\bar x$
- standard deviation = $ s $


## 5.1.3 Sampling with or without replacement

For independence, a large population is going to be at least 10 times
larger than the sample.

$$
\text{Population} \ge 10 \cdot n \\
\text{where n is sample size}
$$

If that's the case, then you're going to say that the probabilities don't
shift very much when you sample "n" items from the population. Therefore,
you can treat the sampling as being almost independent.

Sampling without Replacement:

A sampling plan where each
observation that is sampled is kept out of subsequent selections,
resulting in a sample where each observation can be selected no more than
one time.


## 5.1.4 Sampling error

sample error decreases as sample size increases


## 5.1.5 Distribution of sample means

mean of a sampling distribution of sample means

$ \mu*{\bar x} = \mu*{\text{population}} $

standard deviation of a sampling distribution of sample means

$ \sigma*{\bar x} = \frac{\sigma*{population}}{\sqrt(n)}$

Central Limit Theorem: If the sample size is large enough (generally $ n
\ge 30$), and the std dev fo the population is finite, then the
distribution of the sample means will be approximately normal


## Distribution of sample proportions

mean of a distribution of sample proportions

$ \mu\_{\hat p} = p $

standard deviation of a distribution of sample proportions $
\sigma\_{\hat p} = sqrt(\frac{pq}{n}) $

p = probability of success q = probability of failure (1-p)

for inference these conditions must be true:

$ n \cdot p \ge 10 $ and $ n \cdot q \ge 10 $

shape is going to look like a binomial distribution: skewed left when p
is high and sample size is low, skewed right when p is low and sample
size is low, approximately normal when sample size is high.


## 5.2.3 Errors

Type I error:
an error when a true null hypothesis is rejected

Type II error:
when a false null hypothesis is not rejected


## 5.2.4

Power indicates sensitivity or the ability to detect a difference that is
present.

alpha is equal to the probability of making a type 1 error (rejecting
when we shouldn't)


## 5.2.5

right tailed: looking for value higher than null

left tailed: looking for value lower than null

two-tailed: looking for value different than null (higher or lower)


## 5.2.7

Chi-squared test for association/independence:

ex: does gender affect whether or not someone likes apple, orange or
banana?


In [2]:
norm(loc=122, scale=35).cdf(67)

0.05804156686932752

## 5.3.2z-test for population means

$$

z = \frac{\bar x - \mu}
         {{\frac{\sigma}{\sqrt{n}}}}


$$


In [3]:
mu = 7.5
sigma = 0.7
n = 60
x_bar = 7.2

num = x_bar - mu
denom = sigma / sqrt(n)

num / denom

-3.319700011034927

## 5.3.3

z test population proportions

$$
z = \frac{\hat p - p}
         {\sqrt{\frac{pq}{n}}}
$$


In [4]:
n = 200
p_hat = 140 / n
p = 0.68
q = 1 - p

num = p_hat - p
denom = sqrt(p*q / n)
num/denom

0.6063390625908296

In [5]:
n = 635
p_hat = 71 / n

## 5.3.4

Find critical z value

just use standard normal distribution and ppf


In [6]:
# critical value for left-tailed with alpha of 0.05
# less than this value
norm.ppf(0.05)

-1.6448536269514729

In [7]:
norm.ppf(0.05)

-1.6448536269514729

In [8]:
# critical for right-tailed with 0.09 alpha
norm().ppf(1 - .09)

1.3407550336902165

In [9]:
# two tailed with a 0.05 alpha
# remember to split the tail
norm.ppf(1 - 0.05/2)  # more extreme than plus or minus this value

1.959963984540054

## 5.3.5 find a p-value from a z-test statistic

use norm.cdf or norm.sf

- [find a p-value from z-score](https://www.tutorialspoint.com/how-to-find-a-p-value-from-a-z-score-in-python#:~:text=The%20p%2Dvalue%20for%20a%20z,a%20standard%20normal%20random%20variable.)


In [10]:
# left tailed test
norm().cdf(-2.74)

0.003071959218650488

## 5.3.6 confidence intervals

use `norm().ppf


In [11]:
norm().ppf(1 - .01/2)

2.5758293035489004

## 5.3.7 Calculating the confidence interval for population proportion

- [5.3.7](https://app.sophia.org/spcc/introduction-to-statistics-2-challenge-5-3/7/7595/confidence-interval-for-population-proportion-2)

$$

CI = \hat p \pm z * \sqrt{\frac{\hat p \hat q}{n}}
$$


In [12]:
p_hat = 120 / 500
q_hat = 1 - p_hat
n = 500
# for 90% confidence interval
z = norm().ppf(1 - .10/2)


def ci(p_hat, q_hat, n, z):

    spread = z * sqrt((p_hat * q_hat) / n)

    return (p_hat - spread, p_hat + spread)


ci(p_hat, q_hat, n, z)

(0.20858372631813238, 0.2714162736818676)

In [13]:
n = 635
p = 71/635
q = 1 - p

sqrt(p*q/n)

0.012505703808498162

## Standard error of a sample proportion

sample proportion when population std dev is unknown

$$
\sqrt{\frac{\hat p \hat q}{n}}
$$

- $\hat p = $ sample proportion of success
- $\hat q = $ compliment of $\hat p$
- n = sample size

Remember:
sample means: $ \frac{s}{\sqrt n}$

Sample proportion when population std dev is known

$$
\sqrt{\frac{p q}{n}}
$$


In [14]:
n = 186
p = 57 / n
q = 1 - p

sqrt(p*q / n)

0.03380359319930459

## 5.4.1 T-tests

when population standard deviation is not known, use t-statistic. same as
z but replaces population standard deviation with sample standard
deviation.

$$

z = \frac{\bar x - \mu}
         {{\frac{s}{\sqrt{n}}}}


$$

remember: standard error = $\frac{s}{\sqrt{n}}$


In [15]:
s = 635

s/sqrt(n)

46.56045901927967

t distribution is usually a little flatter and with a little more spread.

degrees of freedom: sample sze - 1

independence: population at least 10 times as large as the sample


In [16]:
obs = [53, 52.5, 54, 51, 50.5, 49.5, 48, 53, 52, 50]
np.std(obs, ddof=1) / sqrt(len(obs))

0.5918426968188235

## 5.4.2 Calculate T-Test statistic


In [17]:
n = 101
x_bar = 6.9
s = 2.5
mu = 8

(x_bar - mu) / (s/sqrt(n))

-4.42194527329319

## 5.4.3 critical t value


In [18]:
# from scipy.stats import t

# two tailed
t(df=28).ppf(1 - 0.10/2)

1.701130934265931

## 5.4.4

How to find a P-Value from a T-Test statistic


In [19]:
# these are going to be a little different.
# sophia answers are rounded
t(df=29).cdf(1.311) * 2

1.7998554107550482

In [20]:
# left tailed...
t(df=15).cdf(-2.624)

0.009580243498838651

In [21]:
# right tailed
1 - t(df=19).cdf(1.729)

# or
t(df=19).sf(1.729)

0.050012127821915116

## 5.4.5 Confidence interval using T-distribution

$$

CI = \bar x \pm t \cdot \frac{s}{\sqrt n}


$$


In [22]:
n = 101
x_bar = 6.5
s = 2.14
t = 1.984

# 95% confidence interval

spread = t * s / sqrt(n)
print(
    x_bar - spread,
    x_bar + spread
)

6.077531089929404 6.922468910070596


## 5.4.6 standard error of sample mean

$$
\frac{s}{\sqrt n}
$$

remember to use ddof=1 when calculating the sample std dev


In [23]:
obs = [6, 6.5, 6, 7.5, 5]
np.std(obs, ddof=1)/sqrt(len(obs))

0.40620192023179796

## 5.5.1 ANOVA

analysis of variance. compare means by analyzing the sample variances from independently selected sample.

compare three or more population means

conditions:

- independent samples from the populations
- each population must be normally distributed
- variances (therefore std dev) of all distributions are the same.

Null hypothesis: these are all the same

Alternate: at least one is different

F-statistic

$$
F = \frac{\text{Variability between the samples}}
         {\text{Variability within each sample}}

small F-statistic: null hypothesis likely
large F-Statistic: evidence against null. would be rare if null were true




$$


## one-way ANOVA, two-way ANOVA

one-way anova:

consider the population means based on one characteristic

two-way anova:

consider the population means based upon multiple characteristics


## Chi-Square statistic

categorical data. measures how expected frequency differs from observed frequency.

1. take observed values
2. subtract the expected values
3. square that difference
4. divide by the expected values
5. add up all of those fractions

$$
X^2 = \sum \frac{(O - E)^2}
                {E}
$$


In [24]:
# from scipy.stats import chisquare
obs = [11, 15, 12, 12]
exp = [12.5, 12.5, 12.5, 12.5]

chisquare(obs, exp, ddof=0)

Power_divergenceResult(statistic=0.72, pvalue=0.8684899681806465)

In [25]:
536*.2

107.2

## 5.5.4 Chi-Square test for goodness of fit

The chi-square distribution is a right-skewed distribution that generally measures the discrepancy from what a sample of categorical data would look like if you had an idea of what the population should look like in those categories.

a smaller chi-square value = small difference
larger chi-square = large discrepancy

degrees of freedom = number of categories - 1


In [26]:
obs = [51, 46, 61, 49, 46, 49, 36, 41, 36, 34, 33, 30]
n = 512
percent_of_population = [
    .08, .07, .08, .08, .08, .08, .09, .09, .09, .09, .08, .09]
exp = [n * val for val in percent_of_population]
print(exp)

[40.96, 35.84, 40.96, 40.96, 40.96, 40.96, 46.08, 46.08, 46.08, 46.08, 40.96, 46.08]


In [27]:
chisquare(obs, exp, ddof=1)

Power_divergenceResult(statistic=34.21737041170634, pvalue=0.00016967314455865548)

In [28]:
obs = [20, 12, 10, 18]
exp = [15, 15, 15, 15]

chisquare(obs, exp, ddof=1)

Power_divergenceResult(statistic=4.533333333333333, pvalue=0.10365712861152787)

## 5.5.5 Chi-square test for homogeneity

$$
\text{expected value for cell} = \frac{\text{(row total)(column total)}}
                                      {\text{(grand total)}}
$$

degrees of freedom = (row total -1)(column total - 1)


In [29]:
(18 + 19) * (15 + 19) / (22 + 15 + 18 + 19)

17.0

In [30]:
(2-1) * (4-1)

3

In [31]:
n = 100
x_bar = 2.9
sigma = 0.1
mu = 3

(x_bar - mu) / (sigma/sqrt(n))

-10.000000000000009

In [32]:
b_obs = [18, 42, 12]
g_obs = [30, 15, 18]

e_b_obs = sum(b_obs) / len(b_obs)
e_g_obs = sum(g_obs) / len(g_obs)

print(e_b_obs, e_g_obs)

24.0 21.0


In [33]:
norm().sf(abs(-1.5))

norm().cdf(-1.5)

0.06680720126885807

In [34]:
n = 635
p_hat = 71/635
p = .141
q = 1 - p

(p_hat - p) / sqrt(p*q/n)

-2.1134870420287037

In [35]:
obs = [65, 71, 74, 61, 66, 70, 72]

obs_mean = sum(obs)/len(obs)
std_dev = 4.577

# obs_mean - *std_dev, obs_mean + *std_dev

In [36]:
dist = norm(loc=obs_mean, scale=std_dev)

In [37]:
spread = norm.ppf(1 - 0.05/2) * 4.577 / sqrt(7)

obs_mean - spread, obs_mean + spread
# more extreme than plus or minus this value

(65.03794468307048, 71.81919817407238)

In [38]:
row_total = 52 + 63
col_total = 37 + 63
grand_total = 48 + 37 + 52 + 63

row_total * col_total / grand_total

57.5

In [39]:
1 - 0.01

0.99

In [40]:
n = 635

In [41]:
obs = [17, 13, 14, 16]
exp = [15, 15, 15, 15]
chisquare(obs, exp, ddof=1)

Power_divergenceResult(statistic=0.6666666666666666, pvalue=0.7165313105737892)

In [42]:
n = 635
p = 71/n
q = 1-p

In [43]:
spread = norm().ppf(1-0.05) * sqrt(p*q/n)  # 1.6448536269514722
# norm().cdf(1.644)
p-spread, p+spread

(0.09124097135505821, 0.13238107588903628)

In [44]:
obs = [65, 71, 74, 61, 66, 70, 72]

obs_mean = np.mean(obs)

obs_std = np.std(obs, ddof=1)

d = norm(loc=obs_mean, scale=obs_std)

d.interval(.95)

(59.45707720385795, 77.40006565328491)

In [45]:
from scipy.stats import t

t.interval(confidence=0.95, df=len(obs)-1, loc=obs_mean, scale=obs_std)

(57.22813320982061, 79.62900964732225)

In [46]:
e = norm.fit(np.array([65, 71, 74, 61, 66, 70, 72]))
e_norm = norm(*e)
e_norm.interval(.95)

(60.12258175033833, 76.73456110680453)

In [47]:
def y_hat(x):
    return 6.6 * x - 260


150 - y_hat(64)

-12.399999999999977

In [48]:
1 / ((1/6) ** 6)

46656.000000000015

In [50]:
a = binom(n=150, p=0.165)
a.var()

20.666249999999998

In [52]:
# actual - estimated, divide by actual. multiply by 100
(100.5 - 95) / 95 * 100

5.7894736842105265

In [53]:
(24 - 35)/6

-1.8333333333333333

In [54]:
obs = [27, 50, 75, 54, 44]
27 / sum(obs)

0.108

In [55]:
.5 * .5

0.25

In [56]:
(17 - 9) * 1.5

12.0

In [57]:
17 + 12, 9 - 12

(29, -3)

In [59]:
n = 125
p_hat = 21/n
p = 0.15
q = 1-p

(p_hat - p) / sqrt(p*q/n)

0.563601861976635

In [60]:
n = 72
x_bar = 36.1
sigma = 0.3
mu = 36

(x_bar - mu) / (sigma/sqrt(n))

2.8284271247462303