# Sampling Distribution 

- Inferential statistics connect relationship between sample statistics and population parameters. For example, we predict the mean of a population by the mean of a sample with a confidence interval. Then we check the accuracy of this mean by hypothesis tests.

- A sampling distribution is the probability distribution of a statistic (ex. mean).

- The standard deviation of this statistics is called standard error. 

- standard error = $ \frac{\sigma}{\sqrt{n}}$

- Note that all statistics have sampling distributions, not just the mean. 

# Central Limit Theorem 

- If we have a large enough sample size, the distribution of the sample mean is approximately normal, with a mean of the population mean and a standard deviation of the population standard deviation divided by the square root of the sample size. 

- The distribution of the population  doesn't need to be normal distribution. Even the distribution of the population is skew, the distribution of the sample mean is approximately normal. 


# Confidence Interval 

- The confidence interval changes with respect to the size of the sample.

- Interval estimators are commonly called confidence intervals (CIs).

- Interval endpoints are called the upper and lower confidence intervals. 

-  Notation  : $\alpha$ or $100(1-\alpha)%$.

- If the confidence level is %95, this means $\alpha=1-0.95=0.05$.

- $100(1-\alpha)$ CI = $\bar{x}+-z_{\frac{\alpha}{2}}\frac{\sigma}{\sqrt{n}}$

  CI	= 	confidence interval

  $\bar{x}$	= 	sample mean

  $z_{\frac{\alpha}{2}}$	= 	confidence level value

  ${\sigma}$	= 	population standard deviation

 ${n}$	= 	sample size
 
 Confidence level = $100(1-\alpha)$


# t Distribution 

- If the population variance is not known or the sample size is small, then we use t-distribution (student t- distribution) rather than normal distribution 

- There are actually many different t distributions. The particular form of the t distribution is determined by its degrees of freedom (df). The degrees of the freedom refers to the number of independent observations in a set of data. 

  . df=sample size - 1
  
- If the size of the sample is greater, then t and normal distribution are more similar. 

     $t=\frac{\bar{x}-\mu}{s/\sqrt{n}}$


# Confidence Intervals Using the Normal Distribution & t Distribution

In [1]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'taxis',
 'tips',
 'titanic']

In [3]:
tips = sns.load_dataset("tips")

In [4]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
tipsFri = tips[tips["day"] == "Fri"]
tipsFri.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
90,28.97,3.0,Male,Yes,Fri,Dinner,2
91,22.49,3.5,Male,No,Fri,Dinner,2
92,5.75,1.0,Female,Yes,Fri,Dinner,2
93,16.32,4.3,Female,Yes,Fri,Dinner,2
94,22.75,3.25,Female,No,Fri,Dinner,2


In [6]:
xbar = tipsFri.total_bill.mean()
xbar

17.15157894736842

In [7]:
tipsFri.shape

(19, 7)

In [8]:
# sem calculates standard error of the mean (\frac{\sigma}{\sqrt{n}}) 
sem = tipsFri.total_bill.sem()
sem

1.9047607734794163

In [9]:
# or we can calculate as follow. 
tipsFri.total_bill.std() / np.sqrt(len(tipsFri))

1.9047607734794163

In [10]:
# margin of error
moe = 1.96 * sem
moe

3.733331116019656

In [11]:
upper = xbar + moe
upper

20.884910063388077

In [12]:
lower = xbar - moe
lower

13.418247831348765

In [13]:
# we can easily calculate confidence interval using stats as follow. 
stats.norm.interval(0.95, loc=xbar, scale=sem)

(13.41831643218411, 20.884841462552732)

In [14]:
stats.norm.interval(0.95, loc=tipsFri.total_bill.mean(), scale=tipsFri.total_bill.sem())

(13.41831643218411, 20.884841462552732)

In [15]:
# we calculate t-distribution confidence interval using stats. 
stats.t.interval(0.95, df=len(tipsFri)-1, loc=tipsFri.total_bill.mean(), scale=tipsFri.total_bill.sem())

(13.149825056979097, 21.153332837757745)

In [16]:
tipsSun = tips[tips["day"] == "Sun"]
tipsSun.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [17]:
len(tipsSun)

76

In [18]:
stats.t.interval(0.95, df=len(tipsSun)-1, loc=tipsSun.total_bill.mean(), scale=tipsSun.total_bill.sem())

(19.39177370652103, 23.42822629347897)

In [19]:
xbar = tipsSun.total_bill.mean()
xbar

21.41

In [20]:
std = tipsSun.total_bill.std()
std

8.832121828869889

In [21]:
sem = std / np.sqrt(len(tipsSun))
sem

1.0131138555021968

In [22]:
moe = 1.96 * sem
moe

1.9857031567843058

In [23]:
# we calculate t value with % 97.5 confidence level.
stats.t.ppf(0.975, 75)

1.9921021536898653

In [24]:
# we calculate t value with % 2.5 confidence level.
stats.t.ppf(0.025, 75)

-1.9921021536898658

In [25]:
moe = 1.992 * sem
moe

2.018122800160376

In [26]:
# we calculate z value with %97.5 confidence level. 
stats.norm.ppf(0.975)

1.959963984540054