## Chap 9 - Estimation and Confidence Intervals

In [1]:
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
sns.set()

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

### 9.2 - Point Estimate for a Population mean

<div class="alert alert-info">
<ul>
<li>**POINT ESTIMATE** - The statistic, computed from sample information that estimates the population parameter</li>
</ul>
</div>

### 9.3 - Confidence Intervals for a Population mean

<div class="alert alert-info">
<ul>
<li>**CONFIDENCE INTERVAL** - A range of values constructed from sample data such that the population parameter is likely to occur within that range at a specified probability. The probability is the *level of confidence*</li>
</ul>
</div>

E.g. After sampling annual salary data, we obtain a point estimate of \$85000, and we are 90% sure that the annual salary of workers is between \$81000 and \$89000.

For a population with known standard deviation $\sigma$, given a level of confidence, $z$, the confidence interval is 
<div class="alert alert-success">
**CONFIDENCE INTERVAL FOR POPULATION MEAN WITH $\sigma$ KNOWN**
$$\bar X \pm z \frac {\sigma} {\sqrt n}$$
</div>

Typical values of $z$ and their corresponding $p$ are:

In [2]:
# For 90% level of confidence, 
print(sp.stats.norm.ppf(0.950))
# use z = 1.645

# For 95% level of confidence, 
print(sp.stats.norm.ppf(0.975))
# use z = 1.960

# For 99% level of confidence, 
print(sp.stats.norm.ppf(0.995))
# use z = 2.576

1.6448536269514722
1.959963984540054
2.5758293035489004


#### Self Review 9.1
Given that restaurant daily sales, $X \sim (\mu, \sigma=3000)$. For a sample $n=40$, the mean sales $\bar X = 20000$

a) The population mean, $\mu$ is unknown

b) The best estimate of the population mean is $\mu = \hat \mu=20000$. This is the point estimate of the unknown mean.

c)

In [3]:
mu_hat = 20000
z = sp.stats.norm.ppf(0.995)
print(round(z*3000/(40**0.5)))
print(mu_hat + round(z*3000/(40**0.5)))
print(mu_hat - round(z*3000/(40**0.5)))

1222.0
21222.0
18778.0


The 99% CI of the population mean is $\begin{bmatrix} 18778, 21222\end{bmatrix}$

d) With a 99% level of confidence, the mean of sales falls between \$18878 and \$21222. About 99% of similarly constructed intervals will contain the population mean. If you simulate sampling the same way, you can be 99% confident that the population mean falls within this confidence interval.

$\diamond$

For a population with unknown standard deviation, the confidence interval is now:
<div class="alert alert-success">
**CONFIDENCE INTERVAL FOR POPULATION MEAN WITH $\sigma$ UNKNOWN**
$$\bar X \pm t \frac {s} {\sqrt n}$$
</div>

#### Self Review 9.2

a)

In [4]:
a = pd.Series([4, 1, 2, 2, 1, 2, 2, 1, 0, 3])
print([a.mean(), a.std()])

[1.8, 1.1352924243950933]


The mean and the standard deviatino of the sample is $1.8$ and $1.135$ respectively.

b) The population mean, $\mu$ is unknown and is estimated using the sample mean, $\bar X = 1.8$

c) Since $\sigma$ is unknown, use $t$-distribution with `df=10-1`:

In [5]:
t_value = sp.stats.t.ppf(0.975, df=(a.count()-1))
i2 = t_value*(a.std()/(a.count()**0.5))
print([a.mean()-i2, a.mean()+i2])

[0.9878607239333317, 2.612139276066668]


The 95% confidence interval is $\begin{bmatrix}0.990, 2.612\end{bmatrix}$

d) The $t$-distribution is used because the population standard deviation is unknown.

e) No. 95% of the samples drawn this way do not contain the population parameter $0$.

$\diamond$

**References:**

Lind, Marchal, Wathen (2012). Statistical Techniques in Business and Economics (McGraw-Hill)