## Chap 9 - Estimation and Confidence Intervals

In [1]:
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
sns.set()

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

### 9.2 - Point Estimate for a Population mean

<div class="alert alert-info">
<ul>
<li>**POINT ESTIMATE** - The statistic, computed from sample information that estimates the population parameter</li>
</ul>
</div>

### 9.3 - Confidence Intervals for a Population mean

<div class="alert alert-info">
<ul>
<li>**CONFIDENCE INTERVAL** - A range of values constructed from sample data such that the population parameter is likely to occur within that range at a specified probability. The probability is the *level of confidence*</li>
</ul>
</div>

E.g. After sampling annual salary data, we obtain a point estimate of \$85000, and we are 90% sure that the annual salary of workers is between \$81000 and \$89000.

For a population with known standard deviation $\sigma$, given a level of confidence, $z$, the confidence interval is 
<div class="alert alert-success">
**CONFIDENCE INTERVAL FOR POPULATION MEAN WITH $\sigma$ KNOWN**
$$\bar X \pm z \frac {\sigma} {\sqrt n}$$
</div>

Typical values of $z$ and their corresponding $p$ are:

In [2]:
# For 90% level of confidence, 
print(sp.stats.norm.ppf(0.950))
# use z = 1.645

# For 95% level of confidence, 
print(sp.stats.norm.ppf(0.975))
# use z = 1.960

# For 99% level of confidence, 
print(sp.stats.norm.ppf(0.995))
# use z = 2.576

1.6448536269514722
1.959963984540054
2.5758293035489004


#### Self Review 9.1
Given that restaurant daily sales, $X \sim (\mu, \sigma=3000)$. For a sample $n=40$, the mean sales $\bar X = 20000$

a) The population mean, $\mu$ is unknown

b) The best estimate of the population mean is $\mu = \hat \mu=20000$. This is the point estimate of the unknown mean.

c)

In [3]:
mu_hat = 20000
z = sp.stats.norm.ppf(0.995)
print(round(z*3000/(40**0.5)))
print(mu_hat + round(z*3000/(40**0.5)))
print(mu_hat - round(z*3000/(40**0.5)))

1222.0
21222.0
18778.0


The 99% CI of the population mean is $\begin{bmatrix} 18778, 21222\end{bmatrix}$

d) With a 99% level of confidence, the mean of sales falls between \$18878 and \$21222. About 99% of similarly constructed intervals will contain the population mean. If you simulate sampling the same way, you can be 99% confident that the population mean falls within this confidence interval.

$\diamond$

For a population with unknown standard deviation, the confidence interval is now:
<div class="alert alert-success">
**CONFIDENCE INTERVAL FOR POPULATION MEAN WITH $\sigma$ UNKNOWN**
$$\bar X \pm t \frac {s} {\sqrt n}$$
</div>

#### Self Review 9.2

a)

In [4]:
a = pd.Series([4, 1, 2, 2, 1, 2, 2, 1, 0, 3])
print([a.mean(), a.std()])

[1.8, 1.1352924243950933]


The mean and the standard deviatino of the sample is $1.8$ and $1.135$ respectively.

b) The population mean, $\mu$ is unknown and is estimated using the sample mean, $\bar X = 1.8$

c) Since $\sigma$ is unknown, use $t$-distribution with `df=10-1`:

In [5]:
t_value = sp.stats.t.ppf(0.975, df=(a.count()-1))
i2 = t_value*(a.std()/(a.count()**0.5))
print([a.mean()-i2, a.mean()+i2])

[0.9878607239333317, 2.612139276066668]


The 95% confidence interval is $\begin{bmatrix}0.990, 2.612\end{bmatrix}$

d) The $t$-distribution is used because the population standard deviation is unknown.

e) No. 95% of the samples drawn this way do not contain the population parameter $0$.

$\diamond$

### 9.4 A Confidence Interval for a Proportion

<div class="alert alert-info">
**PROPORTION** - The fraction / ratio / percent indicating the part of the sample / population having a particular trait of interest
</div>

<div class="alert alert-success">
**SAMPLE PROPORTION**, $p$, where $X$ is the number of "successes", $n$ is the number of items sampled, is:
$$p = \frac X n$$

Note that the population proportion is denoted as $\pi$
</div>

Note that for the CLT and subsequently the normal distribution to be used, the values $n\pi$ and $n (1-\pi)$ must be greater than $5$. 

Letting $p$ be the point estimate of $\pi$, the confidence interval of $\pi$ is:
<div class="alert alert-success">
**CONFIDENCE INTERVAL FOR POPULATION PROPORTION**, 

$$p \pm z\sqrt{\frac{p(1-p)}{n}}$$
</div>

#### Self Review 9.3

a)

In [6]:
n3, x3 = 1400, 420
p3 = x3/n3
print(p3)

0.3


$p = 0.3$

b) For a 99% CI,

In [7]:
z3 = sp.stats.norm.ppf(0.995)
print(z3)
intv3 = z3*(p3 * (1-p3)/n3)**0.5
print(intv3)
print([p3-intv3, p3+intv3])

2.5758293035489004
0.03154733729101684
[0.26845266270898316, 0.3315473372910168]


The 99% confidence interval is $\begin{bmatrix}0.268, 0.331\end{bmatrix}$

c) 99% of samples drawn this way will likely contain the population mean in this region.

$\diamond$

### 9.5 Choosing an Appropriate Sample Size

Selecting a sample size is determined by the:

- Margin of error the researcher can tolerate
- The level of confidence desired, e.g. 95% or 99%
- The variation or dispersion of the population being studied

The margin error, $E$ is the amount that is added & subtracted from the sample mean / proportion. There is a trade off between margin of error and sample size. A small margin of error requires a larger sample size and vice versa.

A larger sample size is required for a higher level of confidence.

<div class="alert alert-success">
**SAMPLE SIZE FOR ESTIMATING POPULATION MEAN**
<p>
Solving for $n$ in $E=z\frac {\sigma} {\sqrt n}$,
$$n = \begin{pmatrix}\frac{z\sigma}{E}\end{pmatrix}^2$$

where $n$ is the size of the sample, $z$ is the standard normal value corresponding to the level of confidence, $\sigma$ is the population standard deviation and $E$ is the maximum allowable error.</p>
</div>

<div class="alert alert-success">
**SAMPLE SIZE FOR ESTIMATING POPULATION PROPORTION**
<p>
Solving for $n$ in $E=z\sqrt{\frac {\pi (1-\pi)} {n}}$,
$$n = \pi (1-\pi)\begin{pmatrix}\frac{z}{E}\end{pmatrix}^2$$

where $n$ is the size of the sample, $z$ is the standard normal value corresponding to the level of confidcne, $\pi$ is the population proportion and $E$ is the maximum allowable error.</p>
</div>

#### Self Review 9.4

Given that $E=0.05$, $\pi(1-\pi)=0.279$ and $z=2.576$

In [8]:
E4, z4 = 0.05, sp.stats.norm.ppf(0.995)
print(z4)
n4 = 0.279*(z4/E4)**2
print(n4)

2.5758293035489004
740.4544606739674


The minimum number is 741.

$\diamond$

**References:**

Lind, Marchal, Wathen (2012). Statistical Techniques in Business and Economics (McGraw-Hill)