This notebook aims to be an informal reference for confidence intervals aimed at engineers. All content here is taken from Montgomery and Runger - Applied Statistics and Probability for Engineers 7ed.

<h1>Confidence Intervals</h1>

The sample mean $\overline{X}$ is a point estimate of the population mean $\mu$. But, it does not communicate the variability of the estimate as represented by its sampling distribution. The standard error of the sample mean $\mathrm{SE}(\overline{X})$ gives some idea of the spread of the distribution. Can we do better?

In contrast, <b>confidence intervals</b>, constructed from samples drawn from the population, provide a range of values of a random variable $X$. A probability $1-\alpha$, called the <b>confidence level</b> or <b>confidence coefficient</b>, is associated with this interval. The confidence level is the probability that the provided confidence interval contains the true, but unknown, population parameter $\theta$ that is being estimated. By the rules of probability, the <b>significance level</b> $\alpha$ specifies the probability that $\theta$ lies outside the confidence interval.

For example, 100 samples are drawn and 100 95% confidence intervals are constructed. Then, it is expected that 5 of these confidence intervals do not contain the true but unknown popluation parameter. If 100 80% confidence intervals were constructed instead, then it is expected that 20 of these confidence intervals do not contain $\theta$. If 100 99% confidence intervals were constructed instead, then it is expected that only 1 of these confidence intervals would not contain $\theta$. So, a 95% confidence interval $6\pm1.3$ means "I am 95% sure that $\theta$ is between $[4.7, 7.3]$".

To demonstrate how to calculate confidence intervals and explore further nuances, estimating the population mean $\mu$ with a confidence interval $\overline{x}$ will be explored.

<h2>Confidence Interval on the Mean of a Normally Distributed Population</h2>

The goal is to provide and interval estimate of the unknown mean of a population $\mu$. Suppose that the population's distribution follows the normal distribution with a known variance $\sigma^2=1.3$. From the Central Limit Theorem, <b>it is known that the probability distribution of sample means is a normal distribution with with mean $\mu$ and variance $\frac{\sigma^2}{n}$</b>. This is significant because <b>even if the center or mean $\mu$ of the sampling distribution is not known</b>, it is possible to infer <b>the probability of observing a certain difference between the sample mean $\overline{X}$ and the mean $\mu$</b>. In other words, because the population variance $\sigma^2$ is known, and by extension the sampling distribution variance can be calculated by $\frac{\sigma^2}{n}$, it is possible to know the probability associated with various differences of sample means $\overline{X}$ and the population mean $\mu$.

This inference can be performed as follows:
<ul>
    <li>The population variance $\sigma^2$ is known.</li>
    <li>Calculate the sampling distribution's standard deviation:</li>
    <font size="4">
        $$\sigma_{\overline{X}}=\sqrt{\frac{\sigma^2}{n}}$$
    </font><br>
    <li>Choose a significance level $\alpha$. A common choice is $\alpha=0.05$. This corresponds to a probability or confidence level of $1 - \alpha = 0.95$.</li>
    <li>Use the standard normal distribution to find z-values that appropriately represent $\alpha$.</li>
    <ul>
        <li>Recall that $Z$ is a standardized normal random variable:</li>
        <font size="4">
            \begin{align}
                Z &= \frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\\
            \end{align}
        </font><br>
        <li>For the lower bound, we want a value $z_{lo}$ that satisfies the following. Substitution of how $Z$ is calculated translates probability statements to $\overline{X} - \mu$ terms.</li>
        <font size="4">
            \begin{align}
                P(Z \le z_{lo}) &= \frac{\alpha}{2}\\
                P\left(\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \le z_{lo}\right) &= \frac{\alpha}{2}\\
                P\left(\overline{X} - \mu \le z_{lo}\frac{\sigma}{\sqrt{n}}\right) &= \frac{\alpha}{2}\\
                P\left(\overline{X} - \mu \le z_{lo}\frac{\sigma}{\sqrt{n}}\right) &= \frac{0.05}{2}\\
                P\left(\overline{X} - \mu \le z_{lo}\frac{\sigma}{\sqrt{n}}\right) &= 0.025\\
            \end{align}
        </font><br>
        <li>For the upper bound, we want a value $z_{hi}$ that satisfies:</li>
        <font size="4">
            \begin{align}
                P(Z \le z_{hi}) &= 1 - \frac{\alpha}{2}\\
                P\left(\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \le z_{hi}\right) &= 1 - \frac{\alpha}{2}\\
                P\left(\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) &= 1 - \frac{\alpha}{2}\\
                P\left(\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) &= 1 - \frac{0.05}{2}\\
                P\left(\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) &= 0.975\\
            \end{align}
        </font><br>
        <li>Combining the upper and lower bounds with properties of probabilities:</li><br>
        <font size="4">
            \begin{align}
                P\left(z_{lo}\frac{\sigma}{\sqrt{n}} \le \overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) &= P\left(\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) - P\left(\overline{X} - \mu \le z_{lo}\frac{\sigma}{\sqrt{n}}\right)\\
                x_{lo}&= z_{lo}\frac{\sigma}{\sqrt{n}}\\
                x_{hi}&= z_{hi}\frac{\sigma}{\sqrt{n}}\\
                P\left(x_{lo} \le \overline{X} - \mu \le x_{hi}\right) &=1 - \frac{\alpha}{2} - \frac{\alpha}{2}\\
                &=1 - \alpha\\
                &=1 - 0.05\\
                &=0.95
            \end{align}
        </font><br>
        <li>$\sigma$ and $n$ are known. $z_{lo}$ and $z_{hi}$ can be found through tables or computer software. In Python's scipy.stats module, the operation of finding a value for $z$ such that $P(Z \le z) = p$ for some chosen probability $p$ is called the <b>percent point function</b> ppf(). The actual call as of the time of writing is stats.norm.ppf() for the standard normal distribution.</li><br>
        <font size="4">
            $$\mathrm{ppf \left(\frac{\alpha}{2} \right)} = z_{lo} \mid P(Z \le z_{lo}) = \frac{\alpha}{2} = 0.025$$<br>
            $$\mathrm{ppf \left(1 - \frac{\alpha}{2} \right)} = z_{hi} \mid P(Z \le z_{hi}) = 1 - \frac{\alpha}{2} = 0.975$$
        </font><br>
        <li>After $z_{lo}$ and $z_{hi}$ are plugged in, we now have <b>a range of values $[x_{lo}, x_{hi}]$ centered at 0</b>. Furthermore, the associated probability that $\overline{X} - \mu$ is found in this interval is $1 - \alpha$.</li>
        <li>Refer to confidence_intervals.py for a more detailed notes accompanied by a simulated experiment and visual aids..</li>
    </ul>
</ul>

One last thing - we still do not know what the value of $\mu$ is. What we can do is to center the interval $[x_{lo}, x_{hi}]$ around our sample mean $\overline{x}$. Our confidence interval is then:
<font size="4">
    $$[\overline{x} + x_{lo}, \overline{x} + x_{hi}]$$
</font>

Because we used the standard normal distribution (which is centered at 0), the interval $[x_{lo}, x_{hi}]$ is also centered at 0. So we can express the interval as $[-x, x]$ where $x=x_{hi}=-x_{lo}$. Our confidence interval then becomes:<br>
<font size="4">
    $$\overline{x} \pm x$$
</font>

<h3>Summary</h3>

In summary, $[\overline{x} + x_{lo}, \overline{x} + x_{hi}]$ is an interval that contains the true population mean $\mu$, $(1-\alpha)\times 100 \%$ of the time where: <br><br>
<font size="4">
    $$x_{lo} = z_{lo}\frac{\sigma}{\sqrt{n}}$$<br>
    $$x_{hi} = z_{hi}\frac{\sigma}{\sqrt{n}}$$<br>
    $$z_{lo} = \mathrm{ppf}\left(\frac{\alpha}{2}\right)$$<br>
    $$z_{hi} = \mathrm{ppf}\left(\frac{1 - \alpha}{2}\right)$$<br>
    $$P(\overline{x} + x_{lo} \le \mu \le \overline{x} + x_{hi}) = 1 - \alpha$$
</font><br>

In short, a two-sided confidence interval with known $\sigma$ is given by:<br><br>
<font size = "4">
    $$\overline{x} + \mathrm{ppf}\left(\frac{\alpha}{2}\right)\frac{\sigma}{\sqrt{n}} <= \mu <= \overline{x} + \mathrm{ppf}\left(\frac{1 - \alpha}{2}\right)\frac{\sigma}{\sqrt{n}}$$
</font>

<h2>Large-Sample Confidence Intervals</h2>

Often, the population standard deviation $\sigma$ is not known. But to construct confidence intervals, $x_{lo}$ and $x_{hi}$ both require $\sigma.$ What can be done here?

An easy approximation would be to <b>use the sample standard deviation $s$ instead of $\sigma$</b>. But this estimate only becomes good when <b>when the sample size $n$ is reasonably large</b>. The Central Limit Theorem is usually works when $n\ge30$. To be safe, use this approximation when the sample size is say $n \ge 40$ to compensate for the extra variability introducted by using the sample standard deviation $s$ instead of $\sigma$:<br><br>

<font size = "4">
    $$\overline{x} + \mathrm{ppf}\left(\frac{\alpha}{2}\right)\frac{s}{\sqrt{n}} <= \mu <= \overline{x} + \mathrm{ppf}\left(\frac{1 - \alpha}{2}\right)\frac{s}{\sqrt{n}}$$
</font>

<h2>One-Sided vs Two-sided Confidence Intervals</h2>

The confidence interval calculated above is a <b>two-sided</b> interval. It asserts that the true but unknown population mean $\mu$ is between two values with a certain probability $1 - \alpha$. In contrast, a <b>one-sided</b> confidence interval asserts that the population mean $\mu$ is either less than or greater than some value with a certain probability $1 - \alpha$.

An <b>upper-confidence bound</b> asserts that the population mean $\mu$ is less than or equal to some value with a certain probability $1 - \alpha$:<br><br>
<font size="4">
    \begin{align}
        \mu &\le \overline{x} + x_{hi}\\\\
        x_{hi} &= z_{hi}\frac{\sigma}{\sqrt{n}}\\\\
        P(Z \le z_{hi}) &= 1 - \alpha\\\\
        \mu &\le \overline{x} + \mathrm{ppf}\left(1 - \alpha\right)\frac{\sigma}{\sqrt{n}}\\\\
    \end{align}
</font>

A <b>lower-confidence bound</b> asserts that the population mean $\mu$ is greater than or equal to some value with a certain probability $1 - \alpha$:<br><br>
<font size="4">
    \begin{align}
        \mu &\ge \overline{x} + x_{lo}\\\\
        x_{lo} &= z_{lo}\frac{\sigma}{\sqrt{n}}\\\\
        P(Z \ge z_{lo}) &= 1 - \alpha\\\\
        \mu &\ge \overline{x} + \mathrm{ppf}\left(\alpha\right)\frac{\sigma}{\sqrt{n}}\\\\
    \end{align}
</font>

<h2>Confidence Intervals with Unknown Variance</h2>

To construct convidence intervals without knowing the population variance, the above procedure still works with a few small changes:
<ul>
    <li>Instead of using $\sigma$ to calculate the $Z$ variable, use the sample standard deviation $s$ to calculate $T$ variable instead:</li><br>
    <font size="4">
        $$T=\frac{\overline{X} - \mu}{\frac{s}{\sqrt{n}}}$$<br>
    </font>
    <li>Instead of using the standard normal distribution, the $T$ variable is distributed according to Student's $t$ distribution with $n-1$ degrees of freedom.</li>
    <ul>
        <li>The $T$ statistic is distributed according to Student's $t$ distribution. In other words, if we were to draw many, many samples from the population, calculate the corresponding values of $T$ associated with each sample, and plot a histogram of the values of $T$, we would find that the shape of the histogram closely resembles the $t$ distribution.</li>
    </ul>
    <li>Instead of using the population standard deviation $\sigma$, use the sample standard deviation $s$ instead.</li>
    <li>In Python's scipy.stats module, stats.t is used to invoke the $t$ distribution.</li>
    <li>See confidence_intervals.py</li>
</ul>

<h2>Confidence Intervals for a Population Proportion</h2>

Sometimes, the situation is such that a sample is drawn and a certain number of items $X$ in the sample belong to a class of interest. The proportion of items in the sample that belong to the class of interest is $\hat{P}=\frac{X}{n}$. The task now then is to estimate what proportion $p$ of the population belongs to this class of interest.

The sampling distribution for this type of task is a binomial distribution with parameters $n$ and $p$ where $n$ is the sample size and $p$ is the probability of observing an item belonging to the class of interest. The mean and variance of the binomial distribution are.<br><br>
<font size="4">
    $$\mu=E(X)=np$$<br>
    $$\sigma^2=V(X)=np(1-p)$$
</font>

If $n$ is large and $np$ and $n(1 - p) \ge 5$, then the following variable $Z$ is approximately standard normal. This implies approximating the sampling distribution with a normal distribution of centered around $p$ with standard deviation $\sqrt{\frac{p\left(1 - p\right)}{n}}$.<br><br>
<font size="4">
    $$Z=\frac{X - np}{\sqrt{np(1-p)}}=\frac{\hat{P}-p}{\sqrt{\frac{p(1-p)}{n}}}$$
</font><br>

Constructing a confidence interval is exactly the same as in the case of a standard normal variable. The only difference is that the variance of sampling distribution is:<br><br>
<font size="4">
    $$\frac{\sigma}{\sqrt{n}}=\sqrt{\frac{p(1-p)}{n}}$$
</font><br>

<h2>Choice of Sample Size</h2>

<h3> For a Two-Sided Confidence Interval</h3>

Recall that in a two-sided confidence interval centered around the population mean, when we draw a sample and calculate its sample mean, $100\left(1 - \alpha\right)$% of the time, the sample mean will fall between $x_{lo}$ and $x_{hi}$:
<font size = "4">
    $$x_{lo} \le \overline{X} \le x_{hi}$$<br>
    $$\mu + z_{lo}\frac{\sigma}{\sqrt{n}} \le \overline{X} \le \mu + z_{hi}\frac{\sigma}{\sqrt{n}}$$<br>
    $$z_{lo}\frac{\sigma}{\sqrt{n}} \le \overline{X} -\mu \le z_{hi}\frac{\sigma}{\sqrt{n}}$$<br>
</font><br>

Because of the symmetrical nature of the standard normal distribution:<br><br>
<font size = "4">
    $$-\mathrm{ppf}\left(\frac{\alpha}{2}\right) = \mathrm{ppf}\left(1 - \frac{\alpha}{2}\right)$$<br>
    $$\left|z_{lo}\right| = z_{hi}$$<br>
</font><br>

Taking the absolute values of the inequalities:<br><br>
<font size = "4">
    $$z_{lo}\frac{\sigma}{\sqrt{n}} \le \overline{X} -\mu$$<br>
    $$-z_{lo}\frac{\sigma}{\sqrt{n}} \ge \left|\overline{X} - \mu\right|$$<br>
    $$\left|\overline{X} - \mu\right| \le -z_{lo}\frac{\sigma}{\sqrt{n}}$$<br>
    $$\left|\overline{X} - \mu\right| \le z_{hi}\frac{\sigma}{\sqrt{n}}$$<br>
    $$\left|\overline{X} - \mu\right| \le \mathrm{ppf}\left(1 - \frac{\alpha}{2}\right)\frac{\sigma}{\sqrt{n}}$$<br>
</font><br>

This implies that, with $100\left(1 - \alpha\right)$% confidence, the difference $\left|\overline{X} - \mu\right|$ is no more than $\mathrm{ppf}\left(1 - \frac{\alpha}{2}\right)\frac{\sigma}{\sqrt{n}}$. Therefore:<br><br>
<font size="4">
    $$\mathrm{max}\left(\left|\overline{X} - \mu\right|\right) = \mathrm{ppf}\left(1 - \frac{\alpha}{2}\right)\frac{\sigma}{\sqrt{n}}$$<br>
    $$\mathrm{max}\left(\left|\overline{X} - \mu\right|\right) = \mathrm{ppf}\left(1 - \frac{\alpha}{2}\right)\frac{\sigma}{\sqrt{n}}$$<br>
    $$\sqrt{n} = \mathrm{ppf}\left(1 - \frac{\alpha}{2}\right)\frac{\sigma}{\mathrm{max}\left(\left|\overline{X} - \mu\right|\right)}$$<br>
    $$n = \left(\mathrm{ppf}\left(1 - \frac{\alpha}{2}\right)\frac{\sigma}{\mathrm{max}\left(\left|\overline{X} - \mu\right|\right)}\right)^2$$<br>
</font><br>

In other words, we can calculate the sample size $n$ that guarantees, with $100\left(1 - \alpha\right)$% confidence, that the difference between the sample mean and the true population mean will be no more than a chosen value $\left|\overline{X} - \mu\right|$.

<h3>For a One-Sided Lower-Bound Confidence Interval</h3>

A one-sided lower-bound confidence interval around the population mean $\mu$ gives a range where $100\left(1 - \alpha\right)$% of sample means can be found. This range is bounded by some lower value:<br><br>

<font size = "4">
    $$\overline{X} \ge x_{lo}$$<br>
    $$\overline{X} \ge \mu + z_{lo}\frac{\sigma}{\sqrt{n}}$$<br>
    $$\overline{X} - \mu \ge z_{lo}\frac{\sigma}{\sqrt{n}}$$<br>
</font><br>

This means that the difference between the sample means is greater than some value with $1- \alpha$ confidence.<br><br>
<font size = "4">
    $$\mathrm{min}\left(\overline{X} - \mu\right) = z_{lo}\frac{\sigma}{\sqrt{n}}$$<br>
    $$\sqrt{n} = z_{lo}\frac{\sigma}{\mathrm{min}\left(\overline{X} - \mu\right)}$$<br>
    $$n = \left(z_{lo}\frac{\sigma}{\mathrm{min}\left(\overline{X} - \mu\right)}\right)^2$$<br>
    $$n = \left(\frac{\sigma\mathrm{ppf}\left(\alpha\right)}{\mathrm{min}\left(\overline{X} - \mu\right)}\right)^2$$<br>
</font><br>

The above equation gives the sample size that guarantees the difference between the sample mean and the population will be greater than some value with $1 - \alpha$ confidence.

<h3>For a One-Sided Upper-Bound Confidence Interval</h3>

Similarly, a sample size that guarantees that the difference between sample means and the population will be less than some value with $1-\alpha$ confidence can be calculated.

<font size = "4">
    $$\overline{X} \le x_{hi}$$<br>
    $$\overline{X} \le \mu + z_{hi}\frac{\sigma}{\sqrt{n}}$$<br>
    $$\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}$$<br>
    $$\mathrm{max}\left(\overline{X} - \mu\right) = z_{hi}\frac{\sigma}{\sqrt{n}}$$<br>
    $$\sqrt{n} = z_{hi}\frac{\sigma}{\mathrm{max}\left(\overline{X} - \mu\right)}$$<br>
    $$n = \left(z_{hi}\frac{\sigma}{\mathrm{max}\left(\overline{X} - \mu\right)}\right)^2$$<br>
    $$n = \left(\frac{\sigma\mathrm{ppf}\left(1 - \alpha\right)}{\mathrm{max}\left(\overline{X} - \mu\right)}\right)^2$$<br>
</font><br>