This notebook aims to be a reference and quickstart document for basic statistical concepts aimed at engineers. All content here is taken from Montgomery and Runger - Applied Statistics and Probability for Engineers 7ed.

<h1>Basic Terminology</h1>

A <b>random experiment</b> can be thought of as executing a trial or procedure. The outcome of this trial varies with each execution even when the trial is performed in the same exact manner every time.

Repetitions of a random experiment will produce different outcomes. The set of all possible outcomes is called the <b>sample space</b> of the random experiment.

A <b>discrete</b> sample space is one where the outcomes are countable (even if they are infinite). An example is the set of all ways to arrange the numbers 0 - 9. A <b>continuous</b> sample space is one where the possibilities are represented by real numbers. An example is the length of a rod (in m) or the mass of a ball (in kg).

An <b>event</b> is simply a subset of a sample space.<br>
The <b>union</b> of two events $E_1 \cup E_2$ is the set of all outcomes belonging to either of the two events.<br>
The <b>intersection</b> of two events $E_1 \cap E_2$ is the set of all outcomes common between the two events.<br>
The <b>complement</b> of an event $'E_1$ is the set of all outcomes not in $E_1$.<br>
Two events are <b>mutually exclusive</b> when they share no common outcomes: $E_1 \cap E_2 = \emptyset$.

A <b>random variable</b> can be thought of as a function that produces varying values with different probabilities. For example, the random variable $X$ can be used to denote the measured length of a rod. After every measurement, it may be found that it is more likely to encounter values between $4.7 \le X \le 5.2$ m than it is to encounter values $X \le 4.7$ m. Mathematically:
<center><font size="4">$P(4.7 \le X \le 5.2) > P(X \le 4.7)$</font></center><br>
The random variable $X$ can take on many varying values. The variable $x$ will be used to represent all possible values that $X$ can take.

<h1>Counting Techniques</h1><br>
Counting techniques are useful in calculating probabilities.

<b>Multiplication Rule</b><br>
If there are $k$ steps, $n_1$ ways to complete step 1, $n_2$ ways to complete step 2, $n_3$ ways to complete step 3, ..., the number of different ways to complete $k$ steps is:<br>

<center><font size="4">
$$N = n_1 \times n_2 \times n_3 \times\ldots\times n_k$$
</font></center>

<b>Permutations</b><br>
Permutations of $n$ unique objects can be thought of as different ways of arranging $n$ objects. For example, all the permutations of the numbers 0, 1, and 2 chosen $r=2$ at a time are:<br>
0 1<br>
0 2<br>
1 0<br>
1 2<br>
2 0<br>
2 1<br>
The number of different permutations of $n$ unique objects chosen $r$ objects at a time is:<br>
<center><font size="4">$$N = P^n_r = \frac{n!}{(n - r)!}$$</font></center>

<b>Permutations with Similar Objects</b><br>
There are $k$ types of objects. There are:<br>
$n_1$ type 1 objects<br>
$n_2$ type 2 objects<br>
$n_3$ type 3 objects<br>
$\ldots$<br>
$n_k$ type k objects<br>
The total number of objects $n = n_1 + n_2 + n_3 +\ldots+n_k$. The number of permutations in this case is:<br>
<center><font size="4">$$N = \frac{n!}{n_1!n_2!n_3! \ldots n_k!}$$</font></center>

<b>Combinations</b><br>
A combination is kind of like a permutation except where order doesn't matter. For example, the combination 0 1 2 is the same as 1 0 2. The number of combinations of $n$ objects chosen $r$ at a time is:<br>
<center><font size="4">$$C^n_r={n \choose r}=\frac{n!}{(n-r)!r!}$$</font></center>

<h1>Probability Rules</h1>

If there are $N$ possible outcomes, all <b>equally likely</b> to occur, the probability of each outcome occurring is:<br>
<center><font size="4">$$P(O_n) = \frac{1}{N}$$</font></center>

The <b>probability of an event $E$</b> (a subset of all possible outcomes) is the sum of all probability of outcomes belonging E. If $k$ outcomes belong to $E$, the probability of $E$ is:<br>
<center><font size="4">$$P(E)=\sum_{i=1}^k{P(O_i)}$$</font></center><br>
Furthermore, the probability of any event must be between 0 and 1:<br><br>
<center><font size="4">$$0 \le P(E) \le 1$$</font></center><br>

The <b>probability of the sample space $S$</b> must equal 1. In other words, the sum of the probabilities of all possible outcomes must equal 1.<br><br>
<center><font size="4">$$P(S)=1$$</font></center>

For two <b>mutually exclusive</b> events $E_1$ and $E_2$:<br>
<center><font size="4">
    $$P(E_1 \cup E_2)=P(E_1)+P(E_2)$$<br>
    $$P(E_1 \cap E_2)=P(\emptyset)=0$$
    </font></center>

If two events $E_1$ and $E_2$ have common outcomes (have some overlap in their Venn diagram), $E_1 \cap E_2 \ne \emptyset$, the <b>probability of their union</b> is:<br><br>
<center><font size="4">$$P(E_1 \cup E_2)=P(E1)+P(E2)-P(E_1 \cap E_2)$$</font></center><br>

The intuition behind this is that if $P(E_1)$ and $P(E_2)$ were simply added, their overlap $(E_1 \cap E_2)$ would be counted twice. So, it is appropriate to subtract the probability of their overlap $P(E_1 \cap E_2)$ once.<br>

Extending this to 3 events $E_1$, $E_2$, and $E_3$:<br><br>
<center><font size="4">
    $$P(E_1 \cup E_2 \cup E_3)=P(E_1)+P(E_2)+P(E_3)-P(E_1 \cap E_2)-P(E_1 \cap E_3)-P(E_2 \cap E_3)+P(E_1 \cap E_2 \cap E_3)$$
    </font></center>

Of all the times $E_1$ has occurred, what proportion of these will $E_2$ also occur? This is known as <b>conditional probability</b> of $E_2$ given $E_1$: $P(E_2 \mid E_1)$. In other words, $P(E_2 \mid E_1)$ is the probability that $E_2$ will occur when $E_1$ has already occurred:<br><br>
<center><font size="4">$$P(E_2 \mid E_1)=\frac{P(E_2 \cap E_1)}{P(E_1)}$$</font></center><br>

The intution behind this is to recognize that the conditional probability $P(E_2 \mid E_1)$ is the proportion/fraction of times that $E_2$ occurs when $E_1$ has already occurred. 

The probability of the intersection between two events $E_1$ and $E_2$ can be expressed in terms of conditional probabilities by the <b>multiplication rule</b>:<br><br>
<center><font size="4">$$P(E_1 \cap E_2)=P(E_1 \mid E_2) \times P(E_2)=P(E_2 \mid E_1) \times E_1$$</font></center><br>
Intuition: The conditional probability notation $\mid$ can "kind of" be treated like the divide $\frac{x}{y}$ operator.

For two mutually exclusive events $E_1$ and $E_2$, the <b>total probability rule</b> states that the probability of an event $E_1$ is:<br><br>
<center><font size="4">$$P(E_1)=P(E_1 \cap E_2)+P(E_1 \cap 'E_2)=P(E_1 \mid E_2) \times P(E_2)+P(E_1 \mid 'E_2) \times P('E_2)$$</font></center><br>
This can be extended to multiple events. Say there are only 5 events in a sample space S:<br><br>
<center><font size="4">
        \begin{align}
        P(E_1)&=P(E_1 \cap E_2)+P(E_1 \cap E_3)+P(E_1 \cap E_4)+P(E_1 \cap E_5)\\
        &=P(E_1 \mid E_2) \times P(E_2)+P(E_1 \mid E_3) \times P(E_3)+P(E_1 \mid E_4) \times P(E_4)+P(E_1 \mid E_5) \times
            P(E_5)
        \end{align}
    </font></center>

Two events $E_1$ and $E_2$ are <b>independent</b> when the occurrence of $E_1$ does not affect the probability of $E_2$ occurring and vice versa. The following are properties of the probabilities of two independent events:<br><br>
<center><font size="4">
$$P(E_1 \mid E_2) = P(E_1)$$<br>
$$P(E_1 \cap E_2) = P(E_1)P(E_2)$$
</font></center>

<h1>Probability Distributions</h1>

A <b>probability mass function</b>(pmf) $f(x_i)$ characterizes the probability distribution of a <u>discrete</u> random variable. It can be thought of as a function outputs the probability of the discrete value $x_i$. It is kind of like a bar chart/histogram where the y values are probabilities instead of counts.<br>

A pmf $f(x_i)$ has the following properties owing to the fact that its outputs are probabilities:<br>

All probabilities are non-negative: <center><font size="4">$$f(x_i) \ge 0$$</font></center><br>
The sum of probabilities for all $x_i$ is 1: <center><font size="4">$$\sum_{i=1}^k{f(x_i)}=1$$</font></center><br>
The value of the pmf at each $x_i$ is the probability of $x_i$:
<center><font size="4">
$$f(x_i)=P(X=x_i)$$
</font></center>

A <b>probability density function</b>(pdf) $f(x)$ is a function that characterizes the probability distribution of a <u>continuous</u> random variable. It is similar to the pmf except that <b>its y values are not probabilities</b>. Instead, integrating a pdf $f(x)$ over an interval of $x$ will yield the probability of $x$ being in that range. A histogram is an approximation of the pdf.<br>

A pdf $f(x)$ has the following properties:<br>

All probabilities are positive: <center><font size="4">$$f(x) \ge 0$$</font></center><br>

The sum of all probabilites over all values of $x$ is 1:
<center><font size="4">$$\int_{-\infty}^{+\infty}{f(x)\mathrm{d}x}$$</font></center><br>

The probability of encountering $x$ in the range between $a$ and $b$ is the integral of the probability density function in that range:<br><br>
<center><font size="4">$$P(a \le X \le b)=\int_a^b{f(x)\mathrm{d}x}$$</font></center>

A <b>cumulative distribution function</b>(cdf) $F(x)$ is a function that gives the probability of encountering values of the random variable $X$ that is less than $x$. Mathematically,<br>
<center><font size="4">$$F(x)=P(X \le x)$$</font></center><br>

By definition, a cumulative distribution function involves a summation-type function.<br>

For a discrete random variable $X$:
<center><font size="4">$$F(x)=\sum_{x_i \le x}{f(x)}$$</font></center><br>

For a continuous random variable $X$:
<center><font size="4">$$F(x)=\int_{-\infty}^x{f(x)\mathrm{d}x}$$</font></center><br>

In either case, owing to the properties of probabilities, the value of $F(x)$ cannot be less than 0 or greater than 1:<br><br>
<center><font size="4">$$0 \le F(x) \le 1$$</font></center><br>

Furthermore, because a cdf is a summation-type function starting from the left end of the range of x, it must be monotonically increasing:<br><br>
<center><font size="4">If $a \le b$, then $F(a) \le F(b)$</font></center><br>

The <b>expected value</b> of a random variable $E(X)$ can be thought of as a weighted sum of all the values of $X$ and their corresponding probabilities, $f(x)$:<br>

For a discrete random variable:
<center><font size="4">$$E(X)=\sum_{x_i}{xf(x)}$$</font></center><br>

For a continuous random variable:
<center><font size="4">$$E(X)=\int_{-\infty}^{+\infty}{xf(x)}$$</font></center><br>

The expectation operator can be extended as follows:<br>

For a discrete random variable:
<center><font size="4">$$E(h(X))=\sum_{x_i}{h(x)f(x)}$$</font></center><br>

For a continuous random variable:
<center><font size="4">$$E(h(X))=\int_{-\infty}^{+\infty}{h(x)f(x)}$$</font></center><br>

The <b>mean</b> $\mu$ and <b>variance</b> $\sigma^2$ of a random variable $X$ can be expressed in terms of the expectation operator:<br>

For a discrete random variable:
<center><font size="4">
$$\mu=E(X)=\sum_{x_i}{x_if(x)}$$
$$\sigma^2=E((X-\mu)^2)=\sum_{x_i}{(x_i-\mu)^2f(x)}$$
</font></center><br>

For a continuous random variable:
<center><font size="4">
$$\mu=E(X)=\int_{-\infty}^{+\infty}{xf(x)}$$
$$\sigma^2=E((X-\mu)^2)=\int_{-\infty}^{+\infty}{(x-\mu)^2f(x)}$$
</font></center><br>

The <b>standard deviation</b> of a random variable X is the square root of the variance X:<br>
<center><font size="4">
$$\sigma=\sqrt{\sigma^2}$$
</font></center>

<h2>Bernoulli Trial</h2>

A <b>Bernoulli trial</b> is a random experiment with only two outcomes, 0 or 1. The probability of obtaining 0 or 1 may not be equal. For example, the probability of obtaining tails when flipping a (very) unfair coin could be 0.8, and the probability of obtaining heads is necessarily 0.2.

<h2>Binomial Distribution</h2>

The <b>Binomial Distribution</b> of a discrete random variable $X$ gives the probability of encountering $x$ successful outcomes in $n$ independent Bernoulli trials of constant probability $p$. For example, take a coin that has a probability of landing tails 1/10 times. If the coin is flipped 20 times, what is the probability of getting exactly $x=5$ tails? Here, $n=20$, $p=0.1$, and $x=5$.<br>

Mathematically, the Binomial Distribution is:
<center><font size="4">
$$f(x)={n \choose x}p^x(1-p)^{n-x}$$<br>
$$x=0, 1, 2, \ldots, n$$
</font></center><br>

The mean and variance of the Binomial Distribution are:
<center><font size="4">
$$\mu=E(X)=np$$<br>
$$\sigma^2=V(X)=np(1-p)$$
</font></center><br>

<h2>Geometric Distribution</h2>

The <b>Geometric Distribution</b> of a discrete random variable $X$ gives the probability of encountering the first success after performing $x$ independent Bernoulli trials of constant probability $p$. So, it answers the question, "What is the probability of obtaining the first tail after 10 coin flips, if the coin has a probability of coming up tails 2/10 times?" Here, $x=10$, and $p$ is 0.2.<br>

Mathematically, the Geometric Distribution is:<br>
<center><font size="4">
$$f(x)=(1-p)^{x-1}p$$<br>
$$x=1,2, \ldots$$
</font></center><br>

The mean and variance of the Geometric Distribution are:
<center><font size="4">
$$\mu=E(X)=\frac{1}{p}$$<br>
$$\sigma^2=V(X)=\frac{1-p}{p^2}$$
</font></center><br>

<h2>Negative Binomial Distribution</h2>

The <b>Negative Binomial Distribution</b> of a discrete random variable $X$ gives the probability of performing $x$ independent Bernoulli trials of constant probability $p$ until $r$ successes are encountered. It answers the question, "If a coin is to be flipped until 5 tails are encountered, what is the probability that 20 flips will be executed when the coin comes up tails 2/10 times?" Here, $r=5$, $x=20$, and $p=0.2$.<br>

Mathematically, the Negative Binomial Distribution is:
<center><font size="4">
$$f(x)={x-1 \choose r-1}(1-p)^{x-r}p^r$$<br>
$$x=r, r+1, r+2, \ldots$$
</font></center><br>

The mean and variance of the Binomial Distribution are:
<center><font size="4">
$$\mu=E(X)=\frac{r}{p}$$<br>
$$\sigma^2=V(X)=\frac{r(1-p)}{p^2}$$
</font></center><br>

<h2>Hypergeometric Distribution</h2>

The <b>Hypergeometric Distribution</b> of a discrete random variable $X$ helps in the following situation. There are 100 balls in a bag. 27 of these balls are blue. If 3 balls are taken from the bag, without replacement, what is the probability of obtaining 2 blue balls? In general terms, there are $N$ objects, $K$ of which are designated as successes. If a sample of size $n$ is drawn from the $N$ objects, what is the probability that the sample contains $x$ successes? Here, $N=100$, $K=27$, $n=3$, and $x=2$.<br>

Mathematically, the Hypergeometric Distribution is:
<center><font size="4">
$$f(x)=\frac{{K \choose x}{N-K \choose n-x}}{N \choose n}$$<br>
$x=$max$\{0,n+K-N\}$ to min$\{K,n\}$
</font></center><br>

The mean and variance of the Hypergeometric Distribution are:
<center><font size="4">
$$\mu=E(X)=np$$<br>
$$\sigma^2=V(X)=np(1 - p)\left( \frac{N - n}{N - 1} \right)$$<br>
$$p = \frac{K}{N}$$
</font></center><br>

<h2>Poisson Distribution</h2>

The <b>Poisson Distribution</b> of a discrete random variable $X$ helps in the following situation. Suppose that a length of wire has 2.3 flaws per mm. What is the probability of having 5 flaws in 10 mm of wire? This is called a <b>Poisson process</b>. When an average number of successes per unit $\lambda$ is given, the Poisson Distribution gives the probability of encountering $x$ successes in $T$ units. In the example above, $\lambda=2.3$, $T=10$, and $x=5$.<br>

Mathematically, the Poisson Distribution is:
<center><font size="4">
$$f(x)=\frac{e^{-\lambda T}(\lambda T)^x}{x!}$$<br>
$$x=0, 1, 2, \ldots$$
</font></center><br>

The mean and variance of the Poisson Distribution are:
<center><font size="4">
$$\mu=E(X)=\lambda T$$<br>
$$\sigma^2=V(X)=\lambda T$$
</font></center><br>

<h2>Normal Distribution</h2>

Suppose that there are a million balls. Take a sample of 10 balls, measure their diameters, average them and call this average a1. Take another sample of 10, measure, average, and call this a2. Take yet another sample of 10, measure, average, and call this a3. Keep repeating this (many) $n$ times to get $a_n$. If a histogram is made out of the values of $a_n$, the <b>Central Limit Theorem</b> states that this histogram will be approximate the <b>Normal Distribution</b>. In other words, the continuous random varialbe $X$, which are the averages of diameters of samples of size $n=10$, will be normally distributed.<br>

Mathematically, the Normal Distribution with mean $\mu$ and variance $\sigma^2$ is:
<center><font size="4">
$$f(x)=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(x-\mu)^2}{2\sigma^2}}$$<br>
$$x=0, 1, 2, \ldots$$
</font></center><br>

<h2>Standard Normal Distribution</h2>

The <b>Standard Normal Distribution</b> is obtained by modifying the normal distribution by <b>standardization</b>. Given a continuous random variable $X$, the standardized random variable $Z$ can be obtained by mean-centering and scaling by the standard deviation:<br><br>
<center><font size="4">
$$Z=\frac{(X-\mu)}{\sigma}$$
</font></center><br>

The continuous standard normal random varialbe $Z$ will be normall distributed with:<br>
<center><font size="4">
$$\mu=0$$<br>
$$\sigma^2=1$$
</font></center><br>

The standard normal distribution is usefully in calculating probabilities. Calculating probabilities of normal variables involves the integration of many normal distributions with many different means and variances. Instead, it is more convenient to standardize the random variables and only integrate the standard normal distribution to calculate probabilities:
<center><font size="4">
$$P(X \le x)=P(Z \le z)=P\left(\frac{X-\mu}{\sigma} \le \frac{x-\mu}{\sigma} \right)$$<br>
</font></center><br>

<h2>Standard Normal Approximation to the Binomial Distribution</h2>

It is sometimes convenient to approximate a binomial random variable with a standard normal variable for the purposes of <b>calculating probabilities</b>. Given $X$ is a binomial random variable with parameters $n$ and $p$, the standard normal variable $Z$ is calculated as follows:<br><br>
<center><font size="4">
$$Z=\frac{X-np}{\sqrt{np(1-p)}}$$
</font></center><br>

Approximating probabilities can be computed as follows:<br>
<center><font size="4">
$$P(X \le x)=P(X \le x+0.5)\approx P\left(Z \le \frac{x+0.5-np}{\sqrt{np(1-p)}} \right)$$
$$P(X \ge x)=P(X \ge x-0.5)\approx P\left(Z \ge \frac{x-0.5-np}{\sqrt{np(1-p)}} \right)$$
</font></center><br>

These approximations are good when:
<center><font size="4">
$$np \gt 5$$<br>
$$n(1-p) \gt 5$$
</font></center><br>

<h2>Exponential Distribution</h2>

Given a Poisson process with a mean number of successes per unit $\lambda$, the <b>Exponential Distribution</b> gives the probability of having a distance $x$ between two instances of successes. Suppose that a length of wire has 2.3 flaws per mm. What is the probability of having no flaws in 10 mm of wire? Here, $\lambda=2.3$, and $x=10$.<br>

Mathematically, the exponential distribution is:
<center><font size="4">
$$f(x)=\lambda e^{-\lambda x}$$<br>
$$0 \le x \le \infty$$
</font></center><br>

The mean and variance of the exponential distribution are:
<center><font size="4">
$$\mu=E(X)=\frac{1}{\lambda}$$<br>
$$\sigma^2=V(X)=\frac{1}{\lambda^2}$$
</font></center><br>

The exponential distribution has a property called <b>lack of memory</b>. Using the example, suppose that a flaw has not been encountered in 60 mm of wire. What is the probability that a flaw will be encountered in the next 10 mm (or in 70 mm of wire)? It might be tempting to assume that because there were no flaws in 60 mm of wire, it is "overdue" for a flaw. However, the probability of encountering a flaw in the next 10 mm of wire <b>independent of the past</b>. That is, it is independent of the fact that a flaw has not been encountered in the last 60 mm. Mathematically, <br><br>
<center><font size="4">
$$P(X<t+a|X>t)=P(X<a)$$
</font></center><br>

Here, $t=60$ mm, and $a=10$ mm, making $t+a=70$ mm.

<h1>Joint Probability Distributions</h1>

<b>Joint probability distributions</b> can be thought of as a function that gives the probability of encountering different values of two random variables together. For two random continuous variables, their <b>joint probability density function</b>  (joint pdf) $f_{XY}(x, y)$ is a surface; the x and y-axes are different values of the random variables; the z-axis is the probability of encountering the specific $(X=x, Y=y)$ pair.<br>

Joint pdf's have the following properties owing to the basic properties of probabilities:<br>

The probability of any combination of x and y values $(X=x, Y=y)$ must be non-negative:<br><br>
<center><font size="4">
$$f_{XY}(x, y) \ge 0$$
</font></center><br>

The sum of all probabilities in the applicable range must be 1:<br><br>
<center><font size="4">
$$\int_{-\infty}^{+\infty}{\int_{-\infty}^{+\infty}{f_{XY}(x, y)\mathrm{d}x\mathrm{d}y}}=1$$
</font></center><br>

Combinations of $(x, y)$ value pairs are represented by a region in the x-y plane. The region is defined by an interval of X and an interval of Y. When double-integrating, <b>try to always integrate the variable with largest interval first</b>. This prevents accidentally including regions of the x-y plane that are not desired. Failing that, split up the integral based on different regions of the x-y plane.
<center><font size="4">
$$P((X,Y)\in R)=\int{\int_{R}{f_{XY}(x, y)\mathrm{d}x\mathrm{d}y}}$$
</font></center><br>

The <b>marginal probability density function</b> of X can be obtained from the joint pdf of X and Y by integrating Y out over the relevant region R. This yields the pdf of X. A similar process can be used to obtain the marginal pdf of Y.<br>
<center><font size="4">
$$f_X(x)=\int_{R}{f_{XY}(x, y)\mathrm{d}y}$$<br>
$$f_Y(y)=\int_{R}{f_{XY}(x, y)\mathrm{d}x}$$
</font></center><br>


The <b>mean</b> and <b>variance</b> of a random variable $X$ from a joint pdf can be expressed with its marginal pdf and the expectation operator:<br>
<center><font size="4">
$$E(x)=\int_{-\infty}^{+\infty}{xf_X(x)\mathrm{d}x}=\int_{-\infty}^{+\infty}{\int_{-\infty}^{+\infty}{xf_{XY}(x, y)\mathrm{d}x\mathrm{d}y}}$$<br>
$$V(x)=\int_{-\infty}^{+\infty}{(x-\mu)^2f_X(x)\mathrm{d}x}=\int_{-\infty}^{+\infty}{\int_{-\infty}^{+\infty}{(x-\mu)^2f_{XY}(x, y)\mathrm{d}x\mathrm{d}y}}$$<br>
</font></center><br>

<h1>Covariance and Correlation</h1>

The <b>covariance</b> between two random variables X and Y is a measure of their <b>linear</b> relationship. The <b>correlation</b> scales the covariance by the standard deviation of each variable. Two random variables are said to be correlated when their correlation is non-zero.<br>

The covariance between two random variables $\sigma_{XY}$ is defined mathematically as:<br><br>
<center><font size="4">
$$\sigma_{XY}=E[(X-\mu_X)(Y-\mu_Y)]=E(XY)-\mu_X\mu_Y$$
</font></center><br>

The correlation between two random variables $\rho_{XY}$ is defined mathematically as:<br><br>
<center><font size="4">
$$\rho_{XY}=\frac{\sigma_{XY}}{\sigma_X \sigma_Y}$$<br>
$$-1 \le \rho_{XY} \le +1$$
</font></center><br>

The covariance and correlation of two independent variables is 0. But, two variables whose covariance and correlation are zero are <b>not necessarily independent</b>.<br>
<center><font size="4">
$$\sigma_{XY}=\rho_{XY}=0$$
</font></center><br>

<h1>Linear Functions of Random Variables</h1>

This section details how to calculate the mean and variance of a random variable $Y$ that is a linear function of random variables $X_i$:<br><br>
<font size="4">
    $$Y = c_0 + c_1X_1 + c_2X_2 + \ldots + c_iX_i$$
</font>

The <b>mean</b> and <b>variance</b> of $Y$ can be expressed in terms of the mean and variance of the individual random variables $X_i$:<br><br>
<font size="4">
    $$E(Y) = c_0 + c_1E(X_1) + c_2E(X_2) + \ldots + c_iE(X_i)$$<br>
    $$V(Y) = c_1^2V(X_1) + c_2^2V(X_2) + \ldots + c_i^2V(X_i) + 2 \sum_{a<b}{\sum{c_ac_b\mathrm{cov}(X_a, X_b)}}$$
</font>

If all $X_i$ are independent, the variance of $Y$ can be simplified to:<br><br>
<font size="4">
    $$V(Y) = c_1^2V(X_1) + c_2^2V(X_2) + \ldots + c_i^2V(X_i)$$
</font>

<h2>Mean and Variance of an Average</h2>

Sometimes, multiple samples of $X$ can be taken from a pool where $E(X)=\mu$ and $V(X)=\sigma^2$. The sample averages $\overline{X}$ can be calculated as in:<br><br>
<font size="4">
    $$S_1=\{x_1, x_2, x_3, \ldots, x_i\} \;\;\;\; \overline{X}_1=\frac{1}{i}\sum_1^i{x_i}$$<br>
    $$S_2=\{x_1, x_2, x_3, \ldots, x_i\} \;\;\;\; \overline{X}_2=\frac{1}{i}\sum_1^i{x_i}$$<br>
    $$S_3=\{x_1, x_2, x_3, \ldots, x_i\} \;\;\;\; \overline{X}_3=\frac{1}{i}\sum_1^i{x_i}$$<br>
    $$S_p=\{x_1, x_2, x_3, \ldots, x_i\} \;\;\;\; \overline{X}_p=\frac{1}{i}\sum_1^i{x_i}$$<br>
</font><br>

The mean and variance of all $\overline{X}_p$ is:
<font size="4">
    $$E(\overline{X}_p)=\mu$$<br>
    $$V(\overline{X}_p)=\frac{\sigma^2}{p}$$<br>
</font><br>

<h1>Descriptive Statistics</h1>

For a sample with $n$ observations $\{x_1, x_2, x_3, \ldots, x_n\}$:

The sample mean is:<br><br>
<font size="4">
    $$\overline{x}=\frac{\sum_{i=1}^{n}{x_i}}{n}$$
</font>

The sample variance is:<br><br>
<font size="4">
    $$s^2=\frac{\sum_{i=1}^{n}{(x_i - \overline{x})}}{n - 1}$$
</font>

The sample range is:<br><br>
<font size="4">
    $$r=\mathrm{max}(x_i) - \mathrm{min}(x_i)$$
</font>

<h2>Visual Aids</h2><br>
<b>Stem-and-leaf</b> diagrams divide the data into bins, provides a way to visualize the distribution of data while still displaying individual values. They work well for smaller data sets but struggle to be meaningful for larger data sets.<br>

<b>Histograms</b> provide a way to visualize the distribution of data and can handle larger data sets. However, they do not show the individual data values.<br>

<b>Box plots</b> show the distribution of data by highlighting the boundaries of the first and third quartiles. It also shows possible outliers and the distance between these outliers to the first and third quartiles.

The <b>normal probability plot</b> shows how closely a data set resembles the normal probability distribution. It can show whether or not a data set has lighter or heavier tails, or how skewed the data set is relative to the normal probability distribution.


<h1>Inferential Statistics</h1>

The next sections form the foundations of inferential statistics. Inferential statistics aims to draw conclusions about a population by examining a sample drawn from it.

<h1>Point Estimation</h1>

Many situations arise where an estimate of a generally unknown parameter is needed. Take the following examples:
<ul>
    <li>the average weight of a bag of chips produced at a factor</li>
    <li>the average height of a population of a country</li>
    <li>the standard deviation of temperature of a region</li>
</ul>
It is quite often infeasible to measure the value of a <b>population parameter $\theta$</b>. For example, it would be very expensive to measure the weight of every bag of chips that was produced from a factory!

Instead, a <b> random sample</b> is drawn and measurements taken. A bag of 100 chips can be weighed and their average weight computed. In this case, the sample mean $\overline{x}$ is called a <b>point estimator $\hat{\Theta}$</b> and its value is called a <b>point estimate $\hat{\theta}$</b> of the population parameter, which is the mean weight of all bags. A point estimator can also be thought of as a function performed on the sample to generate an estimate. Any function of the observations of a random sample is known as a <b>statistic</b>. The point estimator that is the sample mean is one such example.

<h1>Sampling Distributions</h1>

The sample mean is also a random variable. If 3 samples are taken (a total of 300 bags of chips), 3 sample means can be computed. The 3 sample means will be different leading to the idea that the sample mean has a <b>sampling distribution</b>. Thus, point estimates can be thought of as having a probability distribution called the sampling distribution. Intuitively, this makes sense since every sample taken from a population will be slightly different. If observations for every are slightly different, then it follows that a function (which a point estimator is) of the slightly different observations will also yield different values.

The <b>Central Limit Theorem</b> states that the distribution of sample means $\overline{x}$ over many, many samples tends to approach a standard normal distribution regardles of the population parameter's $\theta$ probability distribution. Mathematically, the Central Limit Theorem states that the distribution of $Z$ approaches the standard normal distribution as the number of samples approaches infinity.
<font size="4">
    $$Z=\frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}$$<br>
    $$\overline{X}\rightarrow\mu$$<br>
    $$\sigma^2_{\overline{X}}\rightarrow\frac{\sigma^2}{n}$$<br>
    $$\sigma_{\overline{X}}\rightarrow\frac{\sigma}{\sqrt{n}}$$<br>
    $$n\rightarrow\infty$$
</font>

Similarly, for a random variable that is the difference between between two sample means $\overline{X}_1$ and $\overline{X}_2$, the sampling distribution of the following $Z$ is approximately standard normal:<br><br>
<font size="4">
    $$Z=\frac{\overline{X}_1 - \overline{X}_2 - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}$$<br>
    $$\overline{X}_1 - \overline{X}_2 \rightarrow \mu_1 - \mu_2$$<br>
    $$\sigma^2_{\overline{X}_1 - \overline{X}_2} \rightarrow \frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}$$<br>
    $$\sigma_{\overline{X}_1 - \overline{X}_2} \rightarrow \sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}$$<br>
    $$n\rightarrow\infty$$
</font><br>

Where
<ul>
    <li>$\overline{X}_1$ is the sample mean from the population 1</li><br>
    <li>$\overline{X}_2$ is the sample mean from the population 2</li><br>
    <li>$\mu_1$ is the mean of population 1</li><br>
    <li>$\mu_2$ is the mean of population 2</li><br>
    <li>$\sigma^2_1$ is the variance of population 1</li><br>
    <li>$\sigma^2_2$ is the variance of population 2</li><br>
    <li>$n_1$ is the sample size of the sample from population 1</li><br>
    <li>$n_2$ is the sample size of the sample from population 2</li><br>
</ul>

<h1>General Concepts of Point Estimation</h1>

With the idea of a sampling distribution comes the idea of the expected value of a point estimator. The expected value of a point estimator is the sum of the products of the point estimates and their corresponding probabilities as given by the sampling distribution. How does this value differ from the population parameter? The difference between the expected value of a point estimator and the value of the population parameter is called <b>bias</b>:<br><br>

<font size="4">
    $$\mathrm{bias} = E(\hat{\Theta})-\theta$$
</font>

A point estimator is <b>unbiased</b> when $E(\hat{\Theta})=\theta$.<br><br>

What about the variance of a point estimator's sampling distribution? Well, a smaller variance implies a higher chance of observing a point estimate that is close to the expected value of the point estimator. This leads to the concept of the <b>minimum variance unbiased estimator</b> MVUE. In other words, the MVUE is the best estimator because it is centered around the population parameter (unbiased) and there is a high chance of obtaining a point estimate that is close to the expected value (minimum variance). The sample mean $\overline{X}$ is the MVUE of the population mean $\mu$.<br>

The <b>standard error</b> of a point estimator can be used communicate the precision of its point estimates. $\mathrm{SE}(\hat{\Theta})$ is the operation of calculating the standard deviation of the sampling distribution of that statistic. Mathematically:<br><br>

<font size="4">
    $$\mathrm{SE}(\hat{\Theta})=\sqrt{V(\hat{\Theta})}$$<br>
    $$\mathrm{SE}(\overline{X})=\frac{\sigma}{\sqrt{n}}$$
</font><br>

The <b>estimated standard error</b> of a point estimate is used when estimates are used in place of generally unknown parameters. For example, with a large sample size $n$, the variance and standard deviation of the sampling distribution of the sample mean $\overline{X}$ are:<br><br>

<font size="4">
    $$\sigma^2_{\overline{X}}=\frac{\sigma^2}{n}$$<br>
    $$\mathrm{SE}(\overline{X})=\sigma_{\overline{X}}=\frac{\sigma}{\sqrt{n}}$$
</font><br>

But in general, the variance $\sigma^2$ of the population distribution. What can be done is to use the sample variance $s^2$ or sample standard deviation $s$ to calculate an estimate for the standard error of $\overline{X}$:<br><br>

<font size="4">
    $$\hat{\sigma}_\overline{X}=\frac{s}{\sqrt{n}}$$
</font><br>

<h1>Bootstrapping</h1>

For the sample mean point estimator $\overline{X}$, the Central Limit Theorem was used to derive the result that $\sigma_{\overline{X}}=\frac{\sigma}{n}$ (derivation not included in notes). However, some point estimators $\hat{\Theta}$ have complicated forms and derivation of an expression for its standard error is difficult, if not impossible. The <b>bootstrap</b> technique can be used to estimate the standard error of the point estimator $\mathrm{SE}(\hat{\Theta})$.

Suppose that a random sample from the population is present to characterize the population parameter $\theta$. Suppose further that the <b>probability density function of the population is known</b>. With the use of a computer, bootstrapping proceeds as follows:<br>
<ul>
    <li>Using the sample, calculate the sample point estimate $\hat{\theta}$ of the population parameter $\theta$.</li>
    <li>Use the known pdf of the population to generate samples. These are known as bootstrap samples.</li>
    <li>Compute the bootstrap estimate $\hat{\theta}_B$ of the population parameter $\theta$.</li>
    <li>Repeat generating samples and calculating bootstrap estimates $\hat{\theta}_B$ $n_B$ times.</li>
    <li>The $n_B$ bootstrap estimates $\{\hat{\theta_1}_B, \hat{\theta_2}_B, \hat{\theta_3}_B, \ldots, \hat{\theta_{n_B}}_B\}$of $\theta$ can be treated as the sampling distribution of the point estimate $\hat{\theta}$ from the original sample.</li>
    <li>The standard deviation of the sampling distribution formed by the bootstrap estimates is an estimate of the standard error of the point estimate $\hat{\theta}$:</li>
</ul>
<font size ="4">
    $$\overline{\theta}_B=\frac{1}{n_B}\sum_{i=1}^{n_B}{\hat{\theta_i}_B}$$<br>
    $$\mathrm{SE}(\hat{\theta})=\sqrt{\frac{1}{n_B - 1} \sum_{i=1}^{n_B}{(\hat{\theta_i}_B - \overline{\theta}_B})^2}$$
</font>

If the <b>probability density function of the population is unknown</b>, which is probably more often the case, the sample can be treated as the population. Samples can then be drawn with replacement to build a set of boot strap estimates $\{\hat{\theta_1}_B, \hat{\theta_2}_B, \hat{\theta_3}_B, \ldots, \hat{\theta_{n_B}}_B\}$of $\theta$. This set of bootstrap estimates can then be treated as the sampling distribution of $\hat{\theta}$.

<h1>Confidence Intervals</h1>

The sample mean $\overline{X}$ is a point estimate of the population mean $\mu$. But, it does not communicate the variability of the estimate as represented by its sampling distribution. The standard error of the sample mean $\mathrm{SE}(\overline{X})$ gives some idea of the spread of the distribution. Can we do better?

In contrast, <b>confidence intervals</b>, constructed from samples drawn from the population, provide a range of values of a random variable $X$. A probability $1-\alpha$, called the <b>confidence level</b> or <b>confidence coefficient</b>, is associated with this interval. The confidence level is the probability that the provided confidence interval contains the true, but unknown, population parameter $\theta$ that is being estimated. By the rules of probability, the <b>significance level</b> $\alpha$ specifies the probability that $\theta$ lies outside the confidence interval.

For example, 100 samples are drawn and 100 95% confidence intervals are constructed. Then, it is expected that 5 of these confidence intervals do not contain the true but unknown popluation parameter. If 100 80% confidence intervals were constructed instead, then it is expected that 20 of these confidence intervals do not contain $\theta$. If 100 99% confidence intervals were constructed instead, then it is expected that only 1 of these confidence intervals would not contain $\theta$. So, a 95% confidence interval $6\pm1.3$ means "I am 95% sure that $\theta$ is between $[4.7, 7.3]$".

To demonstrate how to calculate confidence intervals and explore further nuances, estimating the population mean $\mu$ with a confidence interval $\overline{x}$ will be explored.

<h2>Confidence Interval on the Mean of a Normally Distributed Population</h2>

The goal is to provide and interval estimate of the unknown mean of a population $\mu$. Suppose that the population's distribution follows the normal distribution with a known variance $\sigma^2=1.3$. From the Central Limit Theorem, <b>it is known that the probability distribution of sample means is a normal distribution with with mean $\mu$ and variance $\frac{\sigma^2}{n}$</b>. This is significant because <b>even if the center or mean $\mu$ of the sampling distribution is not known</b>, it is possible to infer <b>the probability of observing a certain difference between the sample mean $\overline{X}$ and the mean $\mu$</b>. In other words, because the population variance $\sigma^2$ is known, and by extension the sampling distribution variance can be calculated by $\frac{\sigma^2}{n}$, it is possible to know the probability associated with various differences of sample means $\overline{X}$ and the population mean $\mu$.

This inference can be performed as follows:
<ul>
    <li>The population variance $\sigma^2$ is known.</li>
    <li>Calculate the sampling distribution's standard deviation:</li>
    <font size="4">
        $$\sigma_{\overline{X}}=\sqrt{\frac{\sigma^2}{n}}$$
    </font>
    <li>Choose a significance level $\alpha$. A common choice is $\alpha=0.05$. This corresponds to a probability or confidence level of $1 - \alpha = 0.95$.</li>
    <li>Use the standard normal distribution to find z-values that appropriately represent $\alpha$.</li>
    <ul>
        <li>Recall that $Z$ is a standardized normal random variable:</li>
        <font size="4">
            \begin{align}
                Z &= \frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\\
            \end{align}
        </font>
        <li>For the lower bound, we want a value $z_{lo}$ that satisfies the following. Substitution of how $Z$ is calculated translates probability statements to $\overline{X} - \mu$ terms.</li>
        <font size="4">
            \begin{align}
                P(Z \le z_{lo}) &= \frac{\alpha}{2}\\
                P\left(\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \le z_{lo}\right) &= \frac{\alpha}{2}\\
                P\left(\overline{X} - \mu \le z_{lo}\frac{\sigma}{\sqrt{n}}\right) &= \frac{\alpha}{2}\\
                P\left(\overline{X} - \mu \le z_{lo}\frac{\sigma}{\sqrt{n}}\right) &= \frac{0.05}{2}\\
                P\left(\overline{X} - \mu \le z_{lo}\frac{\sigma}{\sqrt{n}}\right) &= 0.025\\
            \end{align}
        </font><br>
        <li>For the upper bound, we want a value $z_{hi}$ that satisfies:</li>
        <font size="4">
            \begin{align}
                P(Z \le z_{hi}) &= 1 - \frac{\alpha}{2}\\
                P\left(\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \le z_{hi}\right) &= 1 - \frac{\alpha}{2}\\
                P\left(\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) &= 1 - \frac{\alpha}{2}\\
                P\left(\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) &= 1 - \frac{0.05}{2}\\
                P\left(\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) &= 0.975\\
            \end{align}
        </font><br>
        <li>Combining the upper and lower bounds with properties of probabilities:</li><br>
        <font size="4">
            \begin{align}
                P\left(z_{lo}\frac{\sigma}{\sqrt{n}} \le \overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) &= P\left(\overline{X} - \mu \le z_{hi}\frac{\sigma}{\sqrt{n}}\right) - P\left(\overline{X} - \mu \le z_{lo}\frac{\sigma}{\sqrt{n}}\right)\\
                x_{lo}&= z_{lo}\frac{\sigma}{\sqrt{n}}\\
                x_{hi}&= z_{hi}\frac{\sigma}{\sqrt{n}}\\
                P\left(x_{lo} \le \overline{X} - \mu \le x_{hi}\right) &=1 - \frac{\alpha}{2} - \frac{\alpha}{2}\\
                &=1 - \alpha\\
                &=1 - 0.05\\
                &=0.95
            \end{align}
        </font><br>
        <li>$\sigma$ and $n$ are known. $z_{lo}$ and $z_{hi}$ can be found through tables or computer software. In Python's scipy.stats module, the operation of finding a value for $z$ such that $P(Z \le z) = p$ for some chosen probability $p$ is called the <b>percent point function</b>. The actual call as of the time of writing is stats.norm.ppf() for the standard normal distribution.</li>
        <li>After $z_{lo}$ and $z_{hi}$ are plugged in, we now have <b>a range of values $[x_{lo}, x_{hi}]$ centered at 0</b>. Furthermore, the associated probability that $\overline{X} - \mu$ is found in this interval is $1 - \alpha$.</li>
        <li>Refer to confidence_intervals.py for a more detailed notes accompanied by a simulated experiment and visual aids..</li>
    </ul>
</ul>

One last thing - we still do not know what the value of $\mu$ is. What we can do is to center the interval $[x_{lo}, x_{hi}]$ around our sample mean $\overline{x}$. Our confidence interval is then:
<font size="4">
    $$[\overline{x} + x_{lo}, \overline{x} + x_{hi}]$$
</font>

Because we used the standard normal distribution (which is centered at 0), the interval $[x_{lo}, x_{hi}]$ is also centered at 0. So we can express the interval as $[-x, x]$ where $x=x_{hi}=-x_{lo}$. Our confidence interval then becomes:<br>
<font size="4">
    $$\overline{x} \pm x$$
</font>

<h3>Summary</h3>

In summary, $[\overline{x} + x_{lo}, \overline{x} + x_{hi}]$ is an interval that contains the true population mean $\mu$, $(1-\alpha)\times 100 \%$ of the time where: <br><br>
<font size="4">
\begin{align}
    x_{lo} &= z_{lo}\frac{\sigma}{\sqrt{n}}\\\\
    x_{hi} &= z_{hi}\frac{\sigma}{\sqrt{n}}\\\\
    P(Z < z_{lo}) &= \frac{(\alpha)}{2}\\\\
    P(Z < z_{hi}) &= \frac{(1 - \alpha)}{2}\\\\
    P(\overline{x} + x_{lo} \le \mu \le \overline{x} + x_{hi}) &= 1 - \alpha
\end{align}
</font><br>

<h2>Large-Sample Confidence Intervals</h2>

Often, the population standard deviation $\sigma$ is not known. But to construct confidence intervals, $x_{lo}$ and $x_{hi}$ both require $\sigma.$ What can be done here?

An easy approximation would be to <b>use the sample standard deviation $s$ instead of $\sigma$</b>. But this estimate only becomes good when <b>when the sample size $n$ is reasonably large</b>. The Central Limit Theorem is usually works when $n\ge30$. To be safe, use this approximation when the sample size is say $n \ge 40$ to compensate for the extra variability introducted by using the sample standard deviation $s$ instead of $\sigma$.

<h2>Choice of Sample Size</h2>

Note that the confidence interval $[x_{lo}, x_{hi}]$ is a function of the sample size $n$. The range of the interval $x_{hi} - x_{lo}$ can be expressed as follows. The equation can then be rearranged to compute a sample size $n$ by specifying a an interval range $x_{hi} - x_{lo}$. In other words, if I wanted a 95% confidence interval of a specific range $x_{hi} - x_{lo}$ say 5 cm, I can compute the sample size $n$ that is required to achieve that.<br><br>

<font size="4">
    \begin{align}
        x_{hi} - x_{lo} &= z_{hi}\frac{\sigma}{\sqrt{n}} - z_{lo}\frac{\sigma}{\sqrt{n}}\\\\
        &=\frac{\sigma}{\sqrt{n}}(z_{hi} - z_{lo})\\\\
        \sqrt{n}&=\frac{\sigma}{x_{hi} - x_{lo}}(z_{hi} - z_{lo})\\\\
        n&=\left(\frac{\sigma}{x_{hi} - x_{lo}}(z_{hi} - z_{lo})\right)^2\\\
    \end{align}
</font>

<h2>One-Sided vs Two-sided Confidence Intervals</h2>

The confidence interval calculated above is a <b>two-sided</b> interval. It asserts that the true but unknown population mean $\mu$ is between two values with a certain probability $1 - \alpha$. In contrast, a <b>one-sided</b> confidence interval asserts that the population mean $\mu$ is either less than or greater than some value with a certain probability $1 - \alpha$.

An <b>upper-confidence bound</b> asserts that the population mean $\mu$ is less than or equal to some value with a certain probability $1 - \alpha$:<br><br>
<font size="4">
    \begin{align}
        \mu &\le \overline{x} + x_{hi}\\\\
        x_{hi} &= z_{hi}\frac{\sigma}{\sqrt{n}}\\\\
        P(Z \le z_{hi}) &= 1 - \alpha\\\\
    \end{align}
</font>

A <b>lower-confidence bound</b> asserts that the population mean $\mu$ is greater than or equal to some value with a certain probability $1 - \alpha$:<br><br>
<font size="4">
    \begin{align}
        \mu &\ge \overline{x} + x_{lo}\\\\
        x_{lo} &= z_{lo}\frac{\sigma}{\sqrt{n}}\\\\
        P(Z \ge z_{lo}) &= 1 - \alpha\\\\
    \end{align}
</font>

<h2>Confidence Intervals with Unknown Variance</h2>

To construct convidence intervals without knowing the population variance, the above procedure still works with a few small changes:
<ul>
    <li>Instead of using $\sigma$ to calculate the $Z$ variable, use the sample standard deviation $s$ to calculate $T$ variable instead:</li><br>
    <font size="4">
        $$T=\frac{\overline{X} - \mu}{\frac{s}{\sqrt{n}}}$$<br>
    </font>
    <li>Instead of using the standard normal distribution, the $T$ variable is distributed according to Student's $t$ distribution with $n-1$ degrees of freedom.</li>
    <li>In Python's scipy.stats module, stats.t is used to invoke the $t$ distribution.</li>
    <li>See confidence_intervals.py</li>
</ul>