## <span style="color:darkblue"> Class Notes: Unit 3 (Random variables and distributions)

*The content of this notebook is based on modified materials from Open Intro Biostatistics (OIBiostat)*


----
#### <span style="color:darkblue">  A. Discrete random variables and their probability distribution

A <span style="color:violet">*probability distribution*<span style="color:black"> is a table (often shown as a graph) of all disjoint outcomes and their associated probabilities.  

Counting the number of *heads, X* in 4 tosses of a fair coin:

<img style="float: center", src="prob_4tosses.png">


If the variable $X$ records the number of heads, then $X$ is a random variable and the graph show its distribution.

A <span style="color:violet">*random variable*<span style="color:black"> assigns numerical values to the outcome of a random phenomenon, and is usually written with a capital letter such as $X$, $Y$, or $Z$. 

A <span style="color:violet">*discrete random variable*<span style="color:black"> takes on a finite number of values.

Suppose $X$ is the number of heads in *3* tosses of a coin, instead of 4.


<img style="float: center", src="coinToss.png">

#### Distribution of discrete random variables

The distribution of a discrete random variable is the collection of its values and the probabilities asssociated with those values.

Usually recorded in a table and/or in a graph.

|i|1|2|3|4|Total| 
|:--------:|--|--|--|--|:---------:| 
|$x_i$ | 0 | 1| 2 | 3 |  --| 
| $P(X = x_i)$ |  1/8|  3/8 |  3/8 |  1/8 |  8/8 = 1.00| 


**Bar graph showing a distribution**

<img style="float: center", src="barPlotCoinTossing.png">

#### Expectation of a random variable (Definition in OI Biostat, 2.3.2)

If  the $X$ has outcomes $x_1$, ..., $x_k$ with probabilities $P(X=x_1)$, ..., $P(X=x_k)$, the <span style="color:violet">*expected value*<span style="color:black"> of $X$ is the sum of each outcome multiplied by its corresponding probability:
\begin{align}
E(X) 	&= x_1 P(X=x_1) + \cdots + x_k P(X=x_k) \notag \\
	&= \sum_{i=1}^{k}x_iP(X=x_i) \notag
\end{align}
The Greek letter $\mu$ may be used in place of the notation $E(X)$ and is sometimes written $\mu_X$.


*Simple example:*

Let $X$ be the number of heads in 3 tosses of a coin. Then

\begin{align*}
E(X) &= 0P(X=0) + 1P(X=1) + 2P(X=2) + 3P(X = 3)  \\
	&= (0)(1/8) + (1)(3/8) + (2)(3/8) + (3)(1/8)  \\
	&= 12/8  \\
	&= 1.5 
\end{align*}


#### Variance and SD of a random variable (OI Biostat 2.3.3)

If $X$ takes on outcomes $x_1$, ..., $x_k$ with probabilities $P(X=x_1)$, ..., $P(X=x_k)$ and expected value $\mu=E(X)$, then the <span style="color:violet">*variance*<span style="color:black"> of $X$, denoted by $\text{Var}(X)$ or the symbol $\sigma^2$, is
\begin{align}
\sigma^2 &= (x_1-\mu)^2 P(X=x_1) + \cdots \notag \\
	& \qquad\quad\cdots+ (x_k-\mu)^2 P(X=x_k) \notag \\
	&= \sum_{j=1}^{k} (x_j - \mu)^2 P(X=x_j) \notag
\end{align}
The <span style="color:violet">*standard deviation (sd)*<span style="color:black"> of $X$, labeled $\sigma$, is the square root of the variance.  It is sometimes written $\sigma_X$.


*Example continued:* 

Let $X$ be the number of heads in 3 tosses of a coin. Then

\begin{align}
    \sigma_X^2 &= (x_1-\mu)^2P(X=x_1) + \cdots + (x_4-\mu)^2 P(X=x_4) \notag \\
    	&= (0- 1.5)^2(1/8) + (1 - 1.5)^2 (3/8) + (2 -1.5)^2 (3/8) + (3-1.5)^2 (1/8) \notag  \\
        &= 3/4 = 0.75. \notag
    \end{align}
    
The standard deviation is $\sqrt{3/4} = \sqrt{3}/2 = 0.866$.  


----
#### Unit 3 In-class Exercise 
----

----
#### <span style="color:darkblue"> B. Continuous Random variables and the normal distribution

- Concept of <span style="color:violet">*continuous distribution*<span style="color:black">: the distribution for a variable that can take on all values in a specified range.

- The <span style="color:violet">*normal distribution*<span style="color:black">, perhaps the most important continuous distribution in statistics.  


#### Probabilities for continuous distributions

Two important features of continuous distributions

-  The total area under the density curve is 1

-  The probability that a variable has a value within a specified interval is the area under the curve over that interval

<img style="float: center", src="fdicHeightContDistFilled.png">


#### Features of the normal distribution


- 68% of the data are within 1 SD of the mean

- 95% of the data are within 2 SDs of the mean

- 99.7% of the data are within 3 SDs of the mean

<img style="float: center", src="A6895997.png">

#### A normal example

The distribution of test scores on the SAT and the ACT are both nearly normal. 

Suppose that one student scores an 1800 on the SAT (Student A) and another student scores a 24 on the ACT (Student B). Which student performed better?

<img style="float: center", src="satActNormals.png">



- SAT scores are $N(1500, 300)$. ACT scores are $N(21,5)$.


- $x_A$ represents the score of Student A; $x_B$ represents the score of Student B.  


$$Z_{A} = \frac{x_{A} - \mu_{SAT}}{\sigma_{SAT}} = \frac{1800-1500}{300} = 1$$


$$Z_{B} = \frac{x_{B} - \mu_{ACT}}{\sigma_{ACT}} = \frac{24 - 21}{5} = 0.6$$


#### Calculating normal probabilities (example 1)

What is the percentile rank for a student who scores an 1800 on the SAT for a year in which the scores are $N(1500, 300)$?

a) Calculate a $Z$-score. If $X$ is a normal random variable with mean $\mu$ and standard deviation $\sigma$, 

$$
Z = \frac{X - \mu}{\sigma} 
$$ 
is a standard normal (mean $\mu = 0$, standard deviation $\sigma =1$). 

b) Calculate the normal probability. 

- `pnorm(z)` calculates the area (i.e., probability) to the left of $z$

In [1]:
pnorm(1)

#### Calculating normal probabilities (example 2)

What score on the SAT would put a student in the 99$^{th}$ percentile?

a) Identify the $Z$-value. `qnorm(p)` calculates the value $z$ such that for a standard normal variable $Z$, $p = P(Z \leq z)$.

In [2]:
qnorm(0.99)

b) Calculate the score, $X$. If $Z$ is standard Normal distribution, 
$$
X = \sigma Z + \mu
$$

is Normal with mean $\mu$ and standard deviation $\sigma$.

\begin{align*}
X =& \sigma Z + \mu \\
=& 300(2.33) + 1500 \\
=& 2199
\end{align*}

