# MATH 3350 Course Notes - Module S2

## The Normal Distribution

A Normal distribution with mean $ \mu $ and standard deviation $ \sigma $ is denoted by $ N(\mu , \sigma)$  

$ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{\left ( -\frac{x-\mu}{\sigma}\right )^2}$

The distribution is centered at $ \mu $ and the spread of the distribution is proportional to $ \sigma $.
 

#### ACT Scores are approximately normally distributed $ N(21,6) $

In [None]:
ACT_mu <- 21
ACT_sigma <- 6
x <- seq(ACT_mu-3*ACT_sigma, ACT_mu+3*ACT_sigma,length.out = 100) # "empirical rule" - 3 sigmas
d <- dnorm(x, mean = ACT_mu, sd = ACT_sigma)
# 
plot(x,d, ylab = "density", type = 'l')
title('Distribution of ACT Scores', xlab = 'Score')

#### If you scored a 27 on the ACT, what is your percentile ? (what percentage of scores are at or below your score?)
This is equivalent to finding the area under the distribution below the boundary _x=27_.


In [None]:
plot(x,d, type = 'l', ylab='Density',xlab='Score', main="ACT Score Distribution")
abline(v=27,col="red",lwd=2)

The **`pnorm`** command will compute this area.  Note that by default, this computes the area to the LEFT of the boundary.

Note that the area will be between 0 and 1 (a proportion of the entire curve).  Multiply by 100 to express as a percentage.

In [None]:
left_area <- pnorm(27,ACT_mu,ACT_sigma)   # pnorm(score, mu, sigma)
cat("Proportion:", left_area, "\n")
cat("Percent:", 100*left_area, "\n")

#### SAT scores are approximately normally distributed $ N(1060 , 217) $
#### If you are at the 70th percentile on the SAT, what is your score?
This is the **_reverse_** of the function above.  We are given the area to the left of the boundary, and we need to find the boundary. 

The **`qnorm`** command will compute this boundary.  We will round to the nearest whole number, since SAT scores are always whole numbers.

In [None]:
SAT_mu <- 1060
SAT_sigma <- 217

round(qnorm(.70, mean = SAT_mu,  sd = SAT_sigma))

#### Comparing Scores

Which is the better score?  A 27 on the ACT -- $ N(21 , 6) $

or 

a 1300 on the SAT -- $ N(1060 , 217) $ ?

In [None]:
ACT_score <- 27
SAT_score <- 1300

p_ACT <- pnorm(ACT_score, ACT_mu, ACT_sigma)
cat('The ACT Score is at the ', p_ACT *100, ' percentile \n')

p_SAT <- pnorm(SAT_score, SAT_mu, SAT_sigma)
cat('The SAT Score is at the ', p_SAT *100, ' percentile')

####   How many standard deviations above the mean is an ACT score of 27? 

We can visualize the answer by examining the plots for each distribution.  The dashed vertical lines show the boundaries at 1, 2, and 3 standard deviations above the mean. The blue triangle marker shows the location (on x axis) of ACT score 27.

In [None]:
#ACT plot with standard deviations
x <- seq(ACT_mu-3.2*ACT_sigma, ACT_mu+3.2*ACT_sigma,length.out = 100) #Extend just past 3 sigmas
d <- dnorm(x, mean = ACT_mu, sd = ACT_sigma)
plot(x,d, type = 'l', main='ACT Scores', xlab = 'Score')
abline(v=ACT_mu,col="red")
abline(v=ACT_mu+ACT_sigma,lty=2,col="grey")
abline(v=ACT_mu+2*ACT_sigma,lty=2,col="grey")
abline(v=ACT_mu+3*ACT_sigma,lty=2,col="grey")
points(x=ACT_score,y=-0.001,col="blue",pch=17,cex=2)

This is a particularly easy case, because the score is exactly one standard deviation (6 points) above the mean.

####   How many standard deviations above the mean is an SAT score of  1300?

We take a similar approach to visualize this scenario below.  However, this time, the score does not coincide neatly with a whole number of standard deviations.

In [None]:
#SAT plot with standard deviations
x <- seq(SAT_mu-3.2*SAT_sigma, SAT_mu+3.2*SAT_sigma,length.out = 100) #Extend just past 3 sigmas
d <- dnorm(x, mean = SAT_mu, sd = SAT_sigma)
plot(x,d, type = 'l', main='SAT Scores', xlab = 'Score')
abline(v=SAT_mu,col="red")
abline(v=SAT_mu+SAT_sigma,lty=2,col="grey")
abline(v=SAT_mu+2*SAT_sigma,lty=2,col="grey")
abline(v=SAT_mu+3*SAT_sigma,lty=2,col="grey")
points(x=SAT_score,y=-0.00002,col="blue",pch=17,cex=2)

Clearly, the SAT score is a little more than 1 standard deviation above the mean.
 
Since the score is 1300, and the mean is 1060, the score is 240 points above the mean. One standard deviation is 217 points, so this score is one standard deviation PLUS an additional 23 points.  As a fraction of one standard deviation, this is $\frac {23}{217} \approx 0.106$, making the total number of standard deviations approximately 1.106.

This can be shown succinctly in the following equation:
   
 SAT standard deviations above mean $ =  \frac{1300-1060}{217} \approx{1.11} $
 
This value (number of standard deviations from the mean is called a **_z-score_**, and the calculation we performed follows the formula:

## <center>$z=\frac{x-\mu}{\sigma} $</center>
 
Note that the formula could also be used to find our earlier result for the ACT score (but we didn't really need the formula to figure it out!)

 ACT standard deviations above mean $ =  \frac{27-21}{6} = 1 $  
 
The code below shows how this formula is computed in _**R**_.

In [None]:
ACT_z <- (ACT_score - ACT_mu)/ACT_sigma
SAT_z <- (SAT_score - SAT_mu)/SAT_sigma
cat('The ACT score is ', ACT_z,' standard deviation(s) away from the mean \n')
cat('The SAT score is ', SAT_z,' standard deviation(s) away from the mean \n')

#### Note that z-scores can be negative.  What would this mean?

_What ACT score would have a z-score of -1.5?_

### The Standard Normal Distribution $ N(0,1)$

Recall that the **Standard Normal Distribution** has $\mu=0$ and $\sigma=1$.  This means that 1 standard deviation corresponds to exactly 1 unit. 



In [None]:
#Plot the Standard Normal Distribution with standard deviation boundaries

x <- seq(-3.2, 3.2, length.out = 100)  #Extend just past 3 sigmas
d <- dnorm(x)                          #Default mean is 0; default sd is 1
plot(x,d, type = 'l', main='Standard Normal Distribution', xlab = "",ylab="Density")
abline(v=0,col="red")
abline(v=-3,lty=2,col="grey")
abline(v=-2,lty=2,col="grey")
abline(v=-1,lty=2,col="grey")
abline(v=1,lty=2,col="grey")
abline(v=2,lty=2,col="grey")
abline(v=3,lty=2,col="grey")

#### The 'Empirical Rule'
The behavior of ALL Normal distributions is consistent with the following characteristics:

- Approximately 68% of observations lie within 1 standard deviation ($-1 \le z.score \le +1$)
- Approximately 95% of observations lie within 2 standard deviations ($-2 \le z.score \le +2$)
- Approximately 99.7% of observations lie within 3 standard deviations ($-3 \le z.score \le +3$)

This can be visualized in the plot above as the area under the curve within the specified bands.

In [None]:
#Area in tails (note left tail is used by default)
right_tail <- pnorm(1,lower.tail=FALSE)   #Proportion more than 1 standard deviation above
left_tail <- pnorm(-1)                    #Proportion more than 1 standard deviation below

within_1sd <- 1 - (left_tail + right_tail)

cat("Left tail:",left_tail,"\n")
cat("Right tail:", right_tail,"\n")
cat("Within 1 SD:", within_1sd,"\n")
