# Week 5. Continuous Distributions

All distributions we have seen up until now refer to cases where we the set S, contains countable random variables. These were the discrete distributions. We will start working on cases where the sets S can be finite or countable infinite, the continuous distributions hence, contains an infinite interval of possible outcomes, 

For continuous random variables the probability that X takes on any particular value x is 0, therefore we'll need to find the probability that X falls in some interval (a, b), <br><br>

![title](integral.gif)

### Example of continuous random variables<br>

1. Height of randomly selected corn plants <br><br>

2. Volume in a randomly selected experimental pond<br><br>

In this case we cannot use a PMF as there will be an infinite number of possible outcomes between any given two values, therefore we need to use a function where two constants lie between $-\infty \leq a \leq b \leq \infty $

$$P(a\leq X \leq b) = \int_{a}^{b} f(x) dx $$



f(x) is then a PDF (Probability Density Function) 



Properties:

$\large \int_{-\infty}^{\infty} f(x) dx = 1 $  **The Area under the curve for all the Set S is 1**

$\large f(x) \geq 0 $ **Always possitive**

Note that any given value of a random variable can be larger than 1

## For Example:

Given the function

#### $f(x) = 3x^2$

#### Take a random variable x = 0.9, $f(0.9) = 3(0.9)^2 = 2.43 \geq 1$

In this case x is not a probability instead it is the height of the curve

In [None]:
curve(3*x^2, 0,1)
abline(v=0.9, col = "red")
abline(h=2.43, col = "red")
points(0.9,2.43,col = "blue", pch = 15)

## Exercise:

What is the probability that X falls between ½ and 1?  That is, what is P(½ < X < 1)?

Lets draw first the function and get the x and y points

In [None]:
curve(3*x^2, 0,1)
##What kind of curve is this??

when x = 0.5 then y  = 3*(0.5^2) 

= 0.75

and when x = 1 then y = 3 * (1 ^2)

= 3

In [None]:
curve(3*(x^2), 0,1)
points(0.5,0.75,col = "red")
points(1,3, col = "red")

### To get the probability we need to calculate the area under the curve between these two points

In [None]:
curve(3*x^2, 0,1)
x=seq(0.5,1,length=200)
y=dexp(x,rate=0)
polygon(c(0.5,x,1),c(0.75,y,3),col="skyblue")
points(0.5,0.75,col = "red", pch = 19)
points(1,3, col = "red",pch = 19)

To get the are under the curve we need to integrate the function using the interval as the limits of the integral

$$P(0.5\leq X \leq 1) = \int_{0.5}^{1} 3x^2 dx $$

$$ = \bigg[ x^3 \bigg] _{x=0.5}^{x=1} $$

$$ = (1^3) - (0.5^3) $$

$$ = 1- 0.125 $$

## $$ P = 0.825 $$

# Let's do it the easy way

In [None]:
f_ex = function(x) {3*(x^2)} ##create the function
integrate(f_ex,0.5,1) ##Integrate using the boundaries

Similarly a CDF of a continuous distribution is defined as:

$$F(x) = \int_{-\infty}^{x} f(t) dt $$

$$for -\infty < x < \infty$$

Remember that the CDF accumulates the probability until reaching 1. 

Looking at our previous example for our function $(3*(x)^2)$
To calculate the CDF we use the integral of the function but our limits now go from 0 to 1
Where P = 1 is the total area.

In [None]:
curve(x^3,0,1)

## The Uniform Distribution

the PDF of a random variable X that has an Uniform distribution is:

$$f(x) = \frac {1} {b-a}$$

The CDF is:

$$ F(x) = \frac {x-a}{b-a}$$

where a < x < b

This distribution is has a constant probability over all points of the interval

In [None]:
par(mfrow=c(1,2))
plot(x = 0, y = 0, type = "n",         # Set up a blank plot
     xlim = c(0, 1), ylim = c(0, 2),    # Define x and y range
     xlab = "", ylab = "",main = "PDF")                # Turn off axis labels

polygon(x = c(0, 0.4),             # Set x & y values for rectangle 
        y = c(0, 0), border = "red")   
polygon(x = c(.4, 0.4),             # Set x & y values for rectangle 
        y = c(0, 1), border = "red")   
polygon(x = c(.4, 0.6),             # Set x & y values for rectangle 
        y = c(1, 1), border = "red")   
polygon(x = c(.6, 0.6),             # Set x & y values for rectangle 
        y = c(0,1), border = "red") 
polygon(x = c(.6, 1),             # Set x & y values for rectangle 
        y = c(0, 0), border = "red") 

plot(x = 0, y = 0, type = "n",         # Set up a blank plot
     xlim = c(0, 1), ylim = c(0, 2),    # Define x and y range
     xlab = "", ylab = "", main = "CDF")                # Turn off axis labels

polygon(x = c(0, 0.4),             # Set x & y values for rectangle 
        y = c(0, 1), border = "red")   
polygon(x = c(.4, 1),             # Set x & y values for rectangle 
        y = c(1, 1), border = "red")   

## One of the most important applications of the uniform distribution is in the generation of random numbers. 

In [None]:
runif(1,0,1) ##generates one random number taken from the uniform distribution between the range of 0 and 1

## The Normal Distribution

The most widely used model for random variables with continuous distributions is the family of normal distributions


$$ PDF = f(x | \mu, \sigma^2) = \frac {1} {\sigma \sqrt {2\pi}} e^ -{\frac {({x - \mu })^2} {2\sigma^2}} $$

$$mean = \mu$$

$$Variance = \sigma^2$$

Because the normal distribution belongs to a family of continuous distributions where integration of the pdfs cannot be done manually (at least not practically) and thus tables of the CDF or computer programs are necessary in order to compute probabilities and quantiles


Physicists called the Gaussian distribution, and due to its shape it is also called a bell curve

In [None]:
x = seq(-4,4,length=100)
hx = dnorm(x)
dh = pnorm(x)

In [None]:
plot(x,hx, type = "l")

In [None]:
plot(x,dh, type = "l")

## Problem 1
If scores are normally distributed with a mean of 35 and a standard deviation of 10, what percent of the scores is: (a) greater than 34? (b) smaller than 42? (c) between 28 and 34?

In [None]:
#a


In [None]:
#b

In [None]:
#c

## Problem 2
A test is normally distributed with a mean of 70 and a standard deviation of 8. (a) What score would be needed to be in the 85th percentile? (b) What score would be needed to be in the 22nd percentile?

In [None]:
#a

In [None]:
#b

In [None]:
##You could for fun plot one of the distributions with the percentile 
##We will not do this in class

## Histograms

As we have seen previously, histograms are useful when we want to check for normality (which is important if we want to apply some statistical tests). The histogram aims to show the center and distribution of the data

To create an histogram we first create a frequency table,


In [None]:
x = round(rnorm(1000,0,1),1)


In [None]:
table(x)

In [None]:
hist(x)

In [None]:
#Lets run another example let's run the dataset sparrows and select one species
##lets plot the histogram of weights

Sparrows = read.table(file = "Sparrows.txt", header = TRUE)
Sparrows2 = Sparrows[Sparrows$Species == "SSTS",]
hist(Sparrows2$Wt, xlab = "Weight in grams", main = expression(italic("Ammodramus caudacutus")))

### We can also draw multiple histograms using the lattice package

In [None]:
Sparrows2

In [None]:
library(lattice)
histogram(~Wt|factor(Observer),
         data = Sparrows2,
         layout = c(1,7),
         nint = 30,
         xlab = "Weight in grams",
         strip = FALSE,
         strip.left = TRUE,
         ylab = "Frequencies")

## Going back to our initial example we can change the size of the bins to make it more define as a continuous distribution

In [None]:
hist(x, breaks = 40)

In [None]:
hist(x, breaks = 400) ##but we loose resolution

#### However, with this shape, it is still a little difficult to assess whether the raw data follows a normal distribution. Ir order to solve this problem we can use Kernel Density Curves

## Kernel Density Curves

A kernel density is a non-parametric way to calculate an empirical PDF of a random variable X, it uses a basic smooting parameter that affects the shape of the curves.

As explained by Leno and Zuur 2015, a kernel defines small functions that are added up as a smoothing function. 

In [None]:
d = density(Sparrows$Wt)

In [None]:
plot(d)

In [None]:
Sparrows$fSpecies <- factor(Sparrows$Species,
                            levels = c("SESP","SSTS"),
                            labels = c("A.maritimus",
                                       "A.caudacutus"))
                                  
par(mar = c(5,5,2,2), cex.lab = 1.5)


plot(d,
     xlab = "Weights (in grams)",
     cex.lab = 1.5,
     cex.main = 1.5,
     main = "",
     xlim = c(15,28),
     ylim = c(0, 0.35),
     lwd = 5)

d1 <- density(Sparrows$Wt[Sparrows$fSpecies == "A.maritimus"])
d2 <- density(Sparrows$Wt[Sparrows$fSpecies == "A.caudacutus"])

lines(d1, lty = 2, lwd = 2)
lines(d2, lty = 3, lwd = 2)

legend("topright",
        legend = expression("Both Species",
                        italic("A. maritimus"),
                        italic("A. caudacutus")),
        lty = c(1, 2, 3),
        lwd = c(5, 2, 2))

### How does the smoothing function works?

#### In the density function the smoothing parameter is called bw (bandwith) it ranges from 0 to 1 

In [None]:
d_ex = density(x, bw = 0.01)
plot(d_ex)

In [None]:
bw_ex = c(0.1,0.5,1)
col2 = c("red","blue","forestgreen")
plot(d_ex)
for (i in 1:3){
    d_ex = density(x, bw = bw_ex[i])
    lines(d_ex, col = col2[i])
}

## Another way to graph the distribution of a continuous variables is using a quantile-quantile plot (Q-Q plot)

### Q-Q plots

The basic idea is to plot the quantiles distributions of two variables against each other. We then draw a straight line and if the plots approximately fall within the boundaries of the line then we can infer that the distributions are similar.

We can plot the quantiles from the raw data against any distribution quantiles and check for similitudes of the distributions.

Following our initial example

In [None]:
qqnorm(x)
qqline(x)

### In the sparrow example, we can also compare the distribution of weights for the two species using the Q-Q plots using the package lattice

In [None]:
qqmath (~Wt|fSpecies,
        data = Sparrows,
        cex = 1, col = 1,
        ylab = list("Weight (in grams)", cex.lab = 1.5),
        xlab = list("Theoretical Quantiles", cex.lab = 1.5),
       prepanel = prepanel.qqmathline,
       panel = function(x, ...) {
          panel.qqmathline(x, ...)
          panel.qqmath(x, ...)
       })