# MATH 3350 Course Notes - Module S0

# <span style="color: blue;">I. JupyterHub</span>

Some course materials for this course will be published as Jupyter notebooks on the University's JupyterHub. JupyterHub is a self contained environment for programming in either R or Python. Students and instructors can access the JupyterHub using any device with a web browser. Any link to a document housed on JupyterHub will take you to the environment. The link below will also connect you to the UNG JupyterHub (though the interface may look slightly different).

[UNG JupyterHub](https://datahub.ung.edu)

You may be asked to log in using 2 factor authentication.  Use the same UNG login information as you would use to get to your UNG email, D2L, Banner, etc.

After you have logged in, you can open existing files by navigating the directory in the left pane of the JupyterHub window, or you can create new files by selecting "File", "New". Select the "Python3" kernel to create a Python based notebook, or select the "R" kernel to create an R based notebook. We will be using this Jupyter environment to work with R.

While reading through the notes, you will be prompted to try executing R code snippets to see data, statistical results, and graphics (such as plots). 

# <span style="color: blue;">II. Jupyter Notebooks</span>

The document you are reading is a Jupyter notebook. A Jupyter notebook is a "live" document that contains both text and working code. The notebook is arranged as a series of cells. Cells that contain code (either R or Python) are called code cells, and cells that contain anything else are called "Markdown" cells. Markdown cells may contain formatted text, images, links, tables, etc. This cell is a Markdown cell. You can double click on this cell to edit the contents. Once you have made your changes, click Shift + Enter to run the cell or hit the "play" button at the top of this document to see your changes. 

The cell below is a code cell. Since this notebook was created with an R kernel, the cell contains R code. Select the cell below and hit Shift + Enter to see the output.

In [None]:
# Here is an example of a code cell (using the R language).

for (i in 1:10){
    cat(i,' ' )
}

# <span style="color: blue;">III.  Introduction to R </span>

### Let's play around with some basic R commands and create some plots.

#### Generate data from known probability distributions

In [None]:
# Generate 100 random values from a uniform distribution

u_values <- runif(100, min=1, max=5)      #runif is 'random uniform'
head(u_values)                            #Shows the first few of the 100 values generated
hist(u_values)                            #Creates a histogram of all the values


In the example above, u_values is a VECTOR of values.

### Tasks
- Try running the code cell above multiple times.  Use CTRL-Enter to run the cell without advancing to the next cell.

- Notice that the histogram changes each time, because the values being generated are random values.

- How do the histograms compare to what you would expect from a uniform distribution?

- Now try generating 1000 of these values instead of 100.  What do you notice?

- Try again with 5000 values.  What do you notice?

#### What else can we do with this distribution?
These functions will allow us to work with the uniform distribution:

| Function | Description |
|----------|-------------|
| runif | Generate random value from uniform distribution |
| dunif | Calculate pdf f(x) for given value within uniform distribution |
| punif | Calculate cdf F(x) for given value within uniform distribution |
| qunif | Identify value at a given quantile of a uniform distribution |

#### Let's look at some examples.

##### Density (pdf) function: dunif
We'll start by using **dunif** to answer the following question:  
In the uniform distribution from 1 to 15, what is the height (y-coordinate) of the pdf, f(x), for a particular x?


In [None]:
# If f(x) is the pdf of the uniform distribution from 1 to 15, what is f(5)?

dunif(5, min=1, max=15)    #Height of pdf graph at f(5)


In [None]:
# If f(x) is the pdf of the uniform distribution from 1 to 15, what is f(14.8)?

dunif(14.8, min=1, max=15) #Height of pdf graph at f(14.8)

In [None]:
# If f(x) is the pdf of the uniform distribution from 1 to 15, what is f(17)?

dunif(17, min=1, max=15)   #Height of pdf graph at f(17)

##### CDF probability function: punif
Now we'll use **punif** to answer the following question:  
In the uniform distribution from 1 to 15, what is the value of the cdf, F(x), for a particular x?  
Recall that for random variable **X** in the distribution, F(x) = P(**X** $\le$ x)

In [None]:
# If F(x) is the cdf of the uniform distribution from 1 to 15, what is F(5)?

punif(5, min=1, max=15)    #P(X < 5)

In [None]:
# If F(x) is the cdf of the uniform distribution from 1 to 15, what is F(14.8)?

punif(14.8, min=1, max=15)    #P(X < 14.8)

In [None]:
# If F(x) is the cdf of the uniform distribution from 1 to 15, what is F(17)?

punif(17, min=1, max=15)    #P(X < 17)

##### Quantile function: qunif
Now we'll use **qunif** to answer the following question:  
In the uniform distribution from 1 to 15, for what value in the distribution does a given proportion *p* of the distribution lie at or below that value?  

Note that the quantile function is $ F^{-1}(x) $, the *inverse* of the cdf

In [None]:
# In the uniform distribution from 1 to 15, find the value for which 28% of the values 
#                                           in the distribution lie at or below that point.
# This point is called the 28th percentile.

qunif(0.28,min=1,max=15)

In [None]:
# In the uniform distribution from 1 to 15, find the value where 100% of the values 
#                                           in the distribution lie at or below that point.
qunif(1,min=1,max=15)

In [None]:
# In the uniform distribution from 1 to 15, what value is at the 25th percentile (aka "first quartile")?
qunif(0.25,min=1,max=15)

#### Similar Functions for Other Distributions

Some distributions you should recognize have a similar set of functions, as shown in the table below.  
**NOTE: Remember that the density (pdf) functions can provide probability P(X=x) _only for DISCRETE distributions_!** 


| Distribution | Random Value | Density (pdf) | Probability _in tail_ (cdf) | Quantile | Parameters (and default values if applicable) |
|----------|-------------|---------|-------|--------|------------|
| Binomial | rbinom | dbinom | pbinom | qbinom | size, prob (*n and p; no defaults*) |
| Exponential | rexp | dexp | pexp | qexp | rate (1) (*lambda*)|
| Geometric | rgeom | dgeom | pgeom | qgeom | prob (*p; no default*)  |
| Hypergeoometric | rhyper | dhyper | phyper | qhyper | m, n, k (*see R docs; no defaults*)  |
| Negative Binomial | rnbinom | dnbinom | pnbinom | qnbinom |  size, prob (*target #successes & p; no defaults*) |
| Normal | rnorm | dnorm | pnorm | qnorm | mean (0), sd (1) |
| Poisson | rpois | dpois | ppois | qpois | lambda (*no default*) |
| Uniform | runif | dunif | punif | qunif | min (0), max (1) |


And these are some additional distributions we will be working with as we explore statistics...

| Distribution | Random Value | Density (pdf) | Probability _in tail_ (cdf) | Quantile | Parameters |
|----------|-------------|---------|-------|--------|------------|
| Chi-Square | rchisq | dchisq | pchisq | qchisq | df (*degrees of freedom; no default*) |
| F Distribution | rf | df | pf | qf | df1, df2 (*degrees of freedom; no defaults*) |
| t Distribution | rt | dt | pt | qt | df (*degrees of freedom; no default*) |

### More Examples
#### Example 1. Simple application of binomial distribution

In [None]:
#Probability of rolling exactly three 6's in 10 rolls of a fair die
dbinom(3, size=10, prob=1/6)

#Probability of rolling AT MOST three 6's in 10 rolls of a fair die
pbinom(3, size=10, prob=1/6)


In [None]:
#Probability of rolling AT LEAST three 6's in 10 rolls of a fair die
1 - pbinom(2, size=10, prob=1/6)    # **NOTICE WE NEED 2 HERE INSTEAD OF 3**

#Another way to accomplish the same thing
pbinom(2, size=10, prob=1/6, lower.tail=FALSE)    

#### Example 2. Generate a plot of a distribution pdf
We can create a plot of any distribution by generating many (x,y) points and then plotting them.  

**Notes about the example below:** 
* *xvalues* and *yvalues* are VECTORS of values  
* The plot creates (x,y) coordinates by pairing elements from each vector in the order they are given
* Notice that the density function (dnorm) computes a y value for EACH x value provided in the vector

In [None]:
xvalues <- seq(-3,3,0.1)     # We will generate many points between -3 and 3 for x
head(xvalues)

yvalues <- dnorm(xvalues)    # Use the density function to generate corresponding y coordinates
head(yvalues)

plot (xvalues,yvalues)       # Generate plot


##### Here's an improved plot using the same data points

In [None]:
#Change from individual points to a smooth line, add a title, and label the y axis
plot (xvalues,yvalues, main="PDF of Standard Normal Curve", ylab="Density", type="l")   # "l" stands for line

#### Example 3. Generate a plot of a distribution cdf
This time we need the pnorm function instead of the dnorm (density) function  

In [None]:
new_yvals <- pnorm(xvalues)     # Calculate cdf F(x) for all of our x coordinates

plot (xvalues,new_yvals, main="CDF of Standard Normal Curve", ylab="P(X < x)", type="l")

### Ways to Simulate a Probability Experiment
Let's look at a simple probability experiment using coin tosses.  Consider a scenario of tossing a coin 10 times.

In [None]:
coin <- c('H','T')                        #Define a vector with all possible outcomes (heads, tails)
flips <- sample(coin,10,replace = TRUE)   #Sample from these outcomes, WITH replacement
print(flips)                              #Look at our result vector

num_heads <- sum(flips == 'H')            #Count number of heads
print(num_heads)

cat('There were',num_heads,'heads in 10 flips.')  #Show results in a full sentence  

#### What if we repeat the above trial thousands of times ?
We'll store the results of each trial and create a histogram to visualize the number of heads over all trials.  
Remember: EACH trial consists of 10 coin tosses.  We are interested in the _number of heads_ in each trial.

In [None]:
heads <- c()           # create a vector to store the number of heads for each trial
num_trials = 10000     # set the number of trials

for (i in 1:num_trials){
    flips <- sample(coin,10,replace = TRUE) # create a trial of 10 coin flips
    heads[i] <- sum(flips == "H")           # count and store the number of heads in this trial
}

head(heads)        # display results of first few trials
hist(heads)        # histogram of all results


### Another Method

The simulation above is useful because it is concrete and can be explored one step at a time.  
For complex simulations, this kind of approach may be the only method available.

However, in this scenario, we can also use the binomial distribution to generate a similar set of test data.

In [None]:
successes <- rbinom(num_trials, size=10, prob=0.5)    #10000 trials, 10 tosses, 0.5 chance of 'heads' (success)
hist(successes)

### Questions

1. What is the theoretical probability of getting exactly 5 heads in 10 flips?
2. How does the theoretical proabability compare to the empirical probability in the simulations above?

The code below can help answer these questions.


In [None]:
#1. Theoretical probability of exactly 5 heads in 10 flips
theo <- dbinom(5,size=10,prob=0.5)
cat("Theoretical: ", theo, "\n")

#2. Examine empirical results from both methods:

# Empirical probability from concrete coin toss simulation
emp_prob1 <- sum(heads==5)/num_trials
cat("Simulation 1: ", emp_prob1, "\n")

# Empirical probability from binomial distribution random values simulation
emp_prob2 <- sum(successes==5)/num_trials
cat("Simulation 2: ", emp_prob2, "\n")