We learned about some of the basic ideas behind probability: what it represents, how to calculate it and some rules for approaching our calculations. We motivated our examples using coins, dice and cards since they represent real-world examples. In all the examples we've used so far, we have the same implicit assumption: we have assumed that the probability of observing any single outcome is equally likely. That is, the probability of rolling a 1 on a six-sided dice is the same as the probability of rolling a 6. This assumption of probability being equally distributed across the different outcomes is valid for some phenomena, but not so in general.

As we progress, we'll discuss different ways that probability can be distributed across the different outcomes of the experiment and explore some examples of said probability distributions. The quickest way to understand how probability is distributed is to look at a visualization. To use the dice roll as an example again, we know that each outcome has a 1 in 6 probability of happening, so its probability distribution would look like:

![image.png](attachment:image.png)

Knowing how probability is distributed is important because it tells us many important things about the experiment, such as:

* what outcomes has the highest probability associated with it?
* what would be the probability of observing a range of outcomes instead of just a single one?

In this case, all of the outcomes are equally likely, so we see a characteristically flat line as expected. If we were to visualize the probability distribution for a coin toss or drawing a single card, we would also see a similarly flat curve.

The "shape" of this probability distribution is so common that it has been given a special name: The **Uniform Distribution**. We can represent the probabilities of a dice roll, coin toss, and card drawing all as a uniform distribution, with some slight tweaks to adjust it to each particular situation. We'll start our exploration of probability distributions with the uniform.

We saw a visualization of how the probability is distributed for a single dice roll. If the probabilities of a random experiment take on a flat shape, we say it follows a uniform distribution.

We can learn a lot about a random experiment just by looking at its probability distribution. Let's say that we start with the visualization. What would we be able to learn about the random experiment that produced it just by looking at the graph? We'll use another visualization as an example:

![image.png](attachment:image.png)

Judging from the shape of the distribution, we can also see that this random experiment follows a uniform distribution. Looking at the x-axis, we can tell that the outcomes of the experiment are the numbers from 1 to 10. Conversely, we can also what outcomes aren't a part of the random experiment. Since there are 10 outcomes and each have the same amount of probability distributed to it, we can mentally calculate that each outcome has a 10% probability of occurring. Just by looking at a visualization, there's already a lot we can deduce about the random experiment without knowing any specific details about it.

Since probability distributions are created from probabilities, it's natural that they must also follow the rules of probability. We learned that probability can range from 0 to 1, but not go any lower or higher. From this, we know that we should never see a probability distribution where the probability was negative or greater than 1.

Finally, we know that if we try to calculate the probability of observing all of the outcomes (aka the sample space), then we would get a probability of 100%. This fact can also be observed in the probability distribution. If we added up all of the probability over all of the outcomes we saw in a distribution (aka the heights of all the bars), then they should all sum to 1.

It is important to remember that all probability distributions share these qualities. The uniform distribution is a useful tool for learning this intuition, which is why we start with it. Once we start looking at other probability distributions with more complicated shapes, we can use these shared qualities to calculate probabilities and learn things about the random experiments that produce them.

**Task**

Below are two hypothetical probability distributions. Using the visualizations, answer the following questions:

![image.png](attachment:image.png)

* Do either of these distributions follow a uniform distribution? Assign `TRUE` to `are_uniforms` if we believe so, `FALSE` otherwise.
* What outcome is Random Experiment has the most probability associated with it? Assign answer to `most_probable_A`.
* Is Random Experiment B a valid probability distribution?
    * Calculate the total amount of probability in Random Experiment B, and assign it to `total_probability_B`.
    * If we believe Random Experiment B has a valid probability distribution, assign `TRUE` to `is_B_valid`, `FALSE` otherwise.
    
**Answer**

`are_uniforms <- FALSE
most_probable_A <- 1
total_probability_B <- 0.2 + 0.2 + 0.2 + 0.2 + 0.3
is_B_valid <- FALSE`

As we start going into non-uniform distributions, it will be harder to keep track of what probabilities are distributed to which outcome. We need a convenient, mathematical way to express how probability is distributed since we won't always have a visualization to reference. The tool we'll use is called the **probability distribution function**: a function that takes in an outcome and gives the probability associated with that outcome.

A function is useful here because it allows us to associate a single probability to a single outcome. That is to say, the probability distribution function creates a 1-to-1 correspondence between the outcomes and the probabilities associated with them. Conventionally, we'll denote a probability distribution function as **P**.

![image.png](attachment:image.png)

Since each outcome has the same amount of probability distributed to it, we would say that this particular function is a constant function. Using **X** to represent an outcome, we can represent the probability distribution functionally as:

![image.png](attachment:image.png)

To fully describe the visualization, we need to indicate two things. First, we need to describe what outcomes actually have probability distributed to them, in this case the numbers between 1 and 10. Implicitly, this also means we need to describe where the function doesn't have probability, indicated by the second equation that describes the probability "elsewhere". Elsewhere refers to all the values that aren't the numbers between 1 and 10. Stating what values have no probability associated with it allows us to be more explicit about what outcomes are in the sample space. Since the function is constant, it doesn't matter what outcome we look at. For example, the probability that we see a 1 is the same as the probability that we see a 10 in the example.

![image.png](attachment:image.png)

A constant function doesn't fully demonstrate why a probability distribution function might be useful, so we'll explore a more complex example.

Many board games incorporate rolls of two dice and looking at the sum of the two dice results. If we rolled a 1 and a 6, then the sum we observe would be a 7. Let's say that we have a board game night planned with friends, and we want to gain an advantage by familiarizing ourself with the probability distribution for this sum. The day before the board game night, we lay out the different possibilities below in a table:

![image.png](attachment:image.png)

Looking at all the different possible sums is a start, but we can do better by visualizing this table as a probability distribution. Noting that there are 36 combinations that the dice can result in, we calculate the different probabilities for each sum and create the following visualization:]

![image.png](attachment:image.png)

It's easier to see that the most likely sum that we'd see in a double dice roll is 7, based on the peak that we see in the probability distribution. As we get farther away from 7, the probability of observing smaller or larger values decreases. The probability distribution has a nice symmetric quality, which may come into handy later. Before we go to our game night, we'd like to summarize this visualization further into a probability distribution function.

**Task**

Using the table of sums and the visualization of the probability distribution, help fill out some of the probability distribution function for the two dice rolls. We'll use **X** to indicate an outcome of the dice roll sum.

![image.png](attachment:image.png)

**Answer**

`prob_2 <- 1/36
prob_4 <- 3/36
prob_7 <- 6/36
prob_10 <- prob_4`

The probability distribution function is useful for quickly understanding how much probability is distributed to a single outcome, but it is also great for helping us calculate the probabilities for ranges of outcomes as well. We'll go back to the probability distribution of the sum of two dice rolls to demonstrate this.

In addition to being able to ask what the probability of seeing a 3 is, we can also ask what the probability of seeing a probability of 3 or less is as well. The only two outcomes that satisfy this condition are 3 and 2, so we can formulate the probability of observing a 3 or less as:

![image.png](attachment:image.png)

There is a special name for this type of probability. When we calculate probabilities up until a certain outcome, we call these probabilities **cumulative probabilities**. They are called cumulative because we are adding up all of the probability up until a certain point.

Cumulative probabilities are important to us because sometimes we want to know how probable a certain set of the outcomes are. This fact will become more crucial to our understanding of hypothesis testing. For now, it's important to understand that cumulative probabilities allow us to get probabilities of ranges of outcomes, as opposed to just a single one.

By convention, cumulative probabilities are typically defined as all the probabilities up until a certain point. In other words, all the outcomes that are less than or equal to a given outcome. If we wanted to calculate the probability of all the outcomes greater than 3, we might just sum up all the probabilities of the numbers from 4 to 12:

![image.png](attachment:image.png)

Using the properties of probability however, we have another clever way to calculate this probability. Even though cumulative probabilities typically look at all the outcomes less than or equal to a given outcome, we can use them to calculate probabilities of seeing higher outcomes. We found earlier that the only outcomes less than or equal to 3 were 2 and 3 itself. The opposite, or complement, of all the outcomes less than or equal to 3 is all the outcomes greater than 3, which is what we want. Using this fact to our advantage, we can calculate the probability in the following way:

![image.png](attachment:image.png)

**Task**

We've provided the table of dice roll sums below. Using this table, calculate the following probabilities:

![image.png](attachment:image.png)

1. What is 
**P
(
X
≤
6
)**
? Assign this probability to `prob_leq_6`.
2. What is 
**P(X>9)**
? Assign this probability to `prob_gt_9`.
3. What is 
**P
(
6
≤
X
≤
8
)**
? Assign this probability to `prob_btwn_6_and_8`.

**Answer**

`prob_leq_6 <- 1/36 + 2/36 + 3/36 + 4/36 + 5/36
prob_gt_9 <- 6/36
prob_btwn_6_and_8 <- 5/36 + 6/36 + 5/36`

The dice random experiment helps inform us about probability and probability distributions, so that we have a good fundamental grasp before diving into real-life experiments. When we start learning about hypothesis testing, we'll need to understand how our data is distributed.

We can also think of gathering data as a random experiment in and of itself. As a first example, let's say that we are interested in measuring the weights of people who work for our company. Weights are continuous and can take on a continuum of values. If we pick a random person in the company and measure their weight, we cannot say with certainty what their weight will be. Because we don't know what we will observe when we measure this person's weight, we can call think of measuring weight as random experiment. This fact applies in general with most things that would want to measure.

So, when we gather data of any sort on a single person, we are essentially performing a random experiment on them. As we measure more and more people, we'll gather more and more data. After we gather enough data, we can start to graph the **empirical distribution** of the data. The empirical distribution shows us how the data we measure is distributed and tells us the probability of seeing a particular value in the data. This is why we call it an **empirical distribution**; "empirical" means based on observation or evidence.

As an example, let's say that we gathered hypothetical data on how fast students finish their exercise on average. After gathering all the data, we used `ggplot` and `geom_density()` to generate the empirical distribution:

![image.png](attachment:image.png)

From our data, we can start learning some things about how our hypothetical users fare in their learning. We can see from the peak that most students finish their exercises at around 90 seconds. There are also a considerable amount of variation, ranging from 45 seconds to 135 seconds. The empirical distribution also appears to be symmetric. Just like with the probability distribution, we can learn a lot from the empirical distribution of our data.

We might ask ourself, "What's the difference between an empirical distribution and a probability distribution?" In past random experiments, we knew all the possible values of the sample space. This allowed us to calculate the exact probability associated with each outcome. When we collect real-life data, we do not see all the possible values that the outcomes can take. For example, if we collect weight data just on 50 people, the data is only representative of those 50 people, not the general population. It is highly unlikely, if not certain, that the weights we measure do not represent all the possible weights that people can take. There is a special relationship between the empirical distribution and the probability distribution for a random experiment. As we collect data more people, we would expect our empirical distribution to start resembling the actual probability distribution. Even though we can collect only a finite amount of data, we can still learn about its distribution.

We'll discuss one of the most important probability distributions in statistics: the **normal distribution**. The normal distribution is also known as the "bell curve" for its characteristic bell shape. Many real world phenomena resemble a normal distribution, which is why it's so important to the field of statistics.

We've overlayed the outline of an actual normal distribution over the empirical distribution, and we can see that the two closely match each other. Even though our data isn't exactly like the normal distribution, it is still a great approximation. Assuming that the data follows a normal distribution allows us to study the data in more detailed ways, like in hypothesis testing. This assumption doesn't always work, but for our purposes, we'll assume that our data does follow a normal distribution. The normal distribution is one of the most well-studied probability distributions and has a lot of R functionality dedicated to it.

All the characteristics of probability distributions we learned still apply to the normal distribution, but there are some key differences that we need to be aware of. The probability distributions we saw were called **discrete**, meaning that its probability is distributed to single, distinct values. For example, in the sum of two dice rolls, we saw that the only possible sums ranged from 2 to 12. There was no probability allocated to in-between values like 4.5. The normal distribution is different in that it is **continuous**, meaning that it can take on any values within a range, not just distinct values. In the example above, we see that the students' exercises finish times range anywhere between 30 and 150.

Another special aspect about the normal distribution is that they are always defined by two special values: a **mean** and a **variance**. In other words, if we specify the mean and for variance for a normal distribution, we would know exactly what the distribution looks like in a visualization. The plot below shows three different normal distributions with different means and variance.

![image.png](attachment:image.png)

Changing the mean of a normal distribution shifts where its peak is on the x-axis. Increasing the mean shifts the normal distribution to the right, while decreasing shifts it left. Changing the variance changes the spread or how wide the normal distribution is. Increasing variance makes the normal distribution wider and shorter, while decreasing it makes it narrower and taller. For the studdents exercise data above, it seems to be well approximated by a normal distribution with mean 90 and variance 15. Deciding the mean and variance of our data is critical in hypothesis testing: we ultimately use hypothesis testing to see what values for the mean and variance might work for the data.

We learned that normal distribution was a bell-shaped. If we were curious, this bell shape is produced by the following formula, the probability density function 
**P
(
x
)** for a normal curve:

![image.png](attachment:image.png)

This formula deserves some explanation. The `mean` is represented by the symbol 
**μ** (read as "mu"), while the variance is represented as 
$σ^2$ (read as "sigma squared"). The particular outcome is represented by the small 
**x**
, while the random experiment itself is big 
**X**
. We would read 
**P
(
X
=
x
)**
 as "the probability density that the random experiment takes on 
**x**
 as a value." We say "density" because that particular point is not a probability; it is more a statement of how often one value will appear relative to another. For example, if a value `0` has a higher density than the value `1`, we would interpret that as saying that `0's` are more likely to occur than `1's`.

From a data scientist's perspective, the more common way we'll deal with the normal probability density function is through the `dnorm()` function. The `"d"` in `dnorm()` stands for "density", which is another name for the probability distribution function, and the "norm" part refers back to the normal. 

`dnorm()` has one required argument and two optional arguments. The first argument is the specific value that we want to calculate a probability for, represented by the small `"x"` in 
**P
(
X
=
x
)**
. The first optional argument is mean and is where we can define the mean of the normal distribution that we are trying to calculate the probability from. 

The second optional argument is called `sd`, which stands for "standard deviation". We may recall that standard deviation is just the square root of variance 
    $σ^2$, so it is represented as **σ**. If we'd like to specify a variance for `dnorm()`, then we would need to square the value that we pass into the `sd` argument. This is a small gotcha for people who are new to using the function. If we do not specify mean and sd, then the function automatically defaults to using `mean = 0` and `sd = 1`.

Let's say that we wanted to calculate the probability density of observing a exercise speed of 90 seconds in our hypothetical example. We're using a normal with mean 90 and variance 15 to model the data, so we would use the following code to get the probability of observing a 90.

`dnorm(90, mean = 90, sd = sqrt(15))
 0.1030065`

This value corresponds to what we see in the visualization. 

**Task**

1. Calculate the probability density of seeing the number 0 in a normal distribution with mean 0 and standard deviation 1. 
2. Calculate the probability density of seeing the number 5 in a normal distribution with mean 5 and standard deviation 5.
3. Calculate the probability density of seeing the number -1 in a normal distribution with mean 1 and variance 4. 

**Answer**

`prob_norm_0 <- dnorm(0)
prob_norm_5 <- dnorm(5, 5, 5)
prob_norm_1 <- dnorm(-1, 1, 2)`

We learned how to use the `dnorm()` function to calculate probability densities of specific values in the normal distribution. We'll look to another function to calculate cumulative probabilities for a normal curve. Unlike with discrete probability distributions, calculating probabilities in a continuous distribution by hand requires calculus, which is out of the scope of this file. Thankfully, R has another function that captures this entire calculation in one line. This function is called `pnorm()`.

The "p" in `pnorm()` is supposed to represent the cumulative probability calculation and has similar syntax to `dnorm()`. Like `dnorm()`, `pnorm()` also takes one required argument and two optional arguments. The required argument represent the value that we want to take the cumulative probability up to, as represented in the equation 
**P
(
X
≤
x
)**
. The optional arguments are also `mean` and `sd`, which allow us to specify these values for the normal we're trying to calcualte the cumulative probability for. Let's say that we want to calculate the cumulative probability up until the number 1 in a normal with mean `0` and standard deviation `1`. As we've mentioned before, cumulative probability looks to the left of the value by convention, as shown below:

`pnorm(1)
0.8413447`

![image.png](attachment:image.png)

If we wanted to calculate the probability of a range in a normal distribution, then we would need to use `pnrom()` twice. By taking advantage of the fact that `pnrom()` always calculate the probability to the left of a given value, we can subtract two cumulative probabilities to get the value of a region. If we want to get the cumulative probability between the values -1 and 1 in a normal distribution with mean 2 and variance 9, we would use the code below. We've also represented this subtraction graphically:

`pnorm(1, 2, 3) - pnorm(-1, 2, 3)
 0.2107861`
 
 ![image.png](attachment:image.png)

Cumulative probability calculations are critical in hypothesis testing, so it is worth spending some time understanding the differences between `dnorm()` and `pnorm()`. With continuous probability distributions, we can see that cumulative probability is represented as the "area" under the curve. This is different from the counting and addition we did for discrete probability distributions. This is perhaps one of the most important takeaways. With cumulative probabilities of a normal distribution under our belt, we are ready to learn on our first hypothesis test.