We all have beliefs about how the world works. These beliefs are usually based on events we observe throughout our lives, and we develop theories on the causes to these events. An egg tastes better after breaking it over a hot pan, so we begin to think that the heat plays a role in making it tastier. When a belief has evidence to back it up, we are more likely to believe that it reflects the truth. As more evidence piles up, these beliefs eventually solidify — at least until contradictory evidence comes to light.

When we propose an idea for how the world works, we are making a **hypothesis**. A hypothesis is an attempt to explain a phenomenon based on limited evidence. We say "limited" because a phenomenon would just be a fact if we knew everything about it. We may also think about a hypothesis as an educated guess as to how something works. We may observe in our everyday life that our best boiled eggs happen when they are cooked from 6 to 7 minutes. From these observations, we can propose a hypothesis that all boiled eggs are perfectly cooked after this particular amount of time.

An essential characteristic of all hypotheses is that they are **testable**. That is to say, we can set up a process or experiment that will do only one of two things: 

1. support the hypothesis or 
2. reject it. 

We must be able to test our hypotheses, or else we will not be able to figure out if the hypothesis represents the ground truth or not. Developing hypotheses and learning how to test them are critical skills for data scientists. Data scientists leverage data to guide company strategy and make improvements to their products. Some examples of hypothesis problems that data scientists may face include:

* if use a new ad on our website, how will we know if created a meaningful increase in user engagement?
* if we raise the price of a product, will it cause a meaningful drop in sales?
* if we develop a new weight loss pill, how can we know if it helped people lose more weight?

We'll learn the framework to create hypotheses and test them. This framework is called **hypothesis testing**, which has a rigorous mathematical foundation and served as the guiding force for researchers and data scientists alike on their experiments. Once mastered, we can start properly making data-driven decisions. 

When we propose hypotheses, there are really only two conclusions we can reach. After we perform an experiment and collect our data, the results will either **support** our hypothesis or **discredit** it. To be more precise, when we perform an experiment, there are actually two hypotheses that are being tested at the same time. One hypothesis corresponds to our own belief about the world, while the other corresponds to the contrary. Take the example of the new ad.

In this case, the ad will either increase sells or it won't. In order to test this, a natural experiment to devise would be to randomly divide up our website users into two populations: 

1. one set of users will see the new ad, while 
2. the second set will not see it.

With these two groups in place, we'll measure how many people in each group engage with the ad. After the experiment is finished, we can compare the number of engaged users between the two groups. 

1. One hypothesis is that the ad did increase user engagement; 
2. the other hypothesis is that the ad did not increase user engagement.

In the world of hypothesis testing, these two hypotheses have special names. 

1. The hypothesis that stated that the ad won't have an effect on the world is called the **null hypothesis**. 
2. Likewise, the hypothesis that states that it will increase engagement is the **alternative hypothesis**.

We can frame the other examples in terms of a null and alternative hypothesis as well:

* if we raise the price of a product, will it cause a meaningful drop in sales?

    * **null hypothesis**: the number of purchases of the product was the same at the lower price than it was at the higher price.
    * **alternative hypothesis**: the number of purchases of the product was lower at the higher price than it was at the lower price.
    
* if we develop a new weight loss pill, how can we know if it helped people lose more weight?

    * **null hypothesis**: patients who went on the weight loss pill lost no more weight than those who didn't.
    * **alternative hypothesis**: patients who went on the weight loss pill lost more weight than those who didn't.

Why is there a need to define these two hypotheses in the first place? The full answer is out of scope, but we can try to shed a little light on the rationale. 

In court trial, we presume a defendant on trial to be "innocent until proven guilty." That is, until we see evidence of a person's guilt, we default to assuming their innocence. In the same way, we presume that nothing special will happen in a random experiment until the data says otherwise.

The `null hypothesis` assumes that the experiment will not change the world in any way, so we assume that this is the default case. With the **null** and **alternative** hypotheses now in our vocabulary, we can start incorporating them into our understanding of hypothesis testing.

We'll walk through an example experiment to learn how to create a null and alternative hypothesis. We have data from a fake experiment on weight loss. A company has developed a new drug that was designed to help subjects lose weight. In order to test the drug, the company gathered people to participate in the study and randomly split them into two groups. 

* **Group A** was given a placebo pill, while
* **Group B** was designated to take it.

**Group A** is our control group, while **Group B** is our treatment group. The company measured each participant's weight before the start of the study and recorded it again after the end. The resulting data in `weight_loss.csv` contains information on how much weight each person lost.

The event we are interested in studying is whether or not the drug helps with weight loss. Knowing what we know about **null** and **alternative** hypotheses, we would construct them as follows:

* $H_0$: the new drug will not reduce the subjects' weight
* $H_A$: the new drug will reduce the subjects' weight

We use **H** to denote a hypothesis and the subscript **0** or **A** to distinguish between the **null** and **alternative** hypothesis.

The company is studying if the new drug will lead to weight loss, so the **alternative hypothesis** is specifically worded to state this. If we're to consider if the drug were to change the weight in either direction, lose or gain, then the hypothesis would have to be phrased differently. This small detail is why it's important to be specific with what event we want to test.

Everybody is different, so we would expect the drug to affect individuals in different ways. There are also many other factors that can cause a subject's weight to change, regardless of what group they're in. The consequence of all these changes on weight causes measured weight loss to vary in both groups.

If the weight loss can vary in both groups, how would we be able to compare them? We need a single value that helps summarize the weight loss in the entire group, and we can get this in the **mean** or **average** of each group. 

By taking the mean of each group, it makes it easier to compare **Group A** and **Group B** in terms of weight loss. The diagram below illustrates the each of the weight losses of each subject, stratified by their drug group:

![image.png](attachment:image.png)

There is some overlap between the distribution of the two groups' weight losses. That being said, there is also a slight shift to the right in **Group B**. Before we move on, calculate the means to get an idea of how different the average weight losses are between the two groups.

**Task**

* Calculate the mean for Group A.
* Calculate the mean for Group B.

**Answer**

`library(readr)
data <- read.csv("weight_loss.csv")
mean_group_a <- mean(data$A)
mean_group_b <- mean(data$B)`

We saw an experiment that split subjects into two groups: a **treatment** and **control** group. Since there was nautral variation in the amount of weight loss in all the subjects, we took the average weight loss in each group. 

We'll dig more into the intuition on why we would use the mean to compare the two groups. We'll refer back to the visualization of the data below

![image.png](attachment:image.png)

For **Group A**, most of the weight losses for people in Group A hovers around 2.5 pounds, as indicated by the peak in the red empirical distribution. Some people lose less than 2.5 pounds while others lose more than the average, but the general balance is around 2.5 pounds.

Generally, there are simlar amounts of people who lose a lower-than-average amount of weight to those who lose a higher-than-average amount of weight. As we get farther away from the mean weight loss, there is an approximate symmetric shape to Group A's empirical distribution. Overall, Group A's distribution forms an approximate bell curve, which we learned was called the **normal distribution**. The same can be said of the shape of **Group B's** distribution.

If we **assume** that each group's weight losses forms a **normal distribution**, we can develop the intution behind how to use the means to decide which hypothesis to support. We'll put aside the raw data for a bit to consider some hypothetical situations. With two normal curves, there are two extreme situations in how much the two overlap. 

1. One extreme case is no overlap, while 
2. the other is total overlap, as seen below.

![image.png](attachment:image.png)

If two experimental groups looked like the left plot, then it is apparent that the two groups are very different. Conversely, if the two groups resembled the right plot, then the distribution of their weight losses would look similar. This brings us back to the means of each group.

Recall that normal distributions are defined by two important values: 

1. the mean and 
2. the variance. 

The `mean` is represented by the peak of the **normal distribution**, while the `variance` changes how wide the bell is. 

When we look at the separation between the two curves in the diagram, we can think of this as comparing their means. That is, if the means of the two groups are extremely different, it suggests that the groups are truly different from each other. Applying this idea back to our weight loss data, if the average weight loss of **Group A** is much less than the average of **Group B**, it suggests that they are different; that is, the drug helped increase weight loss in this group!

This idea is central to this file and is worth reiterating. Natural variation in our subjects causes them to experience different degrees of weight loss. This variation leads to approximate bell shapes for both groups' weight losses.

With our data, we can calculate the **average** weight loss for each group, as well as the **variance** of the data. By performing this experiment with their new drug, the company is trying to study if the drug will increase the average amount of weight loss enough in the group that takes it. If the drug `increases` the **average** weight loss enough, we will be able to distinguish it more easily in a visualization.

In real-world data, the overlap of the two groups will vary somewhere between the two extremes, and we see this in our data. The question of comparing groups then becomes, "How do we know they're different if there is some overlap?" Statistics give us an answer to this.

Recall that statistics are really just handy summaries of data. In terms of the weight loss experiment, 

* the `average` or `mean` would tell us how much weight loss a new subject would expect to experience if they were put in a particular group. 
* The `variance` tells us how much the weight losses in a group differ from the average. 

As we've mentioned before, the `mean` plays a crucial role in allowing us to compare groups, but we haven't incorporated them into our hypotheses! For our weight loss experiment, our two hypotheses currently take the form:

* $H_0$: the new drug will not reduce the subjects' weight
* $H_A$: the new drug will reduce the subjects' weight

Above, we looked at how we might be able to distinguish between two normal curves. We saw that we could distinguish two normal curves if their two means were extremely different from each other. If the two curves have a lot of overlap, then the means would be similar to each other. 

The hypotheses in their current form are qualitative, and don't really allow us any way to do the testing we want to do. By grounding the hypotheses in numbers, we allow ourselves to better define them and make it easier to test which one is the case. We have assumed that each group is normally distributed, so we'll rephrase the hypotheses in terms of group means.

* $H_0$: the `mean` weight loss of **Group B** is the same as the `mean` weight loss of **Group A**.
* $H_A$: the `mean` weight loss of **Group B** is greater than the `mean` weight loss of **Group A**.

Notice that we haven't mentioned variance in our hypotheses. We could technically add in variance into the hypotheses, but it adds extra complexity that we don't need at the moment. Variance still plays an important role in hypotheses testing, but more on a conceptual level. We'll refer back to our diagram of the two normal curves together:

![image.png](attachment:image.png)

In the above diagram, both groups have essentially the same shape. In other words, they have the same variance. Having a large variance means that the bell is wider and shorter, while having a small variance will make the bell taller and thinner. 

The variance of the normal curve influences how much it might overlap with other nearby curves, so it plays a critical role in changing how well we can distinguish between two normal curves. Applying this to the experiment, the variance of the two groups may decide if we are able to tell them apart! The diagram below illustrates this effect:

![image.png](attachment:image.png)

In both plots, the two curves on each plot are the same distance from each other. The only thing that was changed for the second plot was that the variance was reduced for both curves. On the left it is harder to distinguish between the two groups, while it is more clear on the right. 

Variance plays a larger role in distinguishing between two groups when the differences between them are small. Two normal curves could actually have a small, but meaningful difference between them, but if they both have high variance, this difference might be hidden. This fact is why researchers and data scientists try to have as many sample points as possible for their experiments: more subjects typically reduces the variance of the measurements we take from them.

We're close to learning the our first hypothesis test. We've slowly developed an intuition behind how to tell between two normal curves by looking at both their means and variances. Before we move on, we'll need to calculate the variance for each group.

**Task**

* Calculate the variance for **Group A**.
* Calculate the variance for **Group B**.

**Answer**

`var_group_a <- var(data$A)
var_group_b <- var(data$B)`

We alluded to the fact that real-world data usually has some degree of overlap between two groups of data. In our case, we saw that there was some overlap between the weight losses of Group A and Group B.

![image.png](attachment:image.png)

We have assumed that the weight losses for both **Group A** and **Group B** are **normally distributed** and calculated their `means` and `variances`. 

We've also rephrased our hypotheses in terms of their `means`. These steps were all necessary in setting up the test we'll perform to decide if the two groups are different in the presence of overlap between their weight losses.

Let's take another look at our two hypotheses:

* $H_0$: the mean weight loss of Group B is the same as the mean weight loss of Group A
* $H_A$: the mean weight loss of Group B is greater than the mean weight loss of Group A

While these hypotheses are what we need, they could still use some improvement. Converting the hypotheses into more quantitative terms was a step in the right direction, but mathematical statements are the best for assigning a clear `"yes/no"` decision to the hypotheses. 

We can convert the above hypotheses into a mathematical form using some notation:
![image.png](attachment:image.png)

We use **$¯x$** to denote the `average` and the subscript **A** or **B** to distinguish between the two groups. We can confirm for ourself that the phrases are the same as the mathematical equations we wrote out.

One benefit of converting the phrases into mathematical terms is that we can manipulate the mathematical versions in helpful ways. For example, instead of looking at two quantities ($¯x_A$ and 
$¯x_B$), we can subtract $¯x_B$ from both sides and focus the hypothesis onto one quantity: the difference between the means.

![image.png](attachment:image.png)

The hypotheses are still the same, but the above hypotheses suggest something different: that the difference between the two group `means` is either **zero** or **less than zero**. 

For this particular example, we're going to assume that the `variances` we calculated from the data represent the `true variance` we would see if we tested many, many people. 

The difference between `means` from two normals also follows a **normal distribution**. For this particular normal distribution, its `mean` is $¯x_A −¯x_B$ and its `variance` is a weighted sum of the variances of the two groups we're comparing, given by the following equation:

![image.png](attachment:image.png)

We moved from comparing two qualities to focusing on their difference instead. Thanks to some technical properties of the normal distribution, we know that the difference between two groups that follow normal distributions is also normal. Before we finally learn the test, lets calculate the mean and variance of this particular normal.

**Task**

1. Using the data, calculate the difference of the means of the groups.
2. Using the data, calculate the variance of the difference of the means of the groups.

**Answer**

`diff_mean <- mean(data$A) - mean(data$B)
diff_var <- (var(data$A)/length(data$A)) + (var(data$B)/length(data$B))`

Let's look at the mathematical versions of our null and alternative hypotheses:

![image.png](attachment:image.png)

We learned above that the difference between `means` of two **normal distributions** also follows a normal. This detail gives our hypotheses an additional important meaning: 

* the **null hypothesis** is a declaration that this particular normal distribution (aka the difference in weight loss) has a `mean` of **0**. 
* The **alternative hypothesis** says that its `mean` is **less than 0**. Both hypotheses are statements about probability distributions!

The importance of this statement cannot be understated. When we create hypotheses for hypothesis tests, we are making claims that the world follows a particular probability distribution. 

Our first null hypothesis as "the new drug will not reduce the subjects' weight," which we slowly transformed into "the difference in `means` between the two groups is **zero**". 

Along the way we incorporated the **normal distribution** into our hypotheses. These statements are still the same, but we have slowly moved from **qualitative** terms to **quantitative** definitions. Statistics and probability together form the basis of hypothesis testing and are what allow us to make mathematically rigorous assertions about the world around us. Now we will see how to assign a probability to our hypotheses.

Recall any normal distribution can be standardized, meaning that we can change it to have `mean` **0** and `standard deviation` **1**. In order to standardize a **normal distribution**, we subtract the `mean` and divide by the `standard deviation`, as shown below:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

This quantity is known as a **test statistic**. We call them test statistics because they are used to conduct the hypothesis test and conclude which hypothesis to support. 

Test statistics often follow a well-known probability distribution, so we can calculate the point probabilities and cumulative probabilities. 

In this case, `t` follows a **t-distribution**. A t-distribution is also bell shaped like the normal distribution, but it give slightly more probability to the tails than the normal. Because we are trying to investigate the means of **two** groups and the test statistic follows a t-distribution, we are conducting a two sample independent t-test. A **two sample independent t-test** is used to compare continuous qualities between two groups.

Since test statistics follow a probability distribution, we can calculate a probability of observing that test statistic. From there, we can make a data-driven judgment call on which hypothesis we should follow.

We calculated a test statistic of about -6.96. For the two sample independent t-test, the test statistic is a ratio between **a difference between means** and the **standard deviation**. We always interpret test statistics relative to a null hypothesis, soo we would interpret a test statistic of -6.96 as being about 7 standard deviations away from zero! 

Under a normal distribution, most of the data will fall between -3 and 3 standard deviations, so seeing a test statistic 7 standard deviations away is extremely unlikely. The diagram below illustrates this point:

![image.png](attachment:image.png)

We can use our knowledge of cumulative probability to calculate the probability of seeing this test statistic this **extreme or more**. We can use the `pnorm()` function for the normal distribution. By assuming that the data follows a normal distribution, it gives us access to all this useful functionality. Assuming that the null distribution is true, the probability of observing a test statistic of `-6.96` or lower is:

![image.png](attachment:image.png)

We look at all the points at -6.96 and below it because of the nature of continuous probability functions like the normal distribution. Our alternative hypothesis only considers that Group B's weight loss will be greater than Group A's, so we only need to look at one side of the normal distribution. This probability is extremely small and suggests that the actual difference of means is lower than zero. Equipped with this probability, we can finally make a judgment on which hypothesis we should support.

The difference in the means we observed is so extreme under the null hypothesis that it seems unlikely that the null hypothesis is true. So, we will choose to **reject the null hypothesis**. 

The phrasing here is specific: we cannot say that the **alternative hypothesis** is supported. Even though we observed a test statistic of -6.96, it would be incorrect to say that the actual difference in average weight loss between the groups is this value. We have only shown that assuming a mean difference of 0 is highly unlikely. We can only either **reject the null hypothesis** or **fail to reject the null hypothesis** in hypothesis testing. This phrasing is chosen specifically so that we can rule out null hypotheses that aren't likely to happen.

The probability we calculated earlier also has a famous name, called the **p-value**. In general, the p-value represents the probability of observing a test statistic as extreme or more as what we observed under the null hypothesis.

If this value is high, it means that the difference in weight loss between the two groups likely comes from the null distribution. Translated back to our original hypotheses, it is likely that the new drug probably didn't play a role in increasing weight loss. 

On the other hand, a low **p-value** implies that there's an incredibly small probability that the mean difference we observed comes from the null distribution.

This brings up an issue: how small should a **p-value** be before we decide to **reject** the **null hypothesis**? Unfortunately, there is no universally accepted answer, but many research fields have agreed that any **p-values** below `0.05` or `0.01` are "safe" thresholds to **reject** the **null hypothesis**.

Hypothesis testing is one of the most important reasons that learning probability and statistics is so important to a data scientist. What we've learned here forms a great foundation for learning other types of tests, so we'll do a quick review of our process before we move on.

**Task**

* Say we calculated a test statistic of -3. Would we reject the null hypothesis if we observed this test statistic if we were using a threshold of 0.05? Assign `TRUE` to **reject_null_stat_one**, `FALSE` otherwise.
* Say we calculated a test statistic of `-1.6`. Would we reject the null hypothesis if we observed this test statistic if we were using a threshold of 0.05? Assign `TRUE` to **reject_null_stat_two**, `FALSE` otherwise.

**Answer**

`reject_null_stat_one <- TRUE
reject_null_stat_two <- FALSE`

We make hypotheses when we propose that the world works out a certain way. In the case of our weight loss experiment, we hypothesized that the new drug would help subjects lose weight. We know that the drug will either cause a change or not, so we need to develop a **null hypothesis** to balance out the **alternative hypothesis**. 

The experiment split the subjects into two groups: 
1. one that received the drug, and 
2. another that didn't. 

After performing the experiment and collecting our data, we started to investigate the average weight loss in each group to allow us to better compare the two groups. Afterwards, we translated our hypotheses into mathematical statements in order to connect them to our data. 

We calculated a **test statistic** and used our knowledge of the normal distribution to calculate a special probability. This probability, a **p-value**, tells us the likelihood that we would observe the difference in `mean weight losses,` assuming that there would be no change at all. 

Since the **p-value** was small, so we decided to **reject the null hypothesis** and decide that the new drug did cause some degree of weight loss.

The process of hypothesis testing with an experiment can be simplified as follows:

1. Decide what kind of measurement we will be comparing between the treatment and control groups. This choice will change the type of test statistic we can use.
2. Define our null and alternative hypotheses in both simple and mathematical terms.
3. Calculate our test statistic.
4. Figure out what probability distribution our test statistic comes from.
5. Use this distribution to calculate our **p-value**.
6. Use this **p-value** to decide whether to `reject` or `fail to reject` the **null hypothesis**.

In this file, we got an introduction to hypothesis testing and the process of evaluating hypotheses. The two sample indepdendent t-test is used for performing tests on outcomes that are continuous. If we were looking at another type of outcome, like a categorial value, then we would use a different test.