<img src="materials/images/introduction-to-statistics-II-cover.png"/>


# 👋 Welcome, before you start
<br>

### 📚 Module overview

We will go through eleven lessons with you:

- <font color=#E98300>**Lesson 1: Z-score**</font>    `📍You are here.`
    
- [**Lesson 2: P-value**](Lesson_2_P-value.ipynb)
    
- [**Lesson 3: Welchs T-test**](Lesson_3_Welchs_T-test.ipynb)

- [**Lesson 4: Log2 Fold Change**](Lesson_4_Log2_Fold_Change.ipynb)

- [**Lesson 5: Pearson Correlation**](Lesson_5_Pearson_Correlation.ipynb)

- [**Lesson 6: Spearman Correlation**](Lesson_6_Spearman_Correlation.ipynb)

- [**Lesson 7: False Discovery Rate**](Lesson_7_False_Discovery_Rate.ipynb)

- [**Lesson 8: Benjamini Hochberg**](Lesson_8_Benjamini_Hochberg.ipynb)

- [**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**](Lesson_9_Dimensionality_Reduction_Methods_Principal_Component_Analysis.ipynb)

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- [**Lesson 11: UMAP**](Lesson_11_UMAP.ipynb)
</br>



<div class="alert alert-block alert-info">
<h3>⌨️ Keyboard shortcut</h3>

These common shortcut could save your time going through this notebook:
- Run the current cell: **`Enter + Shift`**.
- Add a cell above the current cell: Press **`A`**.
- Add a cell below the current cell: Press **`B`**.
- Change a code cell to markdown cell: Select the cell, and then press **`M`**.
- Delete a cell: Press **`D`** twice.

Need more help with keyboard shortcut? Press **`H`** to look it up.
</div>

---

# Lesson 1: Z-score

To understand what a <mark>**z-score**</mark> is, we must first become familiar with what a <mark>**normal distribution**</mark> is.

`🕒 This module should take about 15 minutes to complete.`

`✍️ This notebook is written using Python.`

---

# Normal Distribution

<img src="materials/images/images_z-score/normal.png"/>

We say that a dataset has a normal distribution if its values fall into a smooth (continuous), bell-shaped curve with a symmetric pattern--each side looks the same when cut down the middle. The normal distribution is extremely important to statistics. One reason that it's so useful is  that it enables us to determine the probability of something occuring due to chance. For example, it enables us to determine the probability that a given data point will fall a given distance from the distribution's mean.

## The Empirical Rule ( also known as "68-95-99.7 Rule")

The standard normal (Z) distribution has a mean of zero and a standard deviation of 1. You can think of the **standard deviation** as roughly **the average distance of a data point from the mean**. A value on the Z-distribution represents the number of standard deviations the data is above or below the mean; these are called <mark>**z-scores**</mark>. 

A z-score tells us how different a value is from what we would expect--how far a value is from the mean in standard deviations. For example, if a score in a distribution is equivalent to the mean score, it is zero distance from the mean, hence its z-score will be zero. Therefore, z = 1 on the Z-distribution represents a value that is 1 standard deviation above the mean. Similarly, z = -1 represents a value that is one standard deviation below the mean.

 In a normal distribution, there are approximately <mark>**68%**</mark> of values within $\pm$ 1 standard deviation of the mean. Approximately <mark>**95%**</mark> of the values fall within $\pm$ 2 standard deviations of the mean. And virtually all values (<mark>**99.7%**</mark>) will fall within $\pm$ 3 standard deviations of the mean. 
 
This is known as the Empirical Rule. This characteristic of the normal distribution is what enables us to generate probabilities about the likelihood of selecting a data point a certain distance from the mean of the distribution: it's z-score.

<img src="materials/images/images_z-score/emperical_rule.png"/>

Before we go on to specifically discuss how the z-score is calculated, let's make sure that we can envision the key characteristic of a normal distribution.

<img src="materials/images/images_z-score/68-95-99.png"/>

Each normal distribution has its own mean, $\mu$, and its own standard deviation, $\sigma$, and the total area under the curve is 1. 

### 68% of values, in a normal distribution, will fall within 1 standard deviation of the mean.

<img src="materials/images/images_z-score/68.png"/>

### 27% of values will fall between 1 and 2 standard deviations above and below the mean.

<img src="materials/images/images_z-score/27.png"/>

### Approximately 4.7% of values will fall between 2 and 3 standard deviations above and below the mean.

<img src="materials/images/images_z-score/5.png"/>

### Only about .3% of values will fall beyond 3 standard deviations away from the mean.

<img src="materials/images/images_z-score/3.png"/>

We will now use the above information to understand the usefulness of the z-score.

---

# Standardization

Using the mean and standard deviation, we are able to convert a score into what's called a z-score. A z-score, also known as a standard score, helps us to understand where an individual score falls in relation to other scores in a distribution. Standardization is simply the process of converting each score in a distribution to a z-score.

<img src="materials/images/images_z-score/standardization.png"/>

$$\large \mu = mean\\ $$
$$\large \sigma = standard\ deviation $$

$$\huge\ z = \frac{x-\mu}{\sigma}$$

You can think of the z-score as communicating how far away from the expected value (mean) each value in a distribution is, in standard deviation units. Therefore, standardization is simply the process of converting individual raw scores into standard deviation units. Let's look at an example.

Suppose that you took a final exam in your biology class, and scored a 75 out of 100. Let's say that the mean score on the exam was 70 with a standard deviation of 5. Let's convert your score to a z-score. Using the mean and standard deviation, we are able to generate a z-score to help us understand where an individual score falls in relation to other scores in the distribution.

$$\huge\ z = \frac{75-\ 70}{\ 5}$$

$$\huge\ z = 1 $$

---

# Estimating probabilities and percentiles

Your test score of 75 minus the mean score of 70 gives you a value of 5. We then divide that result by the standard deviation of 5 to return a z-score of 1 (1 standard deviation above the mean). Let's interpret your z-score. 

<img src="materials/images/images_z-score/prob_1stdev.png"/>

From the earlier illustration of the normal distribution, we know that 50% of values are below and above a z-score of zero (i.e. when the raw score is equal to the mean score). This is because the normal distribution is symmetrical with half the values above the mean, and the other half below. We also saw that approximately 34% of values lie between the mean and the 1 standard deviation above the mean. Therefore, we can use this information to estimate the quality of your test score in relation to the other students in your class, who took the exam. 

To get an estimation of the likelihood that a score was less than your score, we can add the 50% and the 34% to get the total area under the curve, which is 84%. On the other hand, there is a 16% chance (1 - .84) that someone scored better than you did on the exam. As a result, relative to the other students who took the exam, you did pretty well--achieving the 84th percentile.

## Z-table
Because probabilities for any normal distribution are nearly impossible to calculate by hand, we can use the Z-table to find the probability you need. **To use the Z-table to find probabilities, do the following:**

1. Go to the row that represents the leading digit of your z-value and the first digit after the decimal point.
2. Go to the column that represents the second digit after the decimal point of your z-value.
3. Intersect the row and column.

That number represents the probability that a score from this distribution will fall below this z-score.

For example, the Z-table below displays the probabilities for z-scores 1 standard deviation below and 1 standard deviation above the mean. We can see that, as we estimated previously, our test score, which falls 1 standard deviation above the mean, is calculated to be above 84.13 percent of scores on this exam.

<img src="materials/images/images_z-score/z_table.png"/>

<div class="alert alert-block alert-warning">
<b>Alert:</b> You need not worry about whether to include an “equal to” in a greater-than probability because the probability of a continuous random variable equaling one number exactly is zero. (There is no area under the curve at one specific point.)
</div>

---

# Comparing different variables

When we standardize values (convert them to z-scores) we've put them into standard terms. We've put them on the same scale. So, standardizing values is extremely useful for comparing different variables to each other, especially when they are on completely different scales. 

For example, let's say that we wanted to compare a student's GRE score of <mark>**325**</mark>, TOEFL score of <mark>**104**</mark> and GPA of <mark>**3.89**</mark> to determine which score was the most impressive. Presently, they each are on different scales, so it's difficult to compare them without specific domain knowledge. But if we standardize them, they can then be fairly compared. Let's do that.

<img src="materials/images/images_z-score/comp_1.png"/>

---

First, we'll standardize the GRE score. Remember, in order to standardize a score, we subtract the mean from the score, and then divide by the standard deviation. 

Assuming we already know the mean and standard deviation of the GRE score, we can see below that the mean GRE score is 316 and the standard deviation is 11. We can now standardize the raw score as shown below.

<img src="materials/images/images_z-score/comp_2a.png"/>

$$\huge\ z = \frac{325-316}{11}$$

$$\huge\ z = .81 $$

<img src="materials/images/images_z-score/comp_2b.png"/>

We can see that the GRE score is above the mean value but, importantly, we are able to quantify how much greater it is than the mean value in standard deviation units.

---

Now, let's standardize the TOEFL score. It has a mean score of 107, and a standard deviation of 6.

<img src="materials/images/images_z-score/comp_3a.png"/>

$$\huge\ z = \frac{104-107}{6}$$

$$\huge\ z = -.50 $$

<img src="materials/images/images_z-score/comp_3b.png"/>

The TOEFL score is below the mean but, again, we can quantify this measure in standard deviation units.

---

Finally, we'll standardize the GPA score. It has a mean score of 3.46 and a standard deviation of .26.

<img src="materials/images/images_z-score/comp_4a.png"/>

$$\huge\ z = \frac{3.89-3.46}{.26}$$

$$\huge\ z = 1.65 $$

<img src="materials/images/images_z-score/comp_4b.png"/>

Similar to the GRE score, the GPA is above the mean value. However, by standardizing this distance, we are able to compare their distances (deviations) from their respective means. 

After standardizing the scores (converting them to z-scores), we can see that the student's GPA is the most impressive of the scores--the most significantly above its respective mean. 

By standardizing values (putting them on the same scale, in terms of standard deviations), we can compare just about any variable to another. This is especially important when they natively are on completely different scales. 

<img src="materials/images/images_z-score/comp_5.png"/>
<img src="materials/images/images_z-score/comp_6.png"/>

It should now be clear why standardizing raw scores into z-scores is so important. It's an effective way of determining where a raw score falls relative to other scores in a distribution. Further, it's an important method used to compare different variables to each other, especially when they are measured using disparate scales. 

---

# 🌟 Ready for the next one?
<br>

- [**Lesson 2: P-value**](Lesson_2_P-value.ipynb)
    
- [**Lesson 3: Welchs T-test**](Lesson_3_Welchs_T-test.ipynb)

- [**Lesson 4: Log2 Fold Change**](Lesson_4_Log2_Fold_Change.ipynb)

- [**Lesson 5: Pearson Correlation**](Lesson_5_Pearson_Correlation.ipynb)

- [**Lesson 6: Spearman Correlation**](Lesson_6_Spearman_Correlation.ipynb)

- [**Lesson 7: False Discovery Rate**](Lesson_7_False_Discovery_Rate.ipynb)

- [**Lesson 8: Benjamini Hochberg**](Lesson_8_Benjamini_Hochberg.ipynb)

- [**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**](Lesson_9_Dimensionality_Reduction_Methods_Principal_Component_Analysis.ipynb)

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- [**Lesson 11: UMAP**](Lesson_11_UMAP.ipynb)
</br>

---

# Contributions & acknowledgment

Thanks Antony Ross for contributing the content for this notebook.

---

Copyright (c) 2022 Stanford Data Ocean (SDO)

All rights reserved.