<a href="https://colab.research.google.com/github/afvallejo/CSHL2022/blob/main/1_Penguins_and_Descriptive_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Penguins and Descriptive Statistics

**Authors**: Clarence Mah (ckmah@ucsd.edu) and Michelle Franc Ragsac

**Credit**: Adapted from UCSD CMM262

We will learn how to use Jupyter Notebooks to explore the Palmer Penguins dataset in Python with some commonly-used data science packages such as `pandas`, `seaborn`, `matplotlib`, `numpy`, and `scipy`.

## Table of Contents
1. [The Palmer Penguins Dataset](#Exploring-the-Palmer-Penguins-DataFrame-with-the-pandas-and-seaborn-Packages)
2. [Descriptive Statistics](#2.-Descriptive-Statistics)


> The Palmer Penguins dataset is based on data points collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).
>
> Source: https://allisonhorst.github.io/palmerpenguins/

---

## Exploring the Palmer Penguins `DataFrame` with the `pandas` and `seaborn` Packages

`DataFrames` are a fundamental object type for representing data sets, implemented by the `pandas` Python package. Generally: 

- Each **row** contains all information about a single entry in a data set.
- Each **column** describes a single aspect of all entries in a data set.

We'll start out this notebook by **importing** the `pandas` Python package so that we can access the Palmer Penguins dataset in nice, tabular fashion. 

<div class="alert alert-block alert-info">
    <p><b>Additional Information about <code>pandas</code> in case you're interested...</b></p>
    <p>Pandas is an open-source data analysis package in Python that aids in data analysis and manipulation using <code>DataFrame</code> and <code>Series</code> objects. These <code>DataFrame</code> objects can be thought of as multidimensional arrays with attached row and column tables, similar to the tables found in Excel Spreadsheets or <code>data.frame</code> objects in the R Programming Language. The <code>Series</code> objects can be thought of as a single column with connected row and column labels. Pandas also has tools for reading and writing data between in-memory structures and different commonly-used formats, such as CSV, text files, Microsoft Excel sheets, SQL databases, and the HDF5 format.</p>
    <ul>
        <li><strong>Pandas Website</strong>: <a href="https://pandas.pydata.org/">https://pandas.pydata.org/</a></li>
        <li><strong>Pandas Code Base on GitHub</strong>: <a href="https://github.com/pandas-dev/pandas">https://github.com/pandas-dev/pandas</a></li>
    </ul>
</div>

We'll also be using the `seaborn` and `matplotlib` packages for visualizing our data! They are two frequently-used packages used to generate figures for publications. Visualizations are a powerful way to interpret relationships between variables, so they'll be helpful to create as we're using `pandas` simultaneously.

### Import the Packages We'll be Using in this Portion of the Notebook

In [None]:
# First, import/load the pandas package into our notebook using "pd" as the shorthand
import pandas as pd 

# Next, import/load the seaborn package into our notebook using "sns" as the shorthand
import seaborn as sns

### Load in the Palmer Penguins Dataset using `seaborn`

Since `seaborn` includes a sample of the Palmer Penguins dataset, we can load it directly from the module for us to play with! 

In [None]:
# This command uses the load_dataset() method from the seaborn package (sns)
# to load the "penguins" dataset into a variable called penguins
penguins = sns.load_dataset("penguins")

---

## Quick Commands for Previewing Information Contained in `DataFrames` 

Now that we have the dataset loaded into our notebook, the first thing we'll want to do is preview what information is present. Here are the commands that we'll be going over:

* The `DataFrame.info()` method gives us a quick summary of all the columns present in our dataset
* The `DataFrame.describe()` method gives us some quick summary statistics of the numerical values present in our dataset
* The `DataFrame.shape` attribute contains the dimensions of the dataset in rows x columns format 
* The `DataFrame.head()` and `DataFrame.tail()` methods allow us to preview the head/beginning or tail/end of a dataset, respecitvely 

<div class="alert alert-block alert-info">
    <p>With these commands, <code>DataFrame</code> should be replaced by the name of the variable containing your <code>pandas</code> <code>DataFrame</code> (e.g., <code>penguins.shape</code> instead of <code>DataFrame.shape</code> will give us the dimensions of the penguins dataset).</p>
</div>

### Getting Quick Summary Information About the `DataFrame` with the `info()` and `describe()` Methods

`DataFrame.info()` provides a quick summary about all of the entries within a `DataFrame`, along with the data types that are recognized in each of the columns. This method gives you the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory the data takes up on your computer. 

In [None]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


<br>On the other hand, `DataFrame.describe()` provides summary statistics information about columns containing numerical values within our dataset! 

In [None]:
penguins.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


Here, you can see that we were able to calculate the number of entries (`count`) within our columns, the `mean`, the standard deviation (`std`), the minimum value (`min`), the different percentiles, and the maximum value (`max`) for the columns in the Palmer Penguins dataset containing numerical values. Because the `species`, `island`, and `sex` columns within the dataset aren't numerical values, we don't calculate anything for them. 

### Viewing the Dimensions of a `DataFrame` with the `shape` Attribute 

The `DataFrame.shape` command shows how many rows and columns we have.

In [None]:
penguins.shape

(344, 7)

### Previewing a `DataFrame` with the `head()` and `tail()` Methods

View the first 5 rows with `DataFrame.head()`.

In [None]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


View the last 5 rows with `DataFrame.tail()`.

In [None]:
penguins.tail()

<div class="alert alert-block alert-info">
    <p>The <code>DataFrame.head()</code> and <code>DataFrame.tail()</code> methods also take integers as input to specify the <strong>number of rows from the beginning or end of the dataset you would like to preview</strong>. For example, the command <code>penguins.head(10)</code> would allow us to preview the first 10 rows of the dataset instead of the 5 rows that we saw when we didn't provide an integer as input.</p>
</div>

---

## Evaluating Missing or Null or `NaN` Values in our `DataFrame` 

After previewing the Palmer Penguins dataset, we can see that there are a couple of missing values denoted as `NaN` in our data. It is always a good idea to check for missing values before continuing with our analysis. Within this section of the notebook, we'll be using the following commands to **clean** or remove these data points with null values from our dataset:

* The `DataFrame.isna()` method will help us detect rows (data points) that have a `NaN` or missing value in one of its columns
* The `DataFrame.dropna()` method will help us drop/remove these rows containing `NaN` or missing values from our dataset

In [None]:
# First, detect cells within our DataFrame that contain NaN/missing values 
penguins.isna()

<div class="alert alert-block alert-info">
    <p>The <code>DataFrame.isna()</code> method will show you the cells within your dataset that have missing values by either filling the cell with a <code>True</code> if the value in the cell is a <code>NaN</code> or missing, and <code>False</code> if the value in the cell is <strong>not <code>NaN</code> and not missing</strong>. Since this is kind of hard to read, we can do some more commands to digest what's going on and how many missing values we actually have in our dataset!</p>
</div>

Since the output from the `DataFrame.isna()` command is hard to read, we can use **both** the `DataFrame.isna()` and the `sum()` functions to calculate the number of `NaN` or missing values we have in each of our columns! 

In [None]:
# Next, let's determine the total number of NaN/missing values within our dataset with the sum() function...
# We can use the isna() function like before, but then redirect the output from that function into the sum() function right afterwards!
penguins.isna().sum()

From this output, we can see that there are `NaN` values present in the `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g`, and `sex` columns of our dataset: they have 2, 2, 2, 2, and 11 values missing in their columns, respectively. 

### Removing Rows with Missing Values from a `DataFrame` with the `dropna()` Method

Let's go ahead and remove samples (rows) with missing values using `DataFrame.dropna()`.

In [None]:
# Let's remove rows in our dataset that have missing values in one or more of the columns with dropna()
penguins.dropna()

After running this method, we can see in the bottom-left that we have **333 penguin samples** instead of the original **344 penguin samples**. 

Remember to save this to a variable! The convention is to re-save this to the same variable `penguins` and ignore missing values for now!

In [None]:
# Let's print out the shape of the DataFrame before and after removing rows with missing information
# so that we can see if a change was actually made to our variable! 
print(f"The shape of the DataFrame before removing rows with NaN values: {penguins.shape}") 

penguins = penguins.dropna() # run the same command as before BUT with saving it to a variable

# Print out the shape of the DataFrame after our modification...
print(f"The shape of the DataFrame after removing rows with NaN values: {penguins.shape}")

<div class="alert alert-block alert-info">
    <p>By default, the <code>DataFrame.dropna()</code> method drops rows with <code>NaN</code> values by default, but it can also be used to remove columns by specifying the <code>axis</code> parameter as <code>1</code> (e.g., <code>penguins.dropna(axis=1)</code> would drop columns in the Palmer Penguins dataset that have <code>NaN</code> values present).</p>
</div>

---

## Visualizations of Data with Built-In `pandas` Plotting Functions and the `seaborn` Package

It's always good practice to have a rough idea of how your data "looks" or how is distributed before you start analyzing it! We can do this by visualizing the distribution for each variable (e.g., the penguin's bill length, named `bill_length_mm` in our dataset). Within this section, we'll be doing several things:

* Learning how to access certain columns in our `DataFrame` using the bracket notation 
* Generate basic plots using the `pandas` method called `Series.plot()` 
    * Generating histograms with `kind="hist"` as a method parameter
    * Generating histograms with `kind="scatter"` as a method parameter
* Generating plots across all numerical values with the `seaborn` method called `sns.pairplot()`

### Accessing Columns of a `DataFrame` with Bracket Notation

In [None]:
# In pandas, we can access certain columns using brackets 
# and then put the name of the column we want to acess as a string inside
#     e.g., penguins["column_name_like_this"]

# For example, let's access the bill length column (called bill_length_mm) for the penguins dataset!
penguins["bill_length_mm"]

From this output, we can see that we were able to successfully access the bill lengths of all the penguins in our dataset! 

<div class="alert alert-block alert-info">
    <p>When we access columns in a <code>DataFrame</code> using this notation, the column is <strong>not</strong> returned as a <code>DataFrame</code> but instead as a <code>Series</code> object instead! You can think of a <code>pandas</code> <code>Series</code> as a list of items instead of the table-like structure that <code>DataFrame</code> objects provide. 
</div>

### Visualizing Data using Built-In `pandas` Plotting Functions

Within this section, we'll cover two common ways of visualizing biological data: 

* *Histograms* are an approximate representation of the distribution of numerical data;<br>they're a special form of bar chart where the data is continuous, so each bar is the frequency of occurences for data points that fit within a certain "bin" 
* *Scatterplots* are a diagram that use Cartesian coordinates to display values for two variables in a dataset (usually);<br>the position of each dot on the horizontal and vertical axes represents the values for an individual data point

Luckily, `pandas` has a built-in plotting function that we can use to generate basic plots based on the information in our `DataFrame`: it's under the method `pandas.plot()`. This `plot()` method has a parameter called `kind` that we can set to the type of plot we wish to generate! 

<div class="alert alert-block alert-info">
    <p>Depending on what you want to visualize, you can use the <code>plot()</code> function on either a <code>pandas</code> <code>DataFrame</code> or <code>Series</code> object.</p>
</div>

#### Generating a Histogram with the `pandas` `plot()` Function

Let's use a histogram to visualize the distirbution of different penguin bill lengths within our dataset!

In [None]:
# First, we can access the bill length column within our dataframe,
# then right after it, we can plot the information using the plot() method while specifiying that we want to see a hist-ogram! 
penguins["bill_length_mm"].plot(kind="hist")

That seemed to work! Let's pick another variable and visualize it with the `plot()` function! 

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b> Create a command in the code cell below to generate a histogram for penguin bill depth.</p>
</div>

In [None]:
# !TODO

#### Generating a Scatter Plot with the `pandas` `plot()` Function

Let's use a scatter plot to visualize different penguin bill lengths versus penguin bill depths within our dataset!

In [None]:
# Within this example, we'll run the plot() function on the penguins DataFrame object itself instead of specifying a specific column
# Then, within the plot() function, we'll specify the kind of plot as a scatter-plot and then what we want our x and y axes to represent! 
penguins.plot(kind="scatter", x="bill_length_mm", y="bill_depth_mm")

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b> Create a command in the code cell below to generate a scatter plot for penguin flipper length versus penguin body mass.</p>
</div>

In [None]:
# !TODO

### Using `seaborn` to Quickly Visualize All Variables at Once with `sns.pairplot()`

Turns out we can visualize all the variables at once with the handy function `sns.pairplot()`!

Just like we did above with one or two variables, `sns.pairplot()` plots all variables as histograms on the diagonal, while the other plots are scatter plots showing all pairwise relationships between variables. The x and y axes are shared across plots.

In [None]:
# setting corner=True turns off the redundant plots in the upper right triangle of the matrix
sns.pairplot(penguins, corner=True) 

With `sns.pairplot()` visualizing the distributions for things was a whole lot easier! On the diagonal, we can see the different histograms for each of our variables, and then for everything not on the diagonal, we can see how different variables plot against each other (e.g., on the bottom row, we can see `body_mass_g` versus `bill_length_mm`, `bill_depth_mm`, and finally, `flipper_length_mm`. 

---

![image.png](attachment:672e6850-faa1-4694-8432-f3af10d8dc1d.png)

<div class="alert alert-block alert-danger">
    <b>Let's go back to the lecture presentation before delving into the next few blocks of code!</b> 
</div>



# 2. Descriptive Statistics

Now that we have a handle on how to explore datasets in Python using `pandas`, let's go back to our lesson on Descriptive Statistics! 

> Remember, it's usually not possible to collect observations on every member of the population. For example, if we wanted to study wingspan of penguins, it's really difficult to measure the wingspan of **every single penguin on the face of the Earth**! Fortunately, we can take a *sample* of the total *population*, assuming that our sample is a decent representation of what we want to study! Therefore, the **sample** mean, variance, and standard deviation are *estimates* of the true population mean, variance, and standard deviation! 

Within this portion of the notebook, we'll be learning about the normal distribution and its characteristics by using some of the concepts from Descriptive Statistics. 

## The Normal Distribution 

Empirical data often follows or approximates a bell-shaped distribution, or normal distribution especially when sample size increases. 

<img alt="Normal Distribution Image" src="https://github.com/biom262/cmm262-2021/raw/main/module-2-statistics/img/day1_norm-dist.png">

Looking at the pairplot above, do any of the variables resemble this shape? The image above is a representation of the normal distribution! 

---

### Import the Packages We'll be Using in this Portion of the Notebook

For this portion of the notebook, we'll be importing another plotting library called `matplotlib.pyplot` to visualize our data. We'll also be importing a package called `numpy` to perform some numerical operations later on! 

In [None]:
# Load submodule pyplot from matplotlib using "plt" as the shorthand
import matplotlib.pyplot as plt

# Load numpy package using "np" as the shorthand
import numpy as np

---

## The Mean and Median

Let's inspect our sample's body mass distribution and calculate the mean and median of our data points! 

To make things easier, we can access the `body_mass_g` column of our Palmer Penguins dataset and save the values to a variable called `mass`. 

In [None]:
mass = penguins["body_mass_g"] # saves the body mass values to the mass variable

Now, let's calculate the **mean** and **median** of the penguin body masses using the built-in `pandas` functions: 

* `Series.mean()` will calculate the mean of a set of values for us
* `Series.median()` will calculate the median of a set of values for us 

<div class="alert alert-block alert-info">
    <p>Remember, we have to replace the <code>Series</code> with the specific column of the dataset we want to evaluate (e.g., <code>penguins["body_mass_g"].mean()</code> will calculate the mean of the <code>body_mass_g</code> column of the <code>DataFrame</code> referenced by the <code>penguins</code> variable, just as <code>mass.mean()</code> will also calculate the mean since <code>mass = penguins["body_mass_g"]</code>).</p>
</div>

In [None]:
# Calculate mean and save to variable `mass_mean`
mass_mean = mass.mean()

# Calculate nedian and save to variable `mass_median`
mass_median = mass.median()

# Let's print what these values are! 
print(f"The mean body mass is: {mass_mean}")
print(f"The median body mass is: {mass_median}")

Finally, let's plot a histogram of the distribution of body masses using the `sns.histplot()` function found in the `seaborn` package! We'll also plot where the mean and median are on this distribution using the `plt.avxline()` function from `matplotlib.pyplot`. 

<div class="alert alert-block alert-warning">
    <p>You can also use the <code>pandas.plot(kind="hist")</code> function if you'd like, but this function is prettier and the TAs are a little biased towards <code>seaborn</code>...</p>
</div>

In [None]:
# Generate a histogram of the masses stored in the mass variable using the histplot() function! 
sns.histplot(mass)

# Draw vertical red line at the mean
plt.axvline(mass_mean, color="red")

# Draw vertical blue line at the median
plt.axvline(mass_median, color="blue")

### What's the General Takeaway? 

Generally, for **symmetric** distributions, the **mean and median are equal**. 

However, when looking at the body mass distributions of our penguins, is it symmetric? Notice that the mean is shifted to the right of the median. Can we say that the body masses of penguins in our sample are normally distributed? 

In general, if the histogram has a tail on one side (or the distribution is **skewed**), then **the mean is pulled away from the median in the direction of the tail**.

### Let's do a quick exercise to practice what we just did in the last section! 

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b> <i>Let's examine the flipper length and see what conclusions we can make about the distribution!</i></p>
    <p>Plot the histogram of the flipper length and draw lines for the mean and median. Is flipper length roughly normally distributed? Justify your answer.</p>
</div>

In [None]:
# !TODO

# 1. Define a new variable called flipper_length with the flipper lengths of the Palmer Penguins dataset 
flipper_length = None 

# 2. Plot a histogram of the flipper lengths using the variable you just created 
#    then also plot the mean and median flipper lengths that we have in our dataset 


---

## The Standard Deviation and Z-Scores

### Spotting the Standard Deviation on a Normal Curve 

To see how the standard deviation is related to the curve, start at the top of the curve and look towards the right. Notice that there is a place where the curve changes from looking like an "upside-down cup" to a "right-way-up cup"; formally, the curve has a **point of inflection**. That point is one standard deviation above the average. It is the point `z=1`, which is "the average plus 1 standard deviation".

Symmetrically on the left-hand side of the mean, the point of inflection is at `z=−1`, that is, "the average minus 1 standard deviation".

The **z-score** is a commonly used way to describe "average ± **Z** standard deviations".

In general, **for bell-shaped distributions, the SD is the distance between the mean and the points of inflection on either side.**

### Sampling from the Normal Distribution to Visualize Standard Deviation

To visualize the phenomena that we just talked about, let's draw some samples from a normal distribution and then plot the distribution against its mean and standard deviations! We'll be doing the following steps: 

1. Sampling from a normal distribution using the `numpy` method `np.random.normal()` for 10,000 points
2. Plotting the normal distribution as a kernel density estimation plot using the `seaborn` method `sns.kdeplot()` 
3. Plotting the mean of the sampled distribution using the `matplotlib.pyplot` method `plt.axvline()`
4. Plotting +1 and -1 standard deviations above this mean using the `matplotlib.pyplot` method `plt.axvline()` 
5. Labeling our x-axis with the `matplotlib.pyplot` method `plt.xlabel()` to indicate that this axis represents the z-score 

In [None]:
# Draw 10,000 samples from a normally distributed distribution.
normal_data = np.random.normal(size=10000)

In [None]:
# Calculate -1 and +1 standard deviation
minus1sd = normal_data.mean() - normal_data.std()
plus1sd  = normal_data.mean() + normal_data.std()

In [None]:
# Plot the distribution using the sns.kdeplot() method 
sns.kdeplot(normal_data, bw_adjust=2, fill=True)

# Plot the mean as a red line
plt.axvline(normal_data.mean(), color="red")

# Plot ± 1 SD as blue lines
plt.axvline(minus1sd, color="blue")
plt.axvline(plus1sd, color="blue")

# Label the x-axis
plt.xlabel("z-score");

### What's the General Takeaway? 

Whenever you examine a histogram, you should start out by looking at the horizontal axis. On the horizontal axis of a standard normal curve, the values are standard units. For now, we can think of the normal curve as a smoothed outline of a histogram! Knowing this, there are a few properties that apply to this sort of curve; some are apparent by mere observation and others require a considerable amount of mathematics to establish:

* **The total area under the curve is 1.** So you can think of it as a histogram drawn to the density scale.
* **The curve is symmetric about 0.** So if a variable has this distribution, its mean and median are both 0.
* **The points of inflection of the curve are at -1 and +1.**
* **If a variable has this distribution, its standard deviation is 1.** The normal curve is one of the very few distributions that has an standard deviation so clearly identifiable on the histogram.

---

## The Practical Use of Z-Scores and P-Values

Knowing all of this information about the standard curve, what does the area under the curve tell us about our data? Well, we can measure proportions of the total amount of our data! 

### Handy-Dandy Cheat Sheet for Z-Scores and P-Values

Here's a cheat sheet of some approximate values that describe normal distributions:

|  Fraction in range | z-score  |  p-value |
| --- | --- | --- |
| average ± 1 SD | 1   |0.68 |
| average ± 2 SD | 2   |0.95 |
| average ± 3 SD | 3   |0.997|

But how did we get these values? Let's calculate them with code! 

### Validating the Values in the Z-Score/P-Value Cheat Sheet with Code! 

Let's find the area within **1 standard deviation of the mean** under the standard normal curve (area under the cruve between blue lines). This is also denoted as the area between `z = -1` and `z = 1`. The area is approximately `0.68`. This fraction is the probability or **two-tailed p-value** corresponding to a **`z-score = 1`**.

<img alt="Normal Distribution bell curve with vertical lines at 1 standard deviation below and above the mean" src="https://github.com/biom262/cmm262-2021/raw/main/module-2-statistics/img/day1_norm-dist-again.png">

Now, let's go through the different steps to calculating this value manually. 

1. **First, we'll calculate the lower and upper bounds of our different tails.** The lower bound is the mean minus one standard deviation, whereas the upper bound is the mean plus one standard deviation! 
2. **Next, we'll count the number of samples that are present within our bounds**. 
3. **Finally, we'll calculate the fraction of samples that are within our lower and upper bounds compared to the total number of samples**. 

In [None]:
# Step 1: Calculate the lower and upper bounds of our distribution (we'll be looking at the area one standard deviation of the mean)
lower_bound = normal_data.mean() - normal_data.std() # mean minus 1 standard deviation
upper_bound = normal_data.mean() + normal_data.std() # mean plus 1 standard deviation

print(f"The calculated lower bound is: {lower_bound}")
print(f"The calculated upper bound is: {upper_bound}")

In [None]:
# Step 2a: Determine the points that are above our lower bound and points that are below our upper bound
points_above_lower_bound = normal_data > lower_bound
points_below_upper_bound = normal_data < upper_bound

print(f"There were {sum(points_above_lower_bound)} data points above our lower bound")
print(f"There were {sum(points_below_upper_bound)} data points below our upper bound")

In [None]:
# Step 2b: Calculate the total number of points that appear within our lower and upper bounds! 
points_within_bounds = points_above_lower_bound & points_below_upper_bound
total_within_bounds  = sum(points_within_bounds) 

print(f"There were {total_within_bounds} data points within both of our lower and upper bounds")

In [None]:
# Step 3: Finally, calculate the fraction of samples that are within our bounds by dividing it by the total number of samples
total_number_of_data_points = len(normal_data)
fraction_samples_within_bounds = total_within_bounds / total_number_of_data_points

print(f"{fraction_samples_within_bounds} of our data points are within our bounds")

From those calculations, we can see that our fraction is approximately `0.68`, corresponding to the P-value from our cheat sheet! 

However, it's kind of tedious to calculate this manually every time we want to calculate a P-value. **Luckily, there are packages in Python that can perform these calcualtions for us!**

### Calculating P-Values with the `stats` Submodule from `scipy` 

That was starting to get a little complicated. Luckily, the `scikit-learn` Python package includes methods to estimate P-values easily!

In [None]:
from scipy import stats # start out by importing the package we need 

# Calculate the areas left of z=1 and z=-1 
area_left_z_plus_1 = stats.norm.cdf(1)
area_left_z_minus_1 = stats.norm.cdf(-1)

# Determine the area between z=1 and z=-1
area_left_z_plus_1 - area_left_z_minus_1

---

## Applying Descriptive Statistics, Z-Scores, and P-Values to the Palmer Penguins Dataset

Now that we have somewhat of a handle on the theory, let's apply the concepts that we've learned to the Palmer Penguins dataset! 

As we saw earlier, none of the variables that we studied were normally distributed. But for the sake of this exercise, **let's assume that `bill_length_mm` is approximately normal for the sake of this exercise**! Say we want to ask, what's the probability that a randomly-selected penguin's bill length is longer than 55 mm?

We can do this by calculating the Z-score for `bill_length_mm=55`! 

In [None]:
# Let's start out by reviewing what the distribution of bill length looks like again 
# using the seaborn function sns.histplot()... 
sns.histplot(penguins["bill_length_mm"])

### Using the `stats.zmap()` Function to Calculate a Z-Score

We can use the function `stats.zmap()` from the `scipy` package to use our own sampling of penguin bill lengths as a reference to calculate a z-score for an arbitrary value. This function performs the following steps:

1. First, the function starts by calculating the mean and standard deviation for the array of data points you provide 
2. Next, for your samples of interest, the function calculates the number of standard deviations from the mean -- a.k.a. the Z-scores!
3. Finally, the function returns these Z-scores back to you! 

<div class="alert alert-block alert-info">
    <p>If you ever are confused about what a function does, you can use the following notation to bring up the documentation in a Jupyter Notebook: <code>?FUNCTION</code>.<br>(e.g., we can type in <code>?stats.zmap()</code> into a code cell and then run it to view the documentation for this function!)</p>
</div>

In [None]:
# Let's view the documentation for the stats.zmap() function!
?stats.zmap

### Using the `stats.zmap()` and `stats.norm.cdf()` Functions to Calculate the Probability that a Randomly-Selected Penguin's Bill Length is Longer than 55 mm

Now that we've figured out how to use the `stats.zmap()` function, let's use it to solve our question about the bill lengths of penguins! 

In [None]:
# The first parameter is for samples you want to calculate the scores for
# The second parameter is the reference input to calculate sample z-scores
my_zscore = stats.zmap(55, penguins["bill_length_mm"])  

print(f"Our Z-score is: {my_zscore}")

Finally, we can convert this Z-score to a **one-tailed P-value** with the `stats.norm.cdf()` function! 

<div class="alert alert-block alert-info">
    <p>Remember, we are asking for the  probability that a randomly selected penguin's bill length is <strong>longer</strong> than 55 mm. The <code>stats.norm.cdf()</code> function always gives the area under the curve to the <code>left</code> of the score.</p>
</div>

In [None]:
p_value = 1 - stats.norm.cdf(my_zscore)
print(f"Our calculated P-value is: {p_value}")