# Lab 02: Pandas

In this lab, you will review how to create DataFrames and extract information from them. The first part of the lab will be a review of commands we have learned from lecture and some questions to help you solidify your ```pandas``` knowledge. 

In the second part of the lab, you will work with real datasets to answer some questions using your DSC 8 knowledge, but with ```pandas```. The purpose of this assignment is for you to combine Python, math, and the ideas in Data 8 to draw some interesting conclusions. The methods and results will help build the foundation of Data 100.

## Score Breakdown
Question | Points| Question | Points
--- | --- | --- | ---
1a | 2   | 3 |  7
1b | 2   | 4a | 2
1c | 2   | 4b | 2
1d | 2   | 4c | 2
1d |  2  | 4d | 7
2a | 2   | 4e | 2
2b | 2   | 4f | 2
2c |  2   | 4g | 2
2d | 2
2e | 2
Total |   | 50

## Collaboration Policy

Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** below. (It's a good way to learn your classmates' names too!)

**Collaborators**: *list collaborators here*

---
[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will review commonly used data-wrangling operations/tools in `pandas`. We aim to give you familiarity with:

* Creating `DataFrames`,
* Slicing `DataFrames` (i.e., selecting rows and columns)
* Filtering data (using boolean arrays)

In this lab, you are going to use several `pandas` methods. Reminder from lecture that you may press `shift+tab` on method parameters to see the documentation for that method. For example, if you were using the `drop` method in `pandas`, you could press `shift+tab` to see what `drop` is expecting.

`pandas` is very similar to the `datascience` library that you saw in Data 8. This [conversion notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Fsu23-materials&branch=main&urlpath=lab%2Ftree%2Fsu23-materials%2Flec%2Flec02%2Fdata8_translation_examples.ipynb) may serve as a useful guide!

This lab expects that you have watched the `Pandas I` and `II` lectures. If you have not, this lab will probably take a very long time.

**Note**: The `pandas` interface is notoriously confusing for beginners, and the documentation is not consistently great. Throughout the semester, you will have to search through `pandas` documentation and experiment, but remember it is part of the learning experience and will help shape you as a data scientist!

**This assignment seems long, but rest assured that a large part of it is a tutorial (i.e., we will guide you through many aspects of using `pandas` in the most efficient way possible!).**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
%matplotlib inline

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# PART 1: ```PANDAS``` REVIEW

## **REVIEW:** Creating `DataFrames` & Basic Manipulations

Recall that a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) is a table in which each column has a specific data type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

Usually, you'll create `DataFrames` by using a function like `pd.read_csv`. However, in this section, we'll discuss how to create them from scratch.

The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for the `pandas` `DataFrame` class provides several constructors for the `DataFrame` class.

**Syntax 1:** You can create a `DataFrame` by specifying the columns and values using a dictionary, as shown below. 

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

In [None]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink'],
          'price': [1.0, 0.75, 0.35, 0.05]
          })
fruit_info

**Syntax 2:** You can also define a `DataFrame` by specifying the rows as shown below. 

Each row corresponds to a distinct tuple, and the columns are specified separately.

In [None]:
fruit_info2 = pd.DataFrame(
    [("red", "apple", 1.0), ("orange", "orange", 0.75), ("yellow", "banana", 0.35),
     ("pink", "raspberry", 0.05)], 
    columns = ["color", "fruit", "price"])
fruit_info2

You can obtain the dimensions of a `DataFrame` by using the shape attribute `DataFrame.shape`.

In [None]:
fruit_info.shape

You can also convert the entire `DataFrame` into a two-dimensional `NumPy` array. Remember that a `NumPy` array can hold homogenous data whereas a `DataFrame` can contain heterogeneous data. 

In [None]:
numbers = pd.DataFrame({"A":[1, 2, 3], "B":[0, 1, 1]})
numpy_numbers = numbers.to_numpy()

print(type(numpy_numbers))
print(numpy_numbers)

The `values` attribute returns the content of the `DataFrame` in the form of a list of lists.

In [None]:
fruit_info.values

There are other constructors but we will not discuss them here.

## **REVIEW:** Selecting Rows and Columns in `pandas`

As you've seen in lecture, there are two verbose operators in Python for selecting rows: `loc` and `iloc`. Let's review them briefly.

**Approach 1:** `loc`

The first of the two verbose operators is `loc`, which takes two arguments. The first is one or more **row labels**, the second is one or more **column labels** - both of which are displayed in bold to the left of each of the rows and above each of the columns, respectively. These are not the same as positional indices, which are used for indexing Python lists or `NumPy` arrays!

The desired rows or columns can be provided individually, in slice notation, or as a list. Some examples are given below.

Note that **slicing in `loc` is inclusive** on the provided labels.

In [None]:
# Get rows 0 through 2 (inclusive) with labels 'fruit' through 'price' (which would include the color column that is in between both labels)
fruit_info.loc[0:2, 'fruit':'price']

In [None]:
# Get rows 0 through 2 (inclusive) and columns 'fruit' and 'price'. 
# Note the difference in notation and result from the previous example.
fruit_info.loc[0:2, ['fruit', 'price']]

In [None]:
# Get rows 0 and 2 and columns fruit and price. 
fruit_info.loc[[0, 2], ['fruit', 'price']]

In [None]:
# Get rows 0 and 2 and column fruit
fruit_info.loc[[0, 2], ['fruit']]

Note that if we request a single column but don't enclose it in a list, the return type of the `loc` operator is a `Series` rather than a `DataFrame`. 

In [None]:
# Get rows 0 and 2 and column fruit, returning the result as a Series
fruit_info.loc[[0, 2], 'fruit']

If we provide only one argument to `loc`, it uses the provided argument to select rows, and returns all columns.

In [None]:
fruit_info.loc[0:1]

Note that if you try to access columns without providing rows, `loc` will crash. Uncomment the following codes individually to try them and out and become familiar with the types of error message. Then, comment them back up.

In [None]:
# Uncomment, this code will crash
#fruit_info.loc[["fruit", "price"]]

# Uncomment, this code works fine: 
#fruit_info.loc[:, ["fruit", "price"]]

<br/>

**Approach 2:** `iloc`

`iloc` is very similar to `loc` except that its arguments are **row numbers** and **column numbers**, rather than row and column labels. A useful mnemonic is that the `i` stands for "integer". This is quite similar to indexing into a Python `list` or `NumPy` array.

In addition, **slicing for `iloc` is exclusive** on the provided integer indices. Some examples are given below:

In [None]:
# Get rows 0 through 3 (exclusive) and columns 0 through 3 (exclusive)
fruit_info.iloc[0:3, 0:3]

In [None]:
# Get rows 0 through 3 (exclusive) and columns 0 and 2.
fruit_info.iloc[0:3, [0, 2]]

In [None]:
# Get rows 0 and 2 and columns 0 and 2.
fruit_info.iloc[[0, 2], [0, 2]]

In [None]:
# Get rows 0 and 2 and column fruit
fruit_info.iloc[[0, 2], [0]]

In [None]:
# Get rows 0 and 2 and column fruit
fruit_info.iloc[[0, 2], 0]

Note that in these `loc` and `iloc` examples above, the row **label** and row **number** were always the same.

Let's see an example where they are different. If we sort our fruits by price, we get:

In [None]:
fruit_info_sorted = fruit_info.sort_values("price")
fruit_info_sorted

After sorting, note how row number 0 now has index label 3, row number 1 now has index label 2, etc. These indices are the arbitrary numerical indices generated when we created the `DataFrame`. For example, `banana` was originally in row 2, and so it has row label 2. Note the distinction between the index _label_, and the actual index _position_.

If we request the rows in positions 0 and 2 using `iloc`, we're indexing using the row NUMBERS, not labels. 

In [None]:
fruit_info_sorted.iloc[[0, 2], 0]

Lastly, similar to `loc`, the second argument to `iloc` is optional. That is, if you provide only one argument to `iloc`, it treats the argument you provide as a set of desired row numbers, not column numbers.

In [None]:
fruit_info_sorted.iloc[[0, 2]]

**Approach 3:** `[]` Notation for Accessing Rows and Columns

`pandas` also supports the `[]` operator. It's similar to `loc` in that it lets you access rows and columns by their name.

However, unlike `loc`, which takes row names and also optionally column names, `[]` is more flexible. If you provide it only row names, it'll give you rows (same behavior as `loc`), and if you provide it with only column names, it'll give you columns (whereas `loc` will crash).

Some examples:

In [None]:
fruit_info[0:2]

In [None]:
# Here we're providing a list of fruits as single argument to []
fruit_info[["fruit", "color", "price"]]

Note that slicing notation is not supported for columns if you use `[]` notation. Use `loc` instead.

In [None]:
# Uncomment and this code crashes
#fruit_info["fruit":"price"]

# Uncomment and this works fine
#fruit_info.loc[:, "fruit":"price"]

`[]` and `loc` are quite similar. For example, the following two pieces of code are functionally equivalent for selecting the fruit and price columns.

1. `fruit_info[["fruit", "price"]]` 
2. `fruit_info.loc[:, ["fruit", "price"]]`.

Because it yields more concise code, you'll find that our code and your code both tend to feature `[]`. However, there are some subtle pitfalls of using `[]`. If you're ever having performance issues, weird behavior, or you see a `SettingWithCopyWarning` in `pandas`, switch from `[]` to `loc`, and this may help.

To avoid getting too bogged down in indexing syntax, we'll avoid a more thorough discussion of `[]` and `loc`. We may return to this at a later point in the course.

For more on `[]` vs. `loc`, you may optionally try reading:
1. https://stackoverflow.com/questions/48409128/what-is-the-difference-between-using-loc-and-using-just-square-brackets-to-filte
2. https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc/65875826#65875826
3. https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas/53954986#53954986

Now that we've reviewed basic indexing, let's discuss how we can modify `DataFrames`. We'll do this via a series of exercises. 

<br/><br/>

---

## Question 1a

For a `DataFrame` `d`, you can add a column with `d['new column name'] = ...` and assign a `list` or `array` of values to the column. Add a column of integers containing 1, 2, 3, and 4 called `rank1` to the `fruit_info` table, which expresses **your personal preference** about the taste ordering for each fruit (1 is tastiest; 4 is least tasty). There is no right order, it is completely your choice of rankings.


In [None]:
...
fruit_info

<br/><br/>

---

## Question 1b

You can also add a column to `d` with `d.loc[:, 'new column name'] = ...`. As above, the first parameter is for the rows, and the second is for columns. The `:` means changing all rows, and the `'new column name'` indicates the name of the column you are modifying (or, in this case, adding). 

Add a column called `rank2` to the `fruit_info` table, which contains the same values in the same order as the `rank1` column.


In [None]:
...
fruit_info

<br/><br/>

---

## Question 1c

Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) both the `rank1` and `rank2` columns you created. Make sure to use the `axis` parameter correctly. Note that `drop` does not change a table but instead returns a new table with fewer columns or rows unless you set the optional `inplace` argument.

**Hint:** Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to see how you can drop multiple columns of a `DataFrame` at once using a list of column names.


In [None]:
fruit_info_original = ...
fruit_info_original

<br/><br/>

---

## Question 1d

Use the `.rename()` method to rename the columns of `fruit_info_original` so they begin with capital letters. Set this new `DataFrame` to `fruit_info_caps`. For an example of how to use rename, see this linked [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html).

In [None]:
...
fruit_info_caps

<br/><br/>


## Babynames Dataset
For the next few questions of this lab, let's move on to a real-world dataset. We'll be using the babynames dataset from Lecture 3. The babynames dataset contains a record of the given names of babies born in the United States each year.

Let's run the following cells to build the `DataFrame` `babynames`. Note that we only include data from California due to memory constraints (the full dataset has over 6 million rows!). There should be a total of 407428 records.


In [None]:
file_path = 'data/namesbystate_ca.txt.gz'
column_labels = ['State', 'Sex', 'Year', 'Name', 'Count']

babynames = pd.read_csv(file_path, names=column_labels)

In [None]:
len(babynames)

In [None]:
babynames.head()

## Selection Examples on Baby Names

As with our synthetic fruit dataset, we can use `loc` and `iloc` to select rows and columns of interest from our dataset.

In [None]:
babynames.loc[2:5, 'Name']

Notice the difference between the following cell and the previous one; just passing in `'Name'` returns a `Series` while `['Name']` returns a `DataFrame`.

In [None]:
babynames.loc[2:5, ['Name']]

The code below collects the rows in positions 1 through 3, and the column in position 3 ("Name").

In [None]:
babynames.iloc[1:4, [3]]

<br/><br/>

---

## Question 2a

Use `.loc` to select `Name` and `Year` **in that order** from the `babynames` table.


In [None]:
name_and_year = ...
name_and_year[:5]

<br/><br/>

---

## Question 2b
Now repeat the same selection using the plain `[]` notation.

In [None]:
name_and_year = ...
name_and_year[:5]

## **REVIEW**: Filtering with boolean arrays

Filtering is the process of removing unwanted entries. In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, for culling out fishy outliers, or for analyzing subgroups of your dataset. Example usage looks like `df[df['column name'] < 5]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
&gt;=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

As an example, in the following, we construct a `DataFrame` containing only rows where the name is Arman.

In [None]:
arman_babynames = babynames[babynames['Name'] == 'Arman']
arman_babynames.head(5)

<br/><br/>

---
## Question 2c
Using a boolean array, select the names in Year 2000 (from `babynames`) that have larger than 3000 counts. Keep all columns from the original `babynames` `DataFrame`.

_Note_: Note that compound expressions have to be grouped with parentheses. That is, any time you use `p & q` to filter the `DataFrame`, make sure to use `df[(df[p]) & (df[q])]` or `df.loc[(df[p]) & (df[q])]`. 

You may use either `[]` or `loc`. Both will achieve the same result. For more on `[]` vs. `loc`, see the stack overflow links from the intro portion of this lab.

In [None]:
result = ...
result

## **REVIEW:** `str`

`pandas` provides special purpose functions for working with specific common data types such as strings and dates. For example, the code below provides the length of every baby's name from our `babynames` dataset. 

In [None]:
babynames['Name'].str.len()

<br/><br/>

---

## Question 2d

Add a column to `babynames` named `First Letter` that contains the first letter of each baby's name.

Hint: you can index using `.str` similarly to how you'd normally index Python strings. Or, you can use `.str.get` [(documentation here)](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get.html).

In [None]:
...
babynames

<br/><br/>

---

## Question 2e

In 2022, how many babies had names that started with the letter "A"? 

In [None]:
babynames_2022 = ...
just_A_names_2022 = ...
number_A_names = ...
number_A_names

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# PART 2

In this second part of the lab, you will work with real data sets to investigate health questions. We will work on data analysis and plotting using built-in python libraries.

### Initialize your environment

In [None]:
import numpy as np
np.random.seed(42)
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

### Preliminary: Jupyter Shortcuts ###

Here are some useful Jupyter notebook keyboard shortcuts.  To learn more keyboard shortcuts, go to **Help -> Keyboard Shortcuts** in the menu above. 

Here are a few we like:
1. `ctrl`+`return` : *run the current cell*
1. `shift`+`return`: *run the current cell and move to the next*
1. `esc` : *command mode* (may need to press before using any of the commands below)
1. `a` : *create a cell above*
1. `b` : *create a cell below*
1. `dd` : *delete a cell*
1. `m` : *convert a cell to markdown*
1. `y` : *convert a cell to code*

### Preliminary: `NumPy` ###

You should be able to understand the code in the following cells. If not, review the following:

* [Data 8 Textbook Chapter on NumPy](https://www.inferentialthinking.com/chapters/05/1/Arrays)
* [DS100 NumPy Review](http://ds100.org/fa17/assets/notebooks/numpy/Numpy_Review.html)
* [Condensed NumPy Review](http://cs231n.github.io/python-numpy-tutorial/#numpy)
* [The Official NumPy Tutorial](https://numpy.org/doc/stable/user/quickstart.html)

**Jupyter pro-tip**: Pull up the documentation for any function in Jupyter by running a cell with
the function name and a `?` at the end:

In [None]:
np.arange?

**Another Jupyter pro-tip**: Pull up the documentation for any function in Jupyter by typing the function
name, then `<Shift><Tab>` on your keyboard. This is super convenient when you forget the order
of the arguments to a function. You can press `<Tab>` multiple times to expand the docs and reveal additional information.

Try it on the function below:

In [None]:
np.linspace

### Preliminary: LaTeX ###
You should use LaTeX to format math in your answers. If you aren't familiar with LaTeX, don't worry. It's not hard to use in a Jupyter notebook. Just place your math in between dollar signs within Markdown cells:

`$ f(x) = 2x $` becomes $ f(x) = 2x $.

If you have a longer equation, use double dollar signs to place it on a line by itself:

`$$ \sum_{i=0}^n i^2 $$` becomes:

$$ \sum_{i=0}^n i^2$$


You can align multiple lines using the `&` anchor, `\\` newline, in an `align` block as follows:

```
\begin{align}
f(x) &= (x - 1)^2 \\
&= x^2 - 2x + 1
\end{align}
```
becomes

\begin{align}
f(x) &= (x - 1)^2 \\
&= x^2 - 2x + 1
\end{align}

* [This PDF](latex_tips.pdf) has some handy LaTeX tips.
* [For more about basic LaTeX formatting, you can read this article.](https://www.sharelatex.com/learn/Mathematical_expressions)


### Preliminary: Sums ###

Here's a recap of some basic algebra written in sigma notation. The facts are all just applications of the ordinary associative and distributive properties of addition and multiplication, written compactly and without the possibly ambiguous "$\dots$". But if you are ever unsure of whether you're working correctly with a sum, you can always try writing $\sum_{i=1}^n a_i$ as $a_1 + a_2 + \cdots + a_n$ and see if that helps.

You can use any reasonable notation for the index over which you are summing, just as in Python you can use any reasonable name in `for name in list`. Thus $\sum_{i=1}^n a_i = \sum_{k=1}^n a_k$.

- $\sum_{i=1}^n (a_i + b_i) = \sum_{i=1}^n a_i + \sum_{i=1}^n b_i$
- $\sum_{i=1}^n d = nd$
- $\sum_{i=1}^n (ca_i + d) = c\sum_{i=1}^n a_i + nd$

These properties may be useful in the future when we cover Least Squares Predictors. To see the LaTeX we used, double-click this cell. Evaluate the cell to exit.

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Question 1: Distributions ##

Visualizing distributions, both categorical and numerical, helps us understand variability. In Data 8, you visualized numerical distributions by drawing histograms ([Chapter 7.2 link](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html#histogram)), which look like bar charts but represent proportions through the *areas* of the bars instead of the heights or lengths.


---

### Part 0: Matplotlib Tutorial

We will not be using Data 8's `datascience` library in this course. Instead, we will learn industry——and academia——standard libraries for exploring and visualizing data, including `matplotlib` ([official website](https://matplotlib.org/)).
In this exercise, you will use the `hist` function in `matplotlib` instead of the corresponding `Table` method to draw histograms. In a previous cell, we imported the matplotlib library as `plt`, which allows us to call `plt.hist()`.

To start off, suppose we want to plot the probability distribution of the number of spots on a single roll of a die. That should be a flat histogram since the chance of each of the values 1 through 6 is $\frac{1}{6}$. Here is a first attempt at drawing the histogram.

In [None]:
faces = range(1, 7)
plt.hist(faces)

This default plot is not helpful. We have to choose some arguments to get a visualization that we can interpret. 

Note that the second printed line shows the left ends of the default bins, as well as the right end of the last bin. The first line shows the counts in the bins. If you don't want the printed lines, you can add a semi-colon `;` at the end of the call to `plt.hist`, but we'll keep the lines for now.

Let's redraw the histogram with bins of unit length centered at the possible values. By the end of the tutorial, you'll see a reason for centering. Notice that the argument for specifying bins is the same as the one for the `Table` method `hist` from the `datascience` library in DSC 8 ([link](https://www.data8.org/datascience/reference-nb/datascience-reference.html#tbl.hist())).

In [None]:
unit_bins = np.arange(0.5, 6.6)
plt.hist(faces, bins=unit_bins)

We need to see the edges of the bars! Let's specify the edge color `ec` to be `white`. [Here](https://matplotlib.org/3.5.3/gallery/color/named_colors.html) are all the colors you could use, but do try to drag yourself away from the poetic names.

In [None]:
plt.hist(faces, bins=unit_bins, ec='white')

That's much better, but look at the vertical axis. It is not drawn to the density scale defined in Data 8 ([Chapter 7.2 link](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html#the-vertical-axis-density-scale)). We want a histogram of a probability distribution, so the total area should be 1. We just have to ask for that by setting `density` to `True`.

In [None]:
plt.hist(faces, bins=unit_bins, ec='white', density=True)

That's the probability histogram of the number of spots on one roll of a die. The proportion is $\frac{1}{6}$ in each of the bins.

Finally, we can set the opacity, or transparency, of the bars with the `alpha` parameter, which is a value from 0 to 1. For 70% opacity:

In [None]:
plt.hist(faces, bins=unit_bins, ec='white', density=True, alpha=0.7)

**Note/Reminder**: The above cells printed the counts/proportions and bin boundaries with the visualization. This was intentional on our part to show you how `plt.hist()` returned different values per plot. You may use a semicolon `;` on the last line to suppress additional display as needed.

<br/><br/>

---

### Question 3

Define a function `plot_distribution` that takes an array of numbers (integers or decimals) and draws the histogram of the distribution using unit bins centered at the integers and white edges for the bars.

The histogram should be drawn to the density scale, and the opacity should be 75%. The left-most bar should be centered at the integer closest to the smallest number in the array, and the right-most bar should be centered around the integer closest to the largest number in the array.

The display does not need to include the printed proportions and bins. No titles or labels are required for this question. For grading purposes, assign your plot to `histplot`.

If you have trouble defining the function, go back and carefully read all the lines of code that resulted in the probability histogram of the number of spots on one roll of a die. Pay special attention to the bins. Feel free to create a cell to test your function on generic arrays to check for correctness!

**Hint**: 
* See `plt.hist()` [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html).
* We want to: (1) center each bin at integer values and (2) make sure all the values in the array are captured by the bins.
    * For example, let’s say we have the following input array: `[0.3, 0.7, 1.1, 1.4, 1.9]`.
    * The smallest value is `0.3`; the left endpoint of the leftmost bin (the first bin) should be `-0.5` and the rightmost endpoint of this bin should be `0.5` so that this bin is centered at the integer `0`.
    * This first bin above captures `0.3`. The second bin will be centered at `1` (between `0.5` and `1.5`) and captures `0.7`, `1.1`, and `1.4`.
    * We can continue in this manner until all values are captured by our bins.
* What is the left endpoint of the left-most bar? What is the right endpoint of the right-most bar? You may find `min()`, `max()`, and `round()` helpful.
* Please keep in mind your function should be implemented so that it works for _any_ generic array of numbers (integers or decimals), not just the `faces` array in the cell below.
* If you implement the function correctly, you should get a plot like this:

<img src="images/q1a.png" alt="question 1a plot">

In [None]:
def plot_distribution(arr):
    # Define bins
    unit_bins = ...
    # Plot the data arr using unit_bins, assign the plot to histplot
    histplot = ...
    return histplot
faces = range(1, 10)
histplot = plot_distribution(faces)

<br/>

---

### Tutorial: Serum Cholesterol

Recall from Data 8 that you can perform [hypothesis testing using the permutation test](https://inferentialthinking.com/chapters/12/1/AB_Testing.html) (Chapter 12.1). **Before continuing, we HIGHLY ENCOURAGE you to read the above linked Data 8 chapters for a review of how hypothesis testing works.**

Scientists across several hospitals have gathered data about heart disease and non-disease patients, and they are organized into the following dataset called `hearts_df` (from the `csv` file `hearts.csv`). In this question, we study one recorded feature in `hearts_df`: serum cholesterol. Serum cholesterol refers to the total amount of cholesterol in one’s blood. Further details about the dataset are discussed in [this Kaggle page](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).
In this assignment, we will investigate whether patients with heart disease have different serum cholesterol levels than patients without heart disease.

**Run the below cell**, which assigns `non_disease_chol` to a list of serum cholesterol values of patients without heart disease (of which there are 390), and `disease_chol` to a list of serum cholesterols of patients with heart disease (of which there are 356).

In [None]:
# Just run this cell. You will learn these functions soon!

import pandas as pd
hearts_df = pd.read_csv("hearts.csv")

non_disease_chol = hearts_df[hearts_df['HeartDisease'] == 0]['Cholesterol'].values
print(len(non_disease_chol))

disease_chol = hearts_df[hearts_df['HeartDisease'] == 1]['Cholesterol'].values
print(len(disease_chol))

Suppose that we overlay the distributions of cholesterol levels from the two groups:

In [None]:
# Just run this cell. You will learn these functions soon!

import seaborn as sns
sns.histplot(hearts_df, x="Cholesterol", hue="HeartDisease");
plt.title("Distribution of Cholesterol Levels");

In the plot above, `0` indicates data from patients without heart disease, and `1` indicates data from patients with heart disease. The distribution of serum cholesterol of patients without heart disease is centered slightly left of the distribution corresponding to those with heart disease. Specifically, the **average** serum cholesterol of patients without heart disease appears lower than that of patients with heart disease.

<br/>    
As mentioned in the introduction of this question, we'd like to study whether this difference reflects just chance variation or perhaps a difference in the distributions in the larger population. Suppose we propose the following two hypotheses:

> **Null hypothesis ($\mathcal{H}_0$)**: In the population, the distribution of serum cholesterol of non-patients is the same for heart disease patients. The (observed) difference in the sample is due to chance.

> **Alternative hypothesis ($\mathcal{H}_1$)**: In the population, the distribution of serum cholesterol of non-patients is **different** from that of heart disease patients.

We would like to perform hypothesis testing using the permutation test. One way to do so is to compute an observed test statistic and then compare it with multiple simulated test statistics generated through random permutations.


<br/><br/>

---
## Question 4

In this question, we will confirm some details about the hypothesis testing operations proposed.

### Question 4a
Given the study description and hypotheses outlined above, select the statement that most accurately describes the hypothesis test we conducted. Answer this question by entering the letter corresponding to your answer in the textbox below, along with a sentence describing your choice. 

**A.** The hypothesis test is one-sided.  The null hypothesis is rejected when the average serum cholesterol of patients with heart disease is significantly higher than that of patients without heart disease. \
**B.** The hypothesis test is two-sided because we are comparing the average serum cholesterol of two different groups. \
**C.** The hypothesis test is two-sided. The null hypothesis is rejected when the average serum cholesterol of patients with heart disease is significantly higher or lower than that of patients without heart disease. \
**D.** The hypothesis test is two-sided because the test statistic, the difference in means, is symmetrically distributed. In other words, the two halves of the distribution closely resemble each other, so the test is two-sided.


**Hint**: Visit just the first few paragraphs of [this page](https://www.stat.berkeley.edu/~spector/s133/Random1.html) to refresh your knowledge on the differences between "one-sided" and "two-sided" tests.

_Type your answer here, replacing this text._

### Question 4b
​
Suppose that we choose a reasonable test statistic as the **absolute difference** between the average cholesterol level of patients with heart disease and the corresponding average for patients without heart disease.
In the cell below, assign `observed_difference` to the observed value of the test statistic computed from our original samples: `non_disease_chol` and `disease_chol`.
​
**Hint**: This test statistic is slightly different from what is presented in the Data 8 textbook, [Chapter 12.1 link](https://inferentialthinking.com/chapters/12/1/AB_Testing.html#the-hypotheses).

In [None]:
observed_difference = ...
observed_difference

<!-- BEGIN QUESTION -->

<br/><br/>

---

### Question 4c

Before we write any code, let’s review the idea of hypothesis testing with the permutation test. It follows the procedure below: 
1. We first simulate the experiment many times (say, 10,000 times) using [random permutation](https://inferentialthinking.com/chapters/12/1/AB_Testing.html#predicting-the-statistic-under-the-null-hypothesis) (i.e., without replacement) (i.e., under the assumption that the null hypothesis is true). This simulated sampling process produces an empirical distribution of many values of a predetermined test statistic (say, 10,000 values). 
2. Then, we compare our one true observed test statistic to this empirical distribution of simulated test statistics to compute an empirical p-value. 
3. Finally, we compare this p-value to a particular cutoff threshold (often, 0.05) to decide whether we fail to reject the null hypothesis.

In the cell below, answer the following questions:
* What does an empirical p-value from a permutation test mean in this particular context of serum cholesterol and having heart disease?
* Suppose the empirical p-value is $p=0.15$, and our p-value cutoff threshold is $0.01$. Do we reject or fail to reject the null hypothesis? Why?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br/><br/>

---

### Question 4d

Now, we begin the permutation test by generating an array called `differences` that contains simulated values of our test statistic from **10,000 permuted samples**. Again, note that our test statistic differs from what is in the Data 8 textbook: we are computing the **absolute** difference between the average cholesterol levels of patients with heart disease and without heart diseases, where labels have been assigned at random (i.e., in a world where the null hypothesis is true, so disease status is arbitrary and should have no effect on cholesterol).

**Reminder**: Data 100 does **not** support the `datascience` library, so you should instead use the appropriate functions from the `NumPy` library. Some suggested references: Lab 01 (for a quick `NumPy` tutorial), `NumPy` array indexing/slicing [documentation](https://numpy.org/doc/stable/user/basics.indexing.html), `np.random.choice` [documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) (in particular, the `size` and `replace` parameters), and `np.append` [documentation](https://numpy.org/doc/stable/reference/generated/numpy.append.html).

**Note**: We have provided some optional skeleton code below, but you do not need to follow it. However, please still assign your simulated differences to the array `differences`.

In [None]:
np.random.seed(42) # Do not modify this line.

# Create an empty array to hold our simulated differences
differences = np.array([]) 
# Set number of repetitions
repetitions = 10000
# Combine the two arrays into a single array
all_cholestrol = np.append(non_disease_chol, disease_chol)

for i in np.arange(repetitions):
    # Permute all_cholestrol
    shuffled_cholesterols = np.random.choice(all_cholestrol, size=len(all_cholestrol), replace=False)
    
    # Make the simulated patient and non-patient group
    sim_non_disease_chol = shuffled_cholesterols[:len(non_disease_chol)]
    sim_disease_chol = shuffled_cholesterols[len(non_disease_chol):]
    
    # Calculate test statistics
    sim_difference = np.abs(np.mean(sim_disease_chol) - np.mean(sim_non_disease_chol))
    
    # Append the test statistics in differences
    differences = ...

differences

<!-- BEGIN QUESTION -->

<br/><br/>

---

### Question 4e

The array `differences` is an empirical distribution of the test statistic simulated under the null hypothesis. This is a prediction about the test statistic, based on the null hypothesis.

Use the `plot_distribution` function you defined in an earlier part to plot a histogram of this empirical distribution. Because you are using this function, your histogram should have unit bins, with bars centered at integers. No title or labels are required for this question.

**Hint**: This part should be very straightforward.


In [None]:
...

<!-- END QUESTION -->

<br/><br/>

---

### Question 4f

Compute `empirical_p`, the empirical p-value based on `differences`, the empirical distribution of the test statistic, and `observed_difference`, the observed value of the test statistic.

**Hint**: 
* Review the conclusion of the [Data 8 textbook example](https://inferentialthinking.com/chapters/12/1/AB_Testing.html#conclusion-of-the-test) in Chapter 12.1.
* There are two main differences between this example and the Data 8 example. The first being that our test statistic is different. The second is that our hypothesis is different. How can you adjust the code from the Data 8 example to calculate `empirical_p`?

In [None]:
empirical_p = ...
empirical_p

<!-- BEGIN QUESTION -->

<br/><br/>

---

### Question 4g

Based on your computed empirical p-value, do we reject or fail to reject the null hypothesis? Use the p-value cutoff proposed in Question 1c of $0.01$, or $1\%$. 


_Type your answer here, replacing this text._

<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Congratulations! You have finished Lab 1!