In [None]:
# If you're following along, run this cell.
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from IPython.display import display, IFrame

def binning_animation():
    src="https://docs.google.com/presentation/d/e/2PACX-1vTnRGwEnKP2V-Z82DlxW1b1nMb2F0zWyrXIzFSpQx_8Wd3MFaf56y2_u3JrLwZ5SjWmfapL5BJLfsDG/embed?start=false&loop=false&delayms=60000"
    width=900
    height=307
    display(IFrame(src, width, height))

# Lecture 8 – Histograms and Overlaid Plots

## DSC 10, Winter 2022

### Announcements

- Homework 2 is due on **Saturday 1/22 at 11:59pm**.
- Lab 3 is due on **Tuesday, 1/25 at 11:59pm**.
- Supplemental videos 🎥 to be aware of:
    - [Lecture 7 supplement](https://www.youtube.com/watch?v=OVTroiHby3g) (more examples of bar charts).
    - [Lecture 6 supplement](https://www.youtube.com/watch?v=xg7rnjWnZ48).
    - [Discussion 2 video](https://www.youtube.com/watch?v=Q3mww8m3iIQ).
    - All are linked on the course homepage.

### Agenda

- Motivating histograms.
- Density histograms.
- Overlaid plots.

### Review: types of visualizations

The type of visualization we create depends on the kinds of variables we're visualizing.

- **Scatter plot**: numerical vs. numerical.
    - Example: weight vs. height.
- **Line plot**: sequential numerical (time) vs. numerical.
    - Example: height vs. time.
- **Bar chart**: categorical vs. numerical.
    - Example: heights of all family members.
- **Histogram**: distribution of numerical.
    
**Note:** We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

## Motivating histograms

Below, we load in a dataset of the Top 200 songs on the Spotify Charts as of Wednesday, January 19th.

In [None]:
charts = bpd.read_csv('data/spotify-wednesday-jan-19.csv').set_index('Position')
charts

In [None]:
charts.take(np.arange(15)) \
      .sort_values('Streams') \
      .plot(kind='barh', x='Track Name', y='Streams', title='Number of Streams for Top 15 Songs', color='red');

Note the optional `title` and `color` arguments.

### How do we visualize the distribution of the number of streams?

- **Question:** Can we use a bar chart? 🧐

- **Answer:** No! 🙅‍♀️
    - Bar charts visualize the relationship between a categorical variable (e.g. song name) and numerical variable (e.g. streams).
    - In all of the bar charts we created, we had to pick a category for our labels.
        - With a bar chart, we can visualize the number of streams **per song**. 
    - If we're looking at just the number of streams, and ignoring all other columns, there's no "category".

### New idea: binning

- Binning is the act of counting the number of numerical values that fall within ranges, called “bins”.
- A bin is defined by a left endpoint (lower bound) and right endpoint (upper bound).
- A value falls in a bin if it is greater than or equal to the left endpoint and less than the right endpoint.
    - [a, b): a is included, b is not.
- The width of a bin is its right endpoint minus its left endpoint.


In [None]:
binning_animation()

### Distribution of streams

- $x$-axis: streams (numerical).
- $y$-axis: a bar whose height encodes the number of songs that had about that many streams.

In [None]:
charts.plot(kind='hist', y='Streams');

- 👀 It seems like the vast majority of songs on the charts have under 2 million streams, but some songs have as many as 4 million streams.
    - The `1e6` in the bottom right means "multiply this axis by a million", since $10^{6} = 1{,}000{,}000$.

### Plotting histograms

- **Histograms** (not bar charts!) visualize the distribution of a single numerical variable by placing numbers into bins.
- To create one from a DataFrame `df`, use
```py
df.plot(
    kind='hist', 
    y=column_name
)
```
- ⚠️ By default, the height of a bar is the *number* of values in the corresponding *bin*.
- Optional: specify the number of bins with `bins=`. Use `ec='w'` to see where bins start and end more clearly.

In [None]:
charts.plot(kind='hist', y='Streams', bins=20, ec='w');

### Example: Number of songs per artist

In [None]:
songs_per_artist = charts.groupby('Artist').count().get(['Streams'])
songs_per_artist = songs_per_artist.assign(Count=songs_per_artist.get('Streams')).drop(columns=['Streams'])
songs_per_artist.plot(kind='hist', y='Count', ec='w');

- Note: No more "1e6" in the bottom right!
     - We're now plotting number of songs, not number of streams.

### Custom bins

- We can specify our own bins with an array or list.
    - It's good to do this, so that we know where the bins start and end.
- `bins=np.arange(1, 9)` creates the bins [1, 2), [2, 3), [3, 4), [4, 5), [5, 6), [6, 7), and [7, 8].
    - **Important**: in a histogram, only the last bin is inclusive of the right endpoint!
- **Warning**: Data points not in any bin will not be included in the histogram.

In [None]:
song_bins = np.arange(1, 9)
songs_per_artist.plot(kind='hist', y='Count', ec='w', bins=song_bins);

- 👀 The vast majority of artists had only 1 or 2 songs on the charts, but someone had as many as 7.
- We'd say this distribution is **right-skewed** or **right-tailed**.

### Bin widths don't have to be uniform!

- When we set `bins=np.arange(1, 9)`, each bin has the same width (1).
- But we could make bins of varying widths.
- `bins=[1, 2, 3, 4, 6, 8]` creates the bins [1, 2), [2, 3), [3, 4), [4, 6), and [6, 8].

In [None]:
weird_bins = [1, 2, 3, 4, 6, 8]
songs_per_artist.plot(kind='hist', y='Count', ec='w', bins=weird_bins);

### Discussion Question

Intuitively, what should happen to our histogram if we combine the two bins [1, 2) and [2, 3) into one large bin [1, 3)?

A. The height of the bar for bin [1, 3) should be the sum of the heights of the bars for bins [1, 2) and [2, 3).

B. The height of the bar for bin [1, 3) should be the average of the heights of the bars for bins [1, 2) and [2, 3).

C. The area of the bar for bin [1, 3) should be the sum of the areas of the bars for bins [1, 2) and [2, 3).

D. More than one of the above.


### To answer, go to **[menti.com](https://menti.com)** and enter the code **6723 8967**.

In [None]:
weirder_bins = [1, 3, 4, 6, 8]
songs_per_artist.plot(kind='hist', y='Count', ec='w', bins=weirder_bins);

### There's a problem... 🤔

- We know that there are ~90 artists with 1 song and ~20 artists with 2 songs.
- But with these new bins, it looks like there are ~110 artists with 1 song and ~110 artists with 2 songs.
- **Takeaway:** Using bins with different widths is **misleading** if the $y$-axis is frequency (which is the default).

## Density histograms

### Solution: normalize bars by their width

- Use the `density=True` keyword argument to make a **density histogram**.
    - **Important:** We will **always** do this moving forward.

In [None]:
songs_per_artist.plot(kind='hist', y='Count', ec='w', bins=weirder_bins, density=True);

What do you notice about this new histogram?

- The y-axis is now in decimals.

- The relative heights of the bins are now different.

### Areas are proportions!

- The **area** of a bar in a density histogram is equal to the proportion (percentage) of all data points that fall into that bin.
- **The total area of a density histogram is always 1 (100%)**.
- Proportions and percentages represent the same thing.
    - A proportion is a decimal between 0 and 1, a percentage is between 0\% and 100\%.
    - 0.34 means 34\%.

In [None]:
songs_per_artist.plot(kind='hist', y='Count', ec='w', bins=weirder_bins, density=True);

### How to calculate heights in a density histogram

$$\text{Area} = \text{Height} \times \text{Width}$$

That means

$$\text{Height} = \frac{\text{Area}}{\text{Width}} = \frac{\text{Proportion (or Percentage)}}{\text{Width}}$$

- Note that this means the units for height are "proportion per ($x$-axis unit)".

### Example calculation

In [None]:
songs_per_artist.plot(kind='hist', y='Count', ec='w', bins=weirder_bins, density=True);

- The $y$-axis units here are "proportion per song", since the $x$-axis represents number of songs.
    - Unfortunately, the $y$-axis units on the histogram still display as "Frequency". **This is wrong!**
- Based on this histogram, what proportion of artists had either 1 or 2 songs in the top 200?

### Example calculation

- The height of the [1, 3) bar is roughly 0.43.
    - Interpretation: 0.43 per song, or 43% per song.
- The width of the bin is 3 - 1 = 2 songs.

- Hence,

$$\text{Area} = \text{Height} \times \text{Width} = 0.43 \times 2 = 0.86$$

- Since areas = proportions, this means that the proportion of artists with 1 or 2 songs on the charts was roughly 0.86 (86\%).

In [None]:
# Proof
between_1_3 = songs_per_artist[(songs_per_artist.get('Count') >= 1) & (songs_per_artist.get('Count') < 3)].shape[0]
between_1_3

In [None]:
total = songs_per_artist.shape[0]
total

In [None]:
between_1_3 / total

This matches the result we got. (Not exactly, since we made a rough guess for the height.)

### Important

**In this class, "histogram" will always mean "density histogram".** We will **only** use density histograms moving forward.

### Discussion Question

Suppose we created a density histogram of people's shoe sizes. Below are the bins we chose along with their heights.

| Bin | Height of Bar |
| --- | --- |
| [3, 7) | 0.05 |
| [7, 10) | 0.1 |
| [10, 12) | 0.15 |
| [12, 16] | $X$ |


What should the value of $X$ be so that this is a valid histogram?

A. 0.02

B. 0.05

C. 0.2

D. 0.5

E. 0.7

### To answer, go to **[menti.com](https://menti.com)** and enter the code **6723 8967**.

### Bar charts vs. histograms

Bar Chart | Histogram
---|---
1 categorical axis,  1 numerical axis | 2 numerical axes
Bars have arbitrary, but equal, widths and spacing | Horizontal axis is numerical and to scale
Lengths of bars are proportional to the numerical quantity of interest | Height measures density; areas are proportional to the proportion (percent) of individuals

## Overlaid plots

### New dataset: populations of San Diego and San Jose over time

The data for both cities was downloaded from [macrotrends.net](https://www.macrotrends.net/cities/23129/san-diego/population).

In [None]:
population = bpd.read_csv('data/sd-sj-2022.csv').set_index('date')
population

### Recall: line plots

In [None]:
population.plot(kind='line', y='Growth SD');

In [None]:
population.plot(kind='line', y='Growth SJ');

### Overlaying plots

- If `y=column_name` is omitted, all columns are plotted!

In [None]:
population.plot(kind='line');

### Selecting multiple columns at once
- To get multiple columns, use `.get([column_1, ..., column_k])`.
- Passing a list of column labels to `.get` returns a DataFrame.
    - `.get([column_name])` will return a DataFrame with just one column!

In [None]:
growths = population.get(['Growth SD', 'Growth SJ'])
growths

In [None]:
growths.plot(kind='line');

### To plot multiple graphs at once:
* Drop all extraneous columns from your DataFrame.
    * Equivalently, select only the columns that contain information relevant to your plot.
* Specify the column for the $x$-axis (if not the index) in `.plot(x=column_name)`.
* `plot` will plot **all** other columns on a shared $y$-axis.

The same thing works for `barh`, `bar`, and `hist`, but not `scatter`.

## Example: Overlaid histograms

### New dataset: heights of children and their parents

- This data was collected by Francis Galton, a eugenicist and the creator of linear regression.
    - We will revisit this dataset later on in the course.
- We only need the `'father'`, `'mother'`, and `'childHeight'` columns for now.

In [None]:
heights = bpd.read_csv('data/galton.csv')
heights

In [None]:
heights = heights.get(['father', 'mother', 'childHeight'])
heights

### Plotting overlaid histograms

- `alpha` controls how transparent the bars are (`alpha=1` is opaque, `alpha=0` is transparent).

In [None]:
heights.plot(kind='hist', density=True, ec='w', alpha=0.65, bins=np.arange(55, 80, 2.5));

### There's too much going on...

It's too hard to read any of the individual histograms here. Let's instead just draw two, one for `'childHeight'` and one for `'mother'`.

In [None]:
heights.get(['mother', 'childHeight']) \
       .plot(kind='hist', density=True, ec='w', alpha=0.65, bins=np.arange(55, 80, 2.5));

### Discussion Questions

Try to answer these questions; you don't have to submit your answers to Menti.

1. What proportion of children were between 70 and 75 inches tall?

2. What proportion of mothers were between 60 and 63 inches tall?

<h3>Answers</h3>
<details>
<summary>Click here to show.</summary>
    
<b>Question 1</b>
    
The height of the $[70, 72.5)$ bar is around $0.08$, meaning that $0.08 \cdot 2.5 = 0.2$ of children had heights in that interval. The height of the $[72.5, 75)$ bar is around $0.02$, meaning $0.02 \cdot 2.5 = 0.05$ of children had heights in that interval. Thus, the overall proportion of children who were between $70$ and $75$ inches tall was around $0.20 + 0.05 = 0.25$, or $25\%$.
    
To verify our answer, we can run

<code>heights[(heights.get('childHeight') >= 70) & (heights.get('childHeight') < 75)].shape[0] / heights.shape[0]</code>
    
<b>Question 2</b>
    
We can't tell. We could try and breaking it up into the proportion of mothers in $[60, 62.5)$ and $[62.5, 63)$, but we don't know the latter. In the absence of any additional information, we can't infer about the distribution of values within a bin. For example, it could be that everyone in the interval $[62.5, 65)$ actually falls in the interval  $[62.5, 63)$ - or it could be that no one does!

</details>

## Summary

### Summary

- Histograms (not bar charts!) are used to display the distribution of a numerical variable.
- We will always use density histograms.
    - In density histograms, the area of a bar represents the proportion (percentage) of values within its bin.
    - The total area of all bars is 1 (100%).
- We can overlay multiple line plots, bar charts, and histograms on top of one another to look at multiple relationships or distributions.
- **Next time**: More DataFrame manipulation.
    - Writing our own functions (Lab 3).