In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
%matplotlib inline

# Module 2.1 Part 2: Histograms

This lecture guide intoduces histograms, a popular method for visualizing the distributions of numerical data.

7 videos make up this notebook, with a total running time of 60:14.

1. [The Area Principle](#section1) *1 video, total runtime 7:21*
2. [Binning](#section2) *2 videos, total runtime 18:03*
3. [Drawing Histograms](#section3) *1 video, total runtime 13:05*
4. [Density](#section4) *1 video, total runtime 9:37*
5. [Check for Understanding](#section5) *2 videos, total runtime 12:08, and 2 short answer questions*

Textbook readings: [Chapter 7.2: Visualization, Numerical Distributions](https://www.inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html)

<a id='section1'></a>
## 1. Area Principle

Before diving into histograms, we introduce the *area principle*. Following this principle ensures that
histograms honestly depict patterns in the data.

In [None]:
YouTubeVideo("qEYz6D0MKq8")

<a id='section2'></a>
## 2. Binning

In the next videos, you'll learn how to *bin* numerical variables such that they can be plotted in a histogram.

In [None]:
YouTubeVideo("kREoWbByNZs")

In [None]:
YouTubeVideo("vz5VLqrw-tA")

Let's again consider the Bay Area Bike Share dataset. First, filter out all trips whose duration exceeds one hour. Note that
the *Duration* measures the the rental duration in seconds. Next, tabulate *Duration*'s distribution. Use bins corresponding to
5 minute intervals, and to 15 minute intervals. Which binwidth (a.k.a. bin interval) do you prefer?

In [None]:
# load the bike trips data
bike_trips = Table.read_table('https://www.inferentialthinking.com/data/trip.csv')
bike_trips

In [None]:
# filter out trips exceeding one hour
...

# count table with bins corresponding to 5 minute intervals
...

# count table with bins corresponding to 15 minute intervals
...

<details>
    <summary>Solution</summary>
    <b>Code</b>: <br>
    # filter out trips exceeding one hour <br>
    bike_trips = bike_trips.where("Duration", are.below(3600)) <br>
    # count table with bins corresponding to 5 minute intervals <br>
    bike_trips.bin("Duration", bins = np.arange(0, 3601, 300)) <br>
    # count table with bins corresponding to 15 minute intervals <br>
    bike_trips.bin("Duration", bins = np.arange(0, 3601, 900)) <br>
    <b>Explanation</b>: <br>
    Most rentals last less than 15 minutes. The table using 5 minute intervals as bins conveys more information than the table using 15 minute intervals.
    Bins corresponding to 5 minute intervals are therefore more appropriate.
</details>
<br>

<a id='section3'></a>
## 3. Drawing Histograms

In the next video, you'll learn how to visualize the distribution of a numerical variable using a histogram.

In [None]:
YouTubeVideo("xPv7VNSBJZQ")

Let's visualize the rental duration distribution that you tabulated in the previous section. Once again, consider
bins corresponding to 5 and 15 minute intervals. Don't forget to indicate the units! Which histogram do you prefer?

In [None]:
# histogram of duration with bins corresponding to 5 minute intervals
bike_trips.hist(...)

# histogram of duration with bins corresponding to 15 minute intervals
bike_trips.hist(...)

<details>
    <summary>Solution</summary>
    <b>Code</b>: <br>
    # histogram of duration with bins corresponding to 5 minute intervals <br>
    bike_trips.hist("Duration", bins = np.arange(0, 3601, 300), unit = "Seconds") <br>
    # histogram of duration with bins corresponding to 15 minute intervals <br>
    bike_trips.hist("Duration", bins = np.arange(0, 3601, 900), unit = "Seconds") <br>
    <b>Explanation</b>: <br>
    As with the tables, the histogram produced with bins corresponding to 5 minute intervals provides a more descriptive picture of
    <i>Duration</i>'s distribution than does the histogram using 15 minute intervals.
</details>
<br>

<a id='section4'></a>
## 4. Density

The vertical axis of a histogram typically corresponds to a quantity called *density*. This new measurement is defined in the following video.

In [None]:
YouTubeVideo("F8Pv0DWqPls")

<a id='section5'></a>
## 5. Check for Understanding

The problems in the following videos will help you assess your understanding of the material introduced in this submodule.
Don't forget to do the short answer questions, either!

In [None]:
YouTubeVideo("ZwvovAbWUyY")

In [None]:
YouTubeVideo("Jl5fNPkEcDI")

### Short Answer Questions

**A. Fill in the blanks: The area principle dictates that the area of a histogram's bar must be ________ to the number of entries in the bar's bin.**

<details>
    <summary>Solution</summary>
    proportional
</details>
<br>

**B. True or false: The area principle ensures that no distributional details are lost when binning a numerical variable.**

<details>
    <summary>Solution</summary>
    False. Even when judiciously selecting the binwidth, some distributional details are lost in the binning process. For examle,
    review the tables and histograms that you generated in Sections 2 and 3. Even though the area principle is respected when using
    a binwidth of 15 minutes, much less information is conveyed about <i>Duration</i>'s distribution than by the table and plot
    employing a 5 minute binwidth.
</details>
<br>