In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
%matplotlib inline

# Module 2.1 Part 1: Charts, Data Types, and Categorical Distributions 

In this notebook, you'll be introduced to various types of data, and how best to visualize them.

This notebooks contains 8 videos, with a total runtime of 54:49.

1. [Line Graphs](#section1) *2 videos, total runtime 16:10*
2. [Scatter Plots](#section2) *2 videos, total runtime 13:41*
3. [Choosing the Plotting Method](#section3) *1 video, total runtime 2:14*
4. [Types of Data](#section4) *1 video, total runtime 3:57*
5. [Distributions](#section5) *2 videos, total runtime 18:47*
6. [Check for Understanding](#section6)

Textbook Readings:
- [Chapter 7: Visualizations](https://www.inferentialthinking.com/chapters/07/Visualization.html)
- [Chapter 7.1: Visualization, Categorical Distributions](https://www.inferentialthinking.com/chapters/07/1/Visualizing_Categorical_Distributions.html)

<a id='section1'></a>
## 1. Line Graphs

A picture is worth a thousand words -- or, in a data science context, a thousand values stored in a
table. The next videos describes how we can succintly represent chronological trends via line graphs.

In [None]:
YouTubeVideo("pcEadlLnFBw")

In [None]:
YouTubeVideo("5-NEr5Pnybk")

The data below provides aggregated information on movies produced in the United States between 1980 and 2015.
Use a line plot to visualize the number of movies produced each year during this time period. Repeat for the
total gross per year. Are there any obvious trends?

In [None]:
# load the movies_by_year dataset
movies_by_year = Table.read_table('https://www.inferentialthinking.com/data/movies_by_year.csv')

# visualize the number of movies produced per year
movies_by_year.plot(...)

# visualize the total gross produced per year
movies_by_year.plot(...)

<details>
    <summary>Solution</summary>
    <b>Code</b>: <br>
    movies_by_year.plot('Year', 'Number of Movies') <br>
    movies_by_year.plot('Year', 'Total Gross') <br>
    <b>Interpretation</b>: <br>
    Both the number of movies and total gross tended to increase from year to year between 1980 and 2015.
</details>

<a id='section2'></a>
## 2. Scatter Plots

Next, scatter plots are introduced as a tool for displaying the relationship between two numerical variables.

In [None]:
YouTubeVideo("6mPOvbubJSM")

In [None]:
YouTubeVideo("WxrsPBNklks")

Consider the `actors` dataset loaded in the cell below. Plot the relationship between the actors' average gross per movie and the gross
of their most successful movie. Only consider actors who have starred in over 30 movies.

In [None]:
# load the actors dataset
actors = Table.read_table('https://www.inferentialthinking.com/data/actors.csv')

# produce the scatter plot
...

<details>
    <summary>Solution</summary>
    actors.where("Number of Movies", are.above(30)).scatter("Average per Movie", "Gross")
</details>
<br>

<a id='section3'></a>
## 3. Choosing the Plotting Method

In the next video, you'll learn how to choose between a line graph and a scatter plot.

In [None]:
YouTubeVideo("CQIc1pjkyEM")

<a id='section4'></a>
## 4. Types of Data

You'll learn to distinguish between numerical and categorical variables in the upcoming video.

In [None]:
YouTubeVideo("EHRg9ojcVRQ")

<a id='section5'></a>
## 5. Distributions

You'll learn how to visualize and interpret the distributions of categorical variables in the following videos.

In [None]:
YouTubeVideo("ME3LjCrvxik")

In [None]:
YouTubeVideo("hMvuoBFWC1o")

The Bay Area Bike Share service published a dataset describing every one of their bicycle rentals from September 2014 to August 2015.
There were a 354,152 rentals in all. Plot the number of rides starting at each station.

*Hint*: Each row corresponds to a bike rental. The `Start Station` variable indicates the rentals' starting stations.

In [None]:
# load the bike trips data
bike_trips = Table.read_table('https://www.inferentialthinking.com/data/trip.csv')

In [None]:
# plot the distribution of bike rides
...

<details>
    <summary>Solution</summary>
    bike_trips.group("Start Station").sort(1, descending = True).barh(0)
</details>
<br>

<a id="section6"></a>
## 6. Check for Understanding

For the following questions, consider the `temperatures` table:

| Date       | min_temp | max_temp | temp_diff | forecast |
|------------|----------|----------|-----------|----------|
| 2020-05-01 | 44       | 68       | 24        | 0        |
| 2020-05-02 | 49       | 72       | 23        | 2        | 
| 2020-05-03 | 45       | 67       | 22        | 1        |
| 2020-05-04 | 47       | 66       | 19        | 0        |
| 2020-05-05 | 48       | 64       | 16        | 1        |
| 2020-05-06 | 42       | 62       | 20        | 1        |
| 2020-05-07 | 50       | 65       | 15        | 1        |

**A. What kind of plot should be used to visualize the relationship between *Date* and *temp_diff*?** 

<details>
    <summary>Solution</summary>
    A line graph, since we wish to visualize a chronological trend. 
</details>
<br>

**B. What kind of plot should be used to visualize the relationship between *min_temp* and *max_temp*?** 

<details>
    <summary>Solution</summary>
    A scatter plot, since we wish to visualize the relationship between two numerical variables.
</details>
<br>

**C. Suppose we created a scatter plot of *max_temp* and *temp_diff*. What would each point in the plot represent?**

<details>
    <summary>Solution</summary>
    Each point in this scatter plot corresponds to an observation in the temperatures table. Each point therefore
    corresponds to a date, and provides information on that date's <i>max_temp</i> and <i>temp_diff</i>.
</details>
<br>

**D. The predicted forcast of each *Date* was recorded in the *forecast* column: 0 corresponds to sunny, 1 to cloudy, and 2 to rainy.
Is *forecast* a numerical or a categorical variable?**

<details>
    <summary>Solution</summary>
    Although <i>forecast</i> is recorded as a number, it's a categorical variable. It doesn't make sense to perform arithmetic operations on these values.
</details>
<br>