# Lecture 21 - Line Charts, Bar Charts, and Scatter Plots, and More

Thursday 2021/11/04

---

## ✨ Visualize your dataset

▶️ First, run the code below to ensure you're using the correct version of plotly.

In [None]:
# Install plotly 5.3.1 using pip
# Colab environment supports pip
if 'google.colab' in str(get_ipython()):
    !pip install plotly==5.3.1

# If you're using conda, use the code below
# !conda install -c plotly plotly=5.3.1

Import modules used by **🧭 Check Your Work** sections and the autograder.

In [None]:
import unittest
import base64
import plotly
tc = unittest.TestCase()

---

### 🎯 Pre-exercise: Import Packages

#### 👇 Tasks

- ✔️ Import the following Python packages.
    1. `pandas`: Use alias `pd`.
    2. `numpy`: Use alias `np`.
    3. `plotly.express`: Use alias `px`.
    4. `plotly.graph_objects`: Use alias `go`.

In [None]:
# YOUR CODE BEGINS




# YOUR CODE ENDS

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
import sys
tc.assertTrue('pd' in globals(), 'Check whether you have correctly imported Pandas with an alias.')
tc.assertTrue('np' in globals(), 'Check whether you have correctly imported NumPy with an alias.')
tc.assertIsNotNone(go.Figure, 'Check whether you have correctly imported plotly.graph_objects with an alias go.')
tc.assertIsNotNone(px.scatter, 'Check whether you have correctly imported plotly.express with an alias px.')

---
### 📌 Import dataset

![BLUEbikes](https://github.com/bdi475/images/blob/main/lecture-notes/dataviz-python/bluebike-transparent-bike.png?raw=true)

Today, we work with bikesharing trips dataset 🚲 to uncover insights about trips made by subscribers and casual riders of Bluebikes (in Boston). The original dataset has been downloaded from [https://www.bluebikes.com/system-data](https://www.bluebikes.com/system-data) and was preprocessed for this exercise.

▶️ Run the code below to import the dataset. This dataset is a fairly large with ~2 million rows, **so it may take up to a few minutes**.

In [None]:
# Display all columns
pd.set_option('display.max_columns', 50)

df_trips = pd.read_csv('https://github.com/bdi475/datasets/blob/main/bluebikes-trip-data-2020-sampled.csv.gz?raw=true',
                       compression='gzip',
                       parse_dates=['start_time', 'stop_time'])

df_trips_backup = df_trips.copy()

display(df_trips)

--- 

## 📦 Box plots and histograms review

---

### 🎯 Exercise 1: Trip duration box plot (horizontal)

#### 👇 Tasks

- ✔️ Draw a horizontal box plot of `trip_duration` where `trip_duration < 1800` (less than 30 minutes).
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS





# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/box_plot_trip_duration_under_30mins.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'box', 'Not a box plot')
tc.assertEqual(fig.data[0].orientation, 'h', 'Your plot should have a horizontal orientation')
np.testing.assert_array_equal(fig.data[0].x, df_trips['trip_duration'][df_trips['trip_duration'] < 1800], 'Incorrect data')

---

### 🎯 Exercise 2: Trip duration dispersion by user type

#### 👇 Tasks

- ✔️ Draw horizontal box plots of `trip_duration` by `user_type`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

box_plots_trip_duration_by_user_type.png

In [None]:
# YOUR CODE BEGINS






# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/box_plots_trip_duration_by_user_type.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'box', 'Not a box plot')
tc.assertEqual(fig.data[0].orientation, 'h', 'Your plot should have a horizontal orientation')
np.testing.assert_array_equal(fig.data[0].x, df_trips['trip_duration'], 'Incorrect x-axis data')
np.testing.assert_array_equal(fig.data[0].y, df_trips['user_type'], 'Incorrect y-axis data')

---

### 🎯 Exercise 3: Trip duration histogram

#### 👇 Tasks

- ✔️ Draw a histogram of `trip_duration` in `df_trips`.
- ✔️ Set the number of bins to `36`.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/trip_duration_distribution_histogram.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'histogram', 'Not a histogram')
tc.assertEqual(fig.data[0].nbinsx, 36, 'There should be 36 bins - set the nbins parameter')
tc.assertEqual(fig.data[0].orientation, 'v', 'Your plot should have a vertical orientation')
np.testing.assert_array_equal(fig.data[0].x, df_trips['trip_duration'], 'Incorrect data')

--- 

## 📈 Line chart

A line chart displays the evolution of one or more numeric variables. Discrete data points are usually connected by straight lines.

---

### 🎯 Exercise 4: Annual gold price at the end of the year 📈

▶️ First, run the code cell below to import annual gold closing prices dataset.

In [None]:
# DO NOT CHANGE THE CODE BELOW
df_gold = pd.read_csv('https://github.com/bdi475/datasets/raw/main/gold-annual-closing-price.csv')
df_gold

#### 👇 Tasks

- ✔️ Using `df_gold`, create a line chart that displays the closing price by year.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/annual_closing_price_of_gold_line.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'scatter', 'Must be a line chart')
tc.assertIsNotNone(fig.data[0].line.color, 'Must be a line chart')
np.testing.assert_array_equal(fig.data[0].x, df_gold['Year'], 'Incorrect x-axis data')
np.testing.assert_array_equal(fig.data[0].y, df_gold['Closing Price'], 'Incorrect y-axis data')

---
### 🎯 Exercise 5: Create an aggregated DataFrame with number of trips by date

#### 👇 Tasks

- ✔️ One of the common tasks when visualizing your data is to aggregate your data before plotting them.
- ✔️ Using `df_trips`, create a new DataFrame named `df_num_trips_by_date` that holds the number of trips by date.
- ✔️ We will give you the fully working code below.


```python
# YOUR CODE BEGINS
df_num_trips_by_date = df_trips.groupby(
    df_trips['start_time'].dt.date,
    as_index=False
).size()

df_num_trips_by_date.rename(columns={
    'start_time': 'date',
    'size': 'num_trips'
}, inplace=True)

display(df_num_trips_by_date)
# YOUR CODE ENDS
```

In [None]:
# YOUR CODE BEGINS











# YOUR CODE ENDS

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_trips_backup.groupby(df_trips_backup['start_time'].dt.date).size() \
    .reset_index().rename(columns={'start_time': 'date', 0: 'num_trips'})

pd.testing.assert_frame_equal(df_num_trips_by_date.sort_values('date').reset_index(drop=True),
                              df_check.sort_values('date').reset_index(drop=True),)

---

### 🎯 Exercise 6: Number of trips by date 📈

#### 👇 Tasks

- ✔️ Using `df_num_trips_by_date`, create a line chart that displays the number of trips by date.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/number_of_trips_by_date_line.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'scatter', 'Must be a line chart')
tc.assertIsNotNone(fig.data[0].line.color, 'Must be a line chart')
np.testing.assert_array_equal(fig.data[0].x, df_num_trips_by_date['date'], 'Incorrect x-axis data')
np.testing.assert_array_equal(fig.data[0].y, df_num_trips_by_date['num_trips'], 'Incorrect y-axis data')

---

### 🎯 Exercise 7: Number of trips by date (Scatter Plot)

#### 👇 Tasks

- ✔️ Using `df_num_trips_by_date`, create a **scatter** plot that displays the number of trips by date.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/number_of_trips_by_date_scatter.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'scatter', 'Must be a scatter plot')
tc.assertIsNone(fig.data[0].line.color, 'Must be a scatter plot')
np.testing.assert_array_equal(fig.data[0].x, df_num_trips_by_date['date'], 'Incorrect x-axis data')
np.testing.assert_array_equal(fig.data[0].y, df_num_trips_by_date['num_trips'], 'Incorrect y-axis data')

---
### 🎯 Exercise 8: Create an aggregated DataFrame with number of trips by date & user type

#### 👇 Tasks

- ✔️ Using `df_trips`, create a new DataFrame named `df_num_trips_by_date_and_user_type` that holds the number of trips by date and user type.
- ✔️ We will give you the fully working code below.

```python
# YOUR CODE BEGINS
df_num_trips_by_date_and_user_type = df_trips.groupby(
    [df_trips['start_time'].dt.date, 'user_type'],
    as_index=False
).size()

df_num_trips_by_date_and_user_type.rename(columns={
    'start_time': 'date',
    'size': 'num_trips'
}, inplace=True)

display(df_num_trips_by_date_and_user_type)
# YOUR CODE ENDS
```

In [None]:
# YOUR CODE BEGINS











# YOUR CODE ENDS

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_trips_backup.groupby([df_trips_backup['start_time'].dt.date, 'user_type']).size() \
    .reset_index().rename(columns={'start_time': 'date', 0: 'num_trips'})

pd.testing.assert_frame_equal(df_num_trips_by_date_and_user_type.sort_values(['date', 'user_type']).reset_index(drop=True),
                              df_check.sort_values(['date', 'user_type']).reset_index(drop=True),)

---

### 🎯 Exercise 9: Number of trips by date and user type

#### 👇 Tasks

- ✔️ Using `df_num_trips_by_date_and_user_type`, create a **line** chart that displays the number of trips by date.
- ✔️ Draw two line charts on a single figure.
    - Use different colors to distinguish the user types.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS






# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/number_of_trips_by_date_and_user_type_line.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 2, 'There must be two plots in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'scatter', 'Must be a line plot')
tc.assertEqual(fig.data[1].type, 'scatter', 'Must be a line plot')
tc.assertIsNotNone(fig.data[0].line.color, 'Must be a line plot')
tc.assertIsNotNone(fig.data[1].line.color, 'Must be a line plot')
np.testing.assert_array_equal(
    fig.data[0].x,
    df_num_trips_by_date_and_user_type[df_num_trips_by_date_and_user_type['user_type'] == 'Customer']['date'],
    'Incorrect x-axis data'
)
np.testing.assert_array_equal(
    fig.data[0].y,
    df_num_trips_by_date_and_user_type[df_num_trips_by_date_and_user_type['user_type'] == 'Customer']['num_trips'],
    'Incorrect y-axis data'
)

np.testing.assert_array_equal(
    fig.data[1].x,
    df_num_trips_by_date_and_user_type[df_num_trips_by_date_and_user_type['user_type'] == 'Subscriber']['date'],
    'Incorrect x-axis data'
)
np.testing.assert_array_equal(
    fig.data[1].y,
    df_num_trips_by_date_and_user_type[df_num_trips_by_date_and_user_type['user_type'] == 'Subscriber']['num_trips'],
    'Incorrect y-axis data'
)

---

### 🎯 Exercise 10: Number of trips by date and user type (scatter plot)

#### 👇 Tasks

- ✔️ Using `df_num_trips_by_date_and_user_type`, create a **scatter** plot that displays the number of trips by date.
- ✔️ Draw two scatter plots on a single figure.
    - Use different colors to distinguish the user types.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS






# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/number_of_trips_by_date_and_user_type_scatter.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 2, 'There must be two plots in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'scatter', 'Must be a scatter plot')
tc.assertEqual(fig.data[1].type, 'scatter', 'Must be a scatter plot')
tc.assertIsNone(fig.data[0].line.color, 'Must be a scatter plot')
tc.assertIsNone(fig.data[1].line.color, 'Must be a scatter plot')
np.testing.assert_array_equal(
    fig.data[0].x,
    df_num_trips_by_date_and_user_type[df_num_trips_by_date_and_user_type['user_type'] == 'Customer']['date'],
    'Incorrect x-axis data'
)
np.testing.assert_array_equal(
    fig.data[0].y,
    df_num_trips_by_date_and_user_type[df_num_trips_by_date_and_user_type['user_type'] == 'Customer']['num_trips'],
    'Incorrect y-axis data'
)

np.testing.assert_array_equal(
    fig.data[1].x,
    df_num_trips_by_date_and_user_type[df_num_trips_by_date_and_user_type['user_type'] == 'Subscriber']['date'],
    'Incorrect x-axis data'
)
np.testing.assert_array_equal(
    fig.data[1].y,
    df_num_trips_by_date_and_user_type[df_num_trips_by_date_and_user_type['user_type'] == 'Subscriber']['num_trips'],
    'Incorrect y-axis data'
)

--- 

## 📊 Bar chart

A bar chart displays size of the values of categorical data.

---
### 🎯 Exercise 11: Create an aggregated DataFrame with number of trips by month

#### 👇 Tasks

- ✔️ Using `df_trips`, create a new DataFrame named `df_num_trips_by_month` that holds the number of trips by month.
- ✔️ We will give you the fully working code below.

```python
# YOUR CODE BEGINS
df_num_trips_by_month = df_trips.groupby(
    df_trips['start_time'].dt.month,
    as_index=False
).size()

df_num_trips_by_month.rename(columns={
    'start_time': 'month',
    'size': 'num_trips'
}, inplace=True)

display(df_num_trips_by_month)
# YOUR CODE ENDS
```

In [None]:
# YOUR CODE BEGINS











# YOUR CODE ENDS

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
df_check = df_trips_backup.groupby(df_trips_backup['start_time'].dt.month).size() \
    .reset_index().rename(columns={'start_time': 'month', 0: 'num_trips'})

pd.testing.assert_frame_equal(df_num_trips_by_month.sort_values('month').reset_index(drop=True),
                              df_check.sort_values('month').reset_index(drop=True),)

---

### 🎯 Exercise 12: Number of trips by month (Bar Chart)

#### 👇 Tasks

- ✔️ Using `df_num_trips_by_month`, create a **bar** chart that displays the number of trips by month.
- ✔️ Store your figure to a variable named `fig`.
- ✔️ Add an appropriate title to your figure.
- ✔️ Display the figure using `fig.show()`

In [None]:
# YOUR CODE BEGINS


# YOUR CODE ENDS

#### 🔑 Sample output

![image](https://github.com/bdi475/images/blob/main/exercises/plotly-dataviz/number_of_trips_by_month_bar.png?raw=true)

#### 🧭 Check your work

- Once you're done, run the code cell below to test correctness.
- ✔️ If the code cell runs without an error, you're good to move on.
- ❌ If the code cell throws an error, go back and fix incorrect parts.

In [None]:
tc.assertEqual(len(fig.data), 1, 'There must be only one plot in your figure')
tc.assertIsNotNone(fig.layout.title.text, 'Missing figure title')
tc.assertEqual(fig.data[0].type, 'bar', 'Must be a bar chart')
tc.assertEqual(fig.data[0].orientation, 'v', 'Your plot should have a vertical orientation')
np.testing.assert_array_equal(fig.data[0].x, df_num_trips_by_month['month'], 'Incorrect x-axis data')
np.testing.assert_array_equal(fig.data[0].y, df_num_trips_by_month['num_trips'], 'Incorrect y-axis data')