In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw05.ipynb")

# Homework 5 – Data Visualization 🐧

## Data 6, Summer 2021

This homework is due on **Tuesday, August 10th at 11:00PM. (Notice the 24-hour extension for this assignment!)** You must submit the assignment to Gradescope. Submission instructions can be found at the bottom of this notebook. See the [syllabus](http://data6.org/su21/syllabus/#late-policy-and-extensions) for our late submission policy.

### Acknowledgements

Many of the pictures and descriptions in this assignment are taken from [Dr. Allison Horst](https://twitter.com/allison_horst) and [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php). See [here](https://allisonhorst.github.io/palmerpenguins/) for more details.

In [1]:
# Run this cell, don't change anything. Ignore any warning messages, if you receive them. 
!pip install kaleido

In [2]:
# Run this cell, don't change anything.

from datascience import *
import numpy as np
Table.interactive_plots()
import seaborn as sns
import copy
import plotly.io
import plotly.express as px
from IPython.display import display, Image
np.set_printoptions(suppress=True)
import warnings
warnings.filterwarnings('ignore')
from ipywidgets import interact

def save_and_show(fig, path):
    if fig == None:
        print('Error: Make sure to use the argument show = False above.')
    plotly.io.write_image(fig, path)
    display(Image(path))

## Disclaimer (MAKE SURE YOU READ THIS!)

When creating graphs in this assignment, there are two things you will always have to do that we didn't talk about in lecture:

**1. Assign the graph to a variable name. (As in previous assignments, we will tell you which variable names to use.)**

**2. Use the argument `show = False` in your graphing method, in addition to the other arguments you want to use.**

These steps are **required** in order for your work to be graded. The distinction is subtle, since you will see the same visual output either way. See below for an example.

<b style="color:green;">Good:</b>

```py
fig_5 = table.sort('other_column').barh('column_name', show = False)
```

<b style="color:red;">Bad:</b>

```py
table.sort('other_column').barh('column_name')
```

Also note that most of this homework will be graded manually by us rather than being graded by an autograder.

**A general tip we have is that you should slow down and read the question prompts, as they'll have helpful information and tips needed to answer the question.**

Lastly, if you ever restart your kernel or reopen your notebook, remember to re-import all of your tools by running the 3 code cells above.

# Visualization Fundamentals

## The Data

In the first part of this assignment (Questions 1-3), we will explore a dataset containing size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica. The data was collected by Dr. Kristen Gorman, a marine biologist, from 2007 to 2009.

Here's a photo of Dr. Gorman in the wild collecting the data:

<img src='images/gorman1.png' width=500>

Run the cell below to load in our data.

In [5]:
# Run this cell!
penguins = Table.from_df(sns.load_dataset('penguins').dropna())
penguins

Let's make sure we understand what each of the columns in our data represents before proceeding.

### `'species'`

There are three species of penguin in our dataset: Adelie, Chinstrap, and Gentoo.
<img src='images/lter_penguins.png' width=500>

### `'island'`

The penguins in our dataset come from three islands: Biscoe, Dream, and Torgersen. (The smaller image of Anvers Island may initially be confusing; the dark region is land and the light region is water.)

<img src='images/island.png' width=500>

<div align = center>
    Image taken from <a href=https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0090081.g001>here</a>
</div>

### `'bill_length_mm'` and `'bill_depth_mm'`

See the illustration below.

<img src='images/culmen_depth.png' width=350>

### `'flipper_length_mm'`, `'body_mass_g'`, `'sex'`

[Flippers](https://www.thespruce.com/flipper-definition-penguin-wings-385251) are the equivalent of wings on penguins. Body mass and sex should be relatively clear.

## Question 1 – Bar Charts

### Question 1a

Let's start by visualizing the distribution of the islands from which the penguins in our dataset come from. As discussed in Lecture 25, we use bar charts to display the distribution of categorical variables, so `barh` sounds like it will be useful here.

Run the following line of code. Make sure to scroll to the very bottom of the box that appears to see the x-axis.

In [6]:
penguins.barh('island')

<!-- BEGIN QUESTION -->

Hmm... that doesn't look right. In two sentences, explain what's wrong with the above graph and describe how you would fix it.

_Hint: Recall the three-step visualization process._

<!--
BEGIN QUESTION
name: q1a
points: 2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 1b

Now, fix the above visualization so that it correctly shows the distribution of the islands from which our penguins come from. 

Specifically, assign `fig_1b` to a bar chart with three bars, one for each island. The length of each bar should correspond to the number of penguins from each island.

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

In [8]:
fig_1b = ...
fig_1b

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q1b
points: 1
manual: true
-->

In [9]:
# Run this cell, don't change anything.
save_and_show(fig_1b, 'images/saved/1b.png')

<!-- END QUESTION -->



### Question 1c

Let's instead display the distribution of our penguins' species.

Below, assign `fig_1c` to a bar chart with three bars, one for each **species.** The length of each bar should correspond to the **number of penguins from each species.** 

Unlike in Question 1b, you will need to:
1. Sort the table after grouping to match the example below.
2. Edit the title and axis labels to match the example below (hint: set the arguments `xaxis_title`, `yaxis_title`, and `title`).

<img src='images/examples/1c.png' width=750>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

_Hint: Remember that you can use the `\` character to break your work into multiple lines._

In [10]:
fig_1c = ...
fig_1c

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your plot. You must run this cell in order to get credit for this question.


<!--
BEGIN QUESTION
name: q1c
points: 3
manual: true
-->

In [11]:
# Run this cell, don't change anything.
save_and_show(fig_1c, 'images/saved/1c.png')

<!-- END QUESTION -->

### Question 1d

Let's now try and create some grouped bar charts, in order to look at the relationship between `'species'` and `'island'`. We'll first need to create a table with one categorical column and several numerical columns, as that's what `barh` will need to create a grouped bar chart.

Below, assign `species_by_island` to a table with **three rows and four columns.** 

Each row should correspond to a single species of penguin, and the columns should correspond to the islands where our penguins are from. The entries in `species_by_island` should describe the number of penguins of a particular species from a particular island.

The first row of `species_by_island` is given below; remember there are supposed to be three rows.

| species   |   Biscoe |   Dream |   Torgersen |
|----------|---------:|--------:|------------:|
| Adelie    |       44 |      55 |          47 |

_Hint: This should be very straightforward; there is a method we studied that does exactly what you need to do. Also, if you see a warning that starts with `Creating an ndarray from...`, you can safely ignore it._

<!--
BEGIN QUESTION
name: q1d
points: 1
-->

In [13]:
species_by_island = ...
species_by_island

In [None]:
grader.check("q1d")

### Question 1e

Now, using the table `species_by_island` you created in the previous part, replicate the grouped bar chart below and assign the result to `fig_1e`. Make sure to match the axis labels and title.

<img src='images/examples/1e.png' width=750>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

In [18]:
fig_1e = ...
fig_1e

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q1e
points: 2
manual: true
-->

In [19]:
# Run this cell, don't change anything.
save_and_show(fig_1e, 'images/saved/1e.png')

<!-- END QUESTION -->



### Question 1f

Great, now we know how to create grouped bar charts starting from our `penguins` table. Let's try it again – but this time, there will be less scaffolding.

Assign `fig_1f` to the **grouped bar chart** below, which describes the **mean body weight** of each **species** of penguin, separated by **sex**. Once again, make sure to match the axis label and title.

<img src='images/examples/1f.png' width=750>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

_Hint: Think about how this task is similar to what you did in the previous two parts. Don't forget the three-step visualization process, which says that you should first create a table with the information you need in your visualization; in that first step, `np.mean` will need to be used somewhere._

In [20]:
fig_1f = ...
fig_1f

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.


<!--
BEGIN QUESTION
name: q1f
points: 3
manual: true
-->

In [21]:
# Run this cell, don't change anything.
save_and_show(fig_1f, 'images/saved/1f.png')

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1g

Comment on what you see in the grouped bar chart from the previous part. Specifically, which species of penguin appears to be the largest on average, and which sex of penguins appears to be larger on average? Is there a pair of species whose sizes are roughly the same on average?

_Hint: We use the term "on average" a lot in the question, and that's because the graph you created only shows statistics for each species and sex on average (because you used `np.mean`). It is **not** saying that all female Gentoo penguins are larger than all female Chinstrap penguins, for example._

<!--
BEGIN QUESTION
name: q1g
points: 2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 2 – Histograms

<img src='images/all3.png' width=400>

<div align=center>
Adelie, Chinstrap, and Gentoo penguins.
</div>

Great! Now that we've explored the distributions of some of the categorical variables in our dataset (species and island), it's time to study the distributions of some of the numerical variables. As covered in [Lecture 25](https://docs.google.com/presentation/d/1o7s7R9yy8tvNmMbwAWZg17tooQzIAfZtSTuHAY_rLIs/edit#slide=id.ge2af9457a7_0_0), we can visualize a numerical distribution by creating a histogram.

### Question 2a

Before we draw any histograms, we'll make sure that we understand how histograms work.

Run the cell below to draw a histogram displaying the distribution of our penguins' bill lengths.

In [22]:
penguins.select('bill_length_mm').hist(bins = np.arange(30, 60, 5), density = False)

<!-- BEGIN QUESTION -->

Remember, in a histogram, each bin is inclusive of the left endpoint and exclusive of the right endpoint. What this means is that, for example, the bin between 35 mm and 40 mm above corresponds to bill lengths that are greater than or equal to 35 mm and less than 40 mm.

In the Markdown cell below, answer the following three questions by looking at the graph above; do **not** write any code. **If you don't believe it is possible to determine the answer by looking at the above graph, write "impossible to tell" and explain why.** Remember that you can hover over the bars above to get their exact heights.

1. How many penguins have a bill length between 50 mm (inclusive) and 55 mm (exclusive)?
2. True or False: `penguins.where('bill_length_mm', are.between(50, 55)).num_rows` is equal to the correct answer from the previous question.
3. How many penguins have a bill length between 43 mm (inclusive) and 50 mm (exclusive)?

In your answer, use numbers to specify what part of your answer corresponds to the question number above (1, 2, or 3).

<!--
BEGIN QUESTION
name: q2a
points: 3
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 2b

Below, assign `fig_2b` to a count histogram that visualizes the distribution of our penguins' bill depths (**not** lengths, as in the previous part). Specifically, you must re-create the histogram below. Make sure that your histogram has the same bins, y-axis scale, axis labels, and title as the example.

<img src='images/examples/2b.png' width=750>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

_Hint: Our solution sets the arguments `density`, `bins`, `xaxis_title`, `yaxis_title`, `title`, and `show`._

In [23]:
fig_2b = ...
fig_2b

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q2b
points: 3
manual: true
-->

In [24]:
# Run this cell, don't change anything.
save_and_show(fig_2b, 'images/saved/2b.png')

<!-- END QUESTION -->



### Question 2c

When creating histograms, it's important to try several different bin sizes in order to make sure that we're satisfied with the level of detail (or lack thereof) in our histogram.

Run the code cell below. It will present you with a histogram of the distribution of our penguins' body masses, along with a slider for bin widths. Use the slider to try several different bin widths and look at the resulting histograms.

In [25]:
# Don't worry about the code, just play with the slider that appears after running.
def draw_mass_histogram(bin_width):
    fig = penguins.select('body_mass_g').hist(bins = np.arange(2700, 6300+2*bin_width, bin_width), density = False, show = False)
    display(fig)
    
interact(draw_mass_histogram, bin_width=(25, 525, 25));

<!-- BEGIN QUESTION -->

In the cell below, compare the two histograms that result from setting the bin width to **50** and **500.** 

What are the pros and cons of each size? (Remember that these histograms are displaying the same data, just with different bin sizes.)

<!--
BEGIN QUESTION
name: q2c
points: 2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Question 2d

We can also draw overlaid histograms to compare the distribution of a numerical variable, separated by some category (by using the `group` argument). Run the cell below to generate a histogram of body masses, separated by species.

In [27]:
# Run this cell.
penguins.select('species', 'body_mass_g') \
        .hist('body_mass_g', group = 'species', xaxis_title = 'Body Mass (g)', density = False)

The above graph is... okay. It's a little hard to extract any insight from it, since there are so many overlapping regions. However, you can click the text in the legend (`species = Gentoo` for example) to make certain categories disappear. Try it!

Also, keep in mind that there are fewer Chinstrap and Gentoo penguins in our dataset than Adelie penguins, which is why the bars for Adelie are all significantly taller. If you hide the Gentoo distribution and look at just the Adelie and Chinstrap distributions, you'll see that the "shapes" of the distributions are very similar (which is consistent with your result from Question 1f).

However, there's another solution – we can make three separate histograms, one for each species. Below, assign `fig_2d` to the graph below, which displays the distributions of all three species' body masses on separate axes. Unlike the previous parts of this assignment, you will need to set the `height` and `width` arguments in `hist` to 700 and 500, respectively (otherwise the resulting graph is too big).

<img src='images/examples/2d.png' width=300>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

_Hint: Our solution used the arguments `group`, `density`, `overlay`, `height`, `width`, `title`, and `show`. You should start by copying the code we used to create the overlaid histogram and then add and modify arguments._

In [28]:
fig_2d = ...
fig_2d

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q2d
points: 2
manual: true
-->

In [29]:
# Run this cell, don't change anything.
save_and_show(fig_2d, 'images/saved/2d.png')

<!-- END QUESTION -->



## Question 3 – Moving Forward

<img src='images/biscoe.png' width=400>

<div align=center>
A Gentoo penguin colony at Biscoe Point.
</div>

Run the following cell to generate a scatter plot.

In [30]:
penguins.scatter('bill_length_mm', 'bill_depth_mm', 
                 group = 'species', s = 35, sizes = 'body_mass_g',
                 title = 'Bill Length vs. Bill Depth')

<!-- BEGIN QUESTION -->

Here's what you're seeing above:
- Position on the x-axis represents bill length.
- Position on the y-axis represents bill depth.
- Color represents species, as per the legend on the right.
- Size represents body mass (this is a little hard to see, but some points are slightly larger than others).

You'll note that there are three general "clusters" or "groups" of points, corresponding to the three penguin species.

Use the scatter plot to fill in the blanks below. Both blanks should be a species of penguin.

_"It appears that the distribution of bill lengths of Chinstrap penguins is very similar to the distribution of bill lengths of _ _ _ _ penguins, while the distribution of bill depths of Chinstrap penguins is very similar to the distribution of bill depths of _ _ _ _ penguins."_

You can copy paste the text and replace the blanks in your answer.

<!--
BEGIN QUESTION
name: q3
points: 2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Fun Demo

We won't cover boxplots in depth this semester, but we'll draw some for you here. If you aren't familiar with boxplots (also known as "box-and-whisker plots"), you can watch the first few minutes of [this](https://www.youtube.com/watch?v=sDZSljMKkPw) video by Suraj Rampure or [this](https://www.khanacademy.org/math/cc-sixth-grade-math/cc-6th-data-statistics/cc-6th-box-whisker-plots/v/constructing-a-box-and-whisker-plot) video by Khan Academy.

Below, we'll draw boxplots for the distribution of body masses, separated by species.

In [31]:
px.box(penguins.to_df(), x = 'species', y = 'body_mass_g', color = 'species')

This boxplot is showing the same information as the histograms in Question 2d, just in a different way.

We can even take things a step further, by separating by species and sex:

In [32]:
px.box(penguins.to_df(), x = 'species', y = 'body_mass_g', color = 'sex')

In a single graph, we now see the distribution of penguin body masses, separated by species and sex. Pretty cool!

# Advanced Visualization Techniques

## The Data

<img src = 'images/sfo.jpg' width = 600>

In the next part of this homework, we'll explore flight data from San Francisco International Airport (SFO).

## Question 4 – Big 3 Airlines

We'll start by looking at the table `sfo_activity`. The dataset was downloaded directly from [flysfo.com](https://data.sfgov.org/Transportation/Air-Traffic-Passenger-Statistics/rkru-6vcg), SFO's website.

Run the cell below to load it in.

In [33]:
sfo_activity = Table.read_table('data/Air_Traffic_Passenger_Statistics.csv')
sfo_activity

The table has many columns, most of which are not useful for us. Let's start by cutting down the table so that it is in a more meaningful format for us.

Run the cell below. You don't need to change the cell at all, but you should understand how the code in it works.

In [35]:
# Run the cell, but read through this to figure out what's happening.
sfo_three = sfo_activity.select('Activity Period', 'Operating Airline', 'Activity Type Code', 'Passenger Count') \
            .where('Operating Airline', are.contained_in(['American Airlines', 'Delta Air Lines', 'United Airlines'])) \
            .where('Activity Period', are.above(201312)) \
            .group(['Activity Period', 'Operating Airline'], np.sum) \
            .select('Activity Period', 'Operating Airline', 'Passenger Count sum') \
            .relabeled('Passenger Count sum', 'Number of Passengers')

sfo_three

Our condensed table, `sfo_three`, contains information about the Big 3 US airlines – American, Delta, and United. The larger table `sfo_activity` separated passenger counts by type of transit (arriving and departing), destination (Domestic vs. International), and terminal, while `sfo_three` aggregates all of this information into one row per airline per month.

**For example, the first row of `sfo_three` tells us that 227423 passengers traveled on American Airlines to or from SFO in January 2014. (`sfo_three` only contains information about January 2014 and onwards.)**

There's another issue with `sfo_three`: the `'Activity Period'` column is formatted strangely. It contains dates as integers; this will be problematic when we go to create visualizations as our visualization library will interpret the difference between November 2016 (`201611`) and December 2016 (`201612`) as being different than the difference between December 2016 (`201612`) and January 2017 (`201701`).

That's not a problem – we can convert the `'Activity Period'` column into a format that Python understands as being a date. This way, when we go to create line plots, the x-axis will be set correctly.

Again, run the cell below. You should understand how the code in the cell works.

In [36]:
# Run this cell and think about how this code works.

def convert_activity_period(date):
    # This condition exists solely to safeguard against running this cell multiple times
    if isinstance(date, str) and '-' in date: 
        return date
    
    # Get year and month from YYYYMM (int)
    year = str(date)[:4]
    month = str(date)[4:]
    
    # Reformat to 'YYYY-MM' (str)
    return year + '-' + month

sfo_three = sfo_three.with_columns(
    'Activity Period', sfo_three.apply(convert_activity_period, 'Activity Period')
)

sfo_three

Great. Now we can proceed with creating visualizations.

### Question 4a

Our first goal will be to create a line plot of the number of passengers who traveled through SFO each month on each of the Big 3 airlines (we'll want three separate lines, one for each airline).

To do this, we will need to rearrange `sfo_three` so that it has **one column for each airline**, and **only one row for each `'Activity Period'`** (as opposed to the three rows per period it has now).

Below, assign `sfo_three_pivoted` to a table with four columns – `'Activity Period'`, `'American Airlines'`, `'Delta Air Lines'`, and `'United Airlines'` – and one row for each activity period. The numbers in the table should describe the number of passengers who traveled on a particular airline through SFO in a particular month. The first few rows of `sfo_three_pivoted` are shown below.


| Activity Period   |   American Airlines |   Delta Air Lines |   United Airlines |
|------------------:|--------------------:|------------------:|------------------:|
| 2014-01           |              227423 |            184953 |           1314840 |
| 2014-02           |              205019 |            173069 |           1188119 |
| 2014-03           |              240918 |            228384 |           1435920 |
| 2014-04           |              242020 |            231490 |           1517518 |
| 2014-05           |              246582 |            256089 |           1610137 |

_Hints:_ 
- _1. The name `sfo_three_pivoted` should tell you which table method to use._
- _2. You may also be wondering which function you need to provide as the fourth argument to the aforementioned method – most functions will work, since in `sfo_three` there is only one row for every combination of `'Activity Period'` and `'Operating Airline'`._
    - _We used `sum`; **don't use `np.mean`** since it may convert some numbers to scientific notation._

<!--
BEGIN QUESTION
name: q4a
points: 2
-->

In [37]:
sfo_three_pivoted = ...
sfo_three_pivoted

In [None]:
grader.check("q4a")

### Question 4b

Using the table you created in the previous subpart, assign `fig_4b` to a line plot describing the number of passengers traveling through SFO each month on Delta Air Lines. Your plot should match the example below, including the axis labels and title.

<img src = 'images/examples/4b.png' width = 800>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

In [42]:
fig_4b = ...
fig_4b

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q4b
points: 2
manual: true
-->

In [43]:
# Run this cell, don't change anything.
save_and_show(fig_4b, 'images/saved/4b.png')

<!-- END QUESTION -->



### Question 4c

Let's take things one step further. Now, assign `fig_4c` to a line plot describing the number of travelers through SFO each month for each of the three Big 3 airlines. Your plot should match the example below, including the axis labels and title.

<img src = 'images/examples/4c.png' width = 800>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

_Hint: You should only provide your plotting method with one column name; lines will automatically be drawn for all other columns. See the examples in Lecture 27._ 

In [45]:
fig_4c = ...
fig_4c

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q4c
points: 2
manual: true
-->

In [46]:
# Run this cell, don't change anything.
save_and_show(fig_4c, 'images/saved/4c.png')

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 4d

There are a few interesting trends evident in the line plot you created above. One, for instance, is that it seems like the month with the most travelers every year is usually either July or August, which makes sense as people tend to travel a lot in the summer.

In the cell below, answer the following questions.

1. Why are there far more passengers on United than on either of the other airlines? _(Hint: Go to the [Wikipedia article for SFO](https://en.wikipedia.org/wiki/San_Francisco_International_Airport) and look at the information under "Summary" on the right-hand side.)_

2. There is a particular month which seems to see the fewest travelers per year. Which month is it, and why do you think there are fewer travelers in this month than in any other month? _Hint: How many days are in each month? Also, remember that you can zoom in to certain regions of the plot._

3. What is the cause of the dramatic drop in early 2020?

Like the previous section, please number your responses to each question.

<!--
BEGIN QUESTION
name: q4d
points: 3
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Question 5 – United

As we discovered in the previous question, United is by far the most popular airline at SFO.

<img src = 'images/united.jpg' width = 400>

Now we will look at data from the [US Bureau of Transportation Statistics](https://www.transtats.bts.gov/ONTIME/Departures.aspx) containing flight departure on-time statistics for all **domestic** flights on United from SFO in 2019. (We chose 2019 instead of 2020 so that seasonal travel trends are more evident – not many people traveled in the summer of 2020, for example.)

If you click the above hyperlink, you'll be able to enter the parameters you'd like; we chose `All Statistics`, `San Francisco, CA: San Francisco International (SFO)`, `United Airlines Inc. (UA)`, `All Months`, `All Days`, and `2019`, in that order. After entering your parameters and clicking `Submit`, you'll see a table; clicking the `csv` link above the table will download a `csv` that you can then use in a notebook.



Run the cell below to load in the resulting table. The extra arguments are there because the downloaded `csv` has a few unnecessary rows that need to be removed before we store the data as a table.

In [48]:
# Run this cell!
united_raw = Table.read_table('data/Detailed_Statistics_Departures.csv', header = 6, skipfooter = 1, engine = 'python')
united_raw

Once again, the table has more columns than we need. Run the cell below to select only the columns we need and rename them to something slightly more convenient (`'Date'` instead of `'Date (MM/DD/YYYY)'`, for example).

In [49]:
# Run this cell to clean up some data.
united = united_raw.select('Date (MM/DD/YYYY)', 'Flight Number', 'Destination Airport', 'Departure delay (Minutes)') \
                   .relabeled(['Date (MM/DD/YYYY)', 'Destination Airport', 'Departure delay (Minutes)'],
                              ['Date', 'Destination', 'Delay'])

united

This table, `united`, is much more manageable. It contains the date, flight number, destination, and delay (in minutes) of every domestic flight on United from SFO in 2019.

**Note:** You will not use the `sfo_three` or `sfo_three_pivoted` tables from this point forward.

### Question 5a

What are the most common destinations from SFO on United? Below, assign `fig_2a` to a bar chart that describes the number of flights on United from SFO to its top 10 destinations. Your bar chart should match the given example exactly.

<img src = 'images/examples/5a.png' width = 800>


**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

_Hint: You did this several times in a previous section. Remember – `group`, `sort`, `take`, `barh`._

In [50]:
fig_5a = ...
fig_5a

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q5a
points: 2
manual: true
-->

In [51]:
# Run this cell, don't change anything.
save_and_show(fig_5a, 'images/saved/5a.png')

<!-- END QUESTION -->



### Question 5b

Cool! The results shouldn't be all that surprising, given that Newark (EWR), Chicago (ORD), Los Angeles (LAX), Denver (DEN), IAH (Houston), and IAD (Washington-Dulles) are all [hubs for United](https://en.wikipedia.org/wiki/United_Airlines#Hubs). The other airports in the top 10 are in popular West Coast destinations – Las Vegas (LAS), Seattle (SEA), San Diego (SAN), and Portland (PDX).

Up until now, the bar graphs we've drawn have all had the same colors by default. That's boring – let's look at how to change the color of our bars from blue to something else.

We'll start by making a copy of your `fig_5a` from the last subpart. Run the following cell.

In [52]:
fig_5b = copy.deepcopy(fig_5a)

`fig_5b`, like `fig_5a`, is of type `Figure`. All figures have a method, `update_traces`, which allow us to change the figure's visual properties. To change bar colors, you should do the following:

Your job: **Call `.update_traces(market = ...)` on `fig_2b` with a single optional argument, `marker`.** 

The value of the `marker` argument should be a **dictionary** with a single key-value pair; the key should be `'color'` and the value should be the color you want as a string (like `'green'`, `'orange'`, `'red'`, etc).

Your code should fit on a single line, and should be in the following format. (The above paragraph tells you what should go in each blank; you just need to match the description to the syntax. This is the only line of code you need to write for this question.)

<div align=center>

```py
___.___(___ = {___: ___})
```
    
</div>

Start by using `'green'` as your color. Once you get it to work, change `'green'` to whatever you want. The first column of [this table](https://css-tricks.com/snippets/css/named-colors-and-hex-equivalents/) shows you all possible colors.

In [54]:
...

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q5b
points: 2
manual: true
-->

In [55]:
# Run this cell, don't change anything.
save_and_show(fig_5b, 'images/saved/5b.png')

<!-- END QUESTION -->



### Question 5c

Wouldn't it be nice if we could visualize airports on, say, a map? That would require knowing the latitude and longitude of each airport, which is not information we currently have.

Run the following cell to load a table that contains the latitudes and longitudes of most airports in the US.

In [57]:
# Run this cell.
airports = Table.read_table('data/airports.csv') \
                .where('iata_code', are.not_equal_to('nan')) \
                .where('iso_country', 'US') \
                .select('name', 'iata_code', 'latitude_deg', 'longitude_deg')
airports

We can **join** this new table with our existing data in order to get the latitude and longitude of every one of United's domestic destinations from SFO.

In the cell below, assign `destinations` to a table with one row per unique destination from SFO on United. `destinations` should have 5 columns: `'Destination'`, `'count'`, `'name'`, `'latitude_deg'`, and `'longitude_deg'`.

The first few rows of `destinations` are given below.

| Destination   |   count | name                                             |   latitude_deg |   longitude_deg |
|--------------:|--------:|-------------------------------------------------:|---------------:|----------------:|
| ABQ           |       6 | Albuquerque International Sunport                |        35.0402 |       -106.609  |
| ANC           |      89 | Ted Stevens Anchorage International Airport      |        61.1744 |       -149.996  |
| ATL           |     660 | Hartsfield Jackson Atlanta International Airport |        33.6367 |        -84.4281 |
| AUS           |    1251 | Austin Bergstrom International Airport           |        30.1945 |        -97.6699 |
| BHM           |       2 | Birmingham-Shuttlesworth International Airport   |        33.5629 |        -86.7535 |

_Hint: We've created an intermediate table `united_grouped` for you to use in your work. If you use it, your solution should be a single method call and should fit on one line. Don't overthink this, and if you're stuck, look at the bolded word at the top of this cell. Are there any common values between the two tables?_

<!--
BEGIN QUESTION
name: q5c
points: 2
-->

In [59]:
united_grouped = united.group('Destination')
destinations = ...
destinations

In [None]:
grader.check("q5c")

<!-- BEGIN QUESTION -->

### Question 5d

Now, using the `destinations` table you just created, assign `map_2d` to a map with markers at every destination that one can travel to on United from SFO. Your map should look similar to the one below.

<img src = 'images/examples/5d.png' width = 600>

Some pointers:
1. Use `Marker.map_table`. See [Lecture 27](https://docs.google.com/presentation/d/19PAJUTWZJcSmdGKsyr-ixG9Vdf6FEKzR4A06Whb6Xvo/edit#slide=id.ge2aef31ca8_0_0) for the type of table that `Marker.map_table` expects. In this subpart, the table you call `Marker.map_table` on should only have two columns.
2. You can set the `color` argument to be whatever you want; your markers don't need to be green.
3. Set the `marker_icon` argument to `'plane'` to make the icon of each marker an airplane.

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

<!--
BEGIN QUESTION
name: q5d
points: 2
manual: true
-->

In [65]:
map_5d = ...
map_5d

<!-- END QUESTION -->



Unlike for graphs, there is no cell to run after creating your map – just make sure your map is outputted by the cell above.

<!-- BEGIN QUESTION -->

### Question 5e

Let's add some spice to our map. Below, assign `map_5e` to a map with circles at every destination that one can travel to on United from SFO. The **color** of each circle should correspond to the **number of flights to that airport from SFO** – the more flights there are, the lighter the circle should be. Furthermore, each circle should be labeled with the name of the airport. Your map should look similar to the one below.

<img src = 'images/examples/5e.png' width = 600>

Some pointers again: 
1. You should use `Circle.map_table` instead of `Marker.map_table`.
2. As we did in Lecture 27, you will need to select columns from `destinations` and then use `relabeled` to change their names so that they match what `Circle.map_table` expects. This table should have four columns.
    - Two of its column names must be `'labels'` and `'color_scale'`.
    - We recommend assigning this table to a variable; we provide `temp_table` for this purpose, but you don't need to use it.
    - There are multiple examples from lecture that you can follow pretty closely.

3. In `Circle.map_table`, set `line_color` to `None`, `fill_opacity` to 0.7 and `None` and `area` to `500` to match our map.

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

<!--
BEGIN QUESTION
name: q5e
points: 3
manual: true
-->

In [67]:
temp_table = ...
map_5e = ...
map_5e

<!-- END QUESTION -->



Unsurprisingly, we see lighter circles at the hubs and popular West Coast destinations we discussed before.

### Question 5f

To round out our analysis, we'll look at flight delays, which we've ignored so far. The `'Delay'` column in the `united` table describes the number of minutes each flight was delayed.

As an aside, run the cell below to see a histogram of flight delays for flights to Newark (EWR) and Chicago (ORD).

In [69]:
# Run this cell! Make sure you understand what is being graphed.
united.where('Destination', are.contained_in(['EWR', 'ORD'])) \
      .where('Delay', are.below(60)) \
      .hist('Delay', bins = np.arange(-25, 65, 5), group = 'Destination', density = False)

We'll now work towards creating a scatter plot that describes the total number of flights and number of delayed flights from SFO to each United destination. Before creating the scatter plot, we'll first create a table `dest_with_delay`, which will contain the number of flights and number of delayed flights to every destination. We say a flight is delayed if its `'Delay'` is greater than 0 (positive) in the `united` table. The first few rows of `dest_with_delay` are shown below.

| Destination   |   Number of Flights |   Number of Delayed Flights |
|--------------:|--------------------:|----------------------------:|
| ABQ           |                   6 |                           1 |
| ANC           |                  89 |                          35 |
| ATL           |                 660 |                         287 |
| AUS           |                1251 |                         390 |
| BHM           |                   2 |                           1 |

You'll recognize the first two columns of `dest_with_delay` – they're the exact same as `united_grouped`, just with `'count'` relabeled to `'Number of Flights'`. 

The third column, `'Number of Delayed Flights'`, comes from grouping `united` with a special function that we've defined for you, called `num_positive`. Run the three cells below to define it and see it in action.

In [70]:
def num_positive(arr):
    # Returns the number of elements in arr that are greater than 0
    return len(arr[arr > 0])

In [71]:
# 3 of these numbers are positive
num_positive(np.array([5, 1, -2, 4, 0]))

In [72]:
# 2081 flights to EWR had a positive delay
num_positive(united.where('Destination', 'EWR').column('Delay'))

The `num_positive` function takes in an array and returns the number of elements in it that are positive. We can use it with group to determine the number of delayed flights per airline, like so:

In [73]:
united.group('Destination', num_positive)

**Your job in the cell below** is to put together everything we just told you to create the table `dest_with_delay` that was described earlier. Almost all of the code is already provided for you – you'll just need to stitch everything together to create one table. Our solution is of the following form:

```py
dest_with_delay = united_grouped.with_columns(
      ___, ___.column(___)
).relabeled(___, 'Number of Flights')
```

You can begin by copying this skeleton code into the coding cell.

<!--
BEGIN QUESTION
name: q5f
points: 2
-->

In [75]:
dest_with_delay = ...
dest_with_delay

In [None]:
grader.check("q5f")

### Question 5g

Now, assign `fig_5g` to a **scatter plot** with **total number of flights** on the x-axis and **number of delayed flights** on the y-axis. Your scatter plot should look like the example below.

<img src = 'images/examples/5g.png' width = 700>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

_Hint: This should be very straightforward, your code should fit on one line._ 

In [80]:
fig_5g = ...
fig_5g

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q5g
points: 1
manual: true
-->

In [81]:
# Run this cell, don't change anything.
save_and_show(fig_5g, 'images/saved/5g.png')

<!-- END QUESTION -->



### Question 5h

Now, assign `fig_5h` to a scatter plot with the same information as in the previous part, but with the following modifications:
- There are too many points in the bottom left of the graph. Only include points where `'Number of Flights'` is **at least** 700.
- Label each point according to the name of the corresponding airport. We can use `labels = ...` to do this.
- Increase the size of each point by setting `s = 50`.
- Match the title in the example below.

<img src = 'images/examples/5h.png' width = 700>

**Note:** Remember the disclaimer at the top of the assignment. Don't forget to set `show = False`.

In [82]:
fig_5h = ...
fig_5h

<!-- BEGIN QUESTION -->

After creating your visualization above, run the following cell. You should see a picture of your graph. You must run this cell in order to get credit for this question.

<!--
BEGIN QUESTION
name: q5h
points: 2
manual: true
-->

In [83]:
# Run this cell, don't change anything.
save_and_show(fig_5h, 'images/saved/5h.png')

<!-- END QUESTION -->



Unsurprisingly, the destinations with more flights have more delayed flights. It doesn't seem like there are any particular outliers; for most destinations, around a third of flights experience some sort of delay.

# Done!

Congrats! You've finished our last Data 6 homework assignment! This was a long one, so pat yourself on the back!

To submit your work, follow the steps outlined on Ed. **Remember that for this homework in particular, almost all problems will be graded manually, rather than by the autograder.**

The point breakdown for this assignment is given in the table below:

| **Category** | Points |
| --- | --- |
| Autograder | 7 |
| Written (Including Visualizations) | 44 |
| **Total** | 51 |

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()