# Homework 3: Data Visualization and Python Functions

## Due Saturday, July 23th at 11:59pm PST

Welcome to Homework 3! This week, we will cover DataFrame manipulations, making visualizations, and defining functions. You can find additional help on these topics in [Notes 11-17](https://notes.dsc10.com/02-data_sets/groupby.html) in the course notes.

### Instructions

This assignment is due on Saturday, July 23rd at 11:59pm. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the [Calendar](https://dsc10.com/calendar)) or Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

**Please do not use for-loops for any questions in this homework.** If you don't know what a for-loop is, don't worry -- we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.

In [1]:
# Please don't change this cell, but do make sure to run it
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import otter
grader = otter.Notebook()

%reload_ext pandas_tutor

## 1. Winter is Coming ❄️ ⚔️

<img src="./images/Game_of_Thrones.jpeg" width=400/>

*Game of Thrones* is a hit fantasy-adventure series based on George R.R. Martin's book series, "A Song of Ice and Fire." The series is set in a fictional world, where powerful families, or "houses", fight for control of the Seven Kingdoms. In this question, we will investigate these battles.
The data used in this question comes from [Chris Albon's "The
War of the Five Kings" dataset](https://github.com/chrisalbon/war_of_the_five_kings_dataset), which contains detailed information on all of the battles in the series.

The file named `battles.csv` in the `data/` directory has a row for each battle, and the following columns.

|Column|Description|
|------|-----------|
|`'name'`|The name of the battle.|
|`'year'`|The year in which the battle took place (the dataset contains battles from the years 298-300).|
|`'attacker_king'`|The name of the king who was on the attacking side of the battle.|
|`'defender_king'`|The name of the king who was on the defending side of the battle.|
|`'major_death'`|Indicates whether a major character died during the battle. Values of `1.0` mean that some major character died, `0.0` means not.|
|`'major_capture'`|Indicates whether a major character was captured during the battle. Values of `1.0` mean that some major character was captured, `0.0` means not.|
|`'attacker_size'`|Number of people in the attacking army.|
|`'defender_size'`|Number of people in the defending army.|
|`'summer'`|Indicates if the battle took place in summer. Values of `1.0` mean the battle was in summer, `0.0` means not.|
|`'location'`|The specific location where the battle took place.|
|`'region'`|The more general region where the battle took place.|
|`'attacker_win'`| Indicates whether the attacking army won the battle. Values of `1.0` mean the attackers won, `0.0` means they lost.|

First, we'll read the data in as a DataFrame.

In [2]:
battles = bpd.read_csv('data/battles.csv')
battles

You may notice the DataFrame has many `NaN` values. `NaN` means "not a number," and it's how `babypandas` handles missing values. You don't need to worry about these, as they won't affect any of the calculations. Python knows to ignore them.

Let's explore particular columns to get to know the data a little better. The `.describe()` method gives us some useful information about a column. Try it out on the `name` column.

In [3]:
battles.get('name').describe()

We learn that this column has 38 values, all of which are unique, and as a result the most frequent name appears only once.

If we try this same command on the `'attacker_king'` column, we'll see that although there are 36 values (and therefore 2 missing values), there are only 4 distinct values. There are many battles with the same `'attacker_king'`. The most common `'attacker_king'` is `'Joffrey/Tommen Baratheon'` with 14 instances. 

In [4]:
battles.get('attacker_king').describe()

**Question 1.1.** We want to use some column as an index to help us better understand what each row represents. Which column would be the best index for this dataset? Set the index of `battles` to whichever column makes the most sense. 

In [5]:
battles = ...
battles

In [None]:
grader.check("q1_1")

**Question 1.2.** Assign `weakest_attack` and `weakest_defense` to the **names** of the battles with the smallest attacking army and smallest defending army, respectively.

Similarly, assign `strongest_attack` and `strongest_defense` to the **names** of the battles with the largest attacking army and largest defending army, respectively.

In the case of a tie, choose any one of the armies involved in the tie.

_**Hint:**_ When sorting values, `NaN` is always sorted to the last position, no matter if you chose to sort in ascending or descending order. Since there are `NaN` values in this dataset, accessing the last index will probably give you an incorrect answer. Sort accordingly!

In [9]:
weakest_attack = ...
print("Weakest attack:", weakest_attack)

strongest_attack = ...
print("Strongest attack:", strongest_attack)

weakest_defense = ...
print("Weakest defense:", weakest_defense)

strongest_defense = ...
print("Strongest defense:", strongest_defense)

In [None]:
grader.check("q1_2")

**Question 1.3.** Leaders often like to choose where their battles are fought so they can gain an advantage over their opponent.  Make a DataFrame named `river_north_storm` containing only the the battles from `"The Riverlands"`, `"The North"`, and `"The Stormlands"` regions. All columns of `battles` should be included.

In [15]:
river_north_storm = ...
river_north_storm

In [None]:
grader.check("q1_3")

**Question 1.4.** Make an appropriate plot that would help you answer the question,

> Among battles from `"Riverlands"`, `"The North"`, and `"The Stormlands"` regions, do those with larger `attacker_size` also have larger `defender_size`?

You only need to make a plot, you don't need to answer the question above.

_**Note**_: The plot should only include battles for which you have available data for both `attacker_size` and `defender_size`.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_4
manual: true
-->

In [20]:
# Create your plot here
...

<!-- END QUESTION -->



**Question 1.5.** Some characters in *Game of Thrones* have a large impact on the plot. When they die in battle, we record their deaths in the `major_death` column, which has a value of `1.0` if a major death occurred in that battle and `0.0` otherwise.

Create a DataFrame named `major_deaths`, indexed by `'attacker_king'`. This DataFrame should have one column called `'num_major'`, that contains the total number of major deaths that each `attacker_king` saw in their battles.

_**Hint:**_ You will need to change the names of columns, which you can do using `.assign` and `.drop`. Instead of using `.drop`, you may want to use `.get` and pass in a `list` containing the name of a single column that you want to keep (this was done in Lecture 8).

In [21]:
major_deaths = ...
major_deaths

In [None]:
grader.check("q1_5")

**Question 1.6.** It turns out that there is at least one `'attacker_king'` that never had any major deaths in their battles. Below, assign `happy_kings` to an array of the name(s) of these `'attacker_king'`(s).

In [26]:
happy_kings = ...
happy_kings 

In [None]:
grader.check("q1_6")

**Question 1.7.** Suppose that you are royalty in the *Game of Thrones* universe, and you want to conquer more land. You want to team up with the strongest kings, so you're trying to find out who can give you the best attacking army. We want to find armies with the largest average army size (`'attacker_size'` column), those who manage to get rid of their enemies' leaders the most on average, (`'major_death'` column), and those who win the most on average, (`'attacker_win'` column).


Create a DataFrame called `mean_stats`, indexed by `'attacker_king'`, that contains the means of these three columns for each king. `mean_stats` should only have these three columns, in this order: `'attacker_size'`, `'major_death'`, and `'attacker_win'`.




In [30]:
mean_stats = ...
mean_stats

In [None]:
grader.check("q1_7")

**Question 1.8.** While it might make for an entertaining show, it would be no fun to participate in a big battle. From the perspective of a soldier in battle, we'll say that a battle is considered "bad" if there were lots of attackers/defenders involved, if important people were captured, and if important people were killed. Additionally, we'll say that battles that take place in summer are considered worse, as the heat makes it harder for soldiers to carry heavy armor and equipment.

A battle's "badness rating" is a weighted average of these, which can be defined as follows:

- ``'attacker_size'``: 25%
- ``'defender_size'``: 25%
- ``'major_death'``: 25%
- ``'major_capture'``: 15%
- ``'summer'``: 10%

Define a function called `calculate_badness` that takes in a battle's name and outputs the battle's "badness rating".

_**Hint:**_ It may be helpful to work out an example by hand to ensure you know how the calculation is meant to be performed. For example, `'Battle of the Whispering Wood'` has an `'attacker_size'` of 1875, `'defender_size'` of 6000, `'major_death'` of 1.0, `'major_capture'` of 1.0, and `'summer'` of 1.0, so  `calculate_badness("Battle of the Whispering Wood")` should return 1969.25. Once you've implemented `calculate_badness`, you should verify that your function works as intended. This is good practice in general!

In [35]:
def calculate_badness(name):
    ...

In [None]:
grader.check("q1_8")

**Question 1.9.** Use the `calculate_badness` function you've already written, along with the `.apply` method, to create an **array** called `badness_array` that contains the "badness rating" of each battle, in the same order as the rows of the `battles` DataFrame. Many battles may not have a "badness rating" if some data needed to calculate it is missing.

_**Hint**_: Note that the `.apply` method allows you to apply a function to any *column* in a DataFrame, but not to the index. Instead, try using `.apply` on a version of the DataFrame that has the index reset to its default.

In [40]:
badness_array = ...
badness_array

In [None]:
grader.check("q1_9")

**Question 1.10.** Create a DataFrame called `with_badness` that contains all the columns of `battles` plus one more called `'badness'`, containing the values in `badness_array`. Order the rows in descending order of `'badness'`. 

Then, save the name of the worst battle as `worst_battle`. Here, the worst means having the largest "badness rating."

In [44]:
with_badness = ...
worst_battle = ...
print('The battle with the largest "badness rating" is:', worst_battle)
with_badness

In [None]:
grader.check("q1_10")

## 2. Club Penguin 🐧 

In this question, we will explore a dataset containing size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica. The data was collected by Dr. Kristen Gorman, a marine biologist, from 2007 to 2009. Below is a picture of Dr. Gorman and her fellow researchers in action:

<img src="./images/q2_dr_gorman.jpg" width=300/> 

Let's break down the columns in our dataset. 

#### `'species'`

There are three species of penguin in our dataset: Adelie, Chinstrap, and Gentoo.

<img src="./images/q2_lter_penguins.png" width=350/>

#### `'island'`

Dr. Gorman recorded data on penguins from three islands: Biscoe, Dream, and Torgersen. Below is a map:

<img src="./images/q2_map.png" width=350/>

#### `'bill_length_mm'` and  `'bill_depth_mm'`

These are physical attributes of individual penguins. Below is an illustration:

<img src="./images/q2_culmen_depth.png" width=350/>

#### `'flipper_length_mm'`, `'body_mass_g'`, and `'sex'`

Although penguins are birds that cannot fly, their wing, or flipper, structures are optimized for swimming. Here is a National Geographic [video](https://youtu.be/A9mbCNs47FI) showcasing the amazing swimming ability of penguins. The mass of the penguin is another attribute that was recorded in grams, and the sex of the penguin was recorded as either male or female. 

Run the next cell to load in the data. We have cleaned the data beforehand to ensure there are no missing values. Take some time to look at a few rows of the DataFrame to see what information is recorded.

In [50]:
# Run this cell to load the dataset.
penguins = bpd.read_csv('data/penguins.csv')
penguins

**Question 2.1.** Suppose we're curious about how the mean bill length varies between penguin species. Assign `species_mean_bill` to a DataFrame with a single column, `'bill_length_mm'`, that contains the mean bill length for each species. `'species'` should be the index.

In [51]:
species_mean_bill = ...
species_mean_bill

In [None]:
grader.check("q2_1")

**Question 2.2.** Below, write code to find the name of the species with the largest mean bill length, and assign the name `species_largest_bill` to the result.

<!--
BEGIN QUESTION
name: q2_2
-->

In [54]:
species_largest_bill = ...
species_largest_bill

In [None]:
grader.check("q2_2")

**Question 2.3** We want to visualize the **mean bill length** (in mm) for each species. Which visualization is the most appropriate? Assign an integer from 1 to 6 representing your answer to the name `q2_3`.

 1. Histogram
 2. Line plot
 3. Scatterplot
 4. Bar chart
 5. Boxplot
 6. None of the above

In [57]:
q2_3 = ...
q2_3

In [None]:
grader.check("q2_3")

**Question 2.4.** Using your `species_mean_bill` DataFrame from Question 2.1, generate a visualization that shows the mean bill length (in mm) for each species. Don't worry about sorting.

_**Hint:**_ Your visualization type should be the one you selected in Question 2.3. If it doesn't make sense, try a different type of visualization!

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_4
manual: true
-->

In [60]:
# Create your plot here
...

<!-- END QUESTION -->



**Question 2.5.** Assign `tor_ade` to a DataFrame that only contains rows for Adelie penguins on Torgersen Island. `tor_ade` should have all of the columns in `penguins` except for `'species'` and `'island'`, which will now be redundant (since all of the penguins in `tor_ade` will have the same `'species'` and `'island'`).

In [61]:
tor_ade = ...
tor_ade

In [None]:
grader.check("q2_5")

**Question 2.6.** Calculate the proportion of Adelie penguins on Torgersen Island that have a mass of over 4000 grams, and assign the result to `proportion_above_4000g`.

In [65]:
proportion_above_4000g = ...
proportion_above_4000g

In [None]:
grader.check("q2_6")

**Question 2.7.** Create a density histogram showing the distribution of the body mass of Adelie penguins on Torgersen Island. Your histogram should have 10 bars in total (you can accomplish this by setting `bins=10`).

Make sure to set `density=True` to create a density histogram, and set `ec='w'` to make the separation of bars more clear.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_7
manual: true
-->

In [68]:
# Create your plot here
...

<!-- END QUESTION -->



<br>

**Question 2.8.** Finally, we're interested in understanding the median `'flipper_length_mm'` of penguins from every combination of `'species'` and `'island'`. 

Below, assign `median_flipper_lengths` to a DataFrame with three columns, `'species'`, `'island'`, and `'flipper_length_mm'`. Each row of `median_flipper_lengths` should correspond to a different combination of `'species'` and `'island'`.

_**Hint 1:**_ In order to create the desired DataFrame, you will have to group by more than one column. Please read [Note 11.4](https://notes.dsc10.com/02-data_sets/groupby.html) on grouping with subgroups for more information.

_**Hint 2:**_ Remember to use `.reset_index()` after you group by multiple columns to get rid of the "multi-index".

<!--
BEGIN QUESTION
name: q2_8
-->

In [69]:
median_flipper_lengths = ...
median_flipper_lengths

In [None]:
grader.check("q2_8")

Say goodbye to our penguins!

<img src="./images/q2_all3.png" width=350/>

## 3. Ramen🍜

Instant ramen was first invented by Momofuku Ando in 1958 to cure hunger during wartime. It started off with only one kind for its original purpose, but the ramen industry has expanded over the years, and now there are over 100 different kinds of instant ramen. Ramen has not only made its way to the dinner table but also to some popular game shows, TV series, and even museums! Click [here](https://www.cupnoodles-museum.jp/en/osaka_ikeda/) to read more!


<img src="./images/noodles-lowres-8607.png" width=350/>

We have a [dataset of instant ramen ratings from Kaggle](https://www.kaggle.com/datasets/residentmario/ramen-ratings?resource=download). First, we'll read in the data from a CSV. There is no good index, so we will leave it unset.

In [75]:
ramen_data = bpd.read_csv('data/ramen-rating.csv')
ramen_data

Notice that the `'Country'` column contains a country code. We want to convert these country codes into actual country names that everyone can understand.

We'll use a Python [dictionary](https://www.tutorialspoint.com/python/python_dictionary.htm) to help us with this conversion. A dictionary is a simple way to map a unique key to a value. For example, the below dictionary maps course codes to course names.

In [76]:
dsc_courses = {
    # key: value
    'DSC 10': 'Principles of Data Science',
    'DSC 20': 'Programming and Basic Data Structures for Data Science',
    'DSC 30': 'Data Structures and Algorithms for Data Science',
    'DSC 40A': 'Theoretical Foundations of Data Science I',
    'DSC 40B': 'Theoretical Foundations of Data Science II',
    'DSC 80': 'The Practice and Application of Data Science'
}

We can access the value corresponding to each key using bracket notation.

In [77]:
dsc30_name = dsc_courses['DSC 30']
dsc30_name

Here, `'DSC 30'` is the key and `'Data Structures and Algorithms for Data Science'` is the value.

Let's use a dictionary to help us with our country code to country name conversion. Below is a dictionary containing country codes as keys and country names as values for each of the countries in our ramen dataset.

In [78]:
# Run this cell, DO NOT change it.
country_codes = {
    'AU':'Australia',
    'BD':'Bangladesh', 
    'BR':'Brazil', 
    'KH':'Cambodia' , 
    'CA':'Canada', 
    'CN':'China',
    'CO':'Colombia', 
    'DXB':'Dubai' , 
    'EE':'Estonia' , 
    'FIJI':'Fiji', 
    'FI':'Finland' , 
    'DE':'Germany',
    'GHAN':'Ghana' , 
    'NL':'Holland', 
    'HK':'Hong Kong', 
    'HU':'Hungary', 
    'IN':'India', 
    'ID':'Indonesia',
    'JP':'Japan', 
    'MY':'Malaysia', 
    'MX':'Mexico', 
    'MM':'Myanmar', 
    'NP':'Nepal', 
    'AN':'Netherlands',
    'NG':'Nigeria', 
    'PK':'Pakistan', 
    'PH':'Philippines', 
    'PL':'Poland', 
    'SWK':'Sarawak',
    'SG':'Singapore', 
    'KOR':'South Korea', 
    'SE':'Sweden', 
    'TW':'Taiwan', 
    'TH':'Thailand', 
    'UK' :'United Kingdom' ,
    'USA':'United States', 
    'VN':'Vietnam' 
    }

**Question 3.1.** Using the dictionary `country_codes`, define a function named `code_to_country` that takes as input a country code and returns the corresponding country's name. This should only take one line of code.

_**Hint 1:**_ If you're stuck, take a look at the DSC 30 example above.

_**Hint 2:**_ Once you've implemented `code_to_country`, you should verify that it works as intended by trying a few examples yourself. The provided tests will **not** do this for you.

In [79]:
def code_to_country(code):
    ...

In [None]:
grader.check("q3_1")

**Question 3.2.** Use your `code_to_country` function and the `.apply` method to convert all of the country codes in the `'Country'` column of `ramen_data` into country names. Do this without creating an additional column or reordering the existing columns. Assign the resulting DataFrame to the variable name `ramen`.

_**Hint:**_ Is there a way to use the `.assign` method to *replace* values in this column without creating an additional column? See if you can find out by reading [the `babypandas` documentation](https://babypandas.readthedocs.io/en/latest/index.html).

**Important**: For the rest of the questions in this section, use the DataFrame `ramen` instead of `ramen_data`.

In [84]:
ramen = ...
ramen

In [None]:
grader.check("q3_2")

**Question 3.3.** 
Define a function named `word_count` that returns the number of words in a ramen's `'Variety'`. It should take as input a string from the `'Variety'` column and  return the number of words in that string. We'll consider a piece of text to be a word if and only if it is separated from adjacent words by a space. 
For example:
- `word_count('Cup Noodles Chicken Vegetable')` should return 4.
- `word_count('Tonkotsu-Shoyu Rich Pork Flavor Ramen')` should return 5. Notice that `'Tonkotsu-Shoyu'` counts as one word.

_**Hint**_: The string method [`.split()`](https://docs.python.org/3/library/stdtypes.html#str.split) will be helpful.

In [90]:
def word_count(variety):
    ...
    
# Test cases for your own reference. Feel free to test out more!
print(word_count('Cup Noodles Chicken Vegetable'))  # Should print 4
print(word_count('Tonkotsu-Shoyu Rich Pork Flavor Ramen')) # Should print 5

In [None]:
grader.check("q3_3")

**Question 3.4.** Create a DataFrame called `with_word_count` with columns from left to right `'Brand'`, `'Country'`, `'Variety'` and `'Stars'` and a new column `'Word_Count'` that has the word count for each variety. Sort the DataFrame in descending order of `'Word_Count'`.

_**Note**_: The `'Country'` column should have full country names, not codes.

In [95]:
with_word_count = ...
with_word_count


In [None]:
grader.check("q3_4")

**Question 3.5.** How many words does the longest ramen `'Variety'` have? Assign this number to `most_ramen_words`. How many words does the shortest ramen `'Variety'` have? Assign this number to `fewest_ramen_words`. What is the absolute difference between these values? Assign this number to `range_ramen_words`.

In [101]:
most_ramen_words = ...
fewest_ramen_words = ...
range_ramen_words = ...
print('Most ramen words', most_ramen_words)
print('Fewest ramen words:', fewest_ramen_words)
print('Range of ramen words:', range_ramen_words)

In [None]:
grader.check("q3_5")

**Question 3.6.** Create a function named `mean_word_count` that takes as an input the name of a ramen brand and returns the average `'Word_Count'` for all ramen varieties belonging to that brand.

In [106]:
def mean_word_count(brand):
    ...

In [None]:
grader.check("q3_6")

**Question 3.7.** Create a horizontal bar chart that displays the mean word count for all ramen brands that have **more than ten varieties**. Sort the bars so the brands whose varieties have the most words on average appear at the very top, and those with the fewest words on average appear at the bottom.

_**Hint 1:**_ If you use `.groupby` more than once on the same DataFrame, the order of rows will be the same, even with different aggregation methods.
 
_**Hint 2:**_ To get the bar chart to display nicely, try adjusting the optional `figsize` argument, as we did in Lecture 8.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_7
manual: true
-->

In [111]:
# Create your plot here
...

<!-- END QUESTION -->

In [112]:
many_variety_brands

**Question 3.8.** Define a function named `point_total` that takes in a full country name and returns a point total for that country's ramen, according to the following scheme:
- 1 point for every variety of ramen that has at least 1 star and less than 3 stars,
- 2 points for every variety with at least 3 stars and less than 4 stars, and
- 3 points for every variety with at least 4 stars (and at most 5 stars, which is the maximum possible).

|Points Received | Stars (Condition)| 
| --- | --- | 
|1| $[1,3)$|
|2| $[3, 4)$ | 
|3| $[4,5]$ |

_**Hint:**_ Make sure that your function works for countries that don't have varieties of ramen at every possible number of stars. If you aren't able to accomplish this using grouping, try another strategy!

In [113]:
def point_total(country):
    ...

In [None]:
grader.check("q3_8")

**Question 3.9.** Among the five countries listed below, which has the **highest** point total, using the points system from Question 3.8?

-  `'United States'`
-  `'Canada'`
-  `'Sweden'`
-  `'China'`
-  `'Japan'`

Save the name (not country code) of the country as `country` and the country's number of points as `points`. You can set the value of `country` and `points` by hand for this question based on the output of the function you just wrote, for various inputs.

In [119]:
country = ...
points = ...

In [None]:
grader.check("q3_9")

## 4. Histograms 🧑‍💻

Suppose we have a DataFrame called `data` with two numerical columns, `'x'` and `'y'`. Consider the following scatter plot, which was generated by calling `data.plot(kind='scatter', x='x', y='y')`:

<img src="./images/q4_scatter.png" width=450/>

Now consider these two histograms:

**Histogram A**:

<img src="./images/q4_hist_one.png" width=450/>

**Histogram B**:

<img src="./images/q4_hist_two.png" width=450/>

**Question 4.1.** Which of these two lines of code generated Histogram A? Assign either `1` or `2` to `which_code`.
 1. `data.plot(kind='hist', density=True, y='x')`
 2. `data.plot(kind='hist', density=True, y='y')`  

In [124]:
which_code = ...

In [None]:
grader.check("q4_1")

**Question 4.2.** Suppose we run this block of code:

```py
new_data = bpd.DataFrame().assign(
    x = data.get('x') / 4,
    y = data.get('y')
)
```
    
We then run `new_data.plot(kind='hist', density=True, y='x')`. How will this new histogram look compared to the original histogram, `data.plot(kind='hist', density=True, y='x')`, assuming both histograms are drawn on the same scale, with fixed axes? Assign `histogram_difference` to either 1, 2, 3, or 4, corresponding to your choice.

1. The new histogram will be wider and taller than the original histogram.
2. The new histogram will be wider and shorter than the original histogram.
3. The new histogram will be narrower and taller than the original histogram.
4. The new histogram will be narrower and shorter than the original histogram.

_**Hint:**_ Look at the end of Lecture 8 for an example of two histograms drawn on the same scale with fixed axes.

In [127]:
histogram_difference = ...

In [None]:
grader.check("q4_2")

**Question 4.3.** Below, we show Histogram B again.

<img src="./images/q4_hist_two.png" width=450/>

What **percent** of values in Histogram B are between -5 (inclusive) and -2 (exclusive)? While we cannot answer this question exactly since we do not know where the bins start and end, we can still approximate the answer. Assign the variable `are_between` to a number 1 through 5, corresponding to the closest answer.

1. 10% 
2. 13% 
3. 25%
4. 38%
5. 52%

In [130]:
are_between = ...

In [None]:
grader.check("q4_3")

## Finish Line 

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [133]:
grader.check_all()