# Homework 3: Data Visualization and Python Functions

## Due Saturday, January 29th at 11:59pm PST

Welcome to Homework 3! This week, we will cover DataFrame manipulations, making visualizations, and defining functions. You can find additional help on these topics in [Notes 11-17](https://notes.dsc10.com/02-data_sets/groupby.html) in the course notes.

### Instructions

This assignment is due on Saturday, January 29th at 11:59pm. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (see the schedule on the [Calendar](https://dsc10.com/calendar)) or Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

**Please do not use for-loops for any questions in this homework.** If you don't know what a for-loop is, don't worry -- we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and DataFrames should usually be avoided.

In [None]:
# Please don't change this cell, but do make sure to run it
import babypandas as bpd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import otter
grader = otter.Notebook()

## 1. Gotta Catch 'Em All! ⚡️

<img src="./images/q1_pokemon.png" width=400/>

Pokémon is an immensely popular video game and animation franchise that originated from Japan in 1996. Pokémon, short for Pocket Monsters, come in a variety of unique types and have several kinds of stats.  In this problem, we will investigate how attack and defense stats vary among these types up to the seventh generation.

The file named `pokedex.csv` in the `data/` directory has a row for each Pokémon, and the following columns.

|Column|Description|
|------|-----------|
|`'pokedex_number'`|The Pokémon's identification number in an encyclopedia of all Pokémon.|
|`'name'`|The name of the Pokémon.|
|`'type'`|The categorical type of the Pokémon, for example, "normal", "fire", "water". Each Pokémon is limited to one type for simplicity.|
|`'attack'`|The Pokémon's power for physical moves.|
|`'defense'`|The Pokémon's ability to prevent damage from attacks.|
|`'hp'`| Hit Points. Indicates how much damage a Pokémon can tolerate.|
|`'sp_attack'`|Special Attack. The Pokémon's power for special moves.|
|`'sp_defense'`|Special Defense. The Pokémon's ability to prevent damage from special attacks.|
|`'generation'`|A group of Pokémon that are compatible for Pokémon games.|
|`'is_legendary'`|Indicates whether the Pokémon is legendary. Legendary Pokémon are rare and powerful. 1 means legendary, 0 means not.|

First, we read the data in as a DataFrame.

In [None]:
pokedex = bpd.read_csv('data/pokedex.csv')
pokedex

Let's explore particular columns to get to know the data a little better. The `.describe()` method gives us some useful information about a column. Try it out on the `name` column.

In [None]:
pokedex.get('name').describe()

We learn that this column has 801 values, all of which are unique, and as a result the most frequent name appears only once.

If we try this same command on the `'type'` column, we'll see that although there are 801 entries, only 18 of them are unique. There are many Pokémon with the same `'type'`. The most common `'type'` is "water"; there are 114 such Pokémon. 

In [None]:
pokedex.get('type').describe()

**Question 1.1.** Which would be a better choice of index for this dataset, `'name'` or `'type'`? Set the index of `pokedex` to whichever of these two attributes makes more sense.

In [None]:
pokedex = ...
pokedex

In [None]:
grader.check("q1_1")

**Question 1.2.** Assign `weakest_attack` and `weakest_defense` to the names of the weakest Pokémon in terms of attack and defense respectively.

Similarly, assign `strongest_attack` and `strongest_defense` to the names of the strongest Pokémon in terms of attack and defense respectively.

In the case of a tie, choose any one of the equally weakest or equally strongest Pokémon.

In [None]:
weakest_attack = ...
print("Weakest attack:", weakest_attack)

strongest_attack = ...
print("Strongest attack:", strongest_attack)

weakest_defense = ...
print("Weakest defense:", weakest_defense)

strongest_defense = ...
print("Strongest defense:", strongest_defense)

In [None]:
grader.check("q1_2")

**Question 1.3.** Typically at the beginning of a game, the Pokémon trainer (the player) has to make a choice between Pokémon of `'type'` "water", "grass", and "fire". Make a DataFrame named `water_grass_fire` containing only the the "water", "grass", and "fire" Pokémon. All columns of `pokedex` should be included.

In [None]:
water_grass_fire = ...
water_grass_fire

In [None]:
grader.check("q1_3")

**Question 1.4.** Create a DataFrame named `legendary_pokemon`, indexed by `'type'` and having one column, called `'num_legendary'`, that contains the number of legendary Pokémon of each type.

_**Hint:**_ You will need to drop and rename columns. Instead of using `.drop`, you may want to use `.get` with a list containing the name of a single column (this was done in Lecture 8).

In [None]:
legendary_pokemon = ...
legendary_pokemon

In [None]:
grader.check("q1_4")

**Question 1.5.** It turns out that there are a few `'type'`s that don't have any legendary Pokémon 😢. Below, assign `non_legendary` to an array of the names of these `'type'`s.

In [None]:
non_legendary = ...
non_legendary

In [None]:
grader.check("q1_5")

**Question 1.6.** Suppose that as a Pokémon trainer, you want to assemble a strong team of Pokémon of various `'type'`s. Create a DataFrame called `median_stats`, indexed by `'type'`, that contains the median statistics for Pokémon of each type. `median_stats` should have five columns: `'attack'`, `'defense'`, `'hp'`, `'sp_attack'`, and `'sp_defense'`.

In [None]:
median_stats = ...
median_stats

In [None]:
grader.check("q1_6")

**Question 1.7.** A strong Pokémon is one that has high values for `'attack'`, `'defense'`, `'hp'`, `'sp_attack'`, and `'sp_defense'`. Suppose that you develop a formula to summarize all of these stats into a single number called strength. The strength of a Pokémon is a weighted average of these five stats, where each stat is weighted as follows:

- `'attack'`: 25%
- `'defense'`: 25%
- `'hp'`: 10%
- `'sp_attack'`: 30%
- `'sp_defense'`: 10%

Define a function called `calculate_strength` that takes as input the `'pokedex_number'` of a Pokémon and returns its strength, as defined above.

_**Hint:**_ It may be helpful to work out an example by-hand to ensure you know how the calculation is meant to be performed. For example, since Absol (`'pokedex_number'` of 359) has an `'attack'` of 150, `'defense'` of 60, `'hp'` of 65, `'sp_attack'` of 115, and `'sp_defense'` of 60, their strength should be 99.5 and so `calculate_strength(359)` should return 99.5. Once you've implemented `calculate_strength`, you should verify that your function works as intended. This is good practice in general.

In [None]:
def calculate_strength(number):
    ...

In [None]:
grader.check("q1_7")

**Question 1.8.** Create a DataFrame called `with_strength` that contains all the columns of `pokedex` plus one more called `'strength'`, containing the strength of each Pokémon as defined in the previous question. Order the rows in descending order of `'strength'`.

_**Hint**_: Use the `calculate_strength` function you've already written, along with the `.apply` method. This should only take one line of code; don't use a for-loop (don't worry if you don't know what that is).

In [None]:
with_strength = ...
with_strength

In [None]:
grader.check("q1_8")

**Question 1.9.** Make a plot that will help you answer the question,

> Among **fire** Pokémon, do those with stronger attack power also have stronger defense power?

Consider only regular `'attack'` and `'defense'`, **not** special (`'sp_attack'` and `'sp_defense'`). All you need to do for this question is create a plot.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_9
manual: true
-->

In [None]:
# Create your plot here
...

<!-- END QUESTION -->



**Question 1.10.** Considering that Pokémon `'generation'`s were developed in order, we might wonder how Pokémon have evolved over time. Draw a line plot that shows the proportion of legendary Pokémon in each generation. This kind of plot might help you answer the question

> Are later-generation Pokémon more likely to be legendary?

_**Hint 1:**_ You'll have to do some DataFrame manipulation before you can create the line plot; this is typical when visualizing data.

_**Hint 2:**_ A proportion is a value between 0 and 1. 0 means "none" and 1 means "all".The proportion of 1s amongst the numbers 0, 0, 1, 1, 0 is 0.4, which is also the average of these five numbers.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_10
manual: true
-->

In [None]:
# Create your plot here
...

<!-- END QUESTION -->



## 2. Club Penguin 🐧 

In this question, we will explore a dataset containing size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica. The data was collected by Dr. Kristen Gorman, a marine biologist, from 2007 to 2009. Below is a picture of Dr. Gorman and her fellow researchers in action:

<img src="./images/q2_dr_gorman.jpg" width=300/> 

Let's break down the columns in our dataset. 

#### `'species'`

There are three species of penguin in our dataset: Adelie, Chinstrap, and Gentoo.

<img src="./images/q2_lter_penguins.png" width=350/>

#### `'island'`

Dr. Gorman recorded data on penguins from three islands: Biscoe, Dream, and Torgersen. Below is a map:

<img src="./images/q2_map.png" width=350/>

#### `'bill_length_mm'` and  `'bill_depth_mm'`

These are physical attributes of individual penguins. Below is an illustration:

<img src="./images/q2_culmen_depth.png" width=350/>

#### `'flipper_length_mm'`, `'body_mass_g'`, and `'sex'`

Although penguins are birds that cannot fly, their wing, or flipper, structures are optimized for swimming. Here is a National Geographic [video](https://youtu.be/A9mbCNs47FI) showcasing the amazing swimming ability of penguins. The mass of the penguin is another attribute that was recorded in grams, and the sex of the penguin was recorded as either male or female. 

Run the next cell to load in the data. We have cleaned the data beforehand to ensure there are no missing values. Take some time to look at a few rows of the DataFrame to see what information is recorded.

In [None]:
# Run this cell to load the dataset.
penguins = bpd.read_csv('data/penguins.csv')
penguins

**Question 2.1.** Suppose we're curious about how the mean bill length varies between penguin species. Assign `species_mean_bill` to a DataFrame with a single column, `'bill_length_mm'`, that contains the mean bill length for each species. `'species'` should be the index.

In [None]:
species_mean_bill = ...
species_mean_bill

In [None]:
grader.check("q2_1")

**Question 2.2.** Below, write code to find the name of the species with the largest mean bill length, and assign the name `species_largest_bill` to the result.

<!--
BEGIN QUESTION
name: q2_2
-->

In [None]:
species_largest_bill = ...
species_largest_bill

In [None]:
grader.check("q2_2")

**Question 2.3** We want to visualize the **mean bill length** (in mm) for each species. Which visualization is the most appropriate? Assign an integer from 1 to 6 representing your answer to the name `q2_3`.

 1. Histogram
 2. Line plot
 3. Scatterplot
 4. Bar chart
 5. Boxplot
 6. None of the above

In [None]:
q2_3 = ...
q2_3

In [None]:
grader.check("q2_3")

**Question 2.4.** Using your `species_mean_bill` DataFrame from 2.1, generate a visualization that shows the mean bill length (in mm) for each species.

_**Hint 1:**_ Your visualization type should be the one you selected in 2.3. If it doesn't make sense, try a different type of visualization!

_**Hint 2:**_ Don't worry about sorting.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_4
manual: true
-->

In [None]:
# Create your plot here
...

<!-- END QUESTION -->



**Question 2.5.** Assign `tor_ade` to a DataFrame that only contains rows for Adelie penguins on Torgersen Island. `tor_ade` should have all of the columns in `penguins` except for `'species'` and `'island'`, which will now be redundant (since all of the penguins in `tor_ade` will have the same `'species'` and `'island'`).

In [None]:
tor_ade = ...
tor_ade

In [None]:
grader.check("q2_5")

**Question 2.6.** Calculate the proportion of Adelie penguins on Torgersen Island that have a mass of over 4000 grams, and assign the result to the name `proportion_above_4000g`.

In [None]:
proportion_above_4000g = ...
proportion_above_4000g

In [None]:
grader.check("q2_6")

**Question 2.7.** Create a density histogram showing the distribution of the body mass of Adelie penguins on Torgersen Island. Your histogram should have 10 bars in total (you can accomplish this by setting `bins=10`).

Make sure to set `density=True` to create a density histogram, and set `ec='w'` to make the separation of bars more clear.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_7
manual: true
-->

In [None]:
# Create your plot here
...

<!-- END QUESTION -->



<br>

**Question 2.8.** Finally, we're interested understanding the median `'flipper_length_mm'` of penguins from every combination of `'species'` and `'island'`. 

Below, assign `median_flipper_lengths` to a DataFrame with three columns, `'species'`, `'island'`, and `'flipper_length_mm'`. Each row of `median_flipper_lengths` should correspond to a different combination of `'species'` and `'island'`.

_**Hint 1:**_ In order to create the desired DataFrame, you will have to group by more than one column. Please read [Note 11.4](https://notes.dsc10.com/02-data_sets/groupby.html) on Subgroups to find out how.

_**Hint 2:**_ Remember to use `.reset_index()` after you group by multiple columns to get rid of the "multi-index".

<!--
BEGIN QUESTION
name: q2_8
-->

In [None]:
median_flipper_lengths = ...
median_flipper_lengths

In [None]:
grader.check("q2_8")

Say goodbye to our penguins!

<img src="./images/q2_all3.png" width=350/>

## 3. Olympics ⛸️🥇

Last summer, Tokyo held the highly-anticipated Summer Olympics. At the Olympics, 206 nations participated, and 339 medals were awarded in 33 distinct sports. This February, Beijing will host the 2022 Winter Olympic Games.

In light of the upcoming Winter Olympic Games, we will analyze historical data on Olympic athletes who won medals in the Winter Olympic Games from 1896-2014. We will be using a subset of data found [here](https://www.kaggle.com/the-guardian/olympic-games#dictionary.csv).

First, we'll read in the data from a CSV. There is no good index, so we will leave it unset.

In [None]:
winter_data = bpd.read_csv('data/winter.csv')
winter_data

The `'Country'` column contains International Olympic Committee (IOC) [country codes](https://olympics.fandom.com/wiki/List_of_IOC_country_codes). We want to convert these country codes into actual country names that everyone can understand.

We'll use a Python [dictionary](https://www.tutorialspoint.com/python/python_dictionary.htm) to help us with this conversion. A dictionary is a simple way to map a unique key to a value. For example, the below dictionary maps course codes to course names.

In [None]:
dsc_courses = {
    'DSC 10': 'Principles of Data Science',
    'DSC 20': 'Programming and Basic Data Structures for Data Science',
    'DSC 30': 'Data Structures and Algorithms for Data Science',
    'DSC 40A': 'Theoretical Foundations of Data Science I',
    'DSC 40B': 'Theoretical Foundations of Data Science II',
    'DSC 80': 'The Practice and Application of Data Science'
}

We can access the value corresponding to each key using bracket notation.

In [None]:
dsc30_name = dsc_courses['DSC 30']
dsc30_name

Here, `'DSC 30'` is the key and `'Data Structures and Algorithms for Data Science'` is the value.

Let's use a dictionary to help us with our country code to country name conversion. Below is a dictionary containing country codes as keys and country names as values for each of the countries in our dataset of Winter Olympic medal winners.

In [None]:
# Run this cell, DO NOT change it.
country_codes = {
 'USA': 'United States',
 'AFG': 'Afghanistan',
 'AHO': 'Netherlands Antilles',
 'ALG': 'Algeria',
 'ANZ': 'Australasia',
 'ARG': 'Argentina',
 'ARM': 'Armenia',
 'AZE': 'Azerbaijan',
 'CAN': 'Canada',
 'IRI': 'Iran',
 'GHA': 'Ghana',
 'NOR': 'Norway',
 'ERI': 'Eritrea',
 'IRL': 'Ireland',
 'URS': 'Soviet Union',
 'FIN': 'Finland',
 'IRQ': 'Iraq',
 'NAM': 'Namibia',
 'VIE': 'Vietnam',
 'SYR': 'Syria',
 'TAN': 'Tanzania',
 'SWE': 'Sweden',
 'IND': 'India',
 'ETH': 'Ethiopia',
 'IOP': 'Independent Olympic Participants',
 'RSA': 'South Africa',
 'RU1': 'Russian Empire',
 'PER': 'Peru',
 'PHI': 'Philippines',
 'SUR': 'Suriname',
 'PAR': 'Paraguay',
 'PAK': 'Pakistan',
 'GER': 'Germany',
 'SUI': 'Switzerland',
 'BAH': 'Bahamas',
 'PAN': 'Panama',
 'MAS': 'Malaysia',
 'AUT': 'Austria',
 'TRI': 'Trinidad and Tobago',
 'TTO': 'Trinidad and Tobago',
 'INA': 'Indonesia',
 'HKG': 'Hong Kong',
 'SUD': 'Sudan',
 'HAI': 'Haiti',
 'MGL': 'Mongolia',
 'BAR': 'Barbados',
 'MDA': 'Moldova',
 'MKD': 'Macedonia',
 'GRE': 'Greece',
 'MNE': 'Montenegro',
 'GRN': 'Grenada',
 'EGY': 'Egypt',
 'BDI': 'Burundi',
 'DOM': 'Dominican Republic',
 'GUA': 'Guatemala',
 'GUY': 'Guyana',
 'URU': 'Uruguay',
 'PUR': 'Puerto Rico',
 'BER': 'Bermuda',
 'ECU': 'Ecuador',
 'BOH': 'Bohemia',
 'BOT': 'Botswana',
 'KGZ': 'Kyrgyzstan',
 'ZIM': 'Zimbabwe',
 'ZZX': 'Mixed Team',
 'BRA': 'Brazil',
 'BRN': 'Bahrain',
 'GEO': 'Georgia',
 'BWI': 'West Indies Federation',
 'KEN': 'Kenya',
 'KUW': 'Kuwait',
 'KSA': 'Saudi Arabia',
 'RUS': 'Russia',
 'ITA': 'Italy',
 'GDR': 'East Germany',
 'TJK': 'Tajikistan',
 'THA': 'Thailand',
 'TCH': 'Czechoslovakia',
 'FRA': 'France',
 'TGA': 'Tonga',
 'TOG': 'Togo',
 'NIG': 'Niger',
 'NGR': 'Nigeria',
 'TPE': 'Taipei',
 'NED': 'Netherlands',
 'FRG': 'West Germany',
 'KOR': 'Korea, South',
 'CHI': 'Chile',
 'CHN': 'China',
 'GBR': 'United Kingdom',
 'CZE': 'Czech Republic',
 'JPN': 'Japan',
 'EUN': 'Unified Team',
 'CYP': 'Cyprus',
 'POL': 'Poland',
 'EUA': 'United Team of Germany',
 'CMR': 'Cameroon',
 'TUR': 'Turkey',
 'TUN': 'Tunisia',
 'POR': 'Portugal',
 'VEN': 'Venezuela',
 'SLO': 'Slovenia',
 'UGA': 'Uganda',
 'UAE': 'United Arab Emirates',
 'AUS': 'Australia',
 'ISV': 'US Virgin Islands',
 'JAM': 'Jamaica',
 'MEX': 'Mexico',
 'CUB': 'Cuba',
 'SGP': 'Singapore',
 'SIN': 'Singapore',
 'SEN': 'Senegal',
 'BLR': 'Belarus',
 'QAT': 'Qatar',
 'LTU': 'Lithuania',
 'DJI': 'Djibouti',
 'LAT': 'Latvia',
 'BEL': 'Belgium',
 'LIB': 'Lebanon',
 'SCG': 'Serbia and Montenegro',
 'MOZ': 'Mozambique',
 'CRC': 'Costa Rica',
 'COL': 'Colombia',
 'ZAM': 'Zambia',
 'HUN': 'Hungary',
 'UKR': 'Ukraine',
 'MAR': 'Morocco',
 'ISL': 'Iceland',
 'CRO': 'Croatia',
 'GAB': 'Gabon',
 'MRI': 'Mauritius',
 'CIV': 'Ivory Coast',
 'LIE': 'Liechtenstein',
 'SRB': 'Serbia',
 'SRI': 'Sri Lanka',
 'YUG': 'Yugoslavia',
 'EST': 'Estonia',
 'KAZ': 'Kazakhstan',
 'BUL': 'Bulgaria',
 'ISR': 'Israel',
 'DEN': 'Denmark',
 'SVK': 'Slovakia',
 'ROU': 'Romania',
 'ESP': 'Spain',
 'PRK': 'Korea, North',
 'LUX': 'Luxembourg',
 'NZL': 'New Zealand',
 'UZB': 'Uzbekistan'}

**Question 3.1.** Using the dictionary `country_codes`, define a function named `code_to_country` that takes as input a country code and returns the corresponding country's name. This should only take one line of code.

_**Hint 1:**_ Take a look at the DSC 30 example if you're stuck.

_**Hint 2:**_ Once you've implemented `code_to_country`, you should verify that it works as intended by trying a few examples yourself. The public tests will **not** do this for you.

In [None]:
def code_to_country(code):
    ...

In [None]:
grader.check("q3_1")

**Question 3.2.** Use your `code_to_country` function and the `.apply` method to convert all of the country codes in the `'Country'` column of `winter_data` into country names. Do this without creating an additional column. Assign the resulting DataFrame to the variable name `winter`.

_**Hint:**_ Is there a way to use the `.assign` method to *replace* values in this column without creating an additional column?

**Important**: For the rest of the questions in this section, use the DataFrame `winter` instead of `winter_data`.

In [None]:
winter = ...
winter

In [None]:
grader.check("q3_2")

**Question 3.3.** 
Define a function named `avg_name_length` that returns the average length of an individual's first and last name. It should take as an input the name of an individual, as a string in the format `'lastname, firstname'`. Because we assume that the input is in this format, people with middle names or multiple first or last names will have all such names counted as part of `'lastname'` or `'firstname'`. 

For example:
- `avg_name_length('Rampure, Suraj')` should return 6.0, the average of 7 and 5.
- `avg_name_length('Tiefenbruck, Janine LoBue')` should return 11.5, the average of 11 and 12. Notice that `'Janine LoBue'` is being counted as the `'firstname'` and its length is 12, which includes the space.

_**Hint 1:**_ The string method [`.split()`](https://docs.python.org/3/library/stdtypes.html#str.split) will be useful in splitting the name into `'lastname'` and `'firstname'`.

_**Hint 2:**_ Notice that there is a space after the comma, which we are not counting as part of either name's length.

In [None]:
def avg_name_length(full_name):
    ...
    
# Test cases for your own reference. Feel free to test out more!
print(avg_name_length('Skywalker, Luke'))  # Should print 6.5
print(avg_name_length('Schmidt, John Jacob Jingleheimer')) # Should print 15.0

In [None]:
grader.check("q3_3")

**Question 3.4.** Create a DataFrame called `avg_names` with columns `'Year'`, `'Sport'`, `'Athlete'` and `'Country'` and a new column `'Avg_Name_Length'` that has the average first and last name length for each athlete, sorted with the longest average name length at the top and shortest at the bottom.

**Note**: The `'Country'` column should have full country names, not IOC codes.

In [None]:
avg_names = ...
avg_names

In [None]:
grader.check("q3_4")

**Question 3.5.** What is the length of the longest average name? Assign this number to `longest_avg_name_length`. What is the length of the shortest average name? Assign this number to `shortest_avg_name_length`. What is the absolute difference between the longest and shortest average names? Assign this number to `range_avg_name_length`.

In [None]:
longest_avg_name_length = ...
shortest_avg_name_length = ...
range_avg_name_length = ...
print('Longest average name length:', longest_avg_name_length)
print('Shortest average name length:', shortest_avg_name_length)
print('Range of average name length:', range_avg_name_length)

In [None]:
grader.check("q3_5")

**Question 3.6.** Create a function named `median_values` that takes as an input the name of a sport and returns the median of the average first and last name length for all Olympians belonging to that sport. For example, `median_values('Skiing')` should return the median of the average name lengths for all Skiing atheletes.

_**Hint:**_ Within the body of `median_values`, use a DataFrame you've already created.

In [None]:
def median_values(sport):
    ...

In [None]:
grader.check("q3_6")

**Question 3.7.** Create a horizontal bar chart that displays the median of the average name length of all Olympians from each sport.

_**Hint:**_ There are only seven unique sports for which we have data, so your bar chart should only have 7 bars.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q3_7
manual: true
-->

In [None]:
# Create your plot here
...

<!-- END QUESTION -->



**Question 3.8.** Define a function named `point_total` that takes in a full country name (not IOC code) and returns the total number of points that country has earned at the Olympics. In the Olympics, countries get 4 points for every bronze, 5 points for every silver, and 10 points for every gold medal.

_**Hint:**_ Make sure that your function works for countries that don't have any medals of a particular type. If you aren't able to accomplish this using grouping, try another strategy!

In [None]:
def point_total(country):
    ...

In [None]:
grader.check("q3_8")

**Question 3.9.** Among the five countries listed below, which has the **third highest** point total, using the points system from 3.8?

-  `'United States'`
-  `'Canada'`
-  `'Sweden'`
-  `'China'`
-  `'France'`

Save the name (not IOC code) of the country as `country` and the country's number of points as `points`. You can set the value of `country` and `points` by hand.

_**Hint:**_ Use a previously defined function to help!

In [None]:
country = ...
points = ...

In [None]:
grader.check("q3_9")

The following cell creates a DataFrame, `top_countries`, that contains the top 10 countries historically using the points system from 3.8:

In [None]:
# Run this cell, DO NOT change it.
countries_points = bpd.DataFrame().assign(Country = np.array(['France', 'Switzerland', 'Finland', 'Belgium', 'United Kingdom',
                      'Sweden', 'Canada', 'United States', 'Austria', 'Norway',
                      'Germany', 'Czechoslovakia', 'Hungary', 'Italy', 'West Germany',
                      'Netherlands', 'Soviet Union', 'United Team of Germany', 'Japan',
                      'Poland', 'Korea, North', 'Romania', 'East Germany', 'Spain',
                      'Liechtenstein', 'Bulgaria', 'Yugoslavia', 'Unified Team',
                      'Korea, South', 'China', 'Luxembourg', 'New Zealand', 'Russia',
                      'Ukraine', 'Belarus', 'Australia', 'Slovenia', 'Kazakhstan',
                      'Uzbekistan', 'Denmark', 'Czech Republic', 'Croatia', 'Estonia',
                      'Latvia', 'Slovakia']))
top_countries = countries_points.assign(Points = countries_points.get('Country').apply(point_total)).sort_values('Points', ascending = False).reset_index(drop=True).iloc[:10]
top_countries

For comparison, here is a table of the top 10 countries in the most recent Winter Olympics (Peyongchang 2018) determined using the same points system. The data was obtained from [this](https://www.nytimes.com/interactive/2018/sports/olympics/medal-count-results-schedule.html) article.

|Country|Points|
|------|-----------|
|Norway|254|
|Germany|218|
|Canada|190|
|United States|154|
|Netherlands|134|
|South Korea|106|
|Sweden|104|
|Switzerland|96|
|France|94|
|Austria|89|

**Question 3.10.** Below, make two observations regarding the relationship between the two tables (the historical standings in the `top_countries` DataFrame and the 2018 standings above).

_**Note:**_ There are many potential observations, we are just asking you for two. An example observation (which you cannot use in your answer) is "Canada is in the top 3 of both tables."

_Type your answer here, replacing this text._

## 4. Histograms 🧑‍💻

Suppose we have a DataFrame called `data` with two numerical columns, `'x'` and `'y'`. Consider the following scatter plot, which was generated by calling `data.plot(kind='scatter', x='x', y='y')`:

<img src="./images/q4_scatter.png" width=450/>

Now consider these two histograms:

**Histogram A**:

<img src="./images/q4_hist_one.png" width=450/>

**Histogram B**:

<img src="./images/q4_hist_two.png" width=450/>

**Question 4.1.** Which of these two lines of code generated Histogram A? Assign either `1` or `2` to `which_code`.
 1. `data.plot(kind='hist', density=True, y='x')`
 2. `data.plot(kind='hist', density=True, y='y')`  

In [None]:
which_code = ...

In [None]:
grader.check("q4_1")

**Question 4.2.** Suppose we run this block of code:

```py
new_data = bpd.DataFrame().assign(
    x = data.get('x') / 4,
    y = data.get('y')
)
```
    
We then run `new_data.plot(kind='hist', density=True, y='x')`. How will this new histogram look compared to the original histogram, `data.plot(kind='hist', density=True, y='x')`, assuming both histograms are drawn on the same scale, with fixed axes? Assign `histogram_difference` to either 1, 2, 3, or 4, corresponding to your choice.

1. The new histogram will be wider and taller than the original histogram.
2. The new histogram will be wider and shorter than the original histogram.
3. The new histogram will be narrower and taller than the original histogram.
4. The new histogram will be narrower and shorter than the original histogram.

_**Hint:**_ Look at the end of Lecture 8 for an example of two histograms drawn on the same scale with fixed axes.

In [None]:
histogram_difference = ...

In [None]:
grader.check("q4_2")

**Question 4.3.** Below, we show Histogram B again.

<img src="./images/q4_hist_two.png" width=450/>

What **percent** of values in Histogram B are between -5 (inclusive) and -2 (exclusive)? While we cannot answer this question exactly since we do not know where the bins start and end, we can still approximate the answer. Assign the variable `are_between` to a number 1 through 5, corresponding to the closest answer.

1. 10% 
2. 13% 
3. 25%
4. 38%
5. 52%

In [None]:
are_between = ...

In [None]:
grader.check("q4_3")

## Finish Line 

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [None]:
grader.check_all()