# Homework 3: Data Visualization and Python Functions

## Due Saturday, October 16th at 11:59pm PST

Welcome to Homework 3! This week, we will cover DataFrame manipulations, making visualizations, and defining functions. You can find additional help on functions in [Chapter 2.6](https://eldridgejm.github.io/dive_into_data_science/02-data_sets/apply.html?highlight=function) and resources for creating visualizations in [Chapter 3](https://eldridgejm.github.io/dive_into_data_science/03-visualization/intro.html) of Dive into Data Science.

### Instructions

This assignment is due Saturday, October 16 at 11:59pm. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (schedule on Canvas) or your team's chatroom on Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

**Please do not use for-loops for any questions in this homework.** If you don't know what a for-loop is, don't worry -- we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and tables should usually be avoided.

In [None]:
# Please don't change this cell, but do make sure to run it
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
import otter
grader = otter.Notebook()

## 1. Gotta Catch 'Em All!

<img src="./images/pokemon.png" width=400/>

Pokémon is an immensely popular video game and animation franchise that originated from Japan in 1996. Pokémon, short for Pocket Monsters, come in a variety of unique types and have several kinds of stats.  In this problem, we will investigate how attack and defense stats vary among these types up to the seventh generation.

The file named `pokedex.csv` in the `data/` directory has a row for each Pokémon, and the following columns.

|Column|Description|
|------|-----------|
|pokedex_number|The Pokémon's identification number in an encyclopedia of all Pokémon.|
|name|The name of the Pokémon.|
|type|The categorical type of the Pokémon, for example, "normal", "fire", "water". Each Pokémon is limited to one type for simplicity.|
|attack|The Pokémon's power for physical moves.|
|defense|The Pokémon's ability to prevent damage from attacks.|
|hp| Hit Points. Indicates how much damage a Pokémon can tolerate.|
|sp_attack|Special Attack. The Pokémon's power for special moves.|
|sp_defense|Special Defense. The Pokémon's ability to prevent damage from special attacks.|
|generation|A group of Pokémon that are compatible for Pokémon games.|
|is_legendary|Indicates whether the Pokémon is legendary. Legendary Pokémon are rare and powerful. 1 means legendary, 0 means not.|

First, we read the data in as a DataFrame.

In [None]:
pokedex = bpd.read_csv('data/pokedex.csv')
pokedex

Let's explore particular columns to get to know the data a little better. The `.describe()` method gives us some useful information about a column. Try it out on the `name` column.

In [None]:
pokedex.get('name').describe()

We learn that this column has 801 values, all of which are unique, and as a result the most frequent name appears only once.

If we try this same command on the `type` column, we'll see that although there are 801 entries, only 18 of them are unique. There are many Pokémon with the same `type`. The most common `type` is "water"; there are 114 such Pokémon. 

In [None]:
pokedex.get('type').describe()

**Question 1.1.** Which would be a better choice of index for this dataset, `name` or `type`? Set the index of `pokedex` to whichever of these two attributes makes more sense.

In [None]:
pokedex = ...
pokedex

In [None]:
grader.check("q1_1")

**Question 1.2.** Assign `weakest_attack` and `weakest_defense` to the names of the weakest Pokémon in terms of attack and defense respectively.

Similarly, assign `strongest_attack` and `strongest_defense` to the names of the strongest Pokémon in terms of attack and defense respectively.

In the case of a tie, choose any one of the equally weakest or equally strongest Pokémon.

In [None]:
weakest_attack = ...
print("Weakest attack:", weakest_attack)

strongest_attack = ...
print("Strongest attack:", strongest_attack)

weakest_defense = ...
print("Weakest defense:", weakest_defense)

strongest_defense = ...
print("Strongest defense:", strongest_defense)

In [None]:
grader.check("q1_2")

**Question 1.3.** Typically at the beginning of a game, the Pokémon trainer (the player) has to make a choice between Pokémon of `type` "water", "grass", and "fire". Make a DataFrame named `water_grass_fire` containing only the the "water", "grass", and "fire" Pokémon. All columns of `pokedex` should be included.

In [None]:
water_grass_fire = ...
water_grass_fire

In [None]:
grader.check("q1_3")

**Question 1.4.** Create a DataFrame named `legendary_pokemon`, indexed by `type` and having one column, called `num_legendary`, that contains the number of legendary Pokémon of each type.

*Hint:* You will need to drop and rename columns. Instead of using `.drop`, you may want to use `.get` with a list containing the name of a single column.

In [None]:
legendary_pokemon = ...
legendary_pokemon

In [None]:
grader.check("q1_4")

**Question 1.5.** Notice how the `legendary_pokemon` DataFrame has fewer than 18 rows and there were 18 unique `type`s; this means that there are certain `type`s that don't have any legendary Pokémon. Determine which `type`s don't have any legendary Pokémon, and assign `non_legendary` to an array of these `type`s. 

In [None]:
non_legendary = ...
non_legendary

In [None]:
grader.check("q1_5")

**Question 1.6.** Suppose that as a Pokémon trainer, you want to assemble a strong team of Pokémon of various `type`s. Create a DataFrame called `mean_stats`, indexed by `type`, that contains the average statistics for Pokémon of each type. `mean_stats` should have five columns: `attack`, `defense`, `hp`, `sp_attack`, and `sp_defense`.

In [None]:
mean_stats = ...
mean_stats

In [None]:
grader.check("q1_6")

**Question 1.7.** A strong Pokémon is one that has high values for `attack`, `defense`, `hp`, `sp_attack`, and `sp_defense`. Suppose that you develop a formula to summarize all of these stats into a single number called strength. The strength of a Pokémon is a weighted average of these five stats, where each stat is weighted as follows:

- `attack`: 20%
- `defense`: 20%
- `hp`: 30%
- `sp_attack`: 15%
- `sp_defense`: 15%

Define a function called `calculate_strength` that takes as input the `pokedex_number` of a Pokémon and returns its strength, as defined above.

In [None]:
def calculate_strength(number):
    ...

In [None]:
grader.check("q1_7")

**Question 1.8.** Create a DataFrame called `with_strength` that contains all the columns of `pokedex` plus one more called `strength`, containing the strength of each Pokémon as defined in the previous question. Order the rows in descending order of `strength`.

*Hint*: Use the `calculate_strength` function you've already written.

In [None]:
with_strength = ...
with_strength

**Question 1.9.** Make a plot that will help you answer the question, "*Among water Pokémon, do those with stronger attack power also have stronger defense power?*" Consider only regular `attack` and `defense`, **not** special (`sp_attack`, and `sp_defense`).

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_9
manual: true
-->

In [None]:
# Create your plot here
...

<!-- END QUESTION -->



**Question 1.10.** Considering that Pokémon `generation`s were developed in order, we might wonder how Pokémon have evolved over time. Draw a line plot that shows the trend, across generations, the proportion of legendary Pokémon in each generation. This kind of plot might help you answer the question "*Are later-generation Pokémon more likely to be legendary?*"

*Hint*: You'll have to do some DataFrame manipulation before you can create the line plot.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q1_10
manual: true
-->

In [None]:
# create your plot here
...

<!-- END QUESTION -->



## 2. Power Outages


The dataset below contains information about power outages that occurred in North America between 2000 to 2014. The data is a subset of the data available [here](https://urldefense.com/v3/__https://www.kaggle.com/autunno/15-years-of-power-outages__;!!Mih3wA!UQwVlo9PvACY7t4CCustkKZq0RfPE-hGk1xDpoo20dpsixLTa8OgXRoiwcEkHHY$ ). Run the next cell to load in the data. Take some time to look at a few rows of the DataFrame to see what information is recorded.

In [None]:
# Run this cell.
power = bpd.read_csv("data/powerdata.csv")
power

**Question 2.1.** Find the year in which the most power outages occurred. Save your answer in the variable `greatest_outage_year`. Find the proportion of all outages that occurred in this year, and save that result in the variable `greatest_outage_prop`. Note that the year should be an integer while the greatest proportion should be a float.

In [None]:
greatest_outage_year = ...
greatest_outage_prop = ...
print("The proportion of outages that occurred in", greatest_outage_year, "is", greatest_outage_prop)

In [None]:
grader.check("q2_1")

**Question 2.2.** Now, let's create a visualization of the information we touched on above! Create a histogram showing the distribution of power outages by year. Your histogram should have one bar for each year included in the dataset (2000 to 2014).

*Note:* As done in Lecture 8, make sure to set `density=True` to create a density histogram, and set `ec='w'` to make the separation of bars more clear.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_2
manual: true
-->

In [None]:
# Create your plot here
...

<!-- END QUESTION -->



One of the columns in our dataset is `NERC Region`, which refers to a geographic region in North America. This map from [Wikipedia](https://en.wikipedia.org/wiki/North_American_Electric_Reliability_Corporation) shows many of the NERC Regions in our dataset. 

<img src="./images/nerc.png" width=600/>

**Question 2.3.** Make a horizontal bar chart showing the number of outages per `NERC Region`. Make sure that the bars are sorted from longest at the top to shortest at the bottom. 

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q2_3
manual: true
-->

In [None]:
# create your plot here
...

<!-- END QUESTION -->



**Question 2.4.** We notice that the "RFC" NERC Region had the greatest number of power outages. Let's explore those outages in particular. Create a DataFrame named `rfc_tags`, indexed by `Tags`, with one column "Count", containing the number of outages in the "RFC" NERF Region associated with each of the `Tags`. 

*Note*: Two outages should only appear in the same row of the output if they have the exact same set of `Tags`. For example, an outage tagged "vandalism" will count separately from an outage tagged "vandalism, cyber". 

In [None]:
rfc_tags = ...
rfc_tags

In [None]:
grader.check("q2_4")

**Question 2.5.** How many outages were there in California in 2013 that were caused by vandalism? Save your answer as `ca_vandalism`.
 
*Hint*: There are many geographic areas in California. For instance, "Central California" is one, "Northern California" is another, but there are more.

*Hint*: There are many types of vandalism. For instance, "Physical Attack" in `Event Description` is one type of vandalism, and "Vandalism" itself in `Event Description` is also considered a type of vandalism. There is easy way to get all related records.

In [None]:
ca_vandalism = ...
ca_vandalism

In [None]:
grader.check("q2_5")

## 3. Olympics

This summer, Tokyo held the highly-anticipated Summer Olympics. At the Olympics, 206 nations participated, and 339 medals were awarded in 33 distinct sports. In February 2022, Beijing will host the Winter Olympic Games.

In light of the upcoming Winter Olympic Games, we will analyze historical data on Olympic athletes who won medals in the Winter Olympic Games from 1896-2014. We will be using a subset of data [here](https://www.kaggle.com/the-guardian/olympic-games#dictionary.csv).

First, we'll read in the data from a CSV. There is no good index, so we will leave it unset.

In [None]:
winter_data = bpd.read_csv('data/winter.csv')
winter_data

The `Country` column contains International Olympic Committee (IOC) [country codes](https://olympics.fandom.com/wiki/List_of_IOC_country_codes). We want to convert these country codes into actual country names that everyone can understand.

We'll use a Python [dictionary](https://www.tutorialspoint.com/python/python_dictionary.htm) to help us with this conversion. A dictionary is a simple way to map a unique key to a value. For example, the below dictionary maps course codes to course names.

In [None]:
dsc_courses = {
    'DSC 10': 'Principles of Data Science',
    'DSC 20': 'Programming and Basic Data Structures for Data Science',
    'DSC 30': 'Data Structures and Algorithms for Data Science',
    'DSC 40A': 'Theoretical Foundations of Data Science I',
    'DSC 40B': 'Theoretical Foundations of Data Science II',
    'DSC 80': 'The Practice and Application of Data Science'
}

We can access the value corresponding to each key using bracket notation.

In [None]:
dsc30_name = dsc_courses['DSC 30']
dsc30_name

Here, `DSC 30` is the key and `Data Structures and Algorithms for Data Science` is the value.

Let's use a dictionary to help us with our country code to country name conversion. Below is a dictionary containing country codes as keys and country names as values for each of the countries in our dataset of Winter Olympic medal winners.

In [None]:
# Run this cell, DO NOT change it.
country_codes = {
 'USA': 'United States',
 'AFG': 'Afghanistan',
 'AHO': 'Netherlands Antilles',
 'ALG': 'Algeria',
 'ANZ': 'Australasia',
 'ARG': 'Argentina',
 'ARM': 'Armenia',
 'AZE': 'Azerbaijan',
 'CAN': 'Canada',
 'IRI': 'Iran',
 'GHA': 'Ghana',
 'NOR': 'Norway',
 'ERI': 'Eritrea',
 'IRL': 'Ireland',
 'URS': 'Soviet Union',
 'FIN': 'Finland',
 'IRQ': 'Iraq',
 'NAM': 'Namibia',
 'VIE': 'Vietnam',
 'SYR': 'Syria',
 'TAN': 'Tanzania',
 'SWE': 'Sweden',
 'IND': 'India',
 'ETH': 'Ethiopia',
 'IOP': 'Independent Olympic Participants',
 'RSA': 'South Africa',
 'RU1': 'Russian Empire',
 'PER': 'Peru',
 'PHI': 'Philippines',
 'SUR': 'Suriname',
 'PAR': 'Paraguay',
 'PAK': 'Pakistan',
 'GER': 'Germany',
 'SUI': 'Switzerland',
 'BAH': 'Bahamas',
 'PAN': 'Panama',
 'MAS': 'Malaysia',
 'AUT': 'Austria',
 'TRI': 'Trinidad and Tobago',
 'TTO': 'Trinidad and Tobago',
 'INA': 'Indonesia',
 'HKG': 'Hong Kong',
 'SUD': 'Sudan',
 'HAI': 'Haiti',
 'MGL': 'Mongolia',
 'BAR': 'Barbados',
 'MDA': 'Moldova',
 'MKD': 'Macedonia',
 'GRE': 'Greece',
 'MNE': 'Montenegro',
 'GRN': 'Grenada',
 'EGY': 'Egypt',
 'BDI': 'Burundi',
 'DOM': 'Dominican Republic',
 'GUA': 'Guatemala',
 'GUY': 'Guyana',
 'URU': 'Uruguay',
 'PUR': 'Puerto Rico',
 'BER': 'Bermuda',
 'ECU': 'Ecuador',
 'BOH': 'Bohemia',
 'BOT': 'Botswana',
 'KGZ': 'Kyrgyzstan',
 'ZIM': 'Zimbabwe',
 'ZZX': 'Mixed Team',
 'BRA': 'Brazil',
 'BRN': 'Bahrain',
 'GEO': 'Georgia',
 'BWI': 'West Indies Federation',
 'KEN': 'Kenya',
 'KUW': 'Kuwait',
 'KSA': 'Saudi Arabia',
 'RUS': 'Russia',
 'ITA': 'Italy',
 'GDR': 'East Germany',
 'TJK': 'Tajikistan',
 'THA': 'Thailand',
 'TCH': 'Czechoslovakia',
 'FRA': 'France',
 'TGA': 'Tonga',
 'TOG': 'Togo',
 'NIG': 'Niger',
 'NGR': 'Nigeria',
 'TPE': 'Taipei',
 'NED': 'Netherlands',
 'FRG': 'West Germany',
 'KOR': 'Korea, South',
 'CHI': 'Chile',
 'CHN': 'China',
 'GBR': 'United Kingdom',
 'CZE': 'Czech Republic',
 'JPN': 'Japan',
 'EUN': 'Unified Team',
 'CYP': 'Cyprus',
 'POL': 'Poland',
 'EUA': 'United Team of Germany',
 'CMR': 'Cameroon',
 'TUR': 'Turkey',
 'TUN': 'Tunisia',
 'POR': 'Portugal',
 'VEN': 'Venezuela',
 'SLO': 'Slovenia',
 'UGA': 'Uganda',
 'UAE': 'United Arab Emirates',
 'AUS': 'Australia',
 'ISV': 'US Virgin Islands',
 'JAM': 'Jamaica',
 'MEX': 'Mexico',
 'CUB': 'Cuba',
 'SGP': 'Singapore',
 'SIN': 'Singapore',
 'SEN': 'Senegal',
 'BLR': 'Belarus',
 'QAT': 'Qatar',
 'LTU': 'Lithuania',
 'DJI': 'Djibouti',
 'LAT': 'Latvia',
 'BEL': 'Belgium',
 'LIB': 'Lebanon',
 'SCG': 'Serbia and Montenegro',
 'MOZ': 'Mozambique',
 'CRC': 'Costa Rica',
 'COL': 'Colombia',
 'ZAM': 'Zambia',
 'HUN': 'Hungary',
 'UKR': 'Ukraine',
 'MAR': 'Morocco',
 'ISL': 'Iceland',
 'CRO': 'Croatia',
 'GAB': 'Gabon',
 'MRI': 'Mauritius',
 'CIV': 'Ivory Coast',
 'LIE': 'Liechtenstein',
 'SRB': 'Serbia',
 'SRI': 'Sri Lanka',
 'YUG': 'Yugoslavia',
 'EST': 'Estonia',
 'KAZ': 'Kazakhstan',
 'BUL': 'Bulgaria',
 'ISR': 'Israel',
 'DEN': 'Denmark',
 'SVK': 'Slovakia',
 'ROU': 'Romania',
 'ESP': 'Spain',
 'PRK': 'Korea, North',
 'LUX': 'Luxembourg',
 'NZL': 'New Zealand',
 'UZB': 'Uzbekistan'}

**Question 3.1.** Using `country_codes`, define a function named `code_to_country` that takes as input a country code and returns the corresponding country's name. This should only take one line of code.

In [None]:
def code_to_country(code):
    ...

In [None]:
grader.check("q3_1")

**Question 3.2.** Using your `code_to_country` function, use `apply` to convert all of the country codes into country names in the `Country` column. Do this without creating an additional column. Assign the resulting DataFrame to the variable name `winter`.

*Hint*: Is there a way to use the `.assign` method to *replace* values in this column without creating an additional column?

**Important**: For the rest of the questions in this section, use the DataFrame `winter` instead of `winter_data`.

In [None]:
winter = ...
winter

In [None]:
grader.check("q3_2")

**Question 3.3.** 
Define a function named `avg_name_length` that returns the average length of an individual's first and last name. It should take as an input the name of an individual, as a string in the format `'lastname, firstname'`. Because we assume that the input is in this format, people with middle names or multiple first or last names will have all such names counted as part of `lastname` or `firstname`. 

For example, `avg_name_length('Rampure, Suraj')` should return 6.5, the average of 7 and 6. `avg_name_length('Tiefenbruck, Janine LoBue')` should return 11.5, the average of 11 and 12. Notice that `'Janine LoBue'` is being counted as the `firstname` and its length is 12, which includes the space.

*Hint:* The string function [`.split()`](https://docs.python.org/3/library/stdtypes.html#str.split) will be useful in splitting the name into `lastname` and `firstname`.

*Hint:* Notice that there is a space after the comma, which we are not counting as part of either name's length.

In [None]:
def avg_name_length(full_name):
    ...
    
# Test cases for your own reference. Feel free to test out more!
print(avg_name_length('Skywalker, Leia'))  # Should print 6.5
print(avg_name_length('Schmidt, John Jacob Jingleheimer')) # Should print 15.0

In [None]:
grader.check("q3_3")

**Question 3.4.** 
Create a DataFrame called `avg_names` with columns `Year`, `Sport`, `Athlete` and `Country` and new column `Avg_Name_Length` that has the average first and last name length for each athlete, sorted with the longest average name length at the top and shortest at the bottom.

*Note*: The `Country` column should have full country names, not IOC codes.

In [None]:
avg_names = ...
avg_names

In [None]:
grader.check("q3_4")

**Question 3.5.** What is the length of the longest average name? Assign this number to `longest_avg_name_length`. What is the length of the shortest average name? Assign this number to `shortest_avg_name_length`. What is the absolute difference between the longest and shortest average names? Assign this number to `range_avg_name_length`.

In [None]:
longest_avg_name_length = ...
shortest_avg_name_length = ...
range_avg_name_length = ...
range_avg_name_length

In [None]:
grader.check("q3_5")

**Question 3.6.** Create a function named `mean_values` that takes as an input the name of a country and returns the mean of the average first and last name length for all Olympians from that country. 

*Hint:* Use a function you've already written.

In [None]:
def mean_values(country):
    ...

In [None]:
grader.check("q3_6")

**Question 3.7.** Create a horizontal bar chart of the mean of the average first and last name length of all Olympians from each country. Use your chart to answer the question: **which country has the longest average name in general?** Write your answer by hand by assigning the full name of this country to `country_longest_avg_name`. Similarly, assign `country_shortest_avg_name` to the full name of the country with the shortest average name.

In [None]:
# Create your plot here
...
country_longest_avg_name = ...
country_shortest_avg_name = ...

In [None]:
grader.check("q3_7")

**Question 3.8.** Define a function named `point_total` that takes in a full country name (not IOC code) and returns the total number of points that country has earned at the Olympics. In the Olympics, countries get 1 point for every bronze, 3 points for every silver, and 5 points for every gold medal.

In [None]:
def point_total(country):
    ...

In [None]:
grader.check("q3_8")

**Question 3.9.** Among the five countries listed below, which has the **third** highest point total?

-  United States
-  Russia
-  Italy
-  China
-  France

Save the name (not IOC code) of the country as `country` and the country's number of points as `points`. You can set the value of `country` and `points` by hand.

*Hint*: Use a previously defined function to help!

In [None]:
country = ...
points = ...

In [None]:
grader.check("q3_9")

## 4. Histograms

Suppose we have a DataFrame called `data` with two numerical columns, `x` and `y`. Consider the following scatter plot, which was generated by calling `data.plot(kind='scatter', x='x', y='y')`:

<img src="./images/scatter.png" width=600/>

Now consider these two histograms:

**Histogram A**:

<img src="./images/hist_one.png" width=600/>

**Histogram B**:

<img src="./images/hist_two.png" width=600/>

**Question 4.1.** Which of these two lines of code generated Histogram B? Assign either `1` or `2` to `which_code`.
 1. `data.plot(kind='hist', density=True, y='x')`
 2. `data.plot(kind='hist', density=True, y='y')`  

In [None]:
which_code = ...

In [None]:
grader.check("q4_1")

**Question 4.2.** Suppose we run this block of code:

    new_data = bpd.DataFrame().assign(
        x = data.get('x') * 3,
        y = data.get('y')
    )
    
We then run `new_data.plot(kind='hist', density=True, y='x')`. Describe how the new histogram looks compared to `data.plot(kind='hist', density=True, y='x')`, assuming both histograms are drawn on the same scale, with fixed axes. Comment on the height and width of the new histogram as compared to the old.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q4_2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



**Question 4.3.** According to Histogram B above, approximately what **percent** of points in the scatterplot have an `x` value between -5 (inclusive) and -3 (exclusive)? Assign the variable `x_between` to a number 1 through 4, corresponding to your choice.

1. 10% 
2. 15% 
3. 25% 
4. 40% 

In [None]:
x_between = ...

In [None]:
grader.check("q4_3")

# Finish Line

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [None]:
grader.check_all()