

# Homework 1: Basic Python, Arrays, and DataFrames

## Due Monday, July 7th at 11:59PM

Welcome to Homework 1! This week's homework will cover basic Python, arrays, and DataFrames. You can find additional help on these topics in [Chapter 1](https://www.inferentialthinking.com/chapters/01/what-is-data-science.html) of Computational and Inferential Thinking and [BPD 1-11](https://notes.dsc10.com/01-getting_started/tools.html) in the `babypandas` notes.


### Instructions

Remember to start early and submit often.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (the schedule can be found [here](https://dsc10.com/calendar)) or Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

In [None]:
# Please don't change this cell, but do make sure to run it.
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
import otter
grader = otter.Notebook()

plt.style.use('ggplot')

## 1. The Three Musketeers vs. Les Trois Mousquetaires 🐭🐭🐭

<center><img src=images/3_musketeers_illustration.png width=350>(<a href="https://en.wikipedia.org/wiki/The_Three_Musketeers">source</a>)</center>

In Lecture 1, we counted the number of times that the characters Amy, Beth, Jo, Meg, and Laurie were named in each chapter of the classic book, Little Women. In programming, the word "character" also refers to a single element of a string. For instance, the string `"3 zebras!"` has 9 characters – `"3"`, `" "`, `"z"`, `"e"`, `"b"`, `"r"`, `"a"`, `"s"`, and `"!"`. 

Let's use this concept to see if *The Three Musketeers* by Alexandre Dumas has longer sentences in English or the original French. 

The following code generates a scatter plot where each dot represents a chapter of *The Three Musketeers*, either in English or French. For each chapter, there is both a blue dot corresponding to that chapter's English translation, and a green dot for the original French chapter. The horizontal position of a dot measures the number of periods in the chapter. The vertical position measures the total number of characters in that chapter. 

In [None]:
# This cell contains code that hasn't yet been covered in the course.
# It isn't expected that you'll understand the code, but you should be able to 
# interpret the scatter plot it generates.

import numpy as np
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

eng_file = "./data/The_Three_Musketeers.txt"
fre_file = "./data/Les_Trois_Mousquetaires.txt"

eng_chapters = open(eng_file, encoding="utf-8").read().split('Chapter ')[67:]
fre_chapters = open(fre_file, encoding="utf-8").read().split('CHAPITRE ')[67:]

eng_periods = np.char.count(eng_chapters, '.')
fre_periods = np.char.count(fre_chapters, '.')

eng_chars = [len(c) for c in eng_chapters]
fre_chars = [len(c) for c in fre_chapters]

plt.scatter(eng_periods, eng_chars, color='b')
plt.scatter(fre_periods, fre_chars, color='g')
plt.xlabel('Periods')
plt.ylabel('Characters')
plt.legend(['English', 'French'])
plt.axis([0, 450, 0, 45000]); 

Before we decide which language has longer sentences, let's start with some investigative questions. All of the questions in this section should be answered using the scatter plot alone, plus process of elimination and estimation. Don't write any code for this section! It may help to open another copy of the notebook in a new tab so you can easily refer to the scatter plot throughout this section.

**Question 1.1.** How many <ins>periods</ins> are in the **English** chapter with the greatest number of <ins>characters</ins>? Assign the variable `musketeers_1` to either 1, 2, 3, 4, or 5 corresponding to your choice.

1. 325
2. 352
3. 400
4. 405
5. 452

In [None]:
musketeers_1 = ...

In [None]:
grader.check("q1_1")

**Question 1.2.** How many <ins>characters</ins> are in the **French** chapter with the most <ins>periods</ins>? Assign the variable `musketeers_2` to either 1, 2, 3, 4, or 5.

1. 27,625
2. 32,625
3. 37,625
4. 39,625
5. 40,625

In [None]:
musketeers_2 = ...

In [None]:
grader.check("q1_2")

**Question 1.3.** Which of the following is closest to the <ins>average number of characters per period</ins> in the **English** version of *The Three Musketeers*? This is roughly the length of a typical sentence in the English translation of the book. Assign the variable `musketeers_3` to either 1, 2, 3, 4, or 5

1. 50
2. 100
3. 150
4. 200
5. 250

In [None]:
musketeers_3 = ...

In [None]:
grader.check("q1_3")

**Question 1.4.** Which version of *The Three Musketeers* has more characters per period, on average? Assign the variable `musketeers_4` to either 1 or 2.

1. French
2. English

In [None]:
musketeers_4 = ...

In [None]:
grader.check("q1_4")

**Question 1.5.** Which of the following is a valid conclusion we can make based off of the above scatter plot alone? Assign the variable `musketeers_5` to 1, 2, 3, 4, or 5. There is only one correct answer.

1. The number of periods in each chapter is the same in the French and English versions of the book.

1. The number of characters in each chapter is the same in the French and English versions of the book.

1. The last chapter has more periods in the English translation than in the original French.

1. In the first chapter, the average number of characters per period is roughly the same in the French and English versions of the book.

1. None of these is a valid conclusion based on the scatter plot alone.

In [None]:
musketeers_5 = ...

In [None]:
grader.check("q1_5")

<font color=red>**🚨 Important**: The tests in this section only check that you selected one of the possible answer choices, not that you have selected the right one!. Unlike in labs, tests in homeworks **do not** check that you answered correctly; they only check that your answer is *reasonable*, or in the correct format. To put it another way: all of your tests might pass, but that doesn't mean you'll get full credit – some of your answers may still be wrong. It's up to you to make sure that they're right!<font color=red>

## 2. Python Basics 🐍

**Question 2.1.** When you run the following cell, Python produces a cryptic error message.

In [None]:
2024 = 2025 - 1.0

Choose the best explanation of what's wrong with the code, and then assign 1, 2, 3, or 4 to `basics_1` below to indicate your answer.

1. The left hand side is an `int`, while the right hand side is a `float`. The left side should be `2024.0` instead.

1. The result should be written after the calculation. It should be `2025 - 1.0 = 2024`.

1. This is trying to create a variable named `2024`, which doesn't make sense because `2024` is a number.

1. Python is not able to subtract a `float` from an `int` because they are of different data types.


<font color=red>**🚨 Important**: Once you have finished this question, "comment"  out the above code cell out by replacing it with `# 2024= 2025 - 1.0`. This will prevent the error message from appearing when your notebook is graded.</font>

***Note:*** A shortcut for "commenting" out code is to highlight the code and press `command` or `control` and `/`.

In [None]:
basics_1 = ...

In [None]:
grader.check("q2_1")

**Question 2.2.** Consider the following poorly-written code.

In [None]:
two = 2
three = 3
three = three ** two
three = two - three
three = three * two
three = three * three

As this code executes, what values does the variable `three` take on? Assign 1, 2, 3, or 4 to `basics_2` to indicate your answer.

1. The variable `three` takes on the values 3, 6, 4, 8, 64.

1. The variable `three` takes on the values 3, 6, -4, -8, 64.

1. The variable `three` takes on the values 3, 9, 7, 14, 196.

1. The variable `three` takes on the values 3, 9, -7, -14, 196.

In [None]:
basics_2 = ...

In [None]:
grader.check("q2_2")

## 3. Road Trip   🚘 

This weekend, you plan on going on a road trip to Palm Springs with your friends Jason and Minchan. Before you go, you want to plan which route (Route A or Route B) and  whose car (Jason's or Minchan's) to take. Answer the questions below using Python to perform all the intermediate calculations such as adding, squaring, and dividing.

<font color=red>**🚨 Important**: The `math` package has not been imported. You don't need it for this question, and you should not import it, otherwise the Gradescope autograder may error.</font>

**Question 3.1.** First, you need to decide on the route. You plan to take the route for which the average speed is faster. 

For Route A, you will have to take 3 freeways, each with a different speed limit. Below is a table showing the speed limits (which we will assume is the speed you will travel at) and the time you will have to spend on each freeway on Route A.

| Freeway | Speed Limit (miles per hour) | Time (hours)|
| --- | --- | --- |
| I-5 | 70 | 3 |
| CA-73 | 55 | 2 |
| CA-133 | 45 | 1 |

Using this information, calculate the average speed, in miles per hour, if you take Route A, and assign your answer to the variable `route_A`. Recall from math and physics that the average speed is the total distance driven divided by the total time taken.

In [None]:
# Feel free to define intermediate variables to use in your solution.
total_distance = ...
total_time = ...
route_A = ...
route_A

In [None]:
grader.check("q3_1")

**Question 3.2.** Next, let's consider the other route you might take. For Route B, you will have to take 3 freeways, each with a different speed limit. Below is a table showing the speed limits (which we will assume is the speed you will travel at) and the *distance* you will need to travel on each freeway on Route B.

| Freeway | Speed Limit (miles per hour) | Distance (miles)|
| --- | --- | --- |
| I-5 | 70 | 150 |
| CA-57 | 65 | 110 |
| CA-109 | 55 | 90 |

Using this information, calculate the average speed, in miles per hour, if you take Route B, and assign your answer to the variable `route_B`.

Note that the third column is `'Distance (miles)'`, not `'Time (hours)'`. Unlike in Question 3.1, you aren't given the amount of time you'll spend on each freeway; you need to compute these times yourself. To calculate the time it will take on each freeway, divide the distance for that freeway by the speed for that freeway. Finally, add up the times for each freeway to find the total time.

In [None]:
# Feel free to define intermediate variables to use in your solution.
total_distance2 = ...
total_time2 = ...

route_B = ...
route_B

In [None]:
grader.check("q3_2")

**Question 3.3.** Now it's time to decide whose car to take (Jason's or Minchan's). You decide to take the car with the biggest fuel tank. 

Unfortunately, Jason doesn't know the exact volume of his car's fuel tank, only that the tank has a height of 16 inches, a width of 12 inches, and a length of 32 inches. Minchan doesn't know the exact volume of his car's fuel tank either, but knows that his tank is in the shape of a cube. What is the length of one of the sides of Minchan's fuel tank (in inches) so that it stores the same amount of fuel as Jason's? Save your answer in the variable `side_length`.

In [None]:
# Feel free to define intermediate variables to use in your solution.
side_length = ...
side_length

In [None]:
grader.check("q3_3")

In this problem, though you calculated three different quantities in three different ways, all of your results are actually considered **means**, of various kinds!

In Question 3.1, given $n$ values $x_1, x_2, ..., x_n$, you found an *arithmetic mean*, using the formula

$${x_1+x_2+...+x_n \over n},$$

where the numerator represented total distance and the denominator represented total time. An arithmetic mean is the usual type of mean or average you're used to seeing. It turns out that you actually computed a more sophisticated arithmetic mean, known as a _weighted arithmetic mean_, 

$$\frac{w_1 x_1 + w_2 x_2 + ... + w_n x_n}{w_1 + w_2 + ... + w_n},$$

where the weights $w_1, w_2, w_3$ were the times on each freeway.

In Question 3.2, given  $n$ values $x_1, x_2, ..., x_n$, you found a *harmonic mean*, using the formula

$${n \over {{1 \over x_1}+{1 \over x_2}+ ... + {1 \over x_n}}},$$ 


where the numerator represented total distance and the denominator represented total time. To calculate the total time, you needed to sum the time taken for each part of the trip, calculated using the fact that time is distance over speed. Again, it turns out that you actually computed the _weighted harmonic mean_, but this time the weights were the distances on each of the freeways. If you're curious, see the formula [here](https://en.wikipedia.org/wiki/Harmonic_mean#Weighted_harmonic_mean).

Finally in Question 3.3, given $n$ values $x_1, x_2, ..., x_n$, you found a *geometric mean*, using the formula 

$${\sqrt[n]{x_1 \cdot x_2 \cdot ... \cdot x_n}},$$ 

where each value represented a dimension of the fuel tank. 

As you can see, there are many different notions of the mean. You'll learn about some of them if you take DSC 40A!

## 4. AI Revolution ✨💻✨ 

In this problem, we want to compare and contrast early awareness of ChatGPT among American teenagers in different household income groups. The data below comes from [Pew Research Center's Survey of U.S. Teens](https://www.pewresearch.org/short-reads/2023/11/16/about-1-in-5-us-teens-whove-heard-of-chatgpt-have-used-it-for-schoolwork/sr_23-11-16_ai-in-schools_2-png/), in fall 2023, back when ChatGPT was new to the general public. The numbers below show **percentages** of each income group falling into each awareness category; note that each row sums to 100 or near 100 (because some respondents did not answer the question).

| Household Income     | Highly aware of ChatGPT  | A little aware of ChatGPT  | Not at all aware of ChatGPT |
|-------------------------------------------|-------------|-----------|-----------|
| less than 30,000 dollars      | 11 | 30 | 59 | 
| 30,000 to 74,999 dollars      | 22 | 36 | 40 | 
| more than 75,000 dollars   | 26 | 50  | 24 | 

We define the **dissimilarity** between two income groups as the largest absolute difference between their three respective percentages.

To better understand dissimilarity, consider the following hypothetical situation, where we compare the ChatGPT awareness betwen income group A and income group B. Suppose:
* Group A's *percentage of highly aware* is **10 percent more** than Group B's.
* Group A's *percentage of a little aware* is **4 percent less** than Group B's.
* Group A's *percentage of not at all aware* is **7 percent less** than Group B's.

Then, we would say the dissimilarity between Group A and Group B is 10, since 10 is larger than both 4 and 7.

**Question 4.1.** 
Using this method, compute the dissimilarity between the following two income groups: less than 30,000 dollars and 30,000 to 74,999 dollars.  Assign the result to the variable `dissimilarity`. Use a single expression (a single line of code) to compute the answer. Let Python perform all the arithmetic (like subtracting) rather than calculating the expression yourself. 

**_Hint:_**  The built-in `abs` function computes absolute values. 

In [None]:
dissimilarity = ...
dissimilarity

In [None]:
grader.check("q4_1")

**Question 4.2.** Which pair of income groups is **most** dissimilar, according to this measurement? Assign either 1, 2, or 3 to the variable `most_dissimilar` below. Check whether your answer matches up with your intuition.

1. Less than 30,000 dollars and 30,000 to 74,999 dollars. 
1. Less than 30,000 dollars and more than 75,000 dollars.
1. 30,000 to 74,999 dollars and more than 75,000 dollars.

In [None]:
most_dissimilar = ...

In [None]:
grader.check("q4_2")

**Question 4.3.** Suppose instead of measuring awareness of ChatGPT with three categories (highly aware, a little aware, and not at all aware), the researchers had instead measured awareness with two categories (aware and not aware, where aware includes **both** highly aware and a little aware). Between which two income categories would the dissimilarity change if we measured awareness with two categories instead of three? Assign either 1, 2, 3, or 4 to the variable `would_change` below.

1. Less than 30,000 dollars and 30,000 to 74,999 dollars.
1. Less than 30,000 dollars and more than 75,000 dollars.
1. 30,000 to 74,999 dollars and more than 75,000 dollars.
1. All of the above.
1. None of the above.

In [None]:
would_change = ...

In [None]:
grader.check("q4_3")

## 5. Arrays 🗃️

**Question 5.1.** Make an array called `quirky_numbers` containing the following numbers (in the given order):

1. The square root of 23
2. 73 degrees, in radians
3. $3^9 + 7^5$
4. The mathematical constant of $e$ over 8: $\frac{e}{8}$
5. The base 10 logarithm of 5

*Hint:* Check out the functions constants available in the `numpy` module, which has been imported as `np`. If you're unsure of what function to use, a quick Google search should do the trick.  Do **not** import `math` or any other modules. 

*Note:* In this problem, as with all others, we'll only check that your answer is correct. There may be several valid ways to produce the correct answer.

In [None]:
quirky_numbers = ...
quirky_numbers

In [None]:
grader.check("q5_1")

**Question 5.2.** Make an array called `likes` containing the following three strings:
- `'I like planting'`
- `'my cats'`
- `'and my family!'`

<!--
BEGIN QUESTION
name: q5_2
-->

In [None]:
likes = ...
likes

In [None]:
grader.check("q5_2")

<center><img src=images/cat_plant.jpeg width=400><a href="https://www.reddit.com/r/pottedcats/comments/xpokrc/blue_eyes/">source</a></center>


In Lecture 2, we looked at several string methods, like `upper` and `replace`. Strings have another method that we haven't seen yet, called `join`. `join` takes one argument, an array of strings, and it returns a single string. Specifically, `some_string.join(some_array)` evaluates to a new string consisting of all of the elements in `some_array`, with `some_string` inserted in between each element.

For example, `'-'.join(np.array(['call', '858', '534', '2230']))` evaluates to `'call-858-534-2230'`.

**Question 5.3.** Use the array `likes` and the method `join` to make two strings:

1. `'I like planting, my cats, and my family!'` (call this one `by_comma`)
1. `'I like planting my cats and my family!'` (call this one `by_space`)

In [None]:
by_comma = ...
by_space = ...

# Don't change the lines below.
print(by_comma)
print(by_space)

In [None]:
grader.check("q5_3")

Now let's get some practice accessing individual elements of arrays.  In Python (and in many programming languages), elements are accessed by *integer position*, with the position of the first element being zero. That's probably not the way you learned to count, so it's easy to get mixed up here. Be careful!

**Question 5.4.** The cell below creates an array of strings.

In [None]:
some_strings = np.array(['flowers', '🌼', '🌸', '🌱', 'plant', 'dog', '🐶', 'cat', '🐈'])
some_strings

What is the integer position of `'🐶'` in the array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

_Note:_ Your answer should be a **positive** integer!

In [None]:
dog_emoji_position = ...
dog_emoji_position

In [None]:
grader.check("q5_4")

**Question 5.5.** Suppose you have an array with 500 elements. What is the integer position of the ninth-to-last element in this array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

_Note:_ Again, your answer should be a **positive** integer!

In [None]:
ninth_last_position = ...
ninth_last_position

In [None]:
grader.check("q5_5")

**Question 5.6.** Suppose you have an array with 123 elements. At what integer position is the middle element of this array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

_Note:_ Again, your answer should be a **positive** integer!

In [None]:
mid_position = ...
mid_position

In [None]:
grader.check("q5_6")

By the way, it's also possible to use negative integer positions to access elements in an array, which can be easier than using positive integer positions sometimes.  If a position is negative, you count from the end of the array rather than from the beginning. Position -1 corresponds to the last element, -2 corresponds to the second-last element, and so on. For instance, to find the third-to-last element of `some_strings`, we could use:

In [None]:
some_strings[-3]

**Question 5.7.** Let's say you are trying create an online [multiplication table](https://en.wikipedia.org/wiki/Multiplication_table#Modern_times) for an elementary school. Using the function `np.arange()`, make an array called `multiples_of_10` which contains all the multiples of 10, in ascending order, that appear on the multiplication table below. 

It's also a good exercise to think about how your code would change if you wanted the multiples of $k$, for some other value of $k$ besides 10.

<center>
    <br>
    <img src=images/mult.jpg width=500>
</center>

In [None]:
multiples_of_10 = ...
multiples_of_10

In [None]:
grader.check("q5_7")

## 6. Politics 🐘🐴

The table below has a row for each of the eight most populated counties in California. We'll be looking at how residents of these counties voted in the 2024 presidential election. Our data was compiled and curated by the [Associated Press](https://apnews.com/projects/election-results-2024/california/?r=0).

For each county in the table, the `'Harris'` column records the total number of votes cast for Kamala Harris, and the `'Trump'` column records the total number of votes cast for Donald Trump. Votes for other candidates are not included in this data.

|County|Harris|Trump|
|---|---|---|
|Los Angeles County|2417109|1189862|
|San Diego County|841372|593270|
|Orange County|691731|654815|
|Riverside County|451782|463677|
|San Bernardino County|362114|378416|
|Santa Clara County|510744|210924|
|Alameda County|499551|140789|
|Sacramento County|381564|252140|

In this question, we'll be working with the data from the `'Harris'` and `'Trump'` columns as *arrays*. Here are those arrays:

In [None]:
harris = np.array([2417109, 841372, 691731, 451782, 362114, 510744, 499551, 381564])
harris

In [None]:
trump = np.array([1189862, 593270, 654815, 463677, 378416, 210924, 140789, 252140])
trump

Remember, the `numpy` package (`np` for short) provides many handy functions for working with arrays. These are specifically designed to work with arrays and are faster than using Python's built-in functions. 

Some frequently used array functions are `np.min()`, `np.max()`, `np.sum()`, `np.abs()`, and `np.round()`. There are many more, which you can browse by typing `np.` into a code cell and hitting the *tab* key.

**Question 6.1.** What proportion of the Harris or Trump voters in each county are Harris voters? Store these proportions for each county in a new array called `harris_share`. Similarly, store the proportions of Trump voters in `trump_share`.

In [None]:
harris_share = ...
harris_share

In [None]:
grader.check("q6_1_1")

In [None]:
trump_share = ...
trump_share

In [None]:
grader.check("q6_1_2")

**Question 6.2.** Find the gap between the proportion of Harris and Trump voters in each county. Create an array called `gaps` containing the absolute differences between the proportions in `harris_share` and `trump_share`.

In [None]:
gaps = ...
gaps

In [None]:
grader.check("q6_2")

**Question 6.3.** Now, find the gap between the Harris share and Trump share for each county, but this time, you're only allowed to use the variable `harris_share`. You may not use `trump_share`. Create an array called `gaps_again` containing the absolute differences between the Harris share and Trump share for each county. The answer will be the same as the last question, but your method should be different.

In [None]:
gaps_again = ...
gaps_again

In [None]:
grader.check("q6_3")

**Question 6.4.** You might say that the most bipartisan county is the one with the smallest gap. Find the smallest value in the `gaps` array and save it as `smallest_gap`. Referring back to the table, try to figure out which county that is!

In [None]:
smallest_gap = ...
smallest_gap

In [None]:
grader.check("q6_4")

## 7. World Cup 🌎⚽

The Federale Internationale de Football Association (FIFA) is the international governing body of soccer (or football, depending where you're from). FIFA has 209 member countries, making it one of the most respected sports organizations in the world. The organization has hosted an international tournament, called the World Cup, every four years since 1930, except for during WWII. The most recent one took place in 2022 in Qatar. 

<img src="images/messi.jpeg" width=60%>

The file `world_cup.csv` in the `data/` directory contains information about every World Cup tournament that has ever taken place. Its columns are described below.

| Column      | Description |
| ----------- | ----------- |
| `'Year'`      | Year of World Cup     |
| `'Host'`   | Name of host country        |
| `'Total Attendance'` | Total number of people in attendance across all matches  |
| `'Matches'` | Total number of matches played |
| `'Teams'` | Total number of teams that competed |
| `'First'` | Winner of the World Cup |
| `'Second'` | The team in second place | 
| `'Third'` | The team in third place|
| `'Fourth'` | The team in fourth place |

**Question 7.1.** Read this file into a DataFrame called `world_cup`. 

In [None]:
world_cup = ...
world_cup

In [None]:
grader.check("q7_1")

**Question 7.2.** Add a column to `world_cup` called `'Average_Attendance'` that contains the average number of attendees per match in each World Cup tournament. Do not round.

In [None]:
world_cup = ...
world_cup

In [None]:
grader.check("q7_2")

**Question 7.3.** Create a new DataFrame, `world_cup_by_year`, by setting the index of `world_cup` to `'Year'`. Don't change `world_cup`.

In [None]:
world_cup_by_year = ...
world_cup_by_year

In [None]:
grader.check("q7_3")

You should think about why we've chosen to set the index to `'Year'`, instead of any other column.

**Question 7.4.** Michelle was born in 2006. Where was the World Cup held that year, and who won? Assign your results to `location_06` and `winner_06`, respectively.

Don't type in the answers by hand; get Python to extract this information for you.

In [None]:
location_06 = ...
winner_06 = ...

# Don't change the lines below.
print('Location:', location_06)
print('Winner:', winner_06)

In [None]:
grader.check("q7_4")

**Question 7.5.** Since the first tournament in 1930, more and more countries have joined FIFA, which means more matches are played in each tournament. Using DataFrame operations, find the number of World Cup tournaments that had more than 50 matches. Assign the number of such tournaments to `over_50_matches`. 

In [None]:
over_50_matches = ...
over_50_matches

In [None]:
grader.check("q7_5")

**Question 7.6.** Assign `third_highest_attendance` to the third highest total attendance of all World Cup tournaments. Assign `third_highest_year` to the year in which this attendance occurred.

Again, don't type in these values by hand; get Python to extract this information for you.

**_Note:_** Remember that you can perform intermediate steps in the lines before `third_highest_attendance` and `third_highest_year`.

In [None]:
third_highest_attendance = ...
third_highest_year = ...

# Don't change the lines below.
print('Attendance:', third_highest_attendance)
print('Year:', third_highest_year)

In [None]:
grader.check("q7_6")

**Question 7.7.** Some countries have hosted the World Cup more than once. Set `repeat_host` to an array of the names of these countries.

**_Hints:_** 
- You want to collect rows with the same value in the `'Host'` column. What DataFrame method can help you do that?
- The index of a DataFrame is not an array, but you can use the `np.array()` function to convert it to an array.

In [None]:
repeat_host = ...
repeat_host

In [None]:
grader.check("q7_7")

**Question 7.8.** Now find out which country was the most popular host overall by finding the sum of the `'Total Attendance'` for each country that has ever hosted a World Cup tournament. Assign the name of the host country with the greatest total attendance across all World Cups to `most_popular_host`.

**_Hint:_** Our solution for this question used only one line of code (thanks, `groupby`)!

In [None]:
most_popular_host = ...
most_popular_host

In [None]:
grader.check("q7_8")

## Finish Line: Almost there, but make sure to follow the steps below to submit! 🏁

**_Citations:_** Did you use any generative artificial intelligence tools to assist you on this assignment? If so, please state, for each tool you used, the name of the tool (ex. ChatGPT) and the problem(s) in this assignment where you used the tool for help.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Please cite tools here.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

To submit your assignment:

1. Make sure to comment out the code in Question 2.1 that causes an error.
1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells. 
1. Read through the notebook to make sure all cells ran and all tests passed.
1. Run the cell below to run all tests, and make sure that they all pass.
1. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
1. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.
1. Check that you have a confirmation email from Gradescope and save it as proof of your submission. 

With homeworks, unlike with labs, the grade you see on Gradescope is **not your final score**. We will run correctness tests after the assignment's due date has passed.

In [None]:
grader.check_all()