# Homework 3: DataFrames, Control Flow, and Probability

## Due Sunday, October 27th at 11:59PM

Welcome to Homework 3! This homework will cover lots of different topics:
- Grouping with subgroups (see [BPD 11](https://notes.dsc10.com/02-data_sets/groupby.html#subgroups))
- Merging DataFrames (see [BPD 13](https://notes.dsc10.com/02-data_sets/merging.html))
- Conditional statements (see [CIT 9.1](https://inferentialthinking.com/chapters/09/1/Conditional_Statements.html))
- Iteration (see [CIT 9.2](https://inferentialthinking.com/chapters/09/2/Iteration.html))
- Probability (see [CIT 9.5](https://inferentialthinking.com/chapters/09/5/Finding_Probabilities.html))

### Instructions

Remember to start early and submit often. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (the schedule can be found [here](https://dsc10.com/calendar)) or Ed. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

In [None]:
# Please don't change this cell, but do make sure to run it.
import babypandas as bpd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)

import numpy as np
import otter
grader = otter.Notebook()

# We need to import some extra packages for some fun demonstrations.
import json
from ipywidgets import interact, widgets
from IPython.display import YouTubeVideo, HTML, display, clear_output, Image, IFrame

# Don't worry about this.
def read_json(path):
    f = open(path, 'r')
    return json.load(f)

answer_words = bpd.read_csv('data/wordle.csv').get('word').values

def candy_map():
    src = f"https://map.candystore.com/halloween/2024/fullscreen.html"
    width = 800
    height = 600
    display(IFrame(src, width, height))

### Supplemental Video on DataHub and Jupyter Notebooks

In Lab 0, we linked you to a video that walks you through key ideas you should be aware of when working on DataHub and in Jupyter Notebooks, including
- how files are organized on DataHub
- what it means to "restart the kernel"
- how to use keyboard shortcuts (most important: use `SHIFT + ENTER` to run a cell!)

Now that you have some experience with Jupyter Notebooks, we're linking this video again for your convenience. If you feel a little shaky on how to work your way around a notebook or troubleshoot issues, we recommend you give it another watch. (When troubleshooting, make sure to always check the [Debugging](https://dsc10.com/debugging/) tab on the course website as well.)

The video is quite long, but if you open the video directly on YouTube (which you can do by clicking the video's title after it loads in the next cell) you'll see timestamps in the description which you can use to jump to different parts of the video depending on what you'd like to learn more about.

In [None]:
# Run this cell.
YouTubeVideo('Hq8VaNirDRQ')

## 0. Mid-Quarter Survey

We'd like to hear from you on how DSC 10 has been going so far this quarter. To do so, we've put together a survey that asks you to provide feedback on all aspects of the course. You can provide as much or as little detail as you'd like. We value your input and will use the results of the survey to improve the course!

This survey is entirely anonymous, though you are free to leave your name and email if you want. The responses to the survey will be visible to both course staff and the Data Science Student Representatives. There will also be a question at the end of the survey that will allow you to provide feedback on the DSC program as a whole.

<center><h3>Click <a href="https://forms.gle/wPLUUvWXKhkzGA5X6">here</a> to access the survey.</h3></center>

After completing the survey, enter the keyword provided at the end of the survey to get credit towards this homework assignment.

In [None]:
survey_keyword = ...

In [None]:
grader.check("q0")

## 1. 100 Years of "J" Baby Names 👶🏻

What letter does your first name start with? In this problem, we'll look at baby names starting with the letter "J". The file `data/baby_names.csv` contains information from the [Social Security Administration](https://www.ssa.gov/oact/babynames/limits.html) about "J" baby names in the US from 1924 to 2023 — that's one hundred years of data! Run the cell below to read in the data.

In [None]:
baby = bpd.read_csv('data/baby_names.csv')
baby

The DataFrame `baby` has a row for each `'State'` (50 US states plus Washington DC), `'Gender'` (`'M'` or `'F'`, as assigned at birth), `'Year'` (between 1924 and 2023), and `'Name'`. The `'Count'` column records the number of babies of that gender who were given that name in one state in one year.

The first row in `baby` contains the name John. Below, we look at only the rows corresponding to the name John.

In [None]:
baby[baby.get('Name') == 'John']

The first row of the DataFrame shows that there were 36 male babies named John born in Alaska in 1924. There are many other rows corresponding to the name John, which come from other years, other states, and also female babies named John, of which there are some!


Run the cell below to find out when and where many female Johns were born.

In [None]:
female_john = baby[(baby.get('Name') == 'John') & (baby.get('Gender') == 'F')]
female_john.sort_values(by='Count', ascending=False)

**Question 1.1.** There are many more male Johns than female Johns, so let's look at the popularity of the name John in male babies over time. Create a line plot that shows how the number of male babies named John has changed over time in the US. Then use your plot to answer the question that follows.

In [None]:
# Create your line plot here.

Around what year was the peak in popularity for the name John in male babies? Choose the closest answer from the options below and set `male_john_peak` to 1, 2, 3, or 4 corresponding to your answer choice.
1. 1930
2. 1950
3. 1970
4. 1990

In [None]:
male_john_peak = ...

In [None]:
grader.check("q1_1")

**Question 1.2.** In the `baby` DataFrame, how many babies of each gender were born in each state? Create a DataFrame named `num_babies` with one row for each gender in each state and columns `'State'`, `'Gender'`, and `'Count'`, which contains the total number of babies of each gender in each state with a "J" name. The first few rows of `num_babies` are shown below.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>State</th>
      <th>Gender</th>
      <th>Count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>AK</td>
      <td>F</td>
      <td>15495</td>
    </tr>
    <tr>
      <th>1</th>
      <td>AK</td>
      <td>M</td>
      <td>44767</td>
    </tr>
    <tr>
      <th>2</th>
      <td>AL</td>
      <td>F</td>
      <td>191205</td>
    </tr>
    <tr>
      <th>3</th>
      <td>AL</td>
      <td>M</td>
      <td>555313</td>
    </tr>
  </tbody>
</table>

***Hints:***
- You can do this in one line of code.
- Don't forget to use `.reset_index()`.


In [None]:
num_babies = ...
num_babies

In [None]:
grader.check("q1_2")

A gendered name is a combination of a name and a gender, such as female John. Let's explore the average age of people with each gendered name. For example, let's calculate the average age of all female Johns.

In [None]:
female_john

We'll define the age of a person as 2024 (the current year) minus the year in which the person was born. This doesn't take into account people's birthdays, because we don't have that information. For example, if a female John was born in 1984, they will be counted as 2024 - 1984 = 40 years old. Therefore the **total age** of all the female Johns is given below.

In [None]:
total_age = ((2024 - female_john.get('Year')) * female_john.get('Count')).sum()
total_age

To find the average age, we need to know how many female Johns there are. The **total count** of female Johns is given below.

In [None]:
total_count = female_john.get('Count').sum()
total_count

Therefore the **average age** of female Johns is given below.

In [None]:
average_age = total_age / total_count
average_age

Notice that we _cannot_ calculate the average age of female Johns as follows.

In [None]:
age = 2024 - female_john.get('Year')
age.mean()

This is incorrect because it does not take into account the fact that there were more female Johns born some years than others. 

**Question 1.3.** Create a DataFrame named `avg_age` that has one row for each gendered name and columns `'Gender'`, `'Name'`, and `'Average_Age'`, which contains the average age of all people with each gendered name. The first few rows of `avg_age` are shown below.

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Gender</th>
      <th>Name</th>
      <th>Average_Age</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>F</td>
      <td>Ja</td>
      <td>24.000000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>M</td>
      <td>Ja</td>
      <td>24.571429</td>
    </tr>
    <tr>
      <th>2</th>
      <td>F</td>
      <td>Jace</td>
      <td>10.451613</td>
    </tr>
    <tr>
      <th>3</th>
      <td>M</td>
      <td>Jace</td>
      <td>11.472549</td>
    </tr>
  </tbody>
</table>

***Hints:***
- Before attempting this question, make sure you understand the strategy shown above for finding the average age of female Johns. You will need to generalize this approach.
- This is a multi-step problem. Add cells and display your intermediate results so you can see your progress as you go.
- You should check that the average age for female Johns in your DataFrame `avg_age` is the same as we found above.


In [None]:
avg_age = ...
avg_age

In [None]:
grader.check("q1_3")

## 2. Trick or Treat 🍭🎃

In this question, we'll be exploring some data on the most popular Halloween candies in each state, from [this article](https://www.candystore.com/blogs/facts-trivia/halloween-candy-map-popular?y=2024).

Run the cell below to see a fun interactive data visualization from the same article. Try hovering over your favorite state.

In [None]:
candy_map()

In [None]:
states = bpd.read_csv('data/popular_candy_by_state.csv')
states

In the `states` DataFrame above, each state's `'Top Candy'` is recorded, based on candy sales in that state. `'Pounds'` refers to the total pounds of that specific candy sold in that state.<br>

The `states` DataFrame does not contain any information about the candies themselves, e.g. which candies are chocolate and which candies are fruity. For this information, we can refer to a dataset curated by FiveThirtyEight for their article [The Ultimate Halloween Candy Power Ranking](https://fivethirtyeight.com/videos/the-ultimate-halloween-candy-power-ranking/), which we recommend reading!

Run the cell below to load in a dataset containing information about many varieties of candy and save it as a DataFrame named `varieties`.

_Note_: The column in `varieties` that contains the names of the candies is `'competitorname'`, because these candies were all competing against each other in FiveThirtyEight's Halloween Candy Power Ranking. 

In [None]:
varieties = bpd.read_csv('data/halloween_candy.csv')
varieties

**Question 2.1.** Using the `merge` method, combine the `states` and `varieties` DataFrames, and assign the resulting DataFrame to the variable `states_and_varieties`. 
- `states_and_varieties` should contain all of the columns in both `states` and `varieties`, minus the `'competitorname'` column from `varieties`, which is redundant with the `'Top Candy'` column from `states`.
- Sort `states_and_varieties` by `'State'` in ascending order.


<!--
BEGIN QUESTION
name: q2_1
-->

In [None]:
states_and_varieties = ...
states_and_varieties

In [None]:
grader.check("q2_1")

**Question 2.2.** If you completed Question 2.1 correctly, you'll notice that `states_and_varieties` has fewer rows than both `states` and `varieties`. This is because there are some candies that are in `states` and not in `varieties`, and other candies that are in `varieties` and not in `states`. 

Below, assign `states_not_varieties` to the number of different candies that are in `states` and not in `varieties`. Similarly, assign `varieties_not_states` to the number of different candies that are in `varieties` and not in `states`.

_Hint_: There are two ways to find the number of unique values in a column.

1. Group by that column. On the resulting DataFrame, use `.shape[0]`.

2. Use the `.unique()` method on the Series corresponding to that column. Use `len` on the resulting array.

You'll need to do this three times – once each for the columns that contain candy names in `states`, `varieties`, and `states_and_varieties`.



In [None]:
states_not_varieties = ...
varieties_not_states = ...
print('There are', states_not_varieties, 'candies in `states` that are not in `varieties`.')
print('There are', varieties_not_states, 'candies in `varieties` that are not in `states`.')

In [None]:
grader.check("q2_2")

Now that we better understand how `states_and_varieties` came to be, let's use it to learn more about states' candy preferences.

In [None]:
states_and_varieties

The 0s and 1s in the columns `'chocolate'`, `'fruity'`, `'caramel'`, etc. can be interpreted as Boolean values. For instance, since the state `'CT'` has a 1 in its `'chocolate'` column, it means that Connecticut's most popular candy includes chocolate.

**Question 2.3.** Among just the states in `states_and_varieties` where the most popular candy includes chocolate, what proportion of these states have a most popular candy that also includes caramel? Assign your answer to `p_caramel_given_chocolate`. It should be a decimal between 0 and 1.

<!--
BEGIN QUESTION
name: q2_3
-->

In [None]:
p_caramel_given_chocolate = ...
p_caramel_given_chocolate

In [None]:
grader.check("q2_3")

## 3. Wordle 🟨 ⬛ 🟨 🟩 ⬛


<img src = "data/wordle_example.jpg" width=200>

[Wordle](https://www.nytimes.com/games/wordle/index.html), now owned by The New York Times, is a word-guessing game that became extremely popular at the end of 2021. Players have six tries to guess an unknown five-letter answer word. To play, you first enter a five-letter word as a guess for the answer word. After you make your guess, each letter of your guess will be highlighted with a color-coded square as follows:

- A black square ⬛ means that this letter is **not** in the answer word at all.  
- A yellow square 🟨 means that this letter is in the answer word, but in a **different position**.  
- A green square means 🟩 the letter is in the **correct position** in the answer word.

In this question, you will replicate some of that behavior using `for`-loops. We'll make a simplifying assumption that's not present in the real Wordle game: the answer word will always have five **different** letters and every guess must also have five **different** letters.

We'll represent your Wordle results with a string of five colored square emojis. (Emojis can be included in strings, just like letters, numbers, and punctuation!)

For example, if the answer word is `'STEIN'` and your guess is `'SCALE'`, as shown in the image above, your results can be represented by the string `'🟩⬛⬛⬛🟨'`. 

The array `emojis` defined below contains all the symbols we'll need to construct such strings.

In [None]:
emojis = np.array(['⬛', '🟨', '🟩'])
emojis

Recall, in [Lecture 10](https://dsc10.com/resources/lectures/lec10/lec10.html), we introduced the accumulator pattern. In the coin-flipping example, we started with an empty array, and added onto it in each iteration of our `for`-loop.

We can also use the accumulator pattern for strings, by starting with an empty string, and repeatedly adding onto it via concatenation. Here's an example that loops through the letters in a word and replaces certain letters with colored square emojis.

In [None]:
output = ''
for ch in 'goodbye':
    if ch == 'b':
        output = output + emojis[0] # add a black square
    elif ch == 'y':
        output = output + emojis[1] # add a yellow square
    elif ch == 'g':
        output = output + emojis[2] # add a green square
        
    else:
        output = output + ch
output

In the above example, we started with an empty string, `output`. For each character of the string `'goodbye'`, we added a single new character to `output`, depending on whether we saw `'g'`, `'b'`, `'y'`, or something else.

**Question 3.1.** Now, complete the implementation of the function `emojify`, which takes as input two five-letter strings, each having no repeated letters. The first input string, `guess`, should be cross-checked with each letter of the second input string, `answer`. The function should return a new string, formed entirely of emojis from  the `emojis` array, that indicates the accuracy of each letter in the guess, as described in the rules above. For example, let's say the `answer` string is `'shark'`. Here is how the function would work on various example guesses.

```py
>>> emojify('crept', 'shark')
'⬛🟨⬛⬛⬛'

>>> emojify('chalk', 'shark')
'⬛🟩🟩⬛🟩'

>>> emojify('harks', 'shark')
'🟨🟨🟨🟨🟨'

>>> emojify('sharp', 'shark')
'🟩🟩🟩🟩⬛'

>>> emojify('shark', 'shark')
'🟩🟩🟩🟩🟩'
```

_Note_: As we did in the preceding example, use the array `emojis` to access the emojis – don't actually include any colored square emojis in the code you write below.

***Hints:*** 
- Look at the slide titled *Ranges* in [Lecture 10](https://dsc10.com/resources/lectures/lec10/lec10.html#Ranges) for guidance.
- You'll need to use the `in` keyword, which we also introduced in Lecture 10.
- Remember to write a general function that works for any answer word, not just `'shark'`.

In [None]:
def emojify(guess, answer):
    # This line ensures your code works correctly regardless of 
    # whether the guess is in upper case or lower case.
    guess = guess.lower()

    # Empty string to add on to.
    emoji_output_string = '' 
    
    # You'll need to complete the body of this for loop.
    for i in range(len(guess)):
    ...
    
    # Don't change this
    return emoji_output_string

# An example call to emojify. Try out some other words, too.
emojify('crept', 'shark')

In [None]:
grader.check("q3_1")

### Final Product

You just implemented the logic for a Wordle game. Let's try it out! 

Run the following cell once you've completed the rest of this question, and you'll see a text box. Type a five-letter word with no repeated letters to guess the secret answer word. You can change your guess by backspacing and typing in a different guess for the same answer word. 

You can play the game more than once by running the cell again to generate a new answer word.

In [None]:
answer = np.random.choice(answer_words)
def emojify_live(guess):
    result = emojify(guess, answer)
    display(HTML('<h3>' + result + '</h3>'))
    if result == '🟩🟩🟩🟩🟩':
        display(HTML('<h3> You win! </h3>'))
display(HTML('<h2> Let\'s play Wordle! </h2>'))
interact(emojify_live, guess="");

## 4. Alternating Products

In this problem, we'll define two functions that compute some sort of "alternating product" of a sequence of values.

**Question 4.1.** Complete the implementation of the function `alternating_product`, which takes in an array of numbers, `values`, and returns the product of every other element in `values`, starting with the first element (at position `0`). Example behavior is shown below.

```py
>>> alternating_product(np.array([2, 3.5, 1, 1.5]))
2.0 # comes from 2 * 1

>>> alternating_product(np.array([2, 3.5, 1, 1.5, 4.5]))
9.0 # comes from 2 * 1 * 4.5
```
<!--
BEGIN QUESTION
name: q4_1
-->

In [None]:
def alternating_product(values):
    ...
    
# Feel free to change this input to make sure your function works correctly.
alternating_product(np.array([2, 3.5, 1, 1.5]))

In [None]:
grader.check("q4_1")

**Question 4.2.** In math, the word "alternating" is also used to describe sequences of numbers where the signs oscillate back and forth between positive and negative. Complete the implementation of the function `alternating_sign_product`, which takes in an array of positive numbers, `values`, and returns the product of every element in `values`, with alternating signs, starting with a positive sign for element `0`, a negative sign for element `1`, and so on. Example behavior is shown below.

```py
>>> alternating_sign_product(np.array([2, 3.5, 1]))
-7.0 # comes from 2 * (-3.5) * 1

>>> alternating_sign_product(np.array([2, 3.5, 1, 1.5]))
10.5 # comes from 2 * (-3.5) * 1 * (-1.5)
```

***Hint:*** If `x` is an integer, `x % 2` evaluates to 0 when `x` is even and to 1 when `x` is odd. If `x` represents the position of an element in the array, you can use this to help you figure out whether the sign should be positive or negative.

<!--
BEGIN QUESTION
name: q4_2
-->

In [None]:
def alternating_sign_product(values):
    ...
    
# Feel free to change this input to make sure your function works correctly.
alternating_sign_product(np.array([2, 3.5, 1]))

In [None]:
grader.check("q4_2")

## 5. Lucky Triton Lotto 🔱 🎱 

Suppose UCSD holds an annual lottery called the Lucky Triton Lotto, where students can enter to win Triton Cash, or even free housing! Here's how the Lucky Triton Lotto works:

- First, you pick five **different** numbers, one at a time, from 1 to 29, representing that according to [USNews](https://www.usnews.com/best-colleges/university-of-california-san-diego-1317), UCSD is ranked 29th in the nation for best universities to attend for 2024-2025.
- Then, you separately pick a number from 1 to 12. This is because UCSD's Data Science program is ranked 12th in [USNews's](https://www.usnews.com/best-colleges/rankings/computer-science/data-analytics-science) best undergraduate Data Science programs list (though we think it's number one). Let's say you select 3.
- The six numbers you have selected, or  **your numbers**, can be represented all together as (7, 12, 24, 15, 13, 3). This is a _sequence_ of six numbers – **order matters**!

The **winning numbers** are chosen by King Triton drawing five balls, one at a time, **without replacement**, from a pot of white balls numbered 1 to 29. Then, he draws a gold ball, the Tritonball, from a pot of gold balls numbered 1 to 12. Both pots are completely separate, hence the different ball colors. For example, maybe the winning numbers are (15, 9, 24, 23, 1, 3).

We’ll assume for this problem that in order to win the grand prize (free housing), all six of your numbers need to match the winning numbers and be in the **exact same order**. In other words, your entire sequence of numbers must be exactly the same as the sequence of winning numbers. However, if some numbers in your sequence match up with the corresponding number in the winning sequence, you will still win some Triton Cash. 

Suppose again that you select (7, 12, 24, 15, 13, 3) and the winning numbers are (15, 9, 24, 23, 1, 3). In this case, two of your numbers are considered to match two of the winning numbers. 
- Your numbers: (7, 12, **24**, 15, 13, **3**)
- Winning numbers: (15, 9, **24**, 23, 1, **3**)

You won't win free housing, but you will win some Triton Cash. Note that although both sequences include the number 15 within the first five numbers (representing a white ball), since they are in different positions, that's not considered a match.


**Question 5.1.** What is the probability that your Tritonball number (the last number in your sequence) matches the winning Tritonball number? Calculate your answer and assign it to `tritonball_chance`. If you need to do any calculations (e.g. multiplication or division), make Python do it; don't use a separate calculator. Your result should be a decimal number between 0 and 1.

In [None]:
tritonball_chance = ...
tritonball_chance

In [None]:
grader.check("q5_1")

**Question 5.2.** What is the probability that your first three numbers match the first three winning numbers? Calculate your answer and assign it to `first_three_chance`. If you need to do any calculations (e.g. multiplication or division), make Python do it; don't use a separate calculator. Your result should be a decimal number between 0 and 1.

***Hint:*** You need **all three** of the first three numbers to match. What probability rule should you use?

In [None]:
first_three_chance = ...
first_three_chance

In [None]:
grader.check("q5_2")

**Question 5.3.** What is the probability that you win the grand prize, free housing? Calculate your answer and assign it to `free_housing_chance`. If you need to do any calculations (e.g. multiplication or division), make Python do it; don't use a separate calculator. Your result should be a decimal number between 0 and 1.

***Hint:*** When you select a ball without replacement, what happens to the total number of balls you can select next time?

In [None]:
free_housing_chance = ...
free_housing_chance

In [None]:
grader.check("q5_3")

**Question 5.4.** What is the probability that you do **not** win free housing? Calculate your answer and assign it to `no_free_housing_chance`. If you need to do any calculations (e.g. multiplication or division), make Python do it; don't use a separate calculator. Your result should be a decimal number between 0 and 1.

In [None]:
no_free_housing_chance = ...
no_free_housing_chance

In [None]:
grader.check("q5_4")

## Finish Line: Almost there, but make sure to follow the steps below to submit! 🏁

**_Citations:_** Did you use any generative artificial intelligence tools to assist you on this assignment? If so, please state, for each tool you used, the name of the tool (ex. ChatGPT) and the problem(s) in this assignment where you used the tool for help.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

Please cite tools here.

<hr style="color:Maroon;background-color:Maroon;border:0 none; height: 3px;">

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
1. Read through the notebook to make sure everything is fine and all tests passed.
1. Run the cell below to run all tests, and make sure that they all pass.
1. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.
1. Stick around while the Gradescope autograder grades your work. Make sure you see that all tests have passed on Gradescope.
1. Check that you have a confirmation email from Gradescope and save it as proof of your submission.

In [None]:
grader.check_all()