# Homework 2: Arrays and DataFrames

## Due Saturday, October 9 at 11:59pm

Welcome to Homework 2! This week's HW will cover arrays and DataFrames in Python. You can find additional help on these topics in [Chapter 02](https://eldridgejm.github.io/dive_into_data_science/02-data_sets/arrays.html), sections 1 through 5, of our textbook.

### Instructions

This assignment is due Saturday, October 9 at 11:59pm. You are given six slip days throughout the quarter to extend deadlines. See the syllabus for more details. With the exception of using slip days, late work will not be accepted unless you have made special arrangements with your instructor.

**Important**: For homeworks, the `otter` tests don't usually tell you that your answer is correct. More often, they help catch careless mistakes. It's up to you to ensure that your answer is correct. If you're not sure, ask someone (not for the answer, but for some guidance about your approach). These are great questions for office hours (schedule on Canvas) or your team's chatroom on Campuswire. Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

**Please do not use for-loops for any questions in this homework.** If you don't know what a for-loop is, don't worry -- we haven't covered them yet. But if you do know what they are and are wondering why it's not OK to use them, it is because loops in Python are slow, and looping over arrays and tables should usually be avoided.

In [None]:
# Don't change this cell; just run it. 
import numpy as np
import babypandas as bpd
import math

import otter
grader = otter.Notebook()

## 1. Arrays


**Question 1.1** Make an array called `weird_numbers` containing the following numbers (in the given order):

1. The factorial of 10
2. 3 radians, in degrees
3. The mathematical constant $e$
4. $7^4$
5. The square root of 7

*Hint:* Take a look at the functions and constants in the `math` module.

In [None]:
weird_numbers = ...
weird_numbers

In [None]:
grader.check("q1_1")

**Question 1.2.** Make an array called `words` containing the following three strings:
* `A panda eats`
* `shoots`
* `and leaves.`


In [None]:
words = ...
words

In [None]:
grader.check("q1_2")

<img src="./data/commas.png"/>

Strings have a method called `join`.  `join` takes one argument, an array of strings, and it returns a single string. Specifically, the value of `some_string.join(some_array)` is a single string that's the concatenation ("putting together") of all the strings in `some_array`, *except* `some_string` is inserted in between each string.

Read that again if you need to!

**Question 1.3.** Use the array `words` and the method `join` to make two strings:

1. `A panda eats, shoots, and leaves.` (call this one `with_commas`)
2. `A panda eats shoots and leaves.` (call this one `with_spaces`)

*Hint:* If you're not quite sure what `join` does, first try calling, for example, `"foo".join(words)` .

In [None]:
with_commas = ...
with_spaces = ...
print(with_commas)
print(with_spaces)

In [None]:
grader.check("q1_3")

Now let's get some practice with accessing individual elements of arrays.  In Python (and in many programming languages), elements are accessed by *integer position*, with the position of the first element being zero.  

**Question 1.4.** The cell below creates an array of strings.

In [None]:
some_strings = np.array(['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'last'])
some_strings

What is the integer position of `'fifth'` in the array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

In [None]:
fifth_position = ...
fifth_position

In [None]:
grader.check("q1_4")

**Question 1.5.** Suppose you have an array with 100 elements. What integer position is the second-to-last element of this array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

In [None]:
second_to_last_position = ...
second_to_last_position

In [None]:
grader.check("q1_5")

By the way, you can also use negative integer positions to access elements from the end of the array. Use -1 to get the last element, -2 to get the second-to-last element and so on. For example, to find the last element of `some_strings`:

In [None]:
some_strings[-1]

**Question 1.6.** Suppose you have an array with 499 elements. What integer position is the middle element of this array? You can just type in the answer, which should be of type `int`. This is a conceptual question, not a coding question.

In [None]:
mid_position = ...
mid_position

In [None]:
grader.check("q1_6")

## 2. Unisex Names


The table below shows data on the eight most common unisex names in America. The data originally comes from the Social Security Administration, but it was compiled and curated by FiveThirtyEight for their 2015 article [The Most Common Unisex Names In America: Is Yours One Of Them?](https://fivethirtyeight.com/features/there-are-922-unisex-names-in-america-is-yours-one-of-them/).

For each name in the table below, we have the `Count` which estimates the number of living people with that name, and the `Male Share`, or the proportion of people with that name who are male. Throughout this problem, we'll use *male* to mean assigned male at birth and *female* to mean assigned female at birth. We will assume that each person fits into exactly one of these two categories. 

|Name|Count|Male Share|
|---|---|---|
|Casey|177000|0.5843|
|Riley|155000|0.5076|
|Jessie|136000|0.4778|
|Jackie|133000|0.4211|
|Avery|122000|0.3352|
|Jaime|110000|0.5617|
|Peyton|95000|0.4337|
|Kerry|89000|0.4839|

In this question, we'll be working with the data from the `Count` and `Male Share` columns as *arrays*. Here are those arrays:

In [None]:
count = np.array([177000, 155000, 136000, 133000, 122000, 110000, 95000, 89000])
count

In [None]:
male_share = ...
male_share

Remember, the `numpy` package (`np` for short) provides many handy functions for working with arrays. These are specifically designed to work with arrays and are faster than using Python's built-in functions. 

Some frequently used array functions are `np.min()`, `np.max()`, `np.sum()`, `np.abs()`, and `np.round()`. There are many more, which you can browse by typing `np.` into a code cell and hitting the *tab* key.

**Question 2.1.** What proportion of people with each name are female? Store the female share proportions for each name in a new array called `female_share`. 

In [None]:
female_share = ...
female_share

In [None]:
grader.check("q2_1")

**Question 2.2.** Now let's find the gap between the male and female shares for each name. Use `male_share` and `female_share` to create an array called `gaps` containing the absolute differences between the male share and female share for each name.

In [None]:
gaps = ...
gaps

In [None]:
grader.check("q2_2")

**Question 2.3.** Now, find the gap between the male and female shares for each name, but this time, you're only allowed to use the variable `male_share`. You may not use `female_share`. Create an array called `gaps_again` containing the absolute differences between the male share and female share for each name. The answer will be the same as the last question, but your method should be different.

In [None]:
gaps_again = ...
gaps_again

In [None]:
grader.check("q2_3")

**Question 2.4.** You might say that the most unisex name is the one with the smallest gap. Find the smallest value in the `gaps` array and save it as `smallest_gap`. Referring back to the table, try to figure out which name that is!

In [None]:
smallest_gap = ...
smallest_gap

In [None]:
grader.check("q2_4")

**Question 2.5.** Create an array called `male_count` containing the estimated number of males with each name, and another called `female_count` containing the estimated number of females with each name. Make sure to round each value to the nearest whole number.

In [None]:
male_count = ...
female_count = ...

In [None]:
grader.check("q2_5")

**Question 2.6.** You might say that the most unisex name is the one for which the number of males with that name is closest to the number of females with that name. Assign `smallest_diff` to the smallest absolute difference between the number of males and number of females, among the given names. Referring back to the table, try to figure out which name that is!

In [None]:
smallest_diff = ...
smallest_diff

In [None]:
grader.check("q2_6")

**Question 2.7.** Perhaps instead of working with proportions as we've done here, you prefer to work with percentages. Create an array called `male_percentage` that contains the percentage of people with each name who are male. Similarly, create an array `female_percentage` with the percentage of people with each name who are female. Round each percentage to two decimal places.

In [None]:
male_percentage = ...
female_percentage = ...

In [None]:
grader.check("q2_7")

## 3. Serializing Cereal

<img src="./data/cereal.jpg"/>

The file named `cereal.csv` in the `data/` directory contains data on the production and nutritional content of various popular brand-name cereals.

**Question 3.1.** Read the CSV file into a DataFrame called `cereal_data`.

In [None]:
cereal_data = ...
cereal_data

In [None]:
grader.check("q3_1")

**Question 3.2.** Set the index to be the name of the cereal, so that it's easier to interpret. Save the new DataFrame in a variable called `cereals`.

In [None]:
cereals = ...
cereals

In [None]:
grader.check("q3_2")

**Question 3.3.** Using DataFrame operations, determine how much sodium (given in milligrams) is in a serving of 'Kix'.

In [None]:
kix_sodium = ...
kix_sodium

In [None]:
grader.check("q3_3")

**Question 3.4.** Assign to `lowest_rated` the name of the cereal with the lowest rating. If you're surprised by the result, that's because the `rating` column contains a nutritional rating, not a flavor rating!

In [None]:
lowest_rated = ...
lowest_rated

In [None]:
grader.check("q3_4")

**Question 3.5.** Manufacturers try to hide the less-than-stellar nutrition of their products by making the serving sizes small so that each serving seems to contain fewer calories, grams of fat, grams of sugar, etc. Notice that the serving size, in the `cups` column, varies from one cereal to the next. Being the savvy consumers that we are, we know that we should compare nutrition facts *per cup*. Starting with the `cereals` DataFrame, create a new DataFrame called `sugars_per_cup` which has an additional column called `spc` containing the amount of sugar in each cup of the cereal, sorted so that the cereals with the most sugars per cup appear first.

*Hint*: This is a multi-step problem. Start by defining an array or Series containing sugars per cup, then add it as a column to the DataFrame. Don't forget to sort!

In [None]:
# For a multi-step problem, it's helpful to define intermediate variables. 
# Feel free to do that here, or for any problem!

sugars_per_cup = ...
sugars_per_cup

In [None]:
grader.check("q3_5")

Here's how you can see all the cereals, ranked from most sugary to least sugary:

In [None]:
sugars_per_cup.index

**Question 3.6.** The column labeled `mfr` contains an abbreviation for the manufacturer that produces the cereal. One of these is General Mills, represented by G. Using the `cereals` DataFrame, create a new DataFrame called `general_mills` which contains only those cereals manufactured by General Mills (G).

In [None]:
general_mills = ...
general_mills

In [None]:
grader.check("q3_6")

**Question 3.7.** Among all cereals that are manufactured by General Mills, what is the median amount of sodium per serving?

In [None]:
general_mills_sodium = ...
general_mills_sodium

In [None]:
grader.check("q3_7")

**Question 3.8.** The `temp` column indicates whether the cereal is meant to be eaten hot (listed as 'H') or cold (listed as 'C'). It would be interesting to see if there is a noticeable difference in the amount of protein in hot versus cold cereals. Using the `cereals` DataFrame, find the absolute value of the difference between the average amount of protein for hot cereal and the average amount of protein for cold cereal. Save the result as `protein_difference`. 

*Hint*: To help break this problem into multiple steps, we suggest you define two intermediate variables, `average_protein_hot` and `average_protein_cold`.

In [None]:
# It may help to define these intermediate variables, but you don't need to use them.
average_protein_hot = ...
average_protein_cold = ...

protein_difference = ...
protein_difference

In [None]:
grader.check("q3_8")

**Question 3.9.** Starting with `cereals`, create a DataFrame called `high_fiber_low_calorie` containing all of the cereal brands that have at least 2 grams of fiber per serving and at most 100 calories per serving.

In [None]:
high_fiber_low_calorie = ...
high_fiber_low_calorie

In [None]:
grader.check("q3_9")

**Question 3.10.** Set `low_calorie_mfr` to the abbreviation for the manufacturer that produces cereals with the lowest average calories per serving. If you're interested, you can look up the full name of that manufacturer in the table below.

|Abbreviation|Full Name|
|---|---|
|A|American Home Food Products|
|G|General Mills|
|K|Kelloggs|
|N|Nabisco|
|P|Post|
|Q|Quaker Oats|
|R|Ralston Purina|

*Hint*: Our solution for this question used only one line of code (thanks, `groupby`)!

In [None]:
low_calorie_mfr = ...
low_calorie_mfr

In [None]:
grader.check("q3_10")

**Question 3.11.** Assign to `kelloggs_proportion` the proportion of cereals manufactured by Kelloggs (K) that have at least 2 grams of fat.

*Hint*: Start by defining a new DataFrame, `kelloggs`, that contains only the cereals manufactured by Kelloggs.

In [None]:
kelloggs = ...
kelloggs_proportion = ...
kelloggs_proportion

In [None]:
grader.check("q3_11")

## 4. COVID-19 Vaccinations in San Diego County

The file `data/vaccines.csv` contains information about COVID-19 vaccination progress in San Diego County. The data, which comes from the [California Health and Human Services Open Data Portal](https://data.chhs.ca.gov/dataset/vaccine-progress-dashboard), includes information on vaccine doses administered each day in 2021 through September 28, 2021. 

In [None]:
covid_vaccine = bpd.read_csv('data/vaccines.csv')
covid_vaccine

For a given day, the column `total_doses` contains the total number of vaccines administered that day. The column `cumulative_total_doses` contains the total number of vaccines administered up until and including that day. The [first COVID-19 vaccine dose in San Diego County](https://www.countynewscenter.com/county-of-san-diego-to-open-first-vaccination-super-station-in-partnership-with-uc-san-diego-health-padres-and-city-of-san-diego/) was administered on December 16, 2020. 

**Question 4.1.** How many COVID-19 vaccines were administered in San Diego County in December 2020? Save the result as `december_total`.

In [None]:
december_total = ...
december_total

In [None]:
grader.check("q4_1")

**Question 4.2.** Assuming that vaccines were given each day from December 16 through December 31, 2020, on how many days in December were vaccines administered? What was the average number of vaccines given per day during this period from December 16 to December 31? Save your answers as `december_days` and `december_average`, respectively.

In [None]:
december_days = ...
december_average = ...

In [None]:
grader.check("q4_2")

**Question 4.3.** Create a new table `june_vaccine` to include data (all columns) from June 2021 only. Assign to `june_total` the total number of doses administered in June.

*Hint*: You may want to use `.str.contains`; the *Boolean Indexing* section of the [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view) shows you how to use it.

In [None]:
june_vaccine = ...
june_total = ...

In [None]:
grader.check("q4_3")

**Question 4.4.** Assign `june_average` to the average number of vaccines given per day during the month of June. By what factor did the average number of daily vaccine doses increase from December 2020 to June 2021? Save your answer as `increase_factor`. For example, if the average number of doses given each day in June was double that of December, you would set `increase_factor` to 2.

In [None]:
june_average = ...
increase_factor = ...

In [None]:
grader.check("q4_4")

**Question 4.5.** Determine the day in 2021 on which the most doses were administered, and save the result as as `most_doses_day`.

In [None]:
most_doses_day = ...
most_doses_day

In [None]:
grader.check("q4_5")

**Question 4.6.** Determine the *week* in 2021 on which the most doses were administered on average, and save the result as `most_doses_week`.

In [None]:
most_doses_week = ...
most_doses_week

In [None]:
grader.check("q4_6")

**Question 4.7.** Create an array called `ranked_days` containing the names of the seven days of the week, ordered from the least popular day of the week to get vaccinated, to the most popular day to get vaccinated. 

*Hint*: After doing some grouping, you'll want to grab the index of the table using `.index`, then convert it to an array by feeding it in as an input to `np.array`.

*Magic trick suggestion*: Guess what day of the week your friend got vaccinated!

In [None]:
ranked_days = ...
ranked_days

In [None]:
grader.check("q4_7")

**Question 4.8.** Make a line graph of the total number administered vaccine doses per week so far in 2021. Check that the peak matches to your answer `most_doses_week` above!

*Hint*: Look at [Chapter 3.1 of the textbook](https://eldridgejm.github.io/dive_into_data_science/03-visualization/line.html) for instructions on how to create a line graph. You'll have to do some DataFrame manipulation before you can plot the graph.

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q4_8
manual: true
-->

In [None]:
...

<!-- END QUESTION -->



# Finish Line

Congratulations! You are done with Homework 02.

To submit your assignment:

1. Select `Kernel -> Restart & Run All` to ensure that you have executed all cells, including the test cells.
2. Read through the notebook to make sure everything is fine and all tests passed.
3. Run the cell below to run all tests, and make sure that they all pass.
4. Download your notebook using `File -> Download as -> Notebook (.ipynb)`, then upload your notebook to Gradescope.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()