# Homework 5: Pivot Tables, Probability, and Iteration

**Reading**: Textbook chapter [8.3](https://www.inferentialthinking.com/chapters/08/3/cross-classifying-by-more-than-one-variable.html) to [8.5](https://www.inferentialthinking.com/chapters/08/5/bike-sharing-in-the-bay-area.html) and chapter [9](https://www.inferentialthinking.com/chapters/09/randomness.html).

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests. Each time you start your server, you will need to execute this cell again to load the tests.

Homework 5 is due Thursday, 2/22 at 11:59pm. You will receive an early submission bonus point if you turn in your final submission by Wednesday, 2/21 at 11:59pm. Start early so that you can come to office hours if you're stuck. Check the website for the office hours schedule. Late work will not be accepted as per the [policies](http://data8.org/sp18/policies.html) of this course. 

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. Refer to the policies page to learn more about how to learn cooperatively.

For all problems that you must write our explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

from client.api.notebook import Notebook
ok = Notebook('hw05.ok')
_ = ok.auth(inline=True)

## 1. Causes of Death by Year


This exercise is designed to give you practice using the Table method `pivot`. [Here](http://data8.org/sp18/python-reference.html) is a link to the Python reference page in case you need a quick refresher.

We'll be looking at a dataset from the California Department of Public Health (available [here](http://www.healthdata.gov/dataset/leading-causes-death-zip-code-1999-2013)) that records the cause of death (as recorded on a death certificate) for everyone who died in California from 1999 to 2013.  The data are in the file `causes_of_death.csv.zip`.  Each row records the number of deaths by one cause in one year in one ZIP code.

To make the file smaller, we've compressed it; run the next cell to unzip and load it.

In [13]:
!unzip -o causes_of_death.csv.zip
causes = Table.read_table('causes_of_death.csv')
causes

The causes of death in the data are abbreviated.  We've provided a table called `abbreviations.csv` to translate the abbreviations.

In [4]:
abbreviations = Table.read_table('abbreviations.csv')
abbreviations.show()

The dataset is missing data on certain causes of death for certain years.  It looks like those causes of death are relatively rare, so for some purposes it makes sense to drop them from consideration.  Of course, we'll have to keep in mind that we're no longer looking at a comprehensive report on all deaths in California.

**Question 1.** Let's clean up our data. First, filter out the HOM, HYP, and NEP rows from the table for the reasons described in the above paragraph. Next, join together the abbreviations table and our causes of death table so that we have a more detailed discription of each disease in each row. Lastly, drop the column which contains the acronym of the disease, and rename the column with the full description 'Cause of Death'. Assign the variable `cleaned_causes` to the resulting table. 

In [4]:
cleaned_causes = ...
cleaned_causes

In [6]:
_ = ok.grade('q1_1')

We're going to examine the changes in causes of death over time.  To make a plot of those numbers, we need to have a table with one row per year, and the information about all the causes of death for each year.

**Question 2.** Create a table with one row for each year and a column for each kind of death, where each cell contains the number of deaths by that cause in that year. Call the table `cleaned_causes_by_year`.

In [None]:
cleaned_causes_by_year = ...
cleaned_causes_by_year.show()

In [8]:
_ = ok.grade('q1_2')

**Question 3.** Make a plot of all the causes of death by year, using your cleaned-up version of the dataset.  There should be a single plot with one line per cause of death.

*Hint:* Use the Table method `plot`.  If you pass only a single argument, a line will be made for each of the other columns.

In [7]:
...

After seeing the plot above, we would now like to examine the distributions of diseases over the years using percentages. Below, we have assigned `distributions` to a table with all of the same columns, but the raw counts in the cells are replaced by the percentage of the the total number of deaths by a particular disease that happened in that specific year. 

Try to understand the code below. 

In [8]:
def percents(array_x):
    return np.round( (array_x/sum(array_x))*100, 2)

labels = cleaned_causes_by_year.labels
distributions = Table().with_columns(labels[0], cleaned_causes_by_year.column(0),
                                     labels[1], percents(cleaned_causes_by_year.column(1)),
                                     labels[2], percents(cleaned_causes_by_year.column(2)),
                                     labels[3], percents(cleaned_causes_by_year.column(3)),
                                     labels[4], percents(cleaned_causes_by_year.column(4)),
                                     labels[5], percents(cleaned_causes_by_year.column(5)),
                                     labels[6], percents(cleaned_causes_by_year.column(6)),
                                     labels[7], percents(cleaned_causes_by_year.column(7)),
                                     labels[8], percents(cleaned_causes_by_year.column(8)),
                                     labels[9], percents(cleaned_causes_by_year.column(9)),
                                     labels[10], percents(cleaned_causes_by_year.column(10)),
                                     labels[11], percents(cleaned_causes_by_year.column(11)))
distributions.show()

**Question 4.** What is the sum (roughly) of each of the columns (except the Year column) in the table above? Why does this make sense? 

*Write your answer here, replacing this text.*

**Question 5:** We suspect that the larger percentage of stroke-related deaths over the years 1999-2013 happened in the earlier years, while the larger percentage of Chronic Liver Disease-related deaths over this time period occured in the most recent years. Draw a bar chart to display both of the distributions of these diseases over the time period. 

*Hint:* The relevant column labels are "Cerebrovascular Disease (Stroke)" and "Chronic Liver Disease and Cirrhosis"

In [9]:
...

## 2. Probability


We will be testing some probability concepts that were introduced in lecture. For all of the following problems, we will introduce a problem statement and give you a proposed answer. Next, you must asssign a certain variable to one of three integers. You are more than welcome to create more cells across this notebook to use for arithmetic operations, but be sure to assign the requested variable to 1, 2, or 3 in the end. 

1. Assign the variable to 1 if you believe our proposed answer is too low.
2. Assign the variable to 2 if you believe our proposed answer is correct.
3. Assign the variable to 3 if you believe our proposed answer is too high.

**Question 1.** You roll a 6-sided die 10 times. What is the chance of getting 10 sixes?

Our proposed answer: $$(\frac{1}{6})^{10}$$

Assign `ten_sixes` to either 1, 2, or 3 depending on if you think our answer is too low, correct, or too high. 

In [3]:
ten_sixes = ...
ten_sixes

In [4]:
_ = ok.grade('q2_1')

**Question 2.** Take the same problem set-up as before, rolling a fair dice 10 times. What is the chance that every roll is less than or equal to 5?

Our proposed answer: $$1 - (\frac{1}{6})^{10}$$

Assign `five_or_less` to either 1, 2, or 3. 

In [5]:
five_or_less = ...
five_or_less

In [6]:
_ = ok.grade('q2_2')

**Question 3.** Assume we are picking a lottery ticket. We must choose three distinct numbers from 1 to 100 and write them on a ticket. Next, someone picks three numbers one by one, each time without putting the previous number back in. We win if our numbers are all called. 

If we decide to play the game and pick our numbers as 12, 14, and 89, what is the chance that we win? 

Our proposed answer: $$(\frac{3}{100})^3$$

Assign `lottery` to either 1, 2, or 3. 

In [13]:
lottery = ... 

In [14]:
_ = ok.grade('q2_3')

**Question 4.** Assume we have two lists, list A and list B. List A contains the numbers [10,20,30], while list B contains the numbers [10,20,30,40]. We choose one number from list A randomly and one number from list B randomly. What is the chance that the number we drew from list A is larger than the number we drew from list B?

Our proposed solution: $$1/4$$

Assign `list_chances` to either 1, 2, or 3. 

In [7]:
list_chances = ...

In [16]:
_ = ok.grade('q2_4')

## 3. Monkeys Typing Shakespeare
##### (...or at least the string "datascience")

A monkey is banging repeatedly on the keys of a typewriter. Each time, the monkey is equally likely to hit any of the 26 lowercase letters of the English alphabet, regardless of what it has hit before. There are no other keys on the keyboard.

**Question 1.** Suppose the monkey hits the keyboard 11 times.  Compute the chance that the monkey types the sequence `datascience`.  (Call this `datascience_chance`.) Use algebra and type in an arithmetic equation that Python can evalute.

In [2]:
datascience_chance = ...
datascience_chance

In [4]:
_ = ok.grade('q3_1')

**Question 2.** Write a function called `simulate_key_strike`.  It should take **no arguments**, and it should return a random one-character string that is equally likely to be any of the 26 lower-case English letters. 

In [5]:
# We have provided the code below to compute a list called letters,
# containing all the lower-case English letters.  Print it if you
# want to verify what it contains.
import string
letters = list(string.ascii_lowercase)

def simulate_key_strike():
    """Simulates one random key strike."""
    ...

# An example call to your function:
simulate_key_strike()

In [7]:
_ = ok.grade('q3_2')

**Question 3.** Write a function called `simulate_several_key_strikes`.  It should take one argument: an integer specifying the number of key strikes to simulate. It should return a string containing that many characters, each one obtained from simulating a key strike by the monkey.

*Hint:* If you make a list or array of the simulated key strikes, you can convert that to a string by calling `"".join(key_strikes_array)` (if your array is called `key_strikes_array`).

In [8]:
def simulate_several_key_strikes(num_strikes):
    # Fill in this function.  Our solution used several lines
    # of code.
    ...

# An example call to your function:
simulate_several_key_strikes(11)

In [11]:
_ = ok.grade('q3_3')

**Question 4.** Use `simulate_several_key_strikes` 1000 times, each time simulating the monkey striking 11 keys.  Compute the proportion of times the monkey types `"datascience"`, calling that proportion `datascience_proportion`.

In [18]:
# Our solution used several lines of code.
...
datascience_proportion = ...
datascience_proportion

In [22]:
_ = ok.grade('q3_4')

**Question 5.** Check the value your simulation computed for `datascience_proportion`.  Is your simulation a good way to estimate the chance that the monkey types `"datascience"` in 11 strikes (the answer to question 1)?  Why or why not?

*Write your answer here, replacing this text.*

**Question 6.** Compute the chance that the monkey types the letter `"e"` at least once in the 11 strikes.  Call it `e_chance`. Use algebra and type in an arithmetic equation that Python can evalute. 

In [23]:
e_chance = ...
e_chance

In [None]:
_ = ok.grade('q3_6')

**Question 7.** In comparison to `datascience_chance`, do you think that a computer simulation would be a more or less effective way to estimate `e_chance`?  Why or why not?  (You don't need to write a simulation, but it is an interesting exercise.)

*Write your answer here, replacing this text.*

## (Optional) Unrolling Loops

**The rest of this homework is optional. Do it for your own practice, but it will not be incorporated into the final grading!**

"Unrolling" a `for` loop means to manually write out all the code that it executes.  The result is code that does the same thing as the loop, but without the loop.  For example, the unrolled version of this loop:

    for num in np.arange(3):
        print("The number is", num)

is this:

    print("The number is", 0)
    print("The number is", 1)
    print("The number is", 2)

It's important to understand that this is really all that a `for` loop does.  In this exercise, you'll practice unrolling `for` loops.

In each question below, write code that does the same thing as the given code, but with any `for` loops unrolled.  It's a good idea to run both your answer and the original code to verify that they do the same thing.  (Of course, if the code does something random, you'll get a different random outcome than the original code!)

First, run the cell below to load data that will be used in a few questions.  It's a table with 52 rows, one for each type of card in a deck of playing cards.  A playing card has a "suit" ("♠︎", "♣︎", "♥︎", or "♦︎") and a "rank" (2 through 10, J, Q, K, or A).  There are 4 suits and 13 ranks, so there are $4 \times 13 = 52$ different cards.

In [3]:
deck = Table.read_table("deck.csv")
deck

**Optional Question 1.** Unroll the code below.

In [4]:
# This table will hold the cards in a randomly-drawn hand of
# 5 cards.  We simulate cards being drawn as follows: We draw
# a card at random from the deck, make a copy of it, put the
# copy in our hand, and put the card back in the deck.  That
# means we might draw the same card multiple times, which is
# different from a normal draw in most card games.
hand = Table().with_columns("Rank", make_array(), "Suit", make_array())
for suit in np.arange(5):
    card = deck.row(np.random.randint(deck.num_rows))
    hand.append(card)
hand

In [6]:
hand = Table().with_columns("Rank", make_array(), "Suit", make_array())
...

In [11]:
_ = ok.grade('q4_1')

**Optional Question 2.** Unroll the code below.

In [7]:
for joke_iteration in np.arange(4):
    print("Knock, knock.")
    print("Who's there?")
    print("Banana.")
    print("Banana who?")
print("Knock, knock.")
print("Who's there?")
print("Orange.")
print("Orange who?")
print("Orange you glad I didn't say banana?")

In [None]:
...

**Optional Question 3.** Unroll the code below.

*Hint:* `np.random.randint` returns a random integer between 0 (inclusive) and the value that's passed in (exclusive).

In [9]:
# This table will hold the cards in a randomly-drawn hand of
# 4 cards.  The cards are drawn as follows: For each of the
# 4 suits, we draw a random card of that suit and put it into
# our hand.  The cards within a suit are drawn uniformly at
# random, meaning each card of the suit has an equal chance of
# being drawn.
hand_of_4 = Table().with_columns("Rank", make_array(), "Suit", make_array())
for suit in make_array("♠︎", "♣︎", "♥︎", "♦︎"):
    cards_of_suit = deck.where("Suit", are.equal_to(suit))
    card = cards_of_suit.row(np.random.randint(cards_of_suit.num_rows))
    hand_of_4.append(card)
hand_of_4

In [None]:
...

In [10]:
_ = ok.grade('q4_3')

## 5. Submission


Once you're finished, select "Save and Checkpoint" in the File menu and then execute the `submit` cell below. The result will contain a link that you can use to check that your assignment has been submitted successfully. If you submit more than once before the deadline, we will only grade your final submission. If you mistakenly submit the wrong one, you can head to [okpy.org](https://okpy.org/) and flag the correct version. To do so, go to the website, click on this assignment, and find the version you would like to be graded. There should be an option to flag that submission for grading!

In [None]:
_ = ok.submit()