# Homework 4: Tables and Functions
The tools that we've learned over the last week (for example, function definitions, histograms, and the table methods `where`, `apply`, and `group`) are enough to analyze a wide range of questions and datasets.  

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests.

In [None]:
# Don't change this cell; just run it. 
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from client.api.notebook import Notebook
ok = Notebook('hw04.ok')
_ = ok.auth(inline=True)

Reading:
- Textbook chapters [6](https://data-8r.gitbooks.io/textbook/chapters/06/tables.html) and [7](https://data-8r.gitbooks.io/textbook/chapters/07/functions-and-tables.html)

Deadline:

This assignment is due **Tuesday, July 25 at 1PM**. You will receive an early submission bonus point if you turn in your final submission by **Monday, July 24 at 1PM**. Late work will not be accepted unless you have made special arrangements with your TA or the instructor.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck. Drop-in office hours will be held at various times in the week; check the course calendar on the [course webpage](http://data8r.org) for the latest schedule.

Once you're finished, select "Save and Checkpoint" in the File menu and then execute the `submit` cell below. The result will contain a link that you can use to check that your assignment has been submitted successfully. If you submit more than once before the deadline, we will only grade your final submission.

In [None]:
_ = ok.submit()

## 1. Review of Histograms


We measure the heights of the members of 200 families that each included 1 mother, 1 father, and some varying number of adult sons. We make the following histograms, with all bins being two inches wide.

![](three_height_histograms.png)

#### Question 1

For each quantity listed below, either calculate its value using the histograms, or write *Unknown* if it is not possible to calculate the value numerically given the information we have.
1. The **percentage** of mothers that are at least 60 inches but less than 64 inches tall.
2. The **percentage** of fathers that are at least 64 inches but less than 67 inches tall.
3. The **number** of mothers that are at least 60 inches tall.
4. The **number** of sons that are at least 70 inches tall.

*Write your answer here, replacing this text.*

#### Question 2
If the fathers' histogram was redrawn with bins of width 4, what would be the height of the bar for the bin from 72 to 76?

*Write your answer here, replacing this text.*

#### Question 3
Some of the sons in the dataset are taller than all of the mothers.  It isn't possible to tell exactly how many, because the binning disguises the exact height values of the mothers and sons.  However, we can calculate upper and lower bounds on the value using our histograms. What's the lowest possible value for the percentage of sons who are taller than all of the mothers? The highest possible value?

*Write your answer here, replacing this text.*

Run the following cell to load some more height data, this time on 100 adult men and women.

In [6]:
height_data = Table().read_table("Height_Data.csv")
male_heights = height_data.column("Male Height")
female_heights = height_data.column("Female Height")
all_heights = np.append(male_heights, female_heights)
height_data

#### Question 4
Create a histogram of the heights of the various men in the sample. Then, do the same for women.

In [5]:
...

In [None]:
...

#### Question 5
Ccreate a single histogram of the heights of everyone in the sample, both men and women. 

*Hint: You will need to use the `all_heights` variable, and make a new table*.

In [10]:
...

## 2. Writing Documentation for Functions


When you want to figure out how to use a function, typing its name and a question mark in a code cell (and then running the cell) will show you its *documentation*.  It's a good idea to write documentation for the functions you write, too.  This exercise will give you practice with that.

**Question 1.** The function below does something interesting, but it's been left without documentation.  Figure out what it does by calling it.  (We've given three example calls to get you started.)  Then write documentation that would help someone understand what the function does.  At a minimum, you should describe:

* what the function does, in one short sentence;
* the purpose and type of each argument; and
* what the function returns.

You can follow the [NumPy guidelines for documenting functions](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html) if you like.

**Note:** To complete this exercise, you need to be able to hear audio output from the device you're using.

In [2]:
def mystery_function(arg0, arg1, arg2):
    """
    Documentation goes here.
    """
    v = 10000
    w = v*arg2
    x = np.linspace(arg0, arg1, w)
    y = np.cumsum(x) / v
    z = np.sin(2*np.pi*y)
    from IPython.display import Audio
    return Audio(z, rate=v)

In [5]:
mystery_function(220, 220, 2)

In [6]:
mystery_function(440, 220, 2)

## 3. The Climate near Berkeley


The US National Oceanic and Atmospheric Administration (NOAA) operates thousands of climate observation stations (mostly in the US) that collect information about local climate.  Among other things, each station records the highest and lowest observed temperature each day.  These data, called "Quality Controlled Local Climatological Data," are publicly available [here](http://www.ncdc.noaa.gov/orders/qclcd/) and described [here](https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/quality-controlled-local-climatological-data-qclcd).

We've provided you with an excerpt of that dataset.  All the readings are from 2015 and from California stations.

**Question 1.** Load the data from `temperatures.csv` into a table called `temperatures`.  Check out the columns in the table.  Each row represents the data from one station on one day.  The column "Date" is in MMDD format, meaning that the last two digits denote the day of the month, and the first 1 or 2 digits denote the month.

In [2]:
temperatures = ...
temperatures

In [3]:
_ = ok.grade('q3_1')

**Question 2.** Each station is named for the city in which it resides.  Is there a station in Berkeley?  Write code to help you answer the question in the next cell, and then write your answer in the cell after that, along with **an English explanation** of what your code does.

*Hint:* Use the Table method `.where`.

In [5]:
# Use this cell to work on this problem.

*Write your answer here, replacing this text.*

Let's find the station closest to the UC Berkeley campus.  The campus is located roughly at latitude 37.871746 and longitude -122.259030.  We'll break this down into a few steps.

**Question 3.** Create a table called `with_degree_differences` that's a copy of `temperatures`, but with 2 extra columns:

1. "Latitude difference": The difference between the latitude of the row's station and the latitude of UC Berkeley.
2. "Longitude difference": The difference between the longitude of the row's station and the longitude of UC Berkeley.

In [6]:
# We've provided the lat/long of UC Berkeley so you don't have to retype them:
BERKELEY_LATITUDE = 37.871746
BERKELEY_LONGITUDE = -122.259030

with_degree_differences = ...
    ...
    ...
with_degree_differences

In [7]:
_ = ok.grade('q3_3')

**Question 4.**  Degrees latitude and longitude don't correspond directly to distances, because the Earth is a sphere.  Near Berkeley, one degree latitude is [around 69 miles](https://www2.usgs.gov/faq/categories/9794/3022), and one degree longitude is around 54.6 miles.  Compute a table called `with_mile_differences` that's a copy of `with_degree_differences` with 2 extra columns:

1. "North-South difference": The difference between UC Berkeley and the row's station along the North-South axis.  This is the difference in latitude times 69.
2. "East-West difference": The difference between UC Berkeley and the row's station along the East-West axis.  This is the difference in latitude times 54.6.

In [8]:
MILES_PER_DEGREE_LATITUDE = 69
MILES_PER_DEGREE_LONGITUDE = 54.6
with_mile_differences = with_degree_differences.with_columns(
    ...
    ...
with_mile_differences

In [9]:
_ = ok.grade('q3_4')

**Question 5.** Compute the distance from UC Berkeley to each row's station.  By the Pythagorean theorem, the distance is:
$$\sqrt{(\text{North-South difference (miles)})^2 + (\text{East-West difference (miles)})^2}$$

Create a table called `with_distances` that's a copy of `with_mile_differences`, but with an extra column called "Distance to UC Berkeley" containing these distances.

*Hint:* Use elementwise arithmetic operations to square each difference, add them, and square-root them.

In [10]:
# We found it useful to compute an array of the distances on a separate line,
# but you can do whatever you want as long as you define the with_distances
# table appropriately.
distances = ...
with_distances = with_mile_differences.with_columns(
    ...
with_distances

In [11]:
_ = ok.grade('q3_5')

**Question 6.** Sort the table by distance to find the station that's closest to Berkeley.  Find its name and assign it to `closest_station_name`.

In [12]:
closest_station_name = ...
closest_station_name

In [13]:
_ = ok.grade('q3_6')

**Question 7.** Make a table called `closest_station_readings`.  It should be a table like the original `temperatures` table, except it should contain only the rows from the station you found in the previous question.  Sort it in increasing order by date.

In [14]:
closest_station_readings = ...

# This prints out your whole table (with unnecessary columns removed).
closest_station_readings.select(2, 1, 0).show()
# This code makes a plot of the highs and lows over time in your table,
# which is easier to read than the raw numbers.  You don't need to modify
# this.
closest_station_readings.scatter(2, make_array(0, 1))

In [15]:
_ = ok.grade('q3_7')

**Question 8.** From the graph, can you figure out the hottest and coldest months in 2015, in terms of average minimum temperature?  (If it looks like there's a tie, name all the months that might qualify.  If you can't answer the question from these data, explain why.)

*Write your answer here, replacing this text.*

## 4. Fixing Misspellings


You're editing a collection of your essays for publication, and you discover that you've been misspelling the word "misspell" as "mispel" your whole life.  You decide to use Python to correct this embarrassing mistake.

**Question 1.** Write a function called `correct_mispel`.  It should take a single string as its argument, and return the same string, but with all instances of "mispel" replaced with "misspell".

*Hint:* Use the string method `.replace`.  It takes two arguments: the piece of text you want to find, and the piece of text you want to replace it with.

In [2]:
# Write a function called correct_mispel in this cell.

In [4]:
_ = ok.grade('q4_1')

**Question 2.** Now you need to load your data into Python.  The file `essay_filenames.csv` is a table that contains the *filenames* of your essays.  Each filename is a string that's the name of an essay.  Load it into a table called `essay_filenames`.

In [5]:
essay_filenames = ...
essay_filenames

In [6]:
_ = ok.grade('q4_2')

**Question 3.** Below, we've provided a function that takes as its argument the *filename* of an essay and returns the text in that file (as one long string).  Using `apply`, create a table called `essays` with two columns:

1. "Name": The filename of the essay
2. "Text": The whole text of the essay

(The essays are actually books from [Project Gutenberg](gutenberg.org), modified to misspell "misspell".  Attributions and copyright information are contained in the text of each essay.)

In [7]:
def load_essay(filename):
    """Loads the text in the given file, returning it as one long string."""
    with open(filename, 'r') as essay_file:
        return essay_file.read()

essays = ...
essays

In [8]:
_ = ok.grade('q4_3')

**Question 4.** Using `apply` and the function you wrote earlier, create a table called `corrected_essays` with two columns:

1. "Name": The filename of the essay
2. "Corrected text": The whole text of the essay, with "mispel" corrected as "misspell".

In [9]:
corrected_essays = ...
corrected_essays

In [10]:
_ = ok.grade('q4_4')

Did this do anything?  Were there even misspellings in the original essay?  Let's find out.

**Question 5.** The string method `splitlines` produces an array of the lines of the string.  Use it to make a table called `news_writing_lines` with a column called "Line" containing the lines from the text called "News Writing".  That is, there should be one row in `news_writing_lines` for each line in the text called "News Writing".

In [11]:
news_writing = ...
news_writing_lines = ...
news_writing_lines

In [12]:
_ = ok.grade('q4_5')

**Question 6.** Use the table method `where` and the predicate `are.containing` to find all the lines in `news_writing_lines` that include the word "mispel".  Make a table of those lines called `misspelled_lines`.

*Note:* You should also find versions of "mispel" like "mispeled" or "mispeling", and your code probably corrected those, too.  That's okay.

In [13]:
misspelled_lines = ...
misspelled_lines

In [14]:
_ = ok.grade('q4_6')

**Question 7.** In the cell below, repeat the work you did in questions 5 and 6, but for the corrected version of "News Writing" you produced in `corrected_essays`.  Did your correction fix the misspellings?

In [16]:
# Use this cell to check whether your code fixed the misspellings.

*Write your answer here, replacing this text.*

## 5. Causes of Death in California


This exercise is designed to give you practice using the Table method `group`.

We'll be looking at a dataset from the California Department of Public Health (available [here](http://www.healthdata.gov/dataset/leading-causes-death-zip-code-1999-2013) and described [here](http://www.cdph.ca.gov/data/statistics/Pages/DeathProfilesbyZIPCode.aspx)) that records the cause of death (as recorded on a death certificate) for everyone who died in California from 1999 to 2013.  The data are in the file `causes_of_death.csv.zip`.  Each row records the number of deaths by one cause in one year in one ZIP code.

To make the file smaller, we've compressed it; run the next cell to unzip and load it.

In [None]:
!pip install zip
import  zipfile
zip_ref = zipfile.ZipFile("causes_of_death.csv.zip", 'r')
zip_ref.extractall(".")
zip_ref.close()
causes = Table.read_table('causes_of_death.csv')
causes

The causes of death in the data are abbreviated.  If you want to know what the abbreviations mean, we've provided a table called `abbreviations.csv`.

**Question 1.** Find the top 5 causes of death in California over the entire period covered by the data.  To do that, create a table with one row for each of the top 5 causes of death, a column called "Cause of Death", and a column called "Count" that records the total number of deaths due to that cause.  Sort it in descending order by count, and call it `top_5_causes`.

In [None]:
# Use this cell to find the top 5 causes of death.

In [None]:
_ = ok.grade('q5_1')

**Question 2.** Create a bar chart that displays the *proportion of all deaths* by each cause.

In [None]:
# Use this cell to make your plot.

**Question 3.** Create a plot of the total number of deaths per year in California.

*Hint:* Use the Table method `plot`.  The first argument is the name or index of the column to put on the horizontal axis.

In [None]:
...

# This line will make the vertical axis start at 0.  You can remove
# it if you want to see the default plot, which is more zoomed-in.
plots.ylim(0, 300000)

**Question 4.** You should see that deaths have increased a little over time, though not uniformly.  How would you explain that?  Describe a dataset you'd like to see to test whether your explanation is valid.

*Write your answer here, replacing this text.*