# Homework 3: Tables and Charts

**Reading:** Textbook chapters [5](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/05/tables.html) and [6](http://www.cs.cornell.edu/courses/cs1380/2018sp/textbook/chapters/06/visualization.html).

**Deadline:** This assignment is due Friday, February 16 at 9 pm. You will receive an early submission bonus point if you turn in your final submission by Thursday, February 15 at 9 pm. Late work will not be accepted.  (Usually homeworks are due on Thursdays.  We're giving you an extra day on this one, because the last problem is based on material from Lab 04, which takes place on Wednesday and Thursday.  Also, it means you don't have to skip any plans you might have on February 14 just to get a bonus point.)

**Academic Integrity:** Please review the [course policies on Academic Integrity][policies]. Directly sharing answers with other students is not permitted, but discussing problems with other students is allowed if you keep the discussion at a high level: use English, not code, to communicate.

**Office Hours (OH):** Drop-in office hours are held Monday through Friday. The schedule is in a [Google calendar][oh].

**Ask For Help:** If you get stuck, please visit OH to ask for help! And please start early, so that you have time to get help if you become stuck.

**Checks vs. Tests:** Remember, the checks provided in the homework are just hints: they check for common errors to help guide you along the right path, but they don't tell you whether your answer is 100% correct. When your homework is graded, we will run rigorous tests for correct answers.

**Submission:** See the end of this notebook for instructions on how to submit the assignment in Vocareum.

[policies]: http://www.cs.cornell.edu/courses/cs1380/2018sp/policies.html
[oh]: http://www.cs.cornell.edu/courses/cs1380/2018sp/oh.html

In [None]:
# Don't change this cell; just run it. 
# Each time you re-open Vocareum, you will need to execute this cell again.

import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

from test import *

## 1. Differences between Universities, Part II


**Question 1.** Suppose you're choosing a university to attend, and you'd like to quantify how *dissimilar* any two universities are.  You rate each university you're considering on several numerical traits.  You decide on a very detailed list of 1000 traits, and you measure all of them!  Some of those traits could include:

* The cost per year to attend.
* The average Yelp review of nearby Thai restaurants.
* The USA Today ranking of the Medical school.
* The number of inches it must snow before the university will cancel classes.
* The average number of emails per week from Denice Cassaro.

You decide to quantify the dissimilarity between two universities as the *total* of the differences in their traits.  That is, the dissimilarity is:

* the sum of
* the absolute values of
* the differences of
* the 1000 trait values.

In the next cell, we've loaded arrays containing the 1000 trait values for Cornell and Harvard.  Compute the dissimilarity (according to the above definition) between Cornell and Harvard.  Call your answer `dissimilarity`.  Use a single line of code to compute the answer.

*Note:* The data we're using aren't real&mdash;we made them up for this exercise.

In [None]:
cornell = Table.read_table("cornell.csv").column("Trait value")
harvard = Table.read_table("harvard.csv").column("Trait value")

dissimilarity = ...
dissimilarity

In [None]:
check1_1(dissimilarity)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 2.** What is the type of each variable in Question 1?

1. `cornell`, `harvard`, and `dissimilarity` are all numpy arrays
2. `cornell` and `harvard` are numpy arrays, and `dissimilarity` is a floating-point number
3. `cornell` and `harvard` are lists, and `dissimilarity` is a floating-point number

In [None]:
# Answer either 1, 2, or 3...
data_types = ...
data_types

In [None]:
check1_2(data_types) 

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 3.** Why do we take the absolute value when computing `dissimilarity`?  Choose the best answer.

1. Because the values of some traits might be negative.
2. Because we want the final answer to be a positive number.
3. Because the direction of the difference between traits does not matter.
4. Because some traits are larger in magnitude (e.g., values around 1e5) and some are smaller (e.g., values around 1e-5).

In [None]:
why_abs_value = ...
why_abs_value

In [None]:
check1_3(why_abs_value)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


After computing dissimilarities between several schools, you notice a problem with your method: the scale of the traits matters a lot.

Since schools cost tens of thousands of dollars to attend, the cost-to-attend trait is always a much bigger *number* than most other traits.  That makes it affect the dissimilarity a lot more than other traits.  Two schools that differ in cost-to-attend by \$900, but are otherwise identical, get a dissimilarity of 900.  But two schools that differ in graduation rate by .9 (a huge difference!), but are otherwise identical, get a dissimilarity of only .9.

One way to fix this problem is to assign different *weights* to different traits.  For example, we could fix the problem above by multiplying the difference in the cost-to-attend traits by .001, so that a difference of \$900 in cost-to-attend results in a dissimilarity of \$900 $\times$ .001, which is .9.

Here's a revised method that does that for every trait:

1. For each trait, subtract the two schools' trait values.
2. Then take the absolute value of that difference.
3. *Now multiply that absolute value by a weight specific to that trait, such as .001 or 2.*
4. Now sum the 1000 resulting numbers.

**Question 4.** Suppose you've already decided on a weight for each trait.  These are loaded into an array called `weights` in the cell below.  `weights.item(0)` is the weight for the first trait, `weights.item(1)` is the weight for the second trait, and so on.  Use the revised method to compute a revised dissimilarity between Cornell and Harvard.

*Hint:* Using array arithmetic, your answer should be almost as short as in question 1.

In [None]:
weights = Table.read_table("weights.csv").column("Weight")

revised_dissimilarity = ...
revised_dissimilarity

In [None]:
check1_4(revised_dissimilarity)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


## 2. Unemployment


The Federal Reserve Bank of St. Louis [publishes data](https://fred.stlouisfed.org/categories/33509) about jobs in the US.  In this problem we'll analyze data on unemployment in the United States.

**Question 1.** Some data from the bank are in the CSV file `unemployment.csv`.  Load that file into a table named `unemployment`.

In [None]:
unemployment = ...
unemployment

In [None]:
check2_1(unemployment)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


There are many ways of defining unemployment.  The dataset you just loaded includes two notions of the unemployment rate:

1. The percentage who can't find a job among people who are able to work and are looking for a full-time job.  This is called the Non-Employment Index, which is abbreviated to NEI.
2. The percentage who can't find any job *or* are only working at a part-time job, again among people who are able to work and are looking for a full-time job.  The group working part-time is called "Part-Time for Economic Reasons," so the acronym for this index is NEI-PTER.  

**Question 2.** Create a table named `by_nei` that is sorted in decreasing order by NEI.  And create another table named `by_nei_pter` that is sorted in decreasing order by NEI-PTER.

In [None]:
by_nei = ...
by_nei

In [None]:
by_nei_pter = ...
by_nei_pter

In [None]:
# This check will give you two answers, one for each table.
check2_2(by_nei, by_nei_pter)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 3.** Use `take` to make a table containing the data for the 10 quarters when NEI was greatest.  Call that table `greatest_nei`.

In [None]:
greatest_nei = ...
greatest_nei

In [None]:
check2_3(greatest_nei)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 4.** NEI-PTER is the percentage of people who are unemployed (hence counted in the NEI) plus the percentage of people who are PTER.  Compute an array containing the percentage of people who were PTER in each quarter.  (The first element of the array should correspond to the first row of `unemployment`, and so on.)

*Note:* Use the original `unemployment` table for this.

In [None]:
pter = ...
pter

In [None]:
check2_4(pter)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 5.** Add `pter` as a column named PTER to `unemployment`, and sort the resulting table by that column in decreasing order.  Call the table `by_pter`.

Try to do this with a single line of code, if you can.

In [None]:
by_pter = ...
by_pter

In [None]:
check2_5(by_pter)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 6.**  It has been claimed that the "Great Recession" of 2008-2009 might have caused many people to become part-time for economic reasons.  Does this dataset support that claim?  Specifically, is the PTER rate high during or shortly after the Great Recession, compared to other periods in the dataset?  Justify your answer by referring to specific values in a table or by producing a visualization of the data.

*Write your answer here, replacing this text. Feel free to add code cells above or below if you want to generate new tables or charts.*

## 3. Birth Rates


The following table contains population estimates based on census data.  There are 52 rows in the table, one for each state as well as Puerto Rico and the District of Columbia.  In the rest of this problem, just assume that the word "states" means all those 52 territories.  The columns include population estimates on July 1, 2015 and July 1, 2016.  The last four columns describe some components of the estimated change in population during this time interval.

In [None]:
# Don't change this cell; just run it.
# From http://www2.census.gov/programs-surveys/popest/datasets/2010-2016/national/totals/nst-est2016-alldata.csv
# See http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/totals/nst-est2015-alldata.pdf
#     for column descriptions. (As of Feb. 2017, no descriptions were posted for 2010-2016.)
pop = Table.read_table('nst-est2016-alldata.csv').where('SUMLEV', 40).select([1, 4, 12, 13, 27, 34, 62, 69])
pop = pop.relabeled(2, '2015').relabeled(3, '2016')
pop = pop.relabeled(4, 'BIRTHS').relabeled(5, 'DEATHS')
pop = pop.relabeled(6, 'MIGRATION').relabeled(7, 'OTHER')
pop.set_format(range(2,8), NumberFormatter(decimals=0)).show(3)

**Question 1.** Assign `us_birth_rate` to the total US annual birth rate during this time interval. The annual birth rate for a year-long period is the number of births in that period as a proportion of the population at the start of the period.

In [None]:
us_birth_rate = ...
us_birth_rate

In [None]:
check3_1(us_birth_rate)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 2.** Assign `fastest_growth` to an array of the names of the five states with the fastest population growth rates.  The elements of the array should be sorted in descending order of growth rate. The growth rate is the change in population during the time period as a proportion of the population at the start of the period.

In [None]:
fastest_growth = ...
fastest_growth

In [None]:
check3_2(fastest_growth)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 3.** Assign `movers` to the number of states for which the absolute value of the annual rate of migration was above 1%. The annual rate of migration for a year-long period is the net number of migrations (in and out) as a proportion of the population at the start of the period. The `MIGRATION` column contains estimated annual net migration counts by state.

In [None]:
movers = ...
movers

In [None]:
check3_3(movers)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 4.** Assign `ne_births` to the total number of births that occurred in region 1 (the Northeastern US).

In [None]:
ne_births = ...
ne_births

In [None]:
check3_4(ne_births)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 5.** Assign `less_than_ne_births` to the number of states that had a total population in 2016 that was smaller than the number of babies born in region 1 in this time period.

In [None]:
less_than_ne_births = ...
less_than_ne_births

In [None]:
check3_5(less_than_ne_births)

In [None]:
#
# AUTOGRADER TEST - DO NOT REMOVE
#


**Question 6.** Was there an association between birth rate and death rate during this time period? Use the code cell below to support your conclusion with a chart. If an association exists, what might explain it?

*Write your answer here, replacing this text.*

In [None]:
# Generate a chart here to support your conclusion


## 4. Marginal Histograms


Suppose we have a table called `t` that has two columns in it:

- `x`: a column containing the x-values of some points
- `y`: a column containing the y-values of some points

Also suppose we use `t.scatter('x', 'y')` to produce a scatter plot of those points:

![](scatter.png)


**Question 1:** Below are two expressions, as well as two histograms.  Match each of the histograms to the expression that produced it. Edit the cell below to give your answer and explain your reasoning.

**Expression 1:** `t.hist('x')`

**Histogram for Expression 1:**

**Explanation:**

* * *

**Expression 2:** `t.hist('y')`

**Histogram for Expression 2:**

**Explanation:**

**Histogram A:** ![](var1.png)
**Histogram B:** ![](var2.png)

## 5. Submission

To submit your assignment, click the red Submit button above. You may submit as many times as you wish before the deadline. Only your final submission will be graded. No late work will be accepted, so please make sure you submit something before the deadline!

Before you submit, it would be wise to click on the menu item Kernel -> Restart & Run All.  That will re-run all your cells from scratch.  Take a second look to make sure all your answers are passing the checks.  Doing this will help catch any errors in your homework that result from running cells in a strange order.