# Homework 3: Table Manipulation and Visualization

Please complete this notebook by filling in the cells provided. 

**Helpful Resource:**
- [Python Reference](http://data8.org/sp22/python-reference.html): Cheat sheet of helpful array & table methods used in Data 8!

**Recommended Reading**: 
* [Visualization](https://inferentialthinking.com/chapters/07/Visualization.html)

**Reminders**

1. For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. 

2. Directly sharing answers is not okay, but discussing problems with your instructor or with other students is encouraged. 

3. You should start early so that you have time to get help if you're stuck.

In [None]:
# Don't change this cell; just run it. 

import numpy as np
from datascience import *
import warnings
warnings.simplefilter('ignore', FutureWarning)

# These lines do some fancy plotting magic.\n",
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## 1. Unemployment

The Great Recession of 2008-2009 was a period of economic decline observed globally, with scale and timing varying from country to country. In the United States, it resulted in a rapid rise in unemployment that affected industries and population groups to different extents.

The Federal Reserve Bank of St. Louis publishes data about jobs in the US.  Below, we've loaded data on unemployment in the United States. There are many ways of defining unemployment, and our dataset includes two notions of the unemployment rate:

1. Among people who are able to work and are looking for a full-time job, the percentage who can't find a job.  This is called the Non-Employment Index, or NEI.
2. Among people who are able to work and are looking for a full-time job, the percentage who can't find any job *or* are only working at a part-time job.  The latter group is called "Part-Time for Economic Reasons", so the acronym for this index is NEI-PTER.

The source of the data is [here](https://fred.stlouisfed.org/categories/33509).

**Question 1.** The data are in a CSV file called `unemployment.csv`.  Load that file into a table called `unemployment`. **(4 Points)**


In [None]:
unemployment = ...
unemployment

**Question 2.** Sort the data in descending order by NEI, naming the sorted table `by_nei`.  Create another table called `by_nei_pter` that's sorted in descending order by NEI-PTER instead. **(4 Points)**


In [None]:
by_nei = ...
by_nei_pter = ...

In [None]:
# Run this cell to check your by_nei table. You do not need to change the code.
by_nei.show(5)

In [None]:
# Run this cell to check your by_nei_pter table. You do not need to change the code.
by_nei_pter.show(5)

**Question 3.** Use `take` to make a table containing the data for the 11 quarters when NEI was greatest.  Call that table `greatest_nei`.

`greatest_nei` should be sorted in descending order of `NEI`. Note that each row of `unemployment` represents a quarter. **(4 Points)**


In [None]:
greatest_nei = ...
greatest_nei

**Question 4.** It's believed that many people became PTER (recall: "Part-Time for Economic Reasons") in the "Great Recession" of 2008-2009.  NEI-PTER is the percentage of people who are unemployed (included in the NEI) plus the percentage of people who are PTER.

Compute an array containing the percentage of people who were PTER in each quarter.  (The first element of the array should correspond to the first row of `unemployment`, and so on.) **(4 Points)**

*Note:* Use the original `unemployment` table for this.


In [None]:
pter = ...
pter

**Question 5.** Add `pter` as a column to `unemployment` (name the column `PTER`) and sort the resulting table by that column in descending order.  Call the resulting table `by_pter`.

Try to do this with a single line of code, if you can. **(4 Points)**


In [None]:
by_pter = ...
by_pter

**Question 6.** Create a line plot of PTER over time. To do this, create a new table called `pter_over_time` by making a copy of the `unemployment` table and adding two new columns: `Year` and `PTER` using the `year` array and the `pter` array, respectively. Then, generate a line plot using one of the table methods you've learned in class.

The order of the columns matter for our correctness tests, so be sure `Year` comes before `PTER`. **(4 Points)**

*Note:* When constructing `pter_over_time`, do not just add the `year` column to the `by_pter` table. Please follow the directions in the question above.


In [None]:
year = 1994 + np.arange(by_pter.num_rows)/4
year

In [None]:
year = 1994 + np.arange(by_pter.num_rows)/4
pter_over_time = unemployment.with_columns(
    'Year', year,
    'PTER', pter
)

...              # Use this line to create the line plot
plots.ylim(0,2); # Do not change this line

**Question 7.** Were PTER rates high during the Great Recession (that is to say, were PTER rates particularly high in the years 2008 through 2011)? Assign `highPTER` to `True` if you think PTER rates were notably high in this period, compared with rates from 1995 to 2007, or `False` if you think they weren't. **(4 Points)**


In [None]:
highPTER = ...

## 2. Birth Rates

The following table gives Census-based population estimates for each US state on both July 1, 2015 and July 1, 2016. The last four columns describe the components of the estimated change in population during this time interval. **For all questions below, assume that the word "states" refers to all 52 rows including Puerto Rico and the District of Columbia.**

The data was taken from [here](http://www2.census.gov/programs-surveys/popest/datasets/2010-2016/national/totals/nst-est2016-alldata.csv). (Note: If it doesn't download for you when you click the link, please copy the link address and paste it into your browser's address bar!) If you want to read more about the different column descriptions, click [here](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/totals/nst-est2015-alldata.pdf).

The raw data is a bit messy—run the cell below to clean the table and make it easier to work with.

In [None]:
# Don't change this cell; just run it.
pop = Table.read_table('nst-est2016-alldata.csv').where('SUMLEV', 40).select([1, 4, 12, 13, 27, 34, 62, 69])
pop = pop.relabeled('POPESTIMATE2015', '2015').relabeled('POPESTIMATE2016', '2016')
pop = pop.relabeled('BIRTHS2016', 'BIRTHS').relabeled('DEATHS2016', 'DEATHS')
pop = pop.relabeled('NETMIG2016', 'MIGRATION').relabeled('RESIDUAL2016', 'OTHER')
pop = pop.with_columns("REGION", np.array([int(region) if region != "X" else 0 for region in pop.column("REGION")]))
pop.set_format([2, 3, 4, 5, 6, 7], NumberFormatter(decimals=0)).show(5)

**Question 1.** Use Table methods and the `sum` function to assign `us_birth_rate` to the total US annual birth rate during this time interval. The annual birth rate for a year-long period is the total number of births in that period as a proportion of the total population size at the start of the time period. **(4 Points)**

*Hint:* Which year corresponds to the start of the time period?

In [None]:
total_pop_at_start = ...
total_births_in_period = ...
us_birth_rate = ...
us_birth_rate

**Question 2.** Assign `movers` to the number of states for which the **absolute value** of the **annual rate of migration** was higher than 1%. The annual rate of migration for a year-long period is the net number of migrations (in and out) as a proportion of the population size at the start of the period. The `MIGRATION` column contains estimated annual net migration counts by state. **(4 Points)**

*Hint*: `migration_rates` should be a helpful table based on the `pop` table, and `movers` should be an integer, the number of rows in `migration_rates` where the absolute value of the migration rate is above 1% (i.e., greater than 0.01).


In [None]:
migration_rates_array = pop.column('MIGRATION') / pop.column('2015')
migration_rates = pop.with_columns('MIGRATION_RATE', migration_rates_array)
migration_rates.show(5)
movers = ...
movers

**Question 3.** Assign `west_births` to the total number of births that occurred in region 4 (the Western US). **(4 Points)**

*Hint:* Make sure you double check the type of the values in the `REGION` column and appropriately filter so you are only considering states in the Western US. The correct answer is a bit less than 1 million.


In [None]:
west_region = pop.where("REGION", 4)  # a helpful table
west_region.show(5)
west_births = ...  # total number of western births
west_births

**Question 4.** In the next question, you will be creating a visualization to understand the relationship between birth and death rates. The annual death rate for a year-long period is the total number of deaths in that period as a proportion of the population size at the start of the time period.

What visualization is most appropriate to see if there is an association between birth and death rates during a given time interval?

1. Line Graph
2. Bar Chart
3. Scatter Plot
4. Histogram

Assign `visualization` below to the number corresponding to the correct visualization. **(4 Points)**


In [None]:
visualization = ...

<!-- BEGIN QUESTION -->

**Question 5.** In the code cell below, create a visualization that will help us determine if there is an association between birth rate and death rate during this time interval. It may be helpful to create an intermediate table here. **(4 Points)**

Things to consider:

- What type of chart will help us illustrate an association between two quantitative variables?
- How can you manipulate a certain table to help generate your chart?
- Check out the Recommended Reading for this homework!


In [None]:
# In this cell, use birth_rates and death_rates to generate your visualization
# We start by making two arrays of length 52 for the by-state birth rates and death rates
birth_rates = ...
death_rates = ...

bd_rates_table = Table().with_columns(
    'Birthrate', birth_rates,
    'Deathrate', death_rates
)   # helpful table

...                          # visualization

<!-- END QUESTION -->

**Question 6.** True or False: There is a negative association between birth rate and death rate during this time interval. 

Assign `negative_assoc` to `True` or `False` in the cell below. **(4 Points)**


In [None]:
negative_assoc = ...

## 3. Uber

Below, we load tables containing 200,000 weekday Uber rides in the Manila, Philippines, and Boston, Massachusetts metropolitan areas from the [Uber Movement](https://movement.uber.com) project. The `sourceid` and `dstid` columns contain codes corresponding to start and end locations of each ride. The `hod` column contains codes corresponding to the hour of the day the ride took place. The `ride time` column contains the length of the ride in minutes.

In [None]:
boston = Table.read_table("boston.csv")
manila = Table.read_table("manila.csv")
print("Boston Table")
boston.show(4)
print("Manila Table")
manila.show(4)

<!-- BEGIN QUESTION -->

**Question 1.** Produce a histogram that visualizes the distributions of all ride times in Boston using the given bins in `equal_bins`. **(4 Points)**


In [None]:
equal_bins = np.arange(0, 120, 5)
...   # Add a line of code to draw the histogram

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.** Now, produce a histogram that visualizes the distribution of all ride times in Manila using the given bins. **(4 Points)**


In [None]:
equal_bins = np.arange(0, 120, 5)
...   # Add a line of code to draw the histogram

# Don't delete the following line!
plots.ylim(0, 0.05);

<!-- END QUESTION -->

**Question 3.** Let's take a closer look at the y-axis label. Assign `unit_meaning` to an integer (1, 2, 3) that corresponds to the "unit" in "Percent per unit". **(4 Points)**

1. minute  
2. ride time  
3. second
4. hod


In [None]:
unit_meaning = ...
unit_meaning

**Question 4.** Assign `boston_under_15` and `manila_under_15` to the percentage of rides that are less than 15 minutes in their respective metropolitan areas. 

  - The area of each histogram bar is proportional to the number of entries in the bin. Review the reading, [Section 7.2.5](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html#), as needed.
  - What is the width of each bar in the histograms you just drew?
  - Use the height variables provided below, which give the heights of the first 3 bars in both histograms, in order to compute the percentages. 
  - Your solution should only use height variables, numbers, and mathematical operations. You should **not** access the tables `boston` and `manila` in any way. 
  
**(4 Points)**


In [None]:
boston_under_5_ht = 1.2
manila_under_5_ht = 0.6
boston_5_to_under_10_ht = 3.2
manila_5_to_under_10_ht = 1.4
boston_10_to_under_15_ht = 4.9
manila_10_to_under_15_ht = 2.2

boston_under_15 = ...
manila_under_15 = ...

boston_under_15, manila_under_15

**Question 5.** Let's take a closer look at the distribution of ride times in both cities. Calculate `boston_median_ride_time` and `manila_median_ride_time`.

*Hint:* The median of a sorted list has half of the list elements to its left, and half to its right. For example, you can sort the Boston ride times and locate the median based on the "middle index", or use the function `np.median`.

**(4 Points)**


In [None]:
boston_median_ride_time = ...
manila_median_ride_time = ...

boston_median_ride_time, manila_median_ride_time

<!-- BEGIN QUESTION -->

**Question 6.** Identify *one difference* between the histograms, in terms of the statistical properties. Can you comment on the center and/or skew of each histogram? Does one of the distributions show a wider spread of values away from the center (more variability)? **(4 Points)**

*Hint*: The best way to do this is to compare the two histograms (from 3.1 and 3.2) visually.


%=================================%

_Type your answer here, replacing this text._

%=================================%

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 7.** Why is your response in Question 6 the case? Specifically, based on one of the following two readings, why are the distributions for Boston and Manila different in the way you have identified? **(4 Points)**

- [Boston reading](https://www.boston-discovery-guide.com/boston-driving-tips.html)
- [Manila reading](https://manilafyi.com/why-is-manila-traffic-so-bad/)

*Hint:* Try thinking about external factors of the two cities that may be causing the difference! There may be multiple different factors that come into play.


%=================================%

_Type your answer here, replacing this text._

%=================================%

<!-- END QUESTION -->

## 4. Histograms

Consider the following scatter plot: 

![](scatter.png)

The axes of the plot represent values of two variables: $x$ and $y$. 

Suppose we have a table called `t` that has two columns in it:

- `x`: a column containing the x-values of the points in the scatter plot
- `y`: a column containing the y-values of the points in the scatter plot

Below, you are given three histograms—one corresponds to column `x`, one corresponds to column `y`, and one does not correspond to either column. 

**Histogram A:**

![](var3.png)

**Histogram B:**

![](var1.png)

**Histogram C:**

![](var2.png)

**Question 1.** Suppose we run `t.hist('y')`. Which histogram does this code produce? Assign `histogram_column_y` to either 'A', 'B', or 'C'. **(4 Points)**

In [None]:
histogram_column_y = ...

<!-- BEGIN QUESTION -->

**Question 2.** State at least one reason **why** you chose the histogram from Question 1. (e.g., "I chose histogram A because ..."). **(4 Points)**


%===========================%

_Type your answer here, replacing this text._

%===========================%

## 5. Finish Line

Congratulations, you're done with Homework 3!  

1. Make sure you have run all the cells in your notebook in order, so that all images/graphs appear in the output. 
2. Save and Checkpoint your file (Ctrl-S).
3. Download a copy as HTML.
4. Upload your HTML file to the assignment activity on Moodle.