In [None]:
# Initialize OK
from client.api.notebook import Notebook
ok = Notebook('lab01.ok')

In [1]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
plt.style.use('fivethirtyeight')

import pandas as pd
import zipfile
import io
import math

def css_styling():
    styles = open('../notebook_styles.css', 'r').read()
    return HTML(styles)
css_styling()

# Lab 01 - Life and death in the United States

## Speedy review of life tables

We just learned some of the basics of the **period life table**. The period life table is a key concept in the study of mortality.

Let's quickly review some of the material that we covered in the brief lecture.

**Life table concepts, columns and notation**

The life tables we'll work with today describe the experience of hypothetical cohort of people who start at age 0. The mortality experience of this hypothetical cohort is given by observed mortality in a real world population (e.g. a U.S. state).  The number of people who start at age 0 in the hypothetical cohort is called the **radix**; today, we'll use life tables that have a couple of important properties:

* the radix is 100,000 people
* the width of all of the age intervals is 1

With that in mind, we'll use the following names for the columns of the life table:

* `lx` - the number of people who survive to exact age $x$ (called $l_x$ in mathematical notation)
* `dx` - the number of deaths between ages $x$ and $x + 1$ (called ${}_nd_x$ in mathematical notation)
* `qx` - the probability of dying between ages $x$ and $x+1$, given survival to age $x$ (called ${}_nq_x$ in mathematical notation)
* `mx` - the death *rate* between ages $x$ and $x+1$ in the life table population (called ${}_nm_x$ in mathematical notation)
* `Lx` - the number of person-years lived between ages $x$ and $x + 1$ by people in the life table population (called ${}_nL_x$ in mathematical notation)
* `ax` - the average number of years lived *among people who die* in the interval from $x$ to $x + 1$ (called ${}_na_x$ in mathematical notation)
* `Tx` - the total number of person-years lived above exact age $x$ (called $T_x$ in mathematical notation)
* `ex` - the (remaining) life expectancy at exact age $x$ (called $e_x$ in mathematical notation)


## Introductions

**What is your partner's name?**
<!--
BEGIN QUESTION
name: intro1
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

**Which U.S. State has your partner spent the most time in?**
<!--
BEGIN QUESTION
name: intro2
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

**Which U.S. State would your partner most like to visit in the future?**
<!--
BEGIN QUESTION
name: intro3
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

Let's start with a couple of questions that will check your understanding of life tables. (Be sure to discuss these with your partner.)

**Question** What is another name for $l_0$?
<!--
BEGIN QUESTION
name: intro4
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

**Question** How can life expectancy at age $x$ be calculated from other columns in the life table?
<!--
BEGIN QUESTION
name: intro5
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

Great -- we're ready to get started!

Moving forward, in today's lab, we'll start to use what we learned by exploring life tables for the US states. Our goals are:

* to build our understanding of life tables
* to start to recognize some characteristic patterns of mortality in human populations
* to investigate how different human populations vary in terms of mortality
* to get some practice using the tools you're learning in Data 8

### The US Mortality Database

We'll be looking at life tables from the [United States Mortality Database](https://usa.mortality.org/), a brand-new resource that was produced by researchers here at UC Berkeley, along with collaborators at INED (the French Demographic institute) and the UN. The USMD has life tables for each US state starting several decades in the past, and going up to 2015.

First, please sign up as a user of the US Mortality Database (you and your partner should each do this). You can sign up for free by going to this link:

[https://usa.mortality.org/mp/auth.pl](https://usa.mortality.org/mp/auth.pl)

When you sign up, you help the people who put tons of work into this database prove to their funders that lots of people use it.

Since we will be looking at life tables for different US States, it will be helpful to have a list that has all of the two-letter  codes for the 50 states and for Washington, D.C.

In [2]:
## all of the state codes
all_states = \
       ['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',\
       'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',\
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',\
       'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',\
       'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']

In order to make it easier to read in data from the USMD, I've written a simple function, `get_state_lt`. Look at the function and try to get a sense for how it works. (It's OK if you don't understand every detail at this point.)

In [3]:
def get_state_lt(state_code, sex_code, year = None):
    """
    Grab the life table data for a given state, sex, and time period
    
    Arguments:
      state_code - the two-letter state abbreviation, all caps (example: 'CA' or 'NY')
      sex_code - either 'f' for females, 'm' for males, or 'b' for both
      year - the year (if not used, all years are returned)
    """
    
    zipurl = os.path.join('../data/temp-us-lifetables.zip')
    
    fileurl = os.path.join('States', state_code, state_code + '_' + sex_code + 'ltper_1x1.csv')
    
    with zipfile.ZipFile(zipurl) as archive:        
        with archive.open(fileurl) as ltfile:
            lt_data_raw = pd.read_csv(ltfile)
            lt_data = Table.from_df(lt_data_raw)

    # remove the highest age group
    lt_data = lt_data.where('Age', are.not_equal_to('110+'))
    
    # convert everything but state name and sex to numeric data types
    for cur_col in lt_data.labels[2:]:
        lt_data[cur_col] = pd.to_numeric(lt_data[cur_col])
    
    if year is not None:
        lt_data = lt_data.where('Year', are.equal_to(int(year)))
    else:
        print("No year specified, so returning all years.")
        
    return(lt_data)


**Now use the `get_state_lt` function to read in the life table for New York males in 2015**
<!--
BEGIN QUESTION
name: load_lt
points: 1
-->

In [4]:
test_lt = get_state_lt(...)
test_lt

In [None]:
ok.grade("load_lt");

You can see that, in addition to the life table columns you discussed last week, a USMD life table has the following columns:

* `PopName` - the state code
* `Sex` - 'm' for males, 'f' for females, and 'b' for both
* `Year` - the period (calendar year) that the life table is based on. The most recent year is 2015.

## Visualize columns of the life table for Californian women

We'll start a deeper dive into US life tables using Californian women in 2015 as an example.

**Question: Use the `get_state_lt` function to retrieve the life table for Californan females in 2015**
<!--
BEGIN QUESTION
name: load_lt_caf
points: 1
-->

In [9]:
ca_f_lt = ...
ca_f_lt

In [None]:
ok.grade("load_lt_caf");

**Question: Plot the survivorship column (lx) by age**
<!--
BEGIN QUESTION
name: plot_survivorship
points: 1
manual: true
image: true
-->
<!-- EXPORT TO PDF -->

In [14]:
ca_f_lt.plot(..., ...)

**Question: Plot the life table number of deaths (dx) by age**
<!--
BEGIN QUESTION
name: plot_deaths
points: 1
manual: true
image: true
-->
<!-- EXPORT TO PDF -->

In [15]:
...

**Question: Plot the life table death rate (mx) by age**
<!--
BEGIN QUESTION
name: plot_death_rate
points: 1
manual: true
image: true
-->
<!-- EXPORT TO PDF -->

In [16]:
...

Death rates like the ones you just plotted are central to the study of mortality; however, as the plot reveals, they vary tremendously across the age range.  In fact, they vary so much that it can be hard to see much helpful information on a plot like the one above.  Therefore, demographers usually look at the logarithm of the death rate, rather than the death rates themselves.

**Create a new column called logmx that has the log of the life table death rate**  
*[HINT: the function `np.log` will take the logarithm of a Series for you]*
<!--
BEGIN QUESTION
name: log_death_rate
points: 1
-->

In [17]:
ca_f_lt = ca_f_lt.with_column('logmx', ...)
ca_f_lt

In [None]:
ok.grade("log_death_rate");

**Question: Now plot the log of the life table death rate by age**
<!--
BEGIN QUESTION
name: plot_logmx
points: 1
manual: true
image: true
-->
<!-- EXPORT TO PDF -->

In [19]:
...

**About what age seems to have the lowest death rate?**
<!--
BEGIN QUESTION
name: lowest_death_rate
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

In class, we discussed how to use the life table to calculate life expectancy at birth: it is the life table number of people who reach age x divided by the total person-years lived above age x.

**Question: use the CA female life table to calculate life expectancy at birth**
<!--
BEGIN QUESTION
name: manual_ex
points: 1
-->

In [20]:
radix = 100000 # this is the radix for all of the USMD lifetables
ca_f_e0 = ... / ...
ca_f_e0

In [None]:
ok.grade("manual_ex");

**Question: Now compare your calculation to the `ex` entry in the first row of the life table**
<!--
BEGIN QUESTION
name: compare_ex
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

In [22]:
... # print the first few rows of the life table to compare

We most often discuss life expectancy at birth, but life expectancy can be calculated at any age; the *ex* column of the life table has the expected number of years of life to be lived above each age.

**Question: Plot life expectancy by age**
<!--
BEGIN QUESTION
name: plot_ex
points: 1
manual: True
image: True
-->
<!-- EXPORT TO PDF -->

In [23]:
...

**Does anything surprise you about the plot of life expectancy by age?**
<!--
BEGIN QUESTION
name: ex_plot_surprise
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## Comparing California and another state

**Question: Now plot the dx column of the life table for New Hampshire females in 2015.**  
*[NOTE: naturally, you will have to open the life table before you can plot it]*
<!--
BEGIN QUESTION
name: nh_f_dx_plot
points: 1
manual: True
image: True
-->
<!-- EXPORT TO PDF -->

In [24]:
...

**Question: ... and calculate the log death rates and plot them by age for NH females as well.**
*[NOTE: you may get a warning 'divide by zero encountered in log'. We'll disregard this for the purposes of the plot -- but, so you know, the problem is that at some ages the death rate among NH females is 0, and you can't take the log of 0]*
<!--
BEGIN QUESTION
name: nh_f_logmx_plot
points: 1
manual: True
image: True
-->
<!-- EXPORT TO PDF -->

In [25]:
nh_f_lt = nh_f_lt.with_column('logmx', ...)
nh_f_lt.plot(...)

**Question: What difference do you notice between the plots for NH females and the analogous plots you made earlier for CA females? What might explain this difference?**
<!--
BEGIN QUESTION
name: nh_ca_diff
manual: True
points: 1
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

**Question: What similarities do you notice between the plot for NH females and the plot you made earlier for CA females?**
<!--
BEGIN QUESTION
name: nh_ca_diff
manual: True
points: 1
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

## Comparing males and females

Now we'll start to compare some life table quantities for males and for females.

**Question: Get the life tables for California males and females in 2015**

In [26]:
ca_lt_f = ...
ca_lt_m = ...

**Question: Now make a new table whose colums are 'Age', 'Males' (with male ex column), and 'Females' (with female ex column)**
<!--
BEGIN QUESTION
name: make_mf_ex_table
points: 1
-->

In [27]:
ca_lt_ex = Table().with_columns('Age', ...,
                                'Males', ...,
                                'Females', ...)
ca_lt_ex

In [None]:
ok.grade("make_mf_ex_table");

**Question: Plot the male and female ex column by age (both males and females in one plot)**
<!--
BEGIN QUESTION
name: plot_mf_ex_oneplot
points: 1
manual: True
image: True
-->
<!-- EXPORT TO PDF -->

In [32]:
...

Now we'll generalize the approach above so that it is a function. Our function will be able to plot a comparison between males and females for any life table column. This will save us lots of time, and it's a good example of how writing functions can be helpful.

**Question: Complete the missing parts of the function below**

In [33]:
def mf_compare(state, year, col):
    # For a given life table column, state, and year, make a plot comparing males and females.

    # open the data for males and females
    lt_m = ...
    lt_f = ...
    
    # make a dataframe that has the male ad female values together
    lt_comp = Table().with_columns('Age', lt_f['Age'],
                                   'Males', ...,
                                   'Females', ...)
    
    return(lt_comp)

**Question: Now use the `mf_compare` function to compare the life table age at death (the dx column) for males and females in California in 2015**
<!--
BEGIN QUESTION
name: mf_compare_test
points: 1
-->

In [34]:
ca_compare_dx = mf_compare(...)
ca_compare_dx.plot("Age")

In [None]:
ok.grade("mf_compare_test");

**Question: Pick two other states and use the `mf_compare` function to compare the life table age at death (the dx column) for males and females in those two states**

In [37]:
# first state
...

In [38]:
# second state
...

**Question** The general shape of the male/female differences is probably reasonably similar across the three states you examined. What do you notice about male versus female ages at death across all three states?
<!--
BEGIN QUESTION
name: m_f_diff_comment
points: 1
manual: True
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

### You have finished the lab

There are optinoal challenge problems below. If you are not going to attempt the optional challenge problems, please scroll down to the bottom of the notebook to submit the lab.

### SUBMIT your assignment by MIDNIGHT on the day of class

**For this first lab, to be sure that submission works correclty, please submit using `okpy` and ALSO submit the .pdf that gets generated on bCourses**

## Optional challenge problems: Exploring cross-state variation

**Question** If you have extra time, see if you can figure out   
(a) which state has the highest female life expectancy in 2015?  
(b) which state has the lowest female life expectancy in 2015?  
What might explain the variation in female life expectancy that you observe?

In [39]:
...

*Write your answer here, replacing this text.*

**Question** If you **still** have extra time, see if you can figure out which state has the biggest gap between male and female life expectancies.

In [40]:
...

In [41]:
...

*Write your answer here, replacing this text.*

### Don't forget to SUBMIT your assignment by MIDNIGHT on the day of class

If you attempted the challenge questions, great! Be sure to submit afterwards using the instructions in the cell above.

**For this first lab, to be sure that submission works correclty, please submit using `okpy` and ALSO submit the .pdf that gets generated on bCourses**

# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

<!-- EXPECT 19 EXPORTED QUESTIONS -->

In [None]:
# Save your notebook first, then run this cell to submit.
import jassign.to_pdf
jassign.to_pdf.generate_pdf('lab01.ipynb', 'lab01.pdf')
ok.submit()