# Lecture 7

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Modifying orignal data vs. returning a copy

There are two methods to change the label of a column in the datascience package.

- `tb.relabel()`  overwrites the column name in the original Table. 
- `tb.relabeled()` returns a new Table but does not overwrite the column name original Table. 

Side note: there is also a `tb.copy()` method which copies all the values in a Table into new memory locations, which is a way to preserve the values in a Table in a particular point in time. 

In [None]:
cones = Table.read_table("cones.csv")
cones

In [None]:
# tb.relabeled() to update the name of a column **without** changing the original Table
cones.relabeled("Flavor", "Taste")

In [None]:
# the original Table has not been modified
cones

In [None]:
# if we use the tb.relabel() method it changes the orignal Table (along with returning an updated Table)
cones.relabel("Flavor", "Another Taste")

In [None]:
cones

### Question...

Are `tb.relabel()` and `tb.relabeled()` good method names? 

## Example: Census data ##

The US government conducts a census every 10 years. We can examine the census data to see interesting patterns in the population of people in the United States.


In [None]:
# As of August 2021, this census file is online here: 
data = 'http://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2019/nc-est2019-agesex-res.csv'

# A local copy can be accessed here in case census.gov moves the file:
# data = path_data + 'nc-est2019-agesex-res.csv'

full = Table.read_table(data)
full

In [None]:
# get a reduced set of columns that we want to analyze further
partial = full.select('SEX', 'AGE', 'POPESTIMATE2014', 'POPESTIMATE2019')
partial

In [None]:
# rename the columns to make them easier to work with
simple = partial.relabeled('POPESTIMATE2014', '2014').relabeled('POPESTIMATE2019', '2019')
simple

In [None]:
# let's examine the data a little more
simple.sort('AGE', descending=True)

In [None]:
# let's remove the totals (value of 999 in the AGE column)
no_999 = simple.where('AGE', are.below(999))
no_999.sort("AGE", descending=True)

In [None]:
# let's split the data into male, female and everyone
everyone = no_999.where('SEX', 0).drop('SEX')
males = no_999.where('SEX', 1).drop('SEX')
females = no_999.where('SEX', 2).drop('SEX')

In [None]:
females

In [None]:
# let's see which ages have the most people
females.sort('2019', descending=True)

In [None]:
males.sort('2019', descending=True)

In [None]:
# let's create a Table with age males and females 
pop_2019 = Table().with_columns(
    'Age', males.column('AGE'),
    'Males', males.column('2019'),
    'Females', females.column('2019')
)

In [None]:
pop_2019

In [None]:
# let's add a precent female column to our Table
percent_females = 100 * pop_2019.column('Females')/(pop_2019.column('Males') + pop_2019.column('Females'))
counts_and_percents = pop_2019.with_column('Percent Female', percent_females)

In [None]:
counts_and_percents

## Line Graphs ##

A useful way to visualize data as a function of time is a line plot. We can do this using the `tb.plot('x_col_name', 'y_col_name')` method.


In [None]:
# plot percent female as a function of age
counts_and_percents.plot('Age', 'Percent Female')

In [None]:
pop_2019

In [None]:
# if we only pass one variable to the tb.plot() method, all other variables are plotted
pop_2019.plot('Age')

In [None]:
# let's caclulate which ages saw the biggest change in numbers from 2014 to 2019
everyone = everyone.with_column(
    'Change', everyone.column('2019') - everyone.column('2014')
)

In [None]:
everyone.sort('Change', descending=True)

In [None]:
# Let's examine the percent change
pop_growth = everyone.with_column(
    'Percent change', everyone.column('Change')/everyone.column('2014'))

pop_growth = pop_growth.set_format('Percent change', PercentFormatter)

pop_growth.sort('Percent change', descending=True)


In [None]:
# plot percent change - any ideas why larger increases around age 72 and late 90's? 
pop_growth.plot("AGE", "Percent change")

# actually plot as percentage rather than proportion
# pop_growth.with_column("Percent change", pop_growth.column("Percent change") * 100).plot("AGE", "Percent change")

In [None]:
age_to_examine = 72
print(2014 - age_to_examine)  # people who were 72 in 2014 were born in which year? 
print(2019 - age_to_examine)  # people who were 72 in 2019 were born in which year?  

## COVID-19 trends

If there is time, let's explore COVID-19 trends in the United States...