In [1]:
# HIDDEN
from datascience import *
import numpy as np
np.set_printoptions(threshold=50)

### Example: Trends in the Population of the United States ###

We are now ready to work with large tables of data. The file below contains "Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States." Notice that `read_table` can read data directly from a URL.

In [2]:
census_url = 'http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'
full_census_table = Table.read_table(census_url)
full_census_table

SEX,AGE,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014
0,0,3944153,3944160,3951330,3963071,3926665,3945610,3948350
0,1,3978070,3978090,3957888,3966510,3978006,3943077,3962123
0,2,4096929,4096939,4090862,3971573,3979952,3992690,3957772
0,3,4119040,4119051,4111920,4102501,3983049,3992425,4005190
0,4,4063170,4063186,4077552,4122303,4112638,3994047,4003448
0,5,4056858,4056872,4064653,4087713,4132210,4123408,4004858
0,6,4066381,4066412,4073013,4074979,4097780,4143094,4134352
0,7,4030579,4030594,4043047,4083240,4084964,4108615,4154000
0,8,4046486,4046497,4025604,4053206,4093213,4095827,4119524
0,9,4148353,4148369,4125415,4035769,4063193,4104133,4106832


Only the first 10 rows of the table are displayed. Later we will see how to display the entire table; however, this is typically not useful with large tables.

A [description of the table](http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.pdf) appears online. The `SEX` column contains numeric codes: `0` stands for the total, `1` for male, and `2` for female. The `AGE` column contains ages in completed years, but the special value `999` is a sum of the total population. The rest of the columns contain estimates of the US population.

Typically, a public table will contain more information than necessary for a particular investigation or analysis. In this case, let us suppose that we are only interested in the population changes from 2010 to 2014. Let us select the relevant columns.

In [3]:
partial_census_table = full_census_table.select(['SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2014'])
partial_census_table

SEX,AGE,POPESTIMATE2010,POPESTIMATE2014
0,0,3951330,3948350
0,1,3957888,3962123
0,2,4090862,3957772
0,3,4111920,4005190
0,4,4077552,4003448
0,5,4064653,4004858
0,6,4073013,4134352
0,7,4043047,4154000
0,8,4025604,4119524
0,9,4125415,4106832


We can also simplify the labels of the selected columns.

In [4]:
simple = partial_census_table.relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2014', '2014')
simple

SEX,AGE,2010,2014
0,0,3951330,3948350
0,1,3957888,3962123
0,2,4090862,3957772
0,3,4111920,4005190
0,4,4077552,4003448
0,5,4064653,4004858
0,6,4073013,4134352
0,7,4043047,4154000
0,8,4025604,4119524
0,9,4125415,4106832


We now have a table that is easy to work with. Each column of the table is an array of the same length, and so columns can be combined using arithmetic. Here is the change in population between 2010 and 2014.

In [5]:
simple.column('2014') - simple.column('2010')

array([  -2980,    4235, -133090, ...,    6717,   13410, 4662996])

Let us augment `simple` with a column that contains these changes, both in absolute terms and as percents relative to the value in 2010.

In [6]:
change = simple.column('2014') - simple.column('2010')
census = simple.with_columns(
    'Change', change,
    'Percent Change', change/simple.column('2010')
)
census.set_format('Percent Change', PercentFormatter)

SEX,AGE,2010,2014,Change,Percent Change
0,0,3951330,3948350,-2980,-0.08%
0,1,3957888,3962123,4235,0.11%
0,2,4090862,3957772,-133090,-3.25%
0,3,4111920,4005190,-106730,-2.60%
0,4,4077552,4003448,-74104,-1.82%
0,5,4064653,4004858,-59795,-1.47%
0,6,4073013,4134352,61339,1.51%
0,7,4043047,4154000,110953,2.74%
0,8,4025604,4119524,93920,2.33%
0,9,4125415,4106832,-18583,-0.45%


**Sorting the data.** Arranging data in increasing or decreasing order is often an important first step in data analysis. The method `sort` allows us to do this. By default, `sort` arranges values from smallest to largest. We will use the option `descending=True` to sort from largest to smallest.

Let us sort the table in decreasing order of the absolute change in population.

In [18]:
census.sort('Change', descending=True)

SEX,AGE,2010,2014,Change,Percent Change
0,999,309347057,318857056,9509999,3.07%
1,999,152089484,156936487,4847003,3.19%
2,999,157257573,161920569,4662996,2.97%
0,67,2693709,3485502,791793,29.39%
0,64,2706063,3488136,782073,28.90%
0,66,2621346,3347776,726430,27.71%
0,65,2678532,3384449,705917,26.35%
0,71,1953614,2519748,566134,28.98%
0,34,3822188,4362895,540707,14.15%
0,23,4217221,4698584,481363,11.41%


Not surprisingly, the top row of the sorted table is the line that corresponds to the entire population: both sexes and all age groups. From 2010 to 2014, the population of the United States increased by about 9.5 million people, a change of just over 3%.

The next two rows correspond to all the men and all the women respectively. The male population grew more than the female population, both in absolute and percentage terms. Both percent changes were around 3%.

Now take a look at the next few rows. The percent change jumps from about 3% for the overall population to almost 30% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.

By far the greatest absolute change was among those in the 64-67 agegroup in 2014. What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.

- Those who were in the 64-67 age group in 2010 were born in the years 1943 to 1946. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945. 

- Those who were 64 to 67 years old in 2014 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States. 

The post-war jump in births is the major reason for the large changes that we have observed.