In [1]:
# HIDDEN
import numpy as np
np.set_printoptions(threshold=50)

Tables are a fundamental object type for representing data sets. A table can be viewed in two ways. Tables are a sequence of named columns that each describe a single aspect of all entries in a data set. Tables are also a sequence of rows that each contain all information about a single entry in a data set. 

In order to use tables, import all of the module called `datascience`, a module created for this text.

In [2]:
from datascience import *

Empty tables can be created using the `Table` function, which optionally takes a list of column labels. The `with_row` and `with_rows` methods of a table return new tables with one or more additional rows. For example, we could create a table of `Odd` and `Even` numbers.

In [3]:
Table(['Odd', 'Even']).with_row([3, 4])

Odd,Even
3,4


In [4]:
Table(['Odd', 'Even']).with_rows([[3, 4], [5, 6], [7, 8]])

Odd,Even
3,4
5,6
7,8


Tables can also be extended with additional columns. The `with_column` and `with_columns` methods return new tables with additional labeled columns. Below, we begin each example with an empty table that has no columns.

In [5]:
Table().with_column('Odd', [3, 5, 7])

Odd
3
5
7


In [6]:
Table().with_columns([
        'Odd',  [3, 5, 7],
        'Even', [4, 6, 8]
    ])

Odd,Even
3,4
5,6
7,8


Tables are more typically created from files that contain comma-separated values, called CSV files. The file below contains "Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States." 

In [7]:
census_url = 'http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'
full_census_table = Table.read_table(census_url)
full_census_table

SEX,AGE,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014
0,0,3944153,3944160,3951330,3963071,3926665,3945610,3948350
0,1,3978070,3978090,3957888,3966510,3978006,3943077,3962123
0,2,4096929,4096939,4090862,3971573,3979952,3992690,3957772
0,3,4119040,4119051,4111920,4102501,3983049,3992425,4005190
0,4,4063170,4063186,4077552,4122303,4112638,3994047,4003448
0,5,4056858,4056872,4064653,4087713,4132210,4123408,4004858
0,6,4066381,4066412,4073013,4074979,4097780,4143094,4134352
0,7,4030579,4030594,4043047,4083240,4084964,4108615,4154000
0,8,4046486,4046497,4025604,4053206,4093213,4095827,4119524
0,9,4148353,4148369,4125415,4035769,4063193,4104133,4106832


A [description of the table](http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.pdf) appears online. The `SEX` column contains numeric codes: `0` stands for the total, `1` for male, and `2` for female. The `AGE` column contains ages, but the special value `999` is a sum of the total population. The rest of the columns contain estimates of the US population.

Typically, a public table will contain more information than necessary for a particular investigation or analysis. In this case, let us suppose that we are only interested in the population changes from 2010 to 2014. We can select only a subset of the columns using the `select` method.

In [8]:
partial_census_table = full_census_table.select(['SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2014'])
partial_census_table

SEX,AGE,POPESTIMATE2010,POPESTIMATE2014
0,0,3951330,3948350
0,1,3957888,3962123
0,2,4090862,3957772
0,3,4111920,4005190
0,4,4077552,4003448
0,5,4064653,4004858
0,6,4073013,4134352
0,7,4043047,4154000
0,8,4025604,4119524
0,9,4125415,4106832


The `relabeled` method creates an alternative version of the table that replaces a column lable. Using this method, we can simplify the labels of the selected columns.

In [9]:
simple = partial_census_table.relabeled('POPESTIMATE2010', '2010').relabeled('POPESTIMATE2014', '2014')
simple

SEX,AGE,2010,2014
0,0,3951330,3948350
0,1,3957888,3962123
0,2,4090862,3957772
0,3,4111920,4005190
0,4,4077552,4003448
0,5,4064653,4004858
0,6,4073013,4134352
0,7,4043047,4154000
0,8,4025604,4119524
0,9,4125415,4106832


The `column` method returns an array of the values in a particular column. Each column of a table is an array of the same length, and so columns can be combined using arithmetic.

In [10]:
simple.column('2014') - simple.column('2010')

array([  -2980,    4235, -133090, ...,    6717,   13410, 4662996])

A table with an additional column can be created from an existing table using the `with_column` method, which takes a label and a sequence of values as arguments. There must be either as many values as there are existing rows in the table or a single value (which fills the column with that value).

In [11]:
change = simple.column('2014') - simple.column('2010')
annual_growth_rate = (simple.column('2014') / simple.column('2010')) ** (1/4) - 1
census = simple.with_columns(['Change', change, 'Growth', annual_growth_rate])
census

SEX,AGE,2010,2014,Change,Growth
0,0,3951330,3948350,-2980,-0.000188597
0,1,3957888,3962123,4235,0.000267397
0,2,4090862,3957772,-133090,-0.00823453
0,3,4111920,4005190,-106730,-0.0065532
0,4,4077552,4003448,-74104,-0.00457471
0,5,4064653,4004858,-59795,-0.00369821
0,6,4073013,4134352,61339,0.00374389
0,7,4043047,4154000,110953,0.00679123
0,8,4025604,4119524,93920,0.00578232
0,9,4125415,4106832,-18583,-0.00112804


Although the columns of this table are simply arrays of numbers, the format of those numbers can be changed to improve the interpretability of the table. The `set_format` method takes `Formatter` objects, which exist for dates (`DateFormatter`), currencies (`CurrencyFormatter`), numbers, and percentages.

In [12]:
census.set_format('Growth', PercentFormatter)
census.set_format(['2010', '2014', 'Change'], NumberFormatter)

SEX,AGE,2010,2014,Change,Growth
0,0,3951330,3948350,-2980,-0.02%
0,1,3957888,3962123,4235,0.03%
0,2,4090862,3957772,-133090,-0.82%
0,3,4111920,4005190,-106730,-0.66%
0,4,4077552,4003448,-74104,-0.46%
0,5,4064653,4004858,-59795,-0.37%
0,6,4073013,4134352,61339,0.37%
0,7,4043047,4154000,110953,0.68%
0,8,4025604,4119524,93920,0.58%
0,9,4125415,4106832,-18583,-0.11%


The information in a table can be accessed in many ways. The attributes demonstrated below can be used for any table.

In [13]:
census.labels

('SEX', 'AGE', '2010', '2014', 'Change', 'Growth')

In [14]:
census.num_columns

6

In [15]:
census.num_rows

306

In [16]:
census.row(5)

Row(SEX=0, AGE=5, 2010=4064653, 2014=4004858, Change=-59795, Growth=-0.0036982077956316806)

In [17]:
census.column(2)

array([  3951330,   3957888,   4090862, ...,     26074,     45058,
       157257573])

An individual item in a table can be accessed via a row or a column.

In [18]:
census.row(0).item(2)

3951330

In [19]:
census.column(2).item(0)

3951330

Let's take a look at the growth rates of the total number of males and females by selecting only the *rows* that sum over all ages. This sum is expressed with the special value `999` according to this data set description.

In [20]:
census.where('AGE', 999)

SEX,AGE,2010,2014,Change,Growth
0,999,309347057,318857056,9509999,0.76%
1,999,152089484,156936487,4847003,0.79%
2,999,157257573,161920569,4662996,0.73%


What ages of males are driving this rapid growth? We can first filter the `census` table to keep only the male entries, then sort by growth rate in decreasing order.

In [21]:
males = census.where('SEX', 1)
males.sort('Growth', descending=True)

SEX,AGE,2010,2014,Change,Growth
1,99,6104,9037,2933,10.31%
1,100,9351,13729,4378,10.08%
1,98,9504,13649,4145,9.47%
1,93,60182,85980,25798,9.33%
1,96,22022,31235,9213,9.13%
1,94,43828,62130,18302,9.12%
1,97,14775,20479,5704,8.50%
1,95,31736,42824,11088,7.78%
1,91,104291,138080,33789,7.27%
1,92,83462,109873,26411,7.12%


The fact that there are more men with `AGE` of 100 than 99 looks suspicious; shouldn't there be fewer? A careful look at the description of the data set reveals that the 100 category actually includes all men who are 100 or older. The growth rates in men at these very old ages could have several explanations, such as a large influx from another country, but the most natural explanation is that people are simply living longer in 2014 than 2010.

The `where` method can also take an array of boolean values, constructed by applying some comparison operator to a column of the table. For example, we can find all of the age groups among males for which the absolute `Change` is substantial. The `show` method displays all rows without abbreviating.

In [22]:
males.where(males.column('Change') > 200000).show()

SEX,AGE,2010,2014,Change,Growth
1,23,2151095,2399883,248788,2.77%
1,24,2161380,2391398,230018,2.56%
1,34,1908761,2192455,283694,3.52%
1,57,1910028,2110149,200121,2.52%
1,59,1779504,2006900,227396,3.05%
1,64,1291843,1661474,369631,6.49%
1,65,1272693,1607688,334995,6.02%
1,66,1239805,1589127,349322,6.40%
1,67,1270148,1653257,383109,6.81%
1,71,903263,1169356,266093,6.67%


By far the largest changes are clumped together in the 64-67 range. In 2014, these people would be born between 1947 and 1951, the height of the post-WWII baby boom in the United States.

It is possible to specify multiple conditions using the functions `np.logical_and` and `np.logical_or`. When two conditions are combined with `logical_and`, both must be true for a row to be retained. When conditions are combined with `logical_or`, then either one of them must be true for a row to be retained. Here are two different ways to select 18 and 19 year olds.

In [23]:
both = census.where(census.column('SEX') != 0)
both.where(np.logical_or(both.column('AGE') == 18, 
                         both.column('AGE') == 19))

SEX,AGE,2010,2014,Change,Growth
1,18,2305733,2165062,-140671,-1.56%
1,19,2334906,2220790,-114116,-1.24%
2,18,2185272,2060528,-124744,-1.46%
2,19,2236479,2105604,-130875,-1.50%


In [24]:
both.where(np.logical_and(both.column('AGE') >= 18, 
                          both.column('AGE') <= 19))

SEX,AGE,2010,2014,Change,Growth
1,18,2305733,2165062,-140671,-1.56%
1,19,2334906,2220790,-114116,-1.24%
2,18,2185272,2060528,-124744,-1.46%
2,19,2236479,2105604,-130875,-1.50%


Here, we observe a rather dramatic decrease in the number of 18 and 19 year olds in the United States; the children of the baby boom generation are now in their 20's, leaving fewer teenagers in America.

**Aggregation and Grouping.** This particular dataset includes entries for sums across all ages and sexes, using the special values `999` and `0` respectively. However, if these rows did not exist, we would be able to aggregate the remaining entries in order to compute those same results.

In [25]:
both.where('AGE', 999).select(['SEX', '2014'])

SEX,2014
1,156936487
2,161920569


In [26]:
no_sums = both.select(['SEX', 'AGE', '2014']).where(both.column('AGE') != 999)
females = no_sums.where('SEX', 2).select(['AGE', '2014'])
females

AGE,2014
0,1930493
1,1938870
2,1935270
3,1956572
4,1959950
5,1961391
6,2024024
7,2031760
8,2014402
9,2009560


In [27]:
sum(females['2014'])

161920569

Some columns express categories, such as the sex or age group in the case of the census table. Aggregation can also be performed on every value for a category using the `group` and `groups` methods. The `group` method takes a single column (or column name) as an argument and collects all values associated with each unique value in that column.

The second argument to `group`, the name of a function, specifies how to aggregate the resulting values. For example, the `sum` function can be used to add together the populations for each age. In this result, the `SEX sum` column is meaningless because the values were simply codes for different categories. However, the `2014 sum` column does in fact contain the total number across all sexes for each age category.

In [28]:
no_sums.select(['AGE', '2014']).group('AGE', sum)

AGE,2014 sum
0,3948350
1,3962123
2,3957772
3,4005190
4,4003448
5,4004858
6,4134352
7,4154000
8,4119524
9,4106832


**Joining Tables.** There are many situations in data analysis in which two different rows need to be considered together. Two tables can be joined into one, an operation that creates one long row out of two matching rows. These matching rows can be from the same table or different tables.

For example, we might want to estimate which age categories are expected to change significantly in the future. Someone who is 14 years old in 2014 will be 20 years old in 2020. Therefore, one estimate of the number of 20 year olds in 2020 is the number of 14 year olds in 2014. Between the ages of 1 and 50, annual mortality rates are very low (less than 0.5% for men and less than 0.3% for women [[1](http://www.ssa.gov/oact/STATS/table4c6.html)]). Immigration also affects population changes, but for now we will ignore its influence. Let's consider just females in this analysis.

In [29]:
females['AGE+6'] = females['AGE'] + 6
females

AGE,2014,AGE+6
0,1930493,6
1,1938870,7
2,1935270,8
3,1956572,9
4,1959950,10
5,1961391,11
6,2024024,12
7,2031760,13
8,2014402,14
9,2009560,15


In order to relate the age in 2014 to the age in 2020, we will join this table with itself, matching values in the `AGE` column with values in the `AGE in 2020` column.

In [30]:
joined = females.join('AGE', females, 'AGE+6')
joined

AGE,2014,AGE+6,AGE_2,2014_2
6,2024024,12,0,1930493
7,2031760,13,1,1938870
8,2014402,14,2,1935270
9,2009560,15,3,1956572
10,2015380,16,4,1959950
11,2001949,17,5,1961391
12,1993547,18,6,2024024
13,2041159,19,7,2031760
14,2068252,20,8,2014402
15,2035299,21,9,2009560


The resulting table has five columns. The first three are the same as before. The two new colums are the values for `AGE` and `2014` that appeared in a different row, the one in which that `AGE` appeared in the `AGE+6` column. For instance, the first row contains the number of 6 year olds in 2014 and an estimate of the number of 6 year olds in 2020 (who were 0 in 2014). Some relabeling and selecting makes this table more clear.

In [31]:
future = joined.select(['AGE', '2014', '2014_2']).relabeled('2014_2', '2020 (est)')
future

AGE,2014,2020 (est)
6,2024024,1930493
7,2031760,1938870
8,2014402,1935270
9,2009560,1956572
10,2015380,1959950
11,2001949,1961391
12,1993547,2024024
13,2041159,2031760
14,2068252,2014402
15,2035299,2009560


Adding a `Change` column and sorting by that change describes some of the major changes in age categories that we can expect in the United States for people under 50. According to this simplistic analysis, there will be substantially more people in their late 30's by 2020.

In [32]:
predictions = future.where(future['AGE'] < 50)
predictions['Change'] = predictions['2020 (est)'] - predictions['2014']
predictions.sort('Change', descending=True)

AGE,2014,2020 (est),Change
40,1940627,2170440,229813
38,1936610,2154232,217622
30,2110672,2301237,190565
37,1995155,2148981,153826
39,1993913,2135416,141503
29,2169563,2298701,129138
35,2046713,2169563,122850
36,2009423,2110672,101249
28,2144666,2244480,99814
41,1977497,2046713,69216
