# Wrangling data in Tables

## David Culler

This notebook illustrates several of the datascience tables methods for wrangling typical data.
Here we use some simple census data.  Even that is pretty obscure.  We illustrate the process 
of going from raw data to a distilled form and then answer a simple question: "How does the
relative difference of males and females vary with age?" The answer, there are bit more boys,
but a LOT less old men.

In [None]:
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

Construct a raw table from a csv on the web

In [None]:
#census_url = 'http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'
census_url = 'http://www2.census.gov/programs-surveys/popest/datasets/2010-2014/national/asrh/nc-est2014-agesex-res.csv'
raw_census = Table.read_table(census_url)
raw_census

In [None]:
Table.read_table('https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/birthsmokers.txt')

## Stage 1: cleaning up columns and encodings

As is often the case, the data you find is pretty ugly.  

In [None]:
# A simple tool to decode an encoding
def categorize_sex(s):
   return ['all','male','female'][s]

In [None]:
# Keep only what we need
pre_census = raw_census.select(['SEX', 'AGE', 'CENSUS2010POP', 'POPESTIMATE2014'])
# Clean up the column names
pre_census.relabel('CENSUS2010POP','2010pop')
pre_census.relabel('POPESTIMATE2014','2014est')
# Decode the categories
pre_census['CAT'] = pre_census.apply(categorize_sex, 'SEX')
# Create a new clean table getting rid of what we can
p2_census = pre_census.drop('SEX')
p2_census.move_to_start('CAT')
p2_census

In [None]:
p2_census.show()

## Stage 2 Cleaning up rows

The Census includes *special* rows that are the total of other rows by category.

In [None]:
# How many people?
total = p2_census.where('AGE',999)
total

In [None]:
# Remove the rows that are totals of the other rows
# Now we have a clean Table
census = p2_census.where(p2_census['AGE'] < 999)

## Stage 3: Transform 1 dimension to 2

The natural form of the data is age X gender.  We could split the table and reassmeble it,
but a better approach is to pivot.

In [None]:
# Split it by gender into two tables
male = census.where('CAT','male')
female = census.where('CAT','female')

In [None]:
male

In [None]:
pop2010 = census.pivot('CAT','AGE','2010pop',sum)
pop2010

## Visualization

Now we can easily visualize what's in this data.

In [None]:
# The number of individuals by ages
pop2010.plot('AGE',overlay=True)

In [None]:
pop2010['diff'] = pop2010['male'] - pop2010['female']
pop2010.show()

In [None]:
# The difference in the genders by age
pop2010.select(['AGE', 'diff']).bar('AGE')

In [None]:
pop2010['Rel Diff'] = pop2010['diff'] / pop2010['all']
pop2010.set_format('Rel Diff', PercentFormatter)
pop2010.show()

In [None]:
pop2010.select(['AGE', 'Rel Diff']).bar('AGE')

In [None]:
pop2010['Ratio F/M'] = pop2010['female'] / pop2010['male']
pop2010.select(['AGE', 'Ratio F/M']).bar('AGE')