# Scraping Poverty Thresholds from the Census Website

We have a use case in which it would be convenient to capture the poverty thresholds from the Census website.  The question is whether or not we can reliably scrape the site for the relevant info.

In [1]:
#Data manipulation
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

#Parsing
from bs4 import BeautifulSoup as bs

#Content capture
import urllib2
import urllib

#Display
from IPython.display import HTML

The relevant sites are source from a central location.

In [2]:
#Define URL of central repo
url_repo='https://www.census.gov/hhes/www/poverty/data/threshld/'
HTML(url_repo)

## Test Case - 2009

We need to work with a test case, so, how about 2009?  Once we figure this out, we can develop a crawler to grab data from the year-specific sites by way of the central repo above.  Let's grab the HTML from the 2009 site and parse it.

In [3]:
#Capture URL in a string
url09='https://www.census.gov/hhes/www/poverty/data/threshld/thresh09.html'

#Open the target
url09target=urllib2.urlopen(url09)

#Read the HTML
url09content=url09target.read()

#Parse the HTML
soup09=bs(url09content)

print soup09.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="US Census Bureau Poverty main page " name="DC.title"/>
  <meta content="The Census Bureau reports poverty data from several major household surveys and programs. " name="DC.description"/>
  <meta content="US Census Bureau, Demographic Internet Staff" name="DC.creator"/>
  <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
  <meta content="US Census Bureau Poverty " name="description"/>
  <link href="/hhes/www/poverty/index.html " rel="index"/>
  <!--CHANGE PAGE TITLE (Keep Census Bureau)-->
  <title>
   Poverty Thresholds 2009  - U.S Census Bureau
  </title>
  <!--UPDATE LINK TO CBHEADER INCLUDE-->
  <!--START CBHEADER-->
  <!-- Site wide JS -->
  <script src="/main/javascript/ruthsarian_

It appears that the info we need resides entirely in the table.  The cells in each table are marked in the HTML source by `<td>` tags.  Let's confirm they hold what we need.

In [4]:
soup09.find_all('td')

[<td class="rowName2">One person (unrelated individual) ......</td>,
 <td align="right">10,956</td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td class="rowName2">Â  Â  Under 65 years ......</td>,
 <td align="right">11,161</td>,
 <td>11,161</td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td class="rowName2">Â  Â  65 years and over ......</td>,
 <td align="right">10,289</td>,
 <td>10,289</td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td colspan="11">Â </td>,
 <td class="rowName2">Two people ......</td>,
 <td align="right">13,991</td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td>Â </td>,
 <td class="rowName2">Â  Â  Householder under 65 years ......</td>,
 <td align="right">14,439</td>,
 <t

Indeed they do.  So now we need to roll through this list of table cells to gather the relevant info.  Ultimately, we want to put this in a DF, so we need to be able to meaningfully split the data in this list.  First, note that data are arranged row-wise.  Here is the first row and the first entry in the second.

     <td class="rowName2">One person (unrelated individual) ......</td>,
     <td align="right">10,956</td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td class="rowName2">    Under 65 years ......</td>,
     
This suggests an approach in which we populate a row list until some condition is met (indicating the row is complete).  We can then throw the row list in a large list that holds all of the data in the table, and start again with the next row.  Only the row names (e.g. `One person ...`) have tag classes (`rowName2`).  The existence of this class can be our condition.

Once we have this list of rows, it can be easily converted to a [pandas](http://pandas.pydata.org/) DataFrame, which will facilitate combining the information from all years.  Let's get to it.

In [5]:
#Create container to capture rows
tbl_rows=[]

#Create container for row-specific lists
row_list=[]

#For each cell in the table...
for i,cell in enumerate(soup09.find_all('td')):
    #...if it's the first cell or it isn't a rowName2 cell...
    if (i==0) | (cell.get('class')==None):
        #...and add the contents to the row list...
        row_list.append(cell.get_text())
    else:
        #....otherwise add the existing row list to tbl_rows...
        tbl_rows.append(row_list)
        #...reinitialize the row_list...
        row_list=[]
        #...and add the new rowName2 cell info
        row_list.append(cell.get_text())
        
#Create column names
df09_names=['household','wt_avg']+range(len(tbl_rows[0])-1)

#Capture data in DF 
df09=DataFrame(tbl_rows,columns=df09_names)

#Drop extraneous column
df09.pop(df09.columns[-1:].values[0])

#Create a year variable
df09['year']=2009

#Set household and year to index
df09.set_index(['year','household'],inplace=True)

df09

Unnamed: 0_level_0,Unnamed: 1_level_0,wt_avg,0,1,2,3,4,5,6,7,8
year,household,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2009,One person (unrelated individual) ......,10956,,,,,,,,,
2009,Under 65 years ......,11161,11161.0,,,,,,,,
2009,65 years and over ......,10289,10289.0,,,,,,,,
2009,Two people ......,13991,,,,,,,,,
2009,Householder under 65 years ......,14439,14366.0,14787.0,,,,,,,
2009,Householder 65 years and over ......,12982,12968.0,14731.0,,,,,,,
2009,Three people ......,17098,16781.0,17268.0,17285.0,,,,,,
2009,Four people ......,21954,22128.0,22490.0,21756.0,21832.0,,,,,
2009,Five people ......,25991,26686.0,27074.0,26245.0,25603.0,25211.0,,,,
2009,Six people ......,29405,30693.0,30815.0,30180.0,29571.0,28666.0,28130.0,,,


This is exactly what we need.  Here is a function executing this task so that is more easily deployable in all years.

In [11]:
def povtable(url,year):
    '''
    Function navigates to the Census Poverty Threshold page in question, scrapes the table containing
    the poverty thresholds, and returns it as a pandas DataFrame.
    '''
    #Open the target
    target=urllib2.urlopen(url)

    #Read the HTML
    content=target.read()

    #Parse the HTML
    soup=bs(content)
    
    #Create container to capture rows
    tbl_rows=[]

    #Create container for row-specific lists
    row_list=[]

    #For each cell in the table...
    for i,cell in enumerate(soup.find_all('td')):
        #...if it's the first cell or it isn't a rowName2 cell...
        if (i==0) | (cell.get('class')==None):
            #...and add the contents to the row list...
            row_list.append(cell.get_text())
        else:
            #....otherwise add the existing row list to tbl_rows...
            tbl_rows.append(row_list)
            #...reinitialize the row_list...
            row_list=[]
            #...and add the new rowName2 cell info
            row_list.append(cell.get_text())

    #Create column names
    df_names=['household','wt_avg']+range(1,len(tbl_rows[0]))

    #Capture data in DF 
    df=DataFrame(tbl_rows,columns=df_names)
    
    #Drop extraneous column
    df.pop(df.columns[-1:].values[0])

    #Create a year variable
    df['year']=year

    #Set household and year to index
    df.set_index(['year','household'],inplace=True)

    return df

povtable(url09,2009)

Unnamed: 0_level_0,Unnamed: 1_level_0,wt_avg,1,2,3,4,5,6,7,8,9
year,household,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2009,One person (unrelated individual) ......,10956,,,,,,,,,
2009,Under 65 years ......,11161,11161.0,,,,,,,,
2009,65 years and over ......,10289,10289.0,,,,,,,,
2009,Two people ......,13991,,,,,,,,,
2009,Householder under 65 years ......,14439,14366.0,14787.0,,,,,,,
2009,Householder 65 years and over ......,12982,12968.0,14731.0,,,,,,,
2009,Three people ......,17098,16781.0,17268.0,17285.0,,,,,,
2009,Four people ......,21954,22128.0,22490.0,21756.0,21832.0,,,,,
2009,Five people ......,25991,26686.0,27074.0,26245.0,25603.0,25211.0,,,,
2009,Six people ......,29405,30693.0,30815.0,30180.0,29571.0,28666.0,28130.0,,,


## Crawler

Now that we have our function, let's see if we can deploy it to all years.  The first thing we need to do is capture all of the relevant links.

In [19]:
#Open the target
repo_target=urllib2.urlopen(url_repo)

#Read the HTML
repo_content=repo_target.read()

#Parse the HTML
repo_soup=bs(repo_content)

#For each <a> tag...
for link in repo_soup.find_all('a'):
    #...if it links to another source...
    if (link.get('href')!=None):
        #...and that source is threshold related...
        if ('thresh' in link.get('href')):
            #...tell me about it
            print link,'|',link.get_text()

<a href="/hhes/www/poverty/data/threshld/14PRELIMINARY.xls" title="Preliminary Estimates of Weighted Average Poverty Thresholds for 2014">Preliminary Estimates of Weighted Average Poverty Thresholds for 2014</a> | Preliminary Estimates of Weighted Average Poverty Thresholds for 2014
<a href="/hhes/www/poverty/data/threshld/thresh14.xls" title="Poverty Thresholds: 2014">2014</a> | 2014
<a href="/hhes/www/poverty/data/threshld/thresh13.xls" title="Poverty Thresholds: 2013">2013</a> | 2013
<a href="/hhes/www/poverty/data/threshld/thresh12.xlsx" title="Poverty Thresholds: 2012">2012</a> | 2012
<a href="/hhes/www/poverty/data/threshld/thresh11.xls" title="Poverty Thresholds: 2011">2011</a> | 2011
<a href="/hhes/www/poverty/data/threshld/thresh10.xlsx" title="Poverty Thresholds: 2010">2010</a> | 2010
<a href="thresh09.html">2009</a> | 2009
<a href="thresh08.html">2008</a> | 2008
<a href="thresh07.html">2007</a> | 2007
<a href="thresh06.html">2006</a> | 2006
<a href="thresh05.html">2005</a> |

Ok, our function is designed to handle the types of tables contained in the 1980 to 2009 range.  We can isolate these links.  In fact, all we really need is the URL snippet housed in the `href` parameter of the `<a>` tag.  Let's throw those in a list.

We also need to reference the years associated with each threshold.  In effect, we want a dictionary to map snippets to year (e.g. `'thresh09.html' => 2009`).  

In [34]:
#Create container for URL snippets
url_snips=[]

#For each <a> tag...
for link in repo_soup.find_all('a'):
    #...if it links to another source...
    if (link.get('href')!=None):
        #...and that source is for information in the 1980-2009 time period...
        if link.get_text() in [str(yr) for yr in range(1980,2010)]:
            #...throw the url snippet in the list
            url_snips.append(link.get('href'))
        
#Capture list of strings containing year abbreviations
yr_abv=[str(yr)[-2:] for yr in range(1980,2010)]

#Create mapping between year abbreviations and years
yr_map=dict(zip(yr_abv,range(1980,2010)))

#Map year abbreviations to url_snips
yr_url_map=dict(zip(['thresh'+yr+'.html' for yr in yr_abv],yr_abv))

#Create dictionary with direct mapping of URL snippets to years
url_snip_yr={}
##For each snippet...
for key in yr_url_map.keys():
    ##...update the dict with the appropriate year as the value
    url_snip_yr.update({key:yr_map[yr_url_map[key]]})
    
print 'URL snippets in the dictionary and from the page match up:',set(url_snip_yr.keys())==set(url_snips)
    
url_snip_yr

URL snippets in the dictionary and from the page match up: True


{'thresh00.html': 2000,
 'thresh01.html': 2001,
 'thresh02.html': 2002,
 'thresh03.html': 2003,
 'thresh04.html': 2004,
 'thresh05.html': 2005,
 'thresh06.html': 2006,
 'thresh07.html': 2007,
 'thresh08.html': 2008,
 'thresh09.html': 2009,
 'thresh80.html': 1980,
 'thresh81.html': 1981,
 'thresh82.html': 1982,
 'thresh83.html': 1983,
 'thresh84.html': 1984,
 'thresh85.html': 1985,
 'thresh86.html': 1986,
 'thresh87.html': 1987,
 'thresh88.html': 1988,
 'thresh89.html': 1989,
 'thresh90.html': 1990,
 'thresh91.html': 1991,
 'thresh92.html': 1992,
 'thresh93.html': 1993,
 'thresh94.html': 1994,
 'thresh95.html': 1995,
 'thresh96.html': 1996,
 'thresh97.html': 1997,
 'thresh98.html': 1998,
 'thresh99.html': 1999}

Ok, now we can roll through each year and build our master DF (which is just a concatenation of the year-specific DFs).

In [44]:
#Create container for DFs
thresh_dfs=[]

#For each key (a.k.a. year)...
for key in url_snip_yr.keys():
    #...capture the threshold data in thresh_dfs
    thresh_dfs.append(povtable(url09[:54]+key,url_snip_yr[key]))

#Concatenate the DFs together
thresh_df=pd.concat(thresh_dfs)

thresh_df

Unnamed: 0_level_0,Unnamed: 1_level_0,wt_avg,1,2,3,4,5,6,7,8,9
year,household,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1983,One person (unrelated individual)......,"$5,061",,,,,,,,,
1983,Under 65 years......,5180,5180,,,,,,,,
1983,65 years and over......,4775,4775,,,,,,,,
1983,Two persons......,6483,,,,,,,,,
1983,Householder under 65 years......,6697,6667,6863,,,,,,,
1983,Householder 65 years and over......,6023,6019,6837,,,,,,,
1983,Three persons......,7938,7789,8015,8022,,,,,,
1983,Four persons......,10178,10270,10437,10098,10133,,,,,
1983,Five persons......,12049,12385,12565,12181,11882,11701,,,,
1983,Six persons......,13630,14245,14301,14007,13724,13305,13056,,,
