# Scraping Poverty Thresholds from the Census Website

We have a use case in which it would be convenient to capture the poverty thresholds from the Census website.  The question is whether or not we can reliably scrape the site for the relevant info.

In [22]:
#Data manipulation
import numpy as np
import pandas as pd
from pandas import Series,DataFrame

#Parsing
from bs4 import BeautifulSoup as bs

#Content capture
import urllib2
import urllib

#Display
from IPython.display import HTML

  from pkg_resources import resource_stream


The relevant sites are source from a central location.

In [7]:
#Define URL of central repo
url_repo='https://www.census.gov/hhes/www/poverty/data/threshld/'
HTML(url_repo)

We need to work with a test case, so, how about 2009?  Once we figure this out, we can develop a crawler to grab data from the year-specific sites by way of the central repo above.  Let's grab the HTML from the 2009 site and parse it.

In [8]:
#Capture URL in a string
url09='https://www.census.gov/hhes/www/poverty/data/threshld/thresh09.html'

#Open the target
url09target=urllib2.urlopen(url09)

#Read the HTML
url09content=url09target.read()

#Parse the HTML
soup09=bs(url09content)

print soup09.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="US Census Bureau Poverty main page " name="DC.title"/>
  <meta content="The Census Bureau reports poverty data from several major household surveys and programs. " name="DC.description"/>
  <meta content="US Census Bureau, Demographic Internet Staff" name="DC.creator"/>
  <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
  <meta content="US Census Bureau Poverty " name="description"/>
  <link href="/hhes/www/poverty/index.html " rel="index"/>
  <!--CHANGE PAGE TITLE (Keep Census Bureau)-->
  <title>
   Poverty Thresholds 2009  - U.S Census Bureau
  </title>
  <!--UPDATE LINK TO CBHEADER INCLUDE-->
  <!--START CBHEADER-->
  <!-- Site wide JS -->
  <script src="/main/javascript/ruthsarian_

It appears that the info we need resides entirely in the table.  The cells in each table are marked in the HTML source by `<td>` tags.  Let's confirm they hold what we need.

In [11]:
soup09.find_all('td')

[<td class="rowName2">One person (unrelated individual) ......</td>,
 <td align="right">10,956</td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td class="rowName2">    Under 65 years ......</td>,
 <td align="right">11,161</td>,
 <td>11,161</td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td class="rowName2">    65 years and over ......</td>,
 <td align="right">10,289</td>,
 <td>10,289</td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td colspan="11"> </td>,
 <td class="rowName2">Two people ......</td>,
 <td align="right">13,991</td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td> </td>,
 <td class="rowName2">    Householder under 65 years ......</td>,
 <td align="right">14,439</td>,
 <td>14,366</td>,
 <td>14,787</td>,
 <td> </

Indeed they do.  So now we need to roll through this list of table cells to gather the relevant info.  Ultimately, we want to put this in a DF, so we need to be able to meaningfully split the data in this list.  First, note that data are arranged row-wise.  Here is the first row and the first entry in the second.

     <td class="rowName2">One person (unrelated individual) ......</td>,
     <td align="right">10,956</td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td> </td>,
     <td class="rowName2">    Under 65 years ......</td>,
     
This suggests an approach in which we populate a row list until some condition is met (indicating the row is complete).  We can then throw the row list in a large list that holds all of the data in the table, and start again with the next row.  Only the row names (e.g. `One person ...`) have tag classes (`rowName2`).  The existence of this class can be our condition.

Once we have this list of rows, it can be easily converted to a [pandas](http://pandas.pydata.org/) DataFrame, which will facilitate combining the information from all years.  Let's get to it.

In [35]:
#Create container to capture rows
tbl_rows=[]

#Create container for row-specific lists
row_list=[]

#For each cell in the table...
for i,cell in enumerate(soup09.find_all('td')):
    #...if it's the first cell or it isn't a rowName2 cell...
    if (i==0) | (cell.get('class')==None):
        #...and add the contents to the row list...
        row_list.append(cell.get_text())
    else:
        #....otherwise add the existing row list to tbl_rows...
        tbl_rows.append(row_list)
        #...reinitialize the row_list...
        row_list=[]
        #...and add the new rowName2 cell info
        row_list.append(cell.get_text())
        
#Create column names
df09_names=['household','wt_avg']+range(len(tbl_rows[0])-1)

#Capture data in DF 
df09=DataFrame(tbl_rows,columns=df09_names)

#Drop extraneous column
df09.pop(df09.columns[-1:].values[0])

#Create a year variable
df09['year']=2009

#Set household and year to index
df09.set_index(['year','household'],inplace=True)

df09

Unnamed: 0_level_0,Unnamed: 1_level_0,wt_avg,0,1,2,3,4,5,6,7,8
year,household,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2009,One person (unrelated individual) ......,10956,,,,,,,,,
2009,Under 65 years ......,11161,11161.0,,,,,,,,
2009,65 years and over ......,10289,10289.0,,,,,,,,
2009,Two people ......,13991,,,,,,,,,
2009,Householder under 65 years ......,14439,14366.0,14787.0,,,,,,,
2009,Householder 65 years and over ......,12982,12968.0,14731.0,,,,,,,
2009,Three people ......,17098,16781.0,17268.0,17285.0,,,,,,
2009,Four people ......,21954,22128.0,22490.0,21756.0,21832.0,,,,,
2009,Five people ......,25991,26686.0,27074.0,26245.0,25603.0,25211.0,,,,
2009,Six people ......,29405,30693.0,30815.0,30180.0,29571.0,28666.0,28130.0,,,


This is exactly what we need.  Here is a function executing this task so that is more easily deployable in all years.