---
layout: single
title: 'Acquiring U.S. census data with Python and cenpy'
date: 2016-06-28
authors: [Zach Schira]
category: [tutorials]
excerpt: 'This tutorial outlines the use of the Cenpy package to search for, and acquire specific census data.'
sidebar:
  nav:
author_profile: false
comments: true
lang: [python]
lib: [pandas, cenpy, pysal]
---


There are several useful online sources for accessing census data provided both by the US census Bureau American Factfinder, and outside sources. These sources, however, are not conducive to large scale data aquisition and analysis. The [Cenpy](https://pypi.python.org/pypi/cenpy/0.9.1) python package allows for programmitic access of this data through the [Census Bureau's API](http://www.census.gov/data/developers/data-sets.html){:data-proofer-ignore=''}.

This tutorial outlines the use of the Cenpy package to search for, and acquire specific census data. Cenpy saves this data as a Pandas dataframe. These dataframes allow for easy access and analysis of data within python. For easy visualization of this data look into the [GeoPandas](http://geopandas.org/) package. This package builds on the base Pandas package to add tools for geospatial data analysis.

## Objectives
- Install Cenpy package
- Search for desired census data
- Download and store data

## Dependencies 

The Cenpy package depends on pandas and requests. 

In [1]:
import pandas as pd
import cenpy as cen
import pysal

  from .sqlite import head_to_sql, start_sql


## Finding Data
The cenpy explorer module allows you to view all of the available [United States Census Bureau API's](http://www.census.gov/data/developers/data-sets.html){:data-proofer-ignore=''}. 

In [2]:
datasets = list(cen.explorer.available(verbose=True).items())

# print first rows of the dataframe containing datasets
pd.DataFrame(datasets).head()

Unnamed: 0,0,1
0,title,NONEMP2007 2007 Nonemp...
1,temporal,NONEMP2007 ...
2,spatial,NONEMP2007 United Stat...
3,publisher,NONEMP2007 U.S. Census...
4,programCode,NONEMP2007 006:007 POP...


Passing the name of a specific API to `explorer.explain()` will give a description of the data available. For this example, we will use the 2012 American Community Service 1 year data (`2012acs1`).

In [3]:
dataset = '2012acs1'
cen.explorer.explain(dataset)

{'2012 American Community Survey: 1-Year Estimates': "The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years.  Questionnaires are mailed to a sample of addresses to obtain information about households -- that is, about each person and the housing unit itself.  The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. It produces estimates for small areas, including census tracts and population subgroups.  Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau's Population Estimates Program that produces an

The base module allows you to establish a connection with the desired API that will be used later to acquire data.

In [4]:
con = cen.base.Connection(dataset)
con

Connection to 2012 American Community Survey: 1-Year Estimates (ID: http://api.census.gov/data/id/2012acs1)

## Acquiring Data

### Geographical specification

Cenpy uses FIPS codes to specify the geographical extent of the data to be downloaded. The object `con` is our connection to the api, and the attribute `geographies` is a dictionary.

In [5]:
print(type(con))
print(type(con.geographies))
print(con.geographies.keys())

<class 'cenpy.remote.APIConnection'>
<class 'dict'>
dict_keys(['fips'])


In [6]:
# print head of data frame in the geographies dictionary
con.geographies['fips'].head()

Unnamed: 0,geoLevelId,name,optionalWithWCFor,requires,wildcard
0,500,congressional district,state,[state],[state]
1,60,county subdivision,,"[state, county]",
2,795,public use microdata area,,[state],
3,310,metropolitan statistical area/micropolitan sta...,,,
4,160,place,state,[state],[state]


`geo_unit` and `geo_filter` are both necessary arguments for the `query()` function. `geo_unit` specifies the scale at which data should be taken. `geo_filter` then creates a filter to ensure too much data is not downloaded. The following example will download data from all counties in Colorado (state FIPS codes are accessible [here](https://www.mcc.co.mercer.pa.us/dps/state_fips_code_listing.htm)).

In [7]:
g_unit = 'county:*'
g_filter = {'state':'8'}

### Specifying variables to extract

The other argument taken by `query()` is cols. This is a list of columns taken from the variables of the API. These variables can be displayed using the `variables` function, however, due to the number of variables it is easier to use the [Social Explorer](https://www.socialexplorer.com/) site to find data you are interested in.

In [8]:
var = con.variables
print('Number of variables in', dataset, ':', len(var))
con.variables.head()

Number of variables in 2012acs1 : 68401


Unnamed: 0,concept,group,label,limit,predicateOnly,predicateType
for,Census API Geography Specification,,Census API FIPS 'for' clause,0,True,fips-for
in,Census API Geography Specification,,Census API FIPS 'in' clause,0,True,fips-in
B20005E_045M,B20005E. Sex by Work Experience by Earnings f...,,Margin of Error for!!Male:!!Other:!!With earni...,0,,
B06004HPR_002M,"B06004HPR. Place of Birth (White Alone, Not H...",,Margin of Error for!!Born in Puerto Rico,0,,
B24126_438E,B24126. Detailed Occupation for the Full-Time...,,"Multiple machine tool setters, operators, and ...",0,,


Related columns of data will always start with the same base prefix, so cenpy has an included function, `varslike`, that will create a list of column names that match the input pattern. It is also useful to add on the `NAME` and `GEOID` columns, as these will provide the name and geographic id of all data. In this example, we will use the [B01001A](https://www.socialexplorer.com/data/ACS2013/metadata/?ds=ACS13&table=B01001A), which gives data for sex by age within the desired geography. The identifier at the end corresponds to males or females of different age groups.

In [9]:
cols = con.varslike('B01001A_')
cols.extend(['NAME', 'GEOID'])

With the three necessary arguments, data can be downloaded and saved as a pandas dataframe.

In [10]:
data = con.query(cols, geo_unit=g_unit, geo_filter=g_filter)
# prints a deprecation warning because of how cenpy calls pandas

It is useful to replace the default index with the data from the `NAME` or `GEOID` column, as these will give a more useful description of the data.

In [11]:
data.index = data.NAME

# print first five rows and last five columns
data.iloc[:5, -5:]

Unnamed: 0_level_0,B01001A_007M,B01001A_008E,B01001A_009M,NAME,GEOID
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Adams County, Colorado",514,12648,624,"Adams County, Colorado",05000US08001
"Arapahoe County, Colorado",432,13231,582,"Arapahoe County, Colorado",05000US08005
"Boulder County, Colorado",632,15297,189,"Boulder County, Colorado",05000US08013
"Denver County, Colorado",389,15602,829,"Denver County, Colorado",05000US08031
"Douglas County, Colorado",367,4953,442,"Douglas County, Colorado",05000US08035


### Topologically Integrated Geographic Encoding and Referencing (TIGER) data

The Census TIGER API provides geomotries for desired geographic regions. For instance, perhaps we want to have additional information on each county such as area.

In [12]:
cen.tiger.available()

[{'name': 'AIANNHA', 'type': 'MapServer'},
 {'name': 'CBSA', 'type': 'MapServer'},
 {'name': 'Hydro_LargeScale', 'type': 'MapServer'},
 {'name': 'Hydro', 'type': 'MapServer'},
 {'name': 'Labels', 'type': 'MapServer'},
 {'name': 'Legislative', 'type': 'MapServer'},
 {'name': 'Places_CouSub_ConCity_SubMCD', 'type': 'MapServer'},
 {'name': 'PUMA_TAD_TAZ_UGA_ZCTA', 'type': 'MapServer'},
 {'name': 'Region_Division', 'type': 'MapServer'},
 {'name': 'School', 'type': 'MapServer'},
 {'name': 'Special_Land_Use_Areas', 'type': 'MapServer'},
 {'name': 'State_County', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2013', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2014', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2015', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2016', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2017', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2018', 'type': 'MapServer'},
 {'name': 'tigerWMS_Census2010', 'type': 'MapServer'},
 {'name': 'tigerWMS_Current', 'type': 'MapServer

First, you must establish a connection to the TIGER API, then you can display the avaialable layers. No Tiger data is available for ACS 2012, so we will use the ACS 2013 for the sake of example, but ideally you will be able to find corresponding Tiger data.

In [13]:
con.set_mapservice('tigerWMS_ACS2013')

# print layers
con.mapservice.layers

{0: (ESRILayer) 2010 Census Public Use Microdata Areas,
 1: (ESRILayer) 2010 Census Public Use Microdata Areas Labels,
 2: (ESRILayer) 2010 Census ZIP Code Tabulation Areas,
 3: (ESRILayer) 2010 Census ZIP Code Tabulation Areas Labels,
 4: (ESRILayer) Tribal Census Tracts,
 5: (ESRILayer) Tribal Census Tracts Labels,
 6: (ESRILayer) Tribal Block Groups,
 7: (ESRILayer) Tribal Block Groups Labels,
 8: (ESRILayer) Census Tracts,
 9: (ESRILayer) Census Tracts Labels,
 10: (ESRILayer) Census Block Groups,
 11: (ESRILayer) Census Block Groups Labels,
 12: (ESRILayer) Unified School Districts,
 13: (ESRILayer) Unified School Districts Labels,
 14: (ESRILayer) Secondary School Districts,
 15: (ESRILayer) Secondary School Districts Labels,
 16: (ESRILayer) Elementary School Districts,
 17: (ESRILayer) Elementary School Districts Labels,
 18: (ESRILayer) Estates,
 19: (ESRILayer) Estates Labels,
 20: (ESRILayer) County Subdivisions,
 21: (ESRILayer) County Subdivisions Labels,
 22: (ESRILayer) 

The data retrieved earlier was at the county level, so we will use layer 84. Using the tiger connection, `query()` can retrieve the data, taking the layer and the geographic location as arguments.

In [14]:
geodata = con.mapservice.query(layer=84, where='STATE=8')

In [15]:
# preview geodata
geodata.iloc[:5, :5]

Unnamed: 0,AREALAND,AREAWATER,BASENAME,CENTLAT,CENTLON
0,4376528327,25375721,La Plata,37.2863615,-107.8435627
1,8206547707,4454510,Saguache,38.0807339,-106.2808607
2,1419419128,3530746,Sedgwick,40.8759564,-102.3517903
3,1003660601,2035929,San Juan,37.7640122,-107.6762274
4,4605714032,8166134,Cheyenne,38.828178,-102.6034141


This data can now be merged with the original data to create one pandas dataframe containing all of the relevant data.

In [16]:
newdata = pd.merge(data, geodata, left_on='county', right_on='COUNTY')
newdata.iloc[:5, -5:]

Unnamed: 0,OID,STATE,STGEOMETRY.AREA,STGEOMETRY.LEN,geometry
0,27553700234319,8,5211597000.0,511817.561207,"POLYGON ((-11644798.3074 4851335.998899996, -1..."
1,27553703789414,8,3523333000.0,435243.866171,"(POLYGON ((-11665321.7253 4803086.1294, -11665..."
2,27553701435070,8,3280834000.0,291031.607339,"(POLYGON ((-11760745.906 4874953.136100002, -1..."
3,27553700234321,8,678468800.0,341476.209637,"(POLYGON ((-11700783.4396 4811897.626999997, -..."
4,27553711656416,8,3653474000.0,276727.360768,"POLYGON ((-11674338.3814 4803073.133100003, -1..."
