# Cenpy Tutorial 
Author: Zach Schira

## About 
There are several useful online sources for accessing census data provided both by the US census Bureau ([American Factfinder](http://factfinder.census.gov)), and outside sources. These sources, however, are not conducive to large scale data aquisition and analysis. The [Cenpy](https://pypi.python.org/pypi/cenpy/0.9.1) python package allows for programmitic access of this data through the [Census Bureau's API](http://www.census.gov/data/developers/data-sets.html).

This tutorial outlines the use of the Cenpy package to search for, and acquire specific census data. Cenpy saves this data as a [Pandas](http://pandas.pydata.org/) dataframe. These dataframes allow for easy access and analysis of data within python. For easy visualization of this data look into the [GeoPandas](http://geopandas.org/) package. This package builds on the base Pandas package to add tools for geospatial data analysis.

## Objectives
- Install Cenpy package
- Search for desired census data
- Download and store data

## Dependencies 
The Cenpy package depends on pandas and [requests](http://docs.python-requests.org/en/master/). Ensure that python and pip are already properly installed then use the following commands to install cenpy.

In [None]:
!pip install pandas
!pip install requests
!pip install cenpy

In [1]:
import pandas as pd
import cenpy as cen

## Finding Data
The cenpy explorer module allows you to view all of the available [United States Census Bureau API's](http://www.census.gov/data/developers/data-sets.html). 

In [2]:
datasets = list(cen.explorer.available(verbose=True).items())
print('Number of datasets: ', len(datasets))

# print all available datasets
for i in range(0,len(datasets)):
    print(datasets[i][0],': ',datasets[i][1])

Number of datasets:  125
POPESTcochar62014 :  Vintage 2014 Population Estimates: County Population Estimates by 5 Year Age Groups, Sex, 6 Races, and Hispanic Origin
CBP2012 :  2012 County Business Patterns: Business Patterns
POPESTprmagesex2014 :  Vintage 2014 Population Estimates: Puerto Rico Municipios Estimates by 5-Year Age Groups and Sex
IDBSINGLEYEAR :  Time Series International Database: International Populations by Single Year of Age and Sex
POPESTagesex :  Vintage 2014 Population Estimates: National Annual Resident Population Estimates by Single Year of Age and Sex
SBO2012 :  2012 Survey of Business Owners: Company Summary
ACSSF5Y2009 :  2005-2009 American Community Survey 5-Year Estimates
EITSHV :  Time Series Economic Indicators Time Series -: Housing Vacancies and Homeownership
POPESTprcagesex2013 :  Vintage 2013 Population Estimates: Puerto Rico Commonwealth Estimates by Single Year of Age and Sex
PEPCharage2015 :  Vintage 2015 Population Estimates: Characteristics by Sing

Passing the name of a specific API to `explorer.explain()` will give a description of the data available. For this example, we will use the 2012 [American Community Service](https://www.census.gov/programs-surveys/acs/) 1 year data (`2012acs1`).

In [3]:
dataset = '2012acs1'
cen.explorer.explain(dataset)

{'2012 American Community Survey: 1-Year Estimates': "The American Community Survey (ACS) is a nationwide survey designed to provide communities a fresh look at how they are changing. The ACS replaced the decennial census long form in 2010 and thereafter by collecting long form type information throughout the decade rather than only once every 10 years.  Questionnaires are mailed to a sample of addresses to obtain information about households -- that is, about each person and the housing unit itself.  The American Community Survey produces demographic, social, housing and economic estimates in the form of 1-year, 3-year and 5-year estimates based on population thresholds. The strength of the ACS is in estimating population and housing characteristics. It produces estimates for small areas, including census tracts and population subgroups.  Although the ACS produces population, demographic and housing unit estimates,it is the Census Bureau's Population Estimates Program that produces an

The base module allows you to establish a connection with the desired API that will be used later to acquire data.

In [4]:
con = cen.base.Connection(dataset)
con

Connection to 2012 American Community Survey: 1-Year Estimates (ID: http://api.census.gov/data/id/2012acs1)

## Getting Data

### Geographical specification

Cenpy uses [FIPS codes](https://www.census.gov/geo/reference/codes/cou.html) to specify the geographical extent of the data to be downloaded. The object `con` is our connection to the api, and the attribute `geographies` is a dictionary.

In [5]:
print(type(con))
print(type(con.geographies))
con.geographies.keys()

<class 'cenpy.remote.APIConnection'>
<class 'dict'>


dict_keys(['fips'])

In [6]:
con.geographies['fips']

Unnamed: 0,geoLevelId,name,optionalWithWCFor,requires
0,500.0,congressional district,state,[state]
1,60.0,county subdivision,,"[state, county]"
2,795.0,public use microdata area,,[state]
3,320.0,metropolitan statistical area/micropolitan sta...,,[state]
4,310.0,metropolitan statistical area/micropolitan sta...,,
5,160.0,place,state,[state]
6,50.0,county,state,[state]
7,,combined statistical area,,
8,,combined new england city and town area,,
9,,new england city and town area,,


`geo_unit` and `geo_filter` are both necessary arguments for the `query()` function. `geo_unit` specifies the scale at which data should be taken. `geo_filter` then creates a filter to ensure too much data is not downloaded. The following example will download data from all counties in Colorado (state FIPS codes are accessible [here](https://www.mcc.co.mercer.pa.us/dps/state_fips_code_listing.htm)).

In [7]:
g_unit = 'county:*'
g_filter = {'state':'8'}

### Specifying variables to extract

The other argument taken by `query()` is cols. This is a list of columns taken from the variables of the API. These variables can be displayed using the `variables` function, however, due to the number of variables it is easier to use the [Social Explorer](https://www.socialexplorer.com/) site to find data you are interested in.

In [8]:
var = con.variables
print('Number of variables in', dataset, ':', len(var))

Number of variables in 2012acs1 : 68401


In [9]:
# print first 30 variables
con.variables[:30]

Unnamed: 0,concept,label,predicateOnly,predicateType
AIANHH,,American Indian Area/Alaska Native Area/Hawaii...,,
AIANHHFP,,American Indian Area/Alaska Native Area/Hawaii...,,
AIHHTLI,,American Indian Trust Land/Hawaiian Home Land ...,,
AITS,,American Indian Tribal Subdivision (FIPS),,
AITSCE,,American Indian Tribal Subdivision (Census),,
ANRC,,Alaska Native Regional Corporation (FIPS),,
B00001_001E,B00001. Unweighted Sample Count of the Popula...,Total,,
B00002_001E,B00002. Unweighted Sample Housing Units,Total,,
B01001A_001E,B01001A. SEX BY AGE (WHITE ALONE),Total:,,
B01001A_001M,B01001A. SEX BY AGE (WHITE ALONE),Margin of Error for!!Total:,,


Related columns of data will always start with the same base prefix, so cenpy has an included function, `varslike`, that will create a list of column names that match the input pattern. It is also useful to add on the `NAME` and `GEOID` columns, as these will provide the name and geographic id of all data. In this example, we will use the [B01001A](https://www.socialexplorer.com/data/ACS2013/metadata/?ds=ACS13&table=B01001A), which gives data for sex by age within the desired geography. The identifier at the end corresponds to males or females of different age groups.

In [10]:
cols = con.varslike('B01001A_')
cols.extend(['NAME', 'GEOID'])
cols

['B01001A_001E',
 'B01001A_001M',
 'B01001A_002E',
 'B01001A_002M',
 'B01001A_003E',
 'B01001A_003M',
 'B01001A_004E',
 'B01001A_004M',
 'B01001A_005E',
 'B01001A_005M',
 'B01001A_006E',
 'B01001A_006M',
 'B01001A_007E',
 'B01001A_007M',
 'B01001A_008E',
 'B01001A_008M',
 'B01001A_009E',
 'B01001A_009M',
 'B01001A_010E',
 'B01001A_010M',
 'B01001A_011E',
 'B01001A_011M',
 'B01001A_012E',
 'B01001A_012M',
 'B01001A_013E',
 'B01001A_013M',
 'B01001A_014E',
 'B01001A_014M',
 'B01001A_015E',
 'B01001A_015M',
 'B01001A_016E',
 'B01001A_016M',
 'B01001A_017E',
 'B01001A_017M',
 'B01001A_018E',
 'B01001A_018M',
 'B01001A_019E',
 'B01001A_019M',
 'B01001A_020E',
 'B01001A_020M',
 'B01001A_021E',
 'B01001A_021M',
 'B01001A_022E',
 'B01001A_022M',
 'B01001A_023E',
 'B01001A_023M',
 'B01001A_024E',
 'B01001A_024M',
 'B01001A_025E',
 'B01001A_025M',
 'B01001A_026E',
 'B01001A_026M',
 'B01001A_027E',
 'B01001A_027M',
 'B01001A_028E',
 'B01001A_028M',
 'B01001A_029E',
 'B01001A_029M',
 'B01001A_030E

With the three necessary arguments, data can be downloaded and saved as a pandas dataframe.

In [11]:
data = con.query(cols, geo_unit=g_unit, geo_filter=g_filter)
# spits a deprecation warning because of how cenpy calls pandas

  df[cols] = df[cols].convert_objects(convert_numeric=convert_numeric)


It is useful to replace the default index with the data from the `NAME` or `GEOID` column, as these will give a more useful description of the data.

In [12]:
data.index = data.NAME
data

Unnamed: 0_level_0,B01001A_001E,B01001A_001M,B01001A_002E,B01001A_002M,B01001A_003E,B01001A_003M,B01001A_004E,B01001A_004M,B01001A_005E,B01001A_005M,...,B01001A_028E,B01001A_028M,B01001A_029E,B01001A_029M,B01001A_030E,B01001A_030M,B01001A_031E,B01001A_031M,NAME,GEOID
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Adams County, Colorado",380017,6171,190078,3378,14551,903,16003,1466,13330,1343,...,20219,454,11970,159,6558,671,2483,670,"Adams County, Colorado",05000US08001
"Arapahoe County, Colorado",445459,5621,218157,3358,13536,742,15491,1398,13177,1520,...,31661,390,17581,213,10082,701,5125,688,"Arapahoe County, Colorado",05000US08005
"Boulder County, Colorado",270200,3059,136575,1499,6977,321,6985,715,8421,847,...,18213,166,9613,195,5176,636,2985,645,"Boulder County, Colorado",05000US08013
"Denver County, Colorado",479745,6561,241272,3883,17073,954,13994,1779,10457,1465,...,26702,493,15788,523,10250,654,5408,650,"Denver County, Colorado",05000US08031
"Douglas County, Colorado",271346,1601,135162,1093,9516,354,12390,1074,11748,1049,...,16373,249,8144,230,3486,384,1177,366,"Douglas County, Colorado",05000US08035
"El Paso County, Colorado",527453,4503,264361,2801,17916,599,18957,1465,17433,1477,...,33074,478,19306,508,10462,861,4738,880,"El Paso County, Colorado",05000US08041
"Jefferson County, Colorado",495398,3342,246793,1492,12941,361,14763,1470,15163,1344,...,38198,422,21943,149,12204,737,5601,733,"Jefferson County, Colorado",05000US08059
"Larimer County, Colorado",281231,3020,139420,1665,7459,454,8569,802,7141,1004,...,19663,309,11539,177,6522,687,3344,654,"Larimer County, Colorado",05000US08069
"Mesa County, Colorado",136439,1926,67852,1173,4594,780,4456,918,4048,839,...,9718,309,6246,225,3524,567,2565,517,"Mesa County, Colorado",05000US08077
"Pueblo County, Colorado",128522,4049,62431,2107,3472,550,3488,750,4246,940,...,9469,522,6629,244,4158,529,2018,472,"Pueblo County, Colorado",05000US08101


### Topologically Integrated Geographic Encoding and Referencing (TIGER) data

The Census [TIGER API](https://www.census.gov/geo/maps-data/data/tiger.html) provides geomotries for desired geographic regions. For instance, perhaps we want to have additional information on each county such as area.

In [13]:
cen.tiger.available()

[{'name': 'AIANNHA', 'type': 'MapServer'},
 {'name': 'CBSA', 'type': 'MapServer'},
 {'name': 'Hydro_LargeScale', 'type': 'MapServer'},
 {'name': 'Hydro', 'type': 'MapServer'},
 {'name': 'Labels', 'type': 'MapServer'},
 {'name': 'Legislative', 'type': 'MapServer'},
 {'name': 'Places_CouSub_ConCity_SubMCD', 'type': 'MapServer'},
 {'name': 'PUMA_TAD_TAZ_UGA_ZCTA', 'type': 'MapServer'},
 {'name': 'Region_Division', 'type': 'MapServer'},
 {'name': 'School', 'type': 'MapServer'},
 {'name': 'Special_Land_Use_Areas', 'type': 'MapServer'},
 {'name': 'State_County', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2013', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2014', 'type': 'MapServer'},
 {'name': 'tigerWMS_ACS2015', 'type': 'MapServer'},
 {'name': 'tigerWMS_Census2010', 'type': 'MapServer'},
 {'name': 'tigerWMS_Current', 'type': 'MapServer'},
 {'name': 'tigerWMS_Econ2012', 'type': 'MapServer'},
 {'name': 'tigerWMS_PhysicalFeatures', 'type': 'MapServer'},
 {'name': 'Tracts_Blocks', 'type': 'Ma

First, you must establish a connection to the TIGER API, then you can display the avaialable layers. No Tiger data is available for ACS 2012, so we will use the ACS 2013 for the sake of example, but ideally you will be able to find corresponding Tiger data.

In [14]:
con.set_mapservice('tigerWMS_ACS2013')
con.mapservice.layers

{0: (ESRILayer) 2010 Census Public Use Microdata Areas,
 1: (ESRILayer) 2010 Census Public Use Microdata Areas Labels,
 2: (ESRILayer) 2010 Census ZIP Code Tabulation Areas,
 3: (ESRILayer) 2010 Census ZIP Code Tabulation Areas Labels,
 4: (ESRILayer) Tribal Census Tracts,
 5: (ESRILayer) Tribal Census Tracts Labels,
 6: (ESRILayer) Tribal Block Groups,
 7: (ESRILayer) Tribal Block Groups Labels,
 8: (ESRILayer) Census Tracts,
 9: (ESRILayer) Census Tracts Labels,
 10: (ESRILayer) Census Block Groups,
 11: (ESRILayer) Census Block Groups Labels,
 12: (ESRILayer) Unified School Districts,
 13: (ESRILayer) Unified School Districts Labels,
 14: (ESRILayer) Secondary School Districts,
 15: (ESRILayer) Secondary School Districts Labels,
 16: (ESRILayer) Elementary School Districts,
 17: (ESRILayer) Elementary School Districts Labels,
 18: (ESRILayer) Estates,
 19: (ESRILayer) Estates Labels,
 20: (ESRILayer) County Subdivisions,
 21: (ESRILayer) County Subdivisions Labels,
 22: (ESRILayer) 

The data retrieved earlier was at the county level, so we will use layer 84.

In [15]:
con.mapservice.layers[84]

(ESRILayer) Counties

Now, using the tiger connection, `query()` can retrieve the data, taking the layer and the geographic location as arguments.

In [16]:
geodata = con.mapservice.query(layer=84, where='STATE=8')

In [17]:
geodata

Unnamed: 0,AREALAND,AREAWATER,BASENAME,CENTLAT,CENTLON,COUNTY,COUNTYCC,COUNTYNS,FUNCSTAT,GEOID,INTPTLAT,INTPTLON,LSADC,MTFCC,NAME,OBJECTID,OID,STATE,geometry
0,1881237983,36592000,Boulder,+40.0924502,-105.3577112,013,H1,00198122,A,08013,+40.0949699,-105.3976911,06,G4020,Boulder County,512,27553701435070,08,<pysal.cg.shapes.Polygon object at 0x11484cd30>
1,396290895,4208401,Denver,+39.7620189,-104.8765880,031,H6,00198131,C,08031,+39.7618502,-104.8811054,06,G4020,Denver County,529,27553700234321,08,<pysal.cg.shapes.Polygon object at 0x114a21550>
2,6179976050,30284242,Pueblo,+38.1732359,-104.5127778,101,H1,00198166,A,08101,+38.1706581,-104.4898924,06,G4020,Pueblo County,653,27553704502959,08,<pysal.cg.shapes.Polygon object at 0x117f53748>
3,85478497,1411781,Broomfield,+39.9541268,-105.0527108,014,H6,01945881,C,08014,+39.9533016,-105.0520384,06,G4020,Broomfield County,662,27553700234320,08,<pysal.cg.shapes.Polygon object at 0x118409320>
4,2958007403,16886462,Delta,+38.8613998,-107.8631974,029,H1,00198130,A,08029,+38.8617559,-107.8647570,06,G4020,Delta County,1015,27553332290708,08,<pysal.cg.shapes.Polygon object at 0x118474470>
5,4605714129,8166134,Cheyenne,+38.8281780,-102.6034141,017,H1,00198124,A,08017,+38.8353865,-102.6045852,06,G4020,Cheyenne County,182,27553535192644,08,<pysal.cg.shapes.Polygon object at 0x1184d0588>
6,3332684865,5204095,San Miguel,+38.0042019,-108.4056970,113,H1,00198172,A,08113,+38.0093735,-108.4273260,06,G4020,San Miguel County,589,27553365483912,08,<pysal.cg.shapes.Polygon object at 0x118585dd8>
7,7634475784,21422730,Garfield,+39.6003285,-107.9052109,045,H1,00198138,A,08045,+39.5993517,-107.9097802,06,G4020,Garfield County,1029,27553339760720,08,<pysal.cg.shapes.Polygon object at 0x1185cb2e8>
8,1003638261,2035929,San Juan,+37.7640122,-107.6762274,111,H1,00198171,A,08111,+37.7810745,-107.6702566,06,G4020,San Juan County,220,27553364782268,08,<pysal.cg.shapes.Polygon object at 0x118652278>
9,5255909595,27403242,Montezuma,+37.3382920,-108.5970192,083,H1,00198157,A,08083,+37.3380247,-108.5957864,06,G4020,Montezuma County,780,275531343259915,08,<pysal.cg.shapes.Polygon object at 0x1186e1198>


This data can now be merged with the original data to create one panda dataframe containing all of the relevant data.

In [18]:
newdata = pd.merge(data, geodata, left_on='county', right_on='COUNTY')
newdata

Unnamed: 0,B01001A_001E,B01001A_001M,B01001A_002E,B01001A_002M,B01001A_003E,B01001A_003M,B01001A_004E,B01001A_004M,B01001A_005E,B01001A_005M,...,GEOID_y,INTPTLAT,INTPTLON,LSADC,MTFCC,NAME_y,OBJECTID,OID,STATE,geometry
0,380017,6171,190078,3378,14551,903,16003,1466,13330,1343,...,8001,39.8743252,-104.3318718,6,G4020,Adams County,1226,27553700234319,8,<pysal.cg.shapes.Polygon object at 0x1187e6f28>
1,445459,5621,218157,3358,13536,742,15491,1398,13177,1520,...,8005,39.6445537,-104.3317065,6,G4020,Arapahoe County,2980,27553703789414,8,<pysal.cg.shapes.Polygon object at 0x11ca6f780>
2,270200,3059,136575,1499,6977,321,6985,715,8421,847,...,8013,40.0949699,-105.3976911,6,G4020,Boulder County,512,27553701435070,8,<pysal.cg.shapes.Polygon object at 0x11484cd30>
3,479745,6561,241272,3883,17073,954,13994,1779,10457,1465,...,8031,39.7618502,-104.8811054,6,G4020,Denver County,529,27553700234321,8,<pysal.cg.shapes.Polygon object at 0x114a21550>
4,271346,1601,135162,1093,9516,354,12390,1074,11748,1049,...,8035,39.325414,-104.9259871,6,G4020,Douglas County,2762,27553711656416,8,<pysal.cg.shapes.Polygon object at 0x118d5df28>
5,527453,4503,264361,2801,17916,599,18957,1465,17433,1477,...,8041,38.8273831,-104.5274718,6,G4020,El Paso County,2878,27553704502958,8,<pysal.cg.shapes.Polygon object at 0x11b848ac8>
6,495398,3342,246793,1492,12941,361,14763,1470,15163,1344,...,8059,39.5795106,-105.2454623,6,G4020,Jefferson County,540,27553702223972,8,<pysal.cg.shapes.Polygon object at 0x11a989208>
7,281231,3020,139420,1665,7459,454,8569,802,7141,1004,...,8069,40.6580933,-105.4867638,6,G4020,Larimer County,1691,27553352525178,8,<pysal.cg.shapes.Polygon object at 0x11ca235f8>
8,136439,1926,67852,1173,4594,780,4456,918,4048,839,...,8077,39.0194206,-108.4618935,6,G4020,Mesa County,2925,27553356301770,8,<pysal.cg.shapes.Polygon object at 0x11c017b38>
9,128522,4049,62431,2107,3472,550,3488,750,4246,940,...,8101,38.1706581,-104.4898924,6,G4020,Pueblo County,653,27553704502959,8,<pysal.cg.shapes.Polygon object at 0x117f53748>
