Skip to content
Branch: master
Find file History
Latest commit cfa35e6 Jan 30, 2020
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
.ipynb_checkpoints Minor revisions Jan 30, 2020
inputs Minor revisions Jan 30, 2020
outputs Minor revisions Jan 30, 2020
README.md Corrected minor errors, fixed metadata table block Jan 21, 2020
acs_first_try.py Updated notebook and created readme Jan 20, 2020
acs_second_try.py Updated notebook and created readme Jan 20, 2020
census_acs.ipynb Minor revisions Jan 30, 2020

README.md

American Community Survey

This script is a Python / Pandas notebook that pulls data from the Census Bureau's American Community Survey (ACS) 5-year series API for census tracts, public use microdata areas (PUMAs), and ZIP Code Tabulation Areas (ZCTAs). It was designed to be run iteratively, so that the output of each step of the process can be evaluated. Once data is retrieved from the API it's dumped to a json file, so that if the script has to be rerun from the beginning the API doesn't have to be called again.

This script requires an API key from the Census Bureau and one input file, an Excel spreadsheet that has lists of variables and geographies to retrieve. The NYC Geodatabase has two ACS tables for each geography: acs1 contains a subset of variables from ACS profile tables DP02 and DP03, and acs2 contains a subset of DP04 and DP05. The workbook has sheets for both acs1 and acs2 that correlate the names of the variables used in the NYC Geodatabase with the census variable names and labels. This was originally created using the variable metadata list from the Census Bureau for each of the four profile tables. The workbook also has sheets with geographic identifiers for each geography.

The names of the variable sheet (acs1 or acs2) and geography must be specified in the variable block of the notebook, to pull the specific variable set for a specific type of geography. The IDs for census variables and their labels may change from year to year, so the current variables from the API are read in first and compared to the variables from the Excel workbook. If there is a difference in the number of variables, variable ID codes, or variable labels, the notebook will identify these. The Excel sheet must be updated accordingly, and then the notebook can be rerun from the beginning. The most common change from year to year are labels that indicate dollar values according to year.

Variables can only be retrieved in chunks or blocks of 50 at a time, so the variable list is broken into pieces and is looped through until all variables are retrieved. As data is retrieved, it is appended to a list of nested lists (datalist), one for each geography. This is done in such a way to insure that data from subsequent chunks is appended to the appropriate geographic list / record. A function was written for retrieval.

Different types of geographies require different urls, and the urls return geographies in different quantities. To account for this, the retrieval function is called within loops for each type of geography, which are designed so that each record and block of data that's retrieved is appended correctly to the datalist, and the header row of variable names is only appended once at the outset (it is automatically returned with each API call). Census tracts are retrieved one county at a time, while PUMAs are retrieved for the entire state and are filtered later on in the script. ZCTAs must be retrieved one at a time, as they don't nest within other geography types.

Once the data is retrieved and inserted into a dataframe, a new GEOID2 column is created that holds the FIPS code without the summary level codes, and annotations or footnotes are replaced with NaN or NULL values. All columns are saved as an object type, so strings can be easily replaced with nulls. The names of the columns are replaced with the names used within the geodatabase.

At the end of the process the data is written out to a SQLite database; a CREATE TABLE statement is constructed using the dataframe column names and data types there were stored in the input spreadsheet. Data is always written to a test database and evaluated before being inserted into the final database. The final section of the notebook can be used for creating a metadata table that contains the variable names and labels, and the table where they're stored (acs1 or acs2). The metadata blocks should only be run for acs1 and acs2 for a single geography and not for all of them, as the metadata table is identical for all three (one metadata table applies to them all). To make it easier to read the metadata table contains one record for each variable grouping (estimate, margin of error, percent total, percent margin of error).

You can’t perform that action at this time.