A quick and dirty ruby class/script for importing Census ACS data
Ruby PLpgSQL
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
.gitignore
Gemfile
Gemfile.lock
README.md
TODO
import_census.rb
useful_sql_functions.sql

README.md

Overview

This is a set of quick and dirty scripts for loading census ACS data into a postgres database.

I've tested it on ACS2010_5yr data, ACS2011_5yr, and ACS2011_5yr data, using ruby 1.9.3, and ruby 2.3.0. Ruby 1.8.7 will probably work, but given how painfully slow the CSV module is, you'll want to take steps to update that to FasterCSV. Other databases may well work, as long as you specify :use_copy => false where appropriate, but I have not tested it.

Sequel version 4.x does not work, and I have not yet investigated why.

The only gems required are 'pg' and 'sequel'.

There is one caveat, you must download the Sequence_Number_and_Table_Number_Lookup.txt file as well as any census data. This table is used to generate two tables, which the importer script then uses to create and import the rest of the tables.

Look at the import_census.rb script to get the workflow.

It is slow, and I'm sorry I can't help much with that – blame the ruby CSV parser.

Caveats

Please read Appendix C of the ACS Technical Documentation. Actually, read the whole thing.

Because in postgres we are storing the fields as numbers, we cannot use "." as a jam value to indicate that the sample is too small, as is the case in the distributed files.. Instead, we have replaced that with '-2'.

All numbers are stored as integers or doubles where possible. Otherwise as double precision floats.

Using imported data

The tables are named after the associated table name as displayed in FactFinder, and the columns are similary cryptically named. Getting the right table and column can be an excercise in frustration. However, looking at the census_column_lookup table, generated by create_lookup_tables() will get you pointed in the right direction.

In order to get data for a specific geographic area, you must join through the geoheader table.

Example

Let's say we are trying to get the population for men and women for each county in the state of the California. By looking through census_column_lookup or the ACS website, we find that the table we are after is "B01001": "SEX BY AGE". That gives us (edited down for space):

SELECT "table_id",
       "column_id",
       "column_title",
       "topic",
       "table_universe",
       "subject_area"
FROM "acs2011_5yr"."census_column_lookup"
WHERE "table_id"='B01001';
table_id column_id column_title topic table_universe subject_area
b01001 b010010001 Total: SEX BY AGE Total population [NULL]
b01001 b010010002 Male: SEX BY AGE Total population [NULL]
b01001 b010010003 Under 5 years SEX BY AGE Total population [NULL]
b01001 b010010004 5 to 9 years SEX BY AGE Total population [NULL]
b01001 b010010026 Female: SEX BY AGE Total population [NULL]
b01001 b010010027 Under 5 years SEX BY AGE Total population [NULL]
b01001 b010010028 5 to 9 years SEX BY AGE Total population [NULL]
b01001 b010010029 10 to 14 years SEX BY AGE Total population [NULL]

Unfortunately, the Census does not see fit to break this down hierarchically, but you can see that in some column_title columns the entry ends with a ':' – this tells us that what follows is a subset of a previous column that ended with a ':', but it doesn't (annoyingly) tell you which one, so you'll have to use your intuition, or look the table up on FactFinder, which displays the headings appropriately indented.

So, from this we can see that the male population is in column "b010010002" and the female population in column "b010010026". (If we had wanted the female population aged 10-14, then we could have chosen "B010010029").

So to get those population columns we could do:

SELECT "b010010002" AS male_pop,
       "b010010026" AS female_pop
FROM "acs2011_5yr"."b01001" AS pop;

However, that does not tell us anything about what geographical area that number refers to. In order to do that, we need to join to the geoheader table to the data table using the logrecno field and on the state abbreviation (national level data uses the abbreviation 'us').

I personally do a LEFT JOIN from the geoheader table to the desired, which is especially critical if you want to join more than one data table, just in case there is data missing from the table. For example:

SELECT "geo"."name",
       "b010010002" AS male_pop,
       "b010010026" AS female_pop
FROM "acs2011_5yr"."geoheader" AS geo
LEFT OUTER JOIN "acs2011_5yr"."b01001" AS pop ON (geo.logrecno=pop.logrecno
                                                  AND geo.stusab=pop.stusab);

Finally, if we want to limit the data to just california counties, then we must add the appropriate WHERE clauses:

SELECT "geo"."name",
       "b010010002" AS male_pop,
       "b010010026" AS female_pop
FROM "acs2011_5yr"."geoheader" AS geo
LEFT OUTER JOIN "acs2011_5yr"."b01001" AS pop ON ("geo"."logrecno"="pop"."logrecno"
                                                  AND "geo"."stusab"="pop"."stusab")
WHERE geo.stusab='ca'
  AND geo.sumlevel = '50';

If you have loaded Census TIGER data and want to join to that, then you must also join on the geoid_tiger column (not the geoid column, because that would be too easy).

SELECT "geo"."name",
       "b010010002" AS male_pop,
       "b010010026" AS female_pop,
       "co"."the_geom"
FROM "acs2011_5yr"."geoheader" AS geo
LEFT OUTER JOIN "acs2011_5yr"."b01001" AS pop ON ("geo"."logrecno"="pop"."logrecno"
                                                  AND "geo"."stusab"="pop"."stusab")
JOIN "tiger"."county" co ON co.geoid = geo.geoid_tiger
WHERE geo.stusab='ca'
  AND geo.sumlevel = '50';

Finally, if you want to get the margin of error for any estimate, those are located in the TABLE_moe table with the same column name:

SELECT "geo"."name",
       "pop"."b010010002" AS male_pop,
       "pop_moe"."b010010002" AS male_pop_moe,
       "pop"."b010010026" AS female_pop,
       "pop_moe"."b010010026" AS female_pop_moe,
       "co"."the_geom"
FROM "acs2011_5yr"."geoheader" AS geo
LEFT OUTER JOIN "acs2011_5yr"."b01001" AS pop ON ("geo"."logrecno"="pop"."logrecno"
                                                  AND "geo"."stusab"="pop"."stusab")
LEFT OUTER JOIN "acs2011_5yr"."b01001_moe" AS pop_moe ON ("geo"."logrecno"="pop_moe"."logrecno"
                                                  AND "geo"."stusab"="pop_moe"."stusab")
JOIN "tiger"."county" co ON co.geoid = geo.geoid_tiger
WHERE geo.stusab='ca'
  AND geo.sumlevel = '50';