This is a set of quick and dirty scripts for loading census ACS data into a postgres database.
I've tested it on ACS2010_5yr data, ACS2011_5yr, and ACS2011_5yr data, using ruby 1.9.3, and ruby 2.3.0.
Ruby 1.8.7 will probably work, but given how painfully slow the CSV module is,
you'll want to take steps to update that to FasterCSV. Other databases may
well work, as long as you specify :use_copy => false where appropriate, but I
have not tested it.
Sequel version 4.x does not work, and I have not yet investigated why.
The only gems required are 'pg' and 'sequel'.
There is one caveat, you must download the Sequence_Number_and_Table_Number_Lookup.txt file as well as any census data. This table is used to generate two tables, which the importer script then uses to create and import the rest of the tables.
Look at the import_census.rb script to get the workflow.
It is slow, and I'm sorry I can't help much with that – blame the ruby CSV parser.
Please read Appendix C of the ACS Technical Documentation. Actually, read the whole thing.
Because in postgres we are storing the fields as numbers, we cannot use "." as a jam value to indicate that the sample is too small, as is the case in the distributed files.. Instead, we have replaced that with '-2'.
All numbers are stored as integers or doubles where possible. Otherwise as double precision floats.
The tables are named after the associated table name as displayed in
FactFinder, and the columns are similary
cryptically named. Getting the right table and column can be an excercise in
frustration. However, looking at the census_column_lookup table, generated by
create_lookup_tables() will get you pointed in the right direction.
In order to get data for a specific geographic area, you must join through the geoheader table.
Let's say we are trying to get the population for men and women for each county in
the state of the California. By looking through census_column_lookup or the
ACS website, we find that the table we are after is "B01001": "SEX BY AGE". That gives us (edited down for space):
SELECT "table_id",
"column_id",
"column_title",
"topic",
"table_universe",
"subject_area"
FROM "acs2011_5yr"."census_column_lookup"
WHERE "table_id"='B01001';| table_id | column_id | column_title | topic | table_universe | subject_area |
|---|---|---|---|---|---|
| b01001 | b010010001 | Total: | SEX BY AGE | Total population | [NULL] |
| b01001 | b010010002 | Male: | SEX BY AGE | Total population | [NULL] |
| b01001 | b010010003 | Under 5 years | SEX BY AGE | Total population | [NULL] |
| b01001 | b010010004 | 5 to 9 years | SEX BY AGE | Total population | [NULL] |
| b01001 | b010010026 | Female: | SEX BY AGE | Total population | [NULL] |
| b01001 | b010010027 | Under 5 years | SEX BY AGE | Total population | [NULL] |
| b01001 | b010010028 | 5 to 9 years | SEX BY AGE | Total population | [NULL] |
| b01001 | b010010029 | 10 to 14 years | SEX BY AGE | Total population | [NULL] |
Unfortunately, the Census does not see fit to break this down hierarchically,
but you can see that in some column_title columns the entry ends with a ':' – this
tells us that what follows is a subset of a previous column that ended with a
':', but it doesn't (annoyingly) tell you which one, so you'll have to use your
intuition, or look the table up on FactFinder, which displays the headings
appropriately indented.
So, from this we can see that the male population is in column "b010010002" and the female population in column "b010010026". (If we had wanted the female population aged 10-14, then we could have chosen "B010010029").
So to get those population columns we could do:
SELECT "b010010002" AS male_pop,
"b010010026" AS female_pop
FROM "acs2011_5yr"."b01001" AS pop;However, that does not tell us anything about what geographical area that
number refers to. In order to do that, we need to join to the geoheader
table to the data table using the logrecno field and on the state
abbreviation (national level data uses the abbreviation 'us').
I personally do a LEFT JOIN from the geoheader table to the
desired, which is especially critical if you want to join more than
one data table, just in case there is data missing from the table. For example:
SELECT "geo"."name",
"b010010002" AS male_pop,
"b010010026" AS female_pop
FROM "acs2011_5yr"."geoheader" AS geo
LEFT OUTER JOIN "acs2011_5yr"."b01001" AS pop ON (geo.logrecno=pop.logrecno
AND geo.stusab=pop.stusab);Finally, if we want to limit the data to just california counties, then we
must add the appropriate WHERE clauses:
SELECT "geo"."name",
"b010010002" AS male_pop,
"b010010026" AS female_pop
FROM "acs2011_5yr"."geoheader" AS geo
LEFT OUTER JOIN "acs2011_5yr"."b01001" AS pop ON ("geo"."logrecno"="pop"."logrecno"
AND "geo"."stusab"="pop"."stusab")
WHERE geo.stusab='ca'
AND geo.sumlevel = '50';If you have loaded Census TIGER data and want to join to that, then you must also join on
the geoid_tiger column (not the geoid column, because that would be too easy).
SELECT "geo"."name",
"b010010002" AS male_pop,
"b010010026" AS female_pop,
"co"."the_geom"
FROM "acs2011_5yr"."geoheader" AS geo
LEFT OUTER JOIN "acs2011_5yr"."b01001" AS pop ON ("geo"."logrecno"="pop"."logrecno"
AND "geo"."stusab"="pop"."stusab")
JOIN "tiger"."county" co ON co.geoid = geo.geoid_tiger
WHERE geo.stusab='ca'
AND geo.sumlevel = '50';Finally, if you want to get the margin of error for any estimate, those are located
in the TABLE_moe table with the same column name:
SELECT "geo"."name",
"pop"."b010010002" AS male_pop,
"pop_moe"."b010010002" AS male_pop_moe,
"pop"."b010010026" AS female_pop,
"pop_moe"."b010010026" AS female_pop_moe,
"co"."the_geom"
FROM "acs2011_5yr"."geoheader" AS geo
LEFT OUTER JOIN "acs2011_5yr"."b01001" AS pop ON ("geo"."logrecno"="pop"."logrecno"
AND "geo"."stusab"="pop"."stusab")
LEFT OUTER JOIN "acs2011_5yr"."b01001_moe" AS pop_moe ON ("geo"."logrecno"="pop_moe"."logrecno"
AND "geo"."stusab"="pop_moe"."stusab")
JOIN "tiger"."county" co ON co.geoid = geo.geoid_tiger
WHERE geo.stusab='ca'
AND geo.sumlevel = '50';