Switch branches/tags
Find file
Fetching contributors…
Cannot retrieve contributors at this time
142 lines (112 sloc) 11.1 KB
| ACS Data File Structure |
Geography File:
Located At*.zip
Year = [xxxx] - Last year of the period
Period = [x] - Period length in years
State = [AB] - Standard state abreviation
Filename = "g" + Year + Period + State + ".txt"
File Columns:
STUSAB State Postal Abbreviation 2 7
SUMLEVEL Summary Level 3 9
COMPONENT Geographic Component 2 12
LOGRECNO Logical Record Number 7 14
Many other columns too numerous to enumerate here, but documented in section 2.4
Other columns include State, County, Tract, and Block Group, which can be used to map geographic areas to a Logical Record Number
Shape File:
Located at[FIPS-code]
[FIPS-code] is a somewhat arbitrary 2-digit code corresponding to a US state, see link in Resources section.
Index of files:
Relevant Metadata for each polygon:
- STATE - FIPS 5-2 state code
- COUNTY - 3-digit county code
- TRACT - 6-digit census tract code
- BLKGROUP - 1 digit block group code
NOTE: While the shapefile TRACT field does map directly
NOTE: Opening this file in OpenOffice (as of version 3.3) will cause it to crash. If you need to use OO, have a friend open it in Excel and export it as something else (LibreOffice 3.3.2 seems to work though).
Located at
Maps Column Names to Sequence Numbers and Row Locations
Each row corresponds to a column in one of the Estimate/MoE files, except some which are aggregates.
File Columns:
- File ID (fileid) - Always "ACSSF" as far as I can tell
- Table ID (tblid) - A unique ID for each row, used in -v flag of ACSImporter
- Sequence Numbers (seq) - Correspond to names of Estimate/MoE files
- Line Number (col) - Misnomer, this is really column number (sort of). The actual column number is
computed as (position + col - 1), with the first column as 1. Individual lines within a Estimate/MoE
file correspond to different locations by LOGRECNO
- Position (position) - The starting position to compute the real column number as described above.
If position is undefined for the current row, it inherits from the one above it.
- Cells (cells1) - Total number of columns for aggregate column types. Redundant and useless.
- Total cells in sequence (cells2) - Total number of cells for a sequence (only appears in one row for
a given sequence number). Redundant and useless.
- Title (title) - Column Description
In addition to the columns, there are several "types" of rows. There are normal ones, which have
(fileid, tblid, seq, col, title) fields but are missing (position, cells1, cells2). These correspond to actual
numbers in actual files and are basically what we want. They are unfortunately missing essential information in
the title which is provided by header rows.
There are two types of header rows, Measurement and Demographic. Measurement headers are the only rows with position
defined and always have cells1 defined as well. If it is the last measurement header for a sequence number, it will also
have cells2 defined. Their title fields are in all caps and describe the type of measurement for all their children,
sometimes followed by demographic in () (e.g. "SEX BY AGE". Demographic headers always (?) immediately follow
Measurement headers (if the are present at all; unknown if they can be absent). They don't have entries for (col,
position, cells1, cells2), and their titles always start with "Universe: " followed by a description of the demographic
(e.g. "Universe: People reporting single ancestry"). Thus, to get the full information about what a row really means,
you need the row's title as well as the titles for its Measurement and Demographic headers.
EDIT: We have since discovered that the titles for this file are essentially a tree, with headers as nodes and the actual titles as leaves. Unfortunately, it is nearly impossible to extract this tree structure, as there is no indication of parent-child relationship, and the only indication of node depth is encoded in the formatting of the cell, which we cannot easily extract.
NOTE: Opening this file in OpenOffice (as of version 3.3) will cause it to crash. If you need to use OO, have a friend open it in Excel and export it as something else.
Located at
Maps Column Names to Table IDs
Structure is the same as Sequence_Number_and_Table_Number_Lookup.xls but has slightly different columns
File Columns:
- Table ID (tblid) - Same as Sequence_Number_and_Table_Number_Lookup.xls tblid
- Line (col) - Same as Sequence_Number_and_Table_Number_Lookup.xls col
- Unique ID (id) - Equal to tblid + (3 digit int)(col). Maps directly to the ids in the first row of the 2005-2009_SummaryFileXLS files, except the ids in those files are equal to tblid + '_' + (3 digit int)(col)
- Stub (title) - Same as Sequence_Number_and_Table_Number_Lookup.xls title
This file is useful for mapping desired column names to ids (the Unique ID column). The location for each unique id is extracted from the 2005-2009_SummaryFileXLS files. Unfortunately, it shares the same naming deficiencies that Sequence_Number_and_Table_Number_Lookup.xls does, so this must be done manually, at least for the moment.
2005-2009_SummaryFileXLS files
These files are named Seq[seq].xls, where [seq] is the sequence number the file corresponds to (they vary from "Seq1.xls" to "Seq117.xls"). Each file contains the schema for the corresponding Estimate and MoE files. Each contains only two rows, the first is the ID of the corresponding column in the Estimate MoE files. The first 6 column names are always FILEID FILETYPE STUSAB CHARITER SEQUENCE LOGRECNO. The remainder are unique ids (e.g. B08406_002) corresponding to the unique ids found in ACS2009_5-Year_TableShells.xls.
The second row is the column names, however, while some of them are fairly descriptive (e.g. "Workers 16 years and over"), others are entirely useless (e.g. "MEAN HOUSEHOLD INCOME OF QUINTILES, Universe: Households, Quintile Means:, Lowest Quintile" in ACS2009_5-Year_TableShells.xls becomes "Households% Households" in its corresponding sequence file). The second row should therefore be ignored completely.
These files are useful for mapping unique ids to sequence numbers and column numbers.
Estimate and MoE Files:
Located at*.zip
Naming convention is as follows (x denotes a digit):
Type = ["e" | "m"] - Estimate or Margin of error
Year = [xxxx] - Last year of the period
Period = [x] - Period length in years
State = [AB] - Standard state abreviation
Seq = [xxxx] - Sequence Number
Res = [xxx] - Reserved for future use, currently "000"
Filename = Type + Year + Period + State + Seq + Res + ".txt"
Each file contains a bunch of rows from a database with the following schema:
FILEID File Identification 6 Characters Always "ACSSF"
FILETYPE File Type 6 Characters Year + Type + Period
STUSAB State/U.S.-Abbreviation (USPS) 2 Characters State
CHARACTER Character Iteration 3 Characters Seems to correspond to Res (i.e. currently always "000")
SEQUENCE Sequence Number 4 Characters Seq
LOGRECNO Logical Record Number 7 Characters Location Identifier
Field # 7 and up Estimates Various Defined by position and cells in Sequence_Number_and_Table_Number_Lookup.xls
256 is a hard max for number of fields in one row.
Excerpt from a file:
Summary level info: