### Census Data User Guide

2020-10-11 - New 2018 5-yr census data were downloaded from data.census.gov and prepared into 3 different tables.

Tables:

* 2018_5yr_cendatagov_ESTIMATES.pkl - Census estimates data
* 2018_5yr_cendatagov_ESTIMATES_DD.pkl - Census estimates data - data dictionary
* 2018_5yr_cendatagov_GAZ.pkl - 2018 Gazetteer data


**IMPORTANT:** Use GEOID field (not GEO_ID) to join estimates data and gazetteer data. Both are prepared as Integers and are ready to join.

A notable change from the first set of 2019-1yr data that was prepared is that we no longer need the geo tables.

---

To read the tables, import pandas:

In [1]:
import pandas as pd

Here is some code to read all tables at once. Modify the path as needed.

In [2]:
# Read census estimates
cen_20185_estimates = pd.read_pickle('/media/school/project/data.census.gov_downloads/2018_5yr_cendatagov_ESTIMATES.pkl')
cen_20185_estimates_dd = pd.read_pickle('/media/school/project/data.census.gov_downloads/2018_5yr_cendatagov_ESTIMATES_DD.pkl')
# Read gazetteer file that links geography to lat/lon
cen_20185_gaz = pd.read_pickle('/media/school/project/data.census.gov_downloads/2018_5yr_cendatagov_GAZ.pkl')

Function to display samples of data for each table:

In [3]:
from IPython.display import display
def show_info(df):
    print(df.info())
    display(df)

---

## Data Dictionaries

The following table is the data dictionary for the `estimates` data.

These may be useful when e.g. deciding which data to use in the models, since the column names are not particularly descriptive. These are mostly directly from the Census template files with some minor manual modifications.

### Table 1: cen_20185_estimates_dd

In [4]:
show_info(cen_20185_estimates_dd)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1316 entries, 0 to 1315
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   column       1316 non-null   object
 1   description  1316 non-null   object
dtypes: object(2)
memory usage: 20.7+ KB
None


Unnamed: 0,column,description
0,GEOID,Field to Join Gazetteer Data
1,GEO_ID,id
2,NAME,Geographic Area Name
3,B02001_001E,Estimate!!Total
4,B02001_002E,Estimate!!Total!!White alone
...,...,...
1311,S2405_C03_006E,Estimate!!Service occupations!!Civilian employ...
1312,S2405_C04_006E,Estimate!!Sales and office occupations!!Civili...
1313,S2405_C05_006E,"Estimate!!Natural resources, construction, and..."
1314,S2405_C06_006E,"Estimate!!Production, transportation, and mate..."


---

## Data Tables

### Table 2: cen_20185_estimates

This table contains the data that can be used for the clustering models.

`GEOID` can be used to join Gazetteer data.

In [5]:
show_info(cen_20185_estimates)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74001 entries, 0 to 74000
Columns: 1316 entries, GEOID to S2405_C06_006E
dtypes: float64(942), int64(372), object(2)
memory usage: 743.6+ MB
None


Unnamed: 0,GEOID,GEO_ID,NAME,B02001_001E,B02001_008E,B02001_006E,B02001_010E,B02001_004E,B02001_007E,B02001_002E,...,S2405_C03_005E,S2405_C05_001E,S2405_C06_009E,S2405_C03_003E,S2405_C01_003E,S2405_C01_015E,S2405_C05_003E,S2405_C05_012E,S2405_C04_015E,S2405_C06_006E
0,10003000400,1400000US10003000400,"Census Tract 4, New Castle County, Delaware",2933,40,0,40,26,0,1372,...,0.0,2.7,0.0,0.0,20.0,32.5,100.0,0.0,,24.1
1,10003001300,1400000US10003001300,"Census Tract 13, New Castle County, Delaware",3512,52,0,52,0,0,3223,...,0.0,4.8,0.0,0.0,100.0,11.7,75.0,0.0,,0.0
2,10003002600,1400000US10003002600,"Census Tract 26, New Castle County, Delaware",3512,64,0,64,0,224,1036,...,0.0,9.9,18.8,0.0,136.0,35.9,73.5,0.0,,7.5
3,10003010200,1400000US10003010200,"Census Tract 102, New Castle County, Delaware",2157,27,0,22,0,33,1530,...,0.0,8.3,0.0,0.0,73.0,14.2,84.9,0.0,,75.0
4,10003010300,1400000US10003010300,"Census Tract 103, New Castle County, Delaware",3220,0,0,0,0,0,2138,...,0.0,5.1,0.0,0.0,75.0,13.5,65.3,0.0,,28.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73996,53063010304,1400000US53063010304,"Census Tract 103.04, Spokane County, Washington",5857,65,0,65,0,0,5753,...,0.0,10.3,0.0,0.0,293.0,16.6,65.9,0.0,,20.0
73997,53063010501,1400000US53063010501,"Census Tract 105.01, Spokane County, Washington",8554,271,0,271,222,55,7782,...,0.0,7.8,0.0,0.0,159.0,7.4,29.6,0.0,,5.9
73998,53063010601,1400000US53063010601,"Census Tract 106.01, Spokane County, Washington",3921,278,0,217,48,24,3483,...,0.0,9.8,0.0,0.0,50.0,6.2,84.0,8.8,,24.8
73999,53073000700,1400000US53073000700,"Census Tract 7, Whatcom County, Washington",6581,360,0,301,69,554,4847,...,0.0,14.7,0.0,0.0,370.0,8.8,76.5,0.0,,16.2


### Table 3: cen_20185_gaz

`GEOID` can be used to join to estimates data.

While a data dict was not prepared, it can be found here: https://www.census.gov/programs-surveys/geography/technical-documentation/records-layout/gaz18-record-layouts.html


Brief overview: 

* USPS - State
* GEOID - ID for tract
* ALAND - Area of Land, m^2
* AWATER - Area of Water, m^2
* ALAND_SQMI - Area of Land, mi^2
* AWATER_SQMI - Area of Water, mi^2
* INTPTLAT - Latitude
* INTPTLONG - Longitute

In [6]:
show_info(cen_20185_gaz)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74001 entries, 0 to 74000
Data columns (total 8 columns):
 #   Column                                                                                                                                  Non-Null Count  Dtype  
---  ------                                                                                                                                  --------------  -----  
 0   USPS                                                                                                                                    74001 non-null  object 
 1   GEOID                                                                                                                                   74001 non-null  int64  
 2   ALAND                                                                                                                                   74001 non-null  int64  
 3   AWATER                                                        

Unnamed: 0,USPS,GEOID,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
0,AL,1001020100,9817813,28435,3.791,0.011,32.481959,-86.491338
1,AL,1001020200,3325679,5669,1.284,0.002,32.475758,-86.472468
2,AL,1001020300,5349273,9054,2.065,0.003,32.474024,-86.459703
3,AL,1001020400,6384276,8408,2.465,0.003,32.471030,-86.444835
4,AL,1001020500,11408869,43534,4.405,0.017,32.458922,-86.421826
...,...,...,...,...,...,...,...,...
73996,PR,72153750501,1795740,0,0.693,0.000,18.031240,-66.867250
73997,PR,72153750502,689929,0,0.266,0.000,18.024746,-66.860442
73998,PR,72153750503,3322874,1952,1.283,0.001,18.023325,-66.874841
73999,PR,72153750601,10987037,4527,4.242,0.002,18.017808,-66.839070
