### Census Data User Guide - Version 3

**2020-10-13 V3 UPDATE: New cleaning procedure was implemented. Details below**

2020-10-11 V2 UPDATE: Reproduced tables with protocol=4 for compatibility.

2020-10-11 - New 2018 5-yr census data were downloaded from data.census.gov and prepared into 3 different tables.

Tables:

* 2018_5yr_cendatagov_ESTIMATES_v2.pkl - Census estimates data
* 2018_5yr_cendatagov_ESTIMATES_DD_v2.pkl - Census estimates data - data dictionary
* 2018_5yr_cendatagov_GAZ_v2.pkl - 2018 Gazetteer data


**IMPORTANT:** Use GEOID field (not GEO_ID) to join estimates data and gazetteer data. Both are prepared as Integers and are ready to join.

**V3 Changes to Data Prep:** 

1. The last iteration lazily coerced all values to numeric values. This made some potentially real data into null data. The data cleaning was updated to more appropriately clean and cast that data to numeric values.
2. Data for Puerto Rico were dropped (as well as a row that represented US summary data). Data is now limited to 50 states + Washington DC. The Gaz data was updated to drop these as well.
3. Any column where values were not already in the range [0, 100], values in that column were scaled to that range. Nulls were ignored in that process.
4. After this new processing, 163 columns contained only null values. These were dropped from the Estimates table. The data dict was also updated to drop these fields.

---

To read the tables, import pandas:

In [1]:
import pandas as pd

Here is some code to read all tables at once. Modify the path as needed.

In [2]:
# Read census estimates
cen_20185_estimates = pd.read_pickle('/media/school/project/data.census.gov_downloads/2018_5yr_cendatagov_ESTIMATES_v3.pkl')
cen_20185_estimates_dd = pd.read_pickle('/media/school/project/data.census.gov_downloads/2018_5yr_cendatagov_ESTIMATES_DD_v3.pkl')
# Read gazetteer file that links geography to lat/lon
cen_20185_gaz = pd.read_pickle('/media/school/project/data.census.gov_downloads/2018_5yr_cendatagov_GAZ_v3.pkl')

Function to display samples of data for each table:

In [3]:
from IPython.display import display
def show_info(df):
    print(df.info())
    display(df)

---

## Data Dictionaries

The following table is the data dictionary for the `estimates` data.

These may be useful when e.g. deciding which data to use in the models, since the column names are not particularly descriptive. These are mostly directly from the Census template files with some minor manual modifications.

### Table 1: cen_20185_estimates_dd_v3

In [4]:
show_info(cen_20185_estimates_dd)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1153 entries, 0 to 1152
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   column       1153 non-null   object
 1   description  1153 non-null   object
dtypes: object(2)
memory usage: 18.1+ KB
None


Unnamed: 0,column,description
0,GEOID,Field to Join Gazetteer Data
1,GEO_ID,id
2,NAME,Geographic Area Name
3,B02001_001E,Estimate!!Total
4,B02001_002E,Estimate!!Total!!White alone
...,...,...
1148,S2405_C03_006E,Estimate!!Service occupations!!Civilian employ...
1149,S2405_C04_006E,Estimate!!Sales and office occupations!!Civili...
1150,S2405_C05_006E,"Estimate!!Natural resources, construction, and..."
1151,S2405_C06_006E,"Estimate!!Production, transportation, and mate..."


---

## Data Tables

### Table 2: cen_20185_estimates_v3

This table contains the data that can be used for the clustering models.

`GEOID` can be used to join Gazetteer data.

In [5]:
show_info(cen_20185_estimates)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73056 entries, 0 to 73055
Columns: 1153 entries, GEOID to S2405_C02_002E
dtypes: float64(1150), int64(1), object(2)
memory usage: 642.7+ MB
None


Unnamed: 0,GEOID,GEO_ID,NAME,B02001_002E,B02001_003E,B02001_008E,B02001_004E,B02001_007E,B02001_001E,B02001_006E,...,S2405_C04_012E,S2405_C05_007E,S2405_C01_008E,S2405_C06_001E,S2405_C01_011E,S2405_C02_011E,S2405_C04_004E,S2405_C04_002E,S2405_C05_008E,S2405_C02_002E
0,10003000400,1400000US10003000400,"Census Tract 4, New Castle County, Delaware",3.119600,8.768328,0.937866,0.265767,0.000000,4.173841,0.000000,...,0.0,0.0,0.775194,2.8,4.592987,69.6,56.5,100.0,0.0,0.0
1,10003001300,1400000US10003001300,"Census Tract 13, New Castle County, Delaware",7.328331,0.744868,1.219226,0.000000,0.000000,4.997794,0.000000,...,11.7,0.0,3.875969,2.8,3.967603,77.0,0.0,,0.0,
2,10003002600,1400000US10003002600,"Census Tract 26, New Castle County, Delaware",2.355616,12.609971,1.500586,0.000000,2.907958,4.997794,0.000000,...,31.9,0.0,0.000000,13.5,5.033832,63.3,0.0,0.0,,0.0
3,10003010200,1400000US10003010200,"Census Tract 102, New Castle County, Delaware",3.478854,3.126100,0.633060,0.000000,0.428405,3.069545,0.000000,...,0.0,10.5,5.340224,10.7,3.075661,60.3,31.8,0.0,0.0,62.5
4,10003010300,1400000US10003010300,"Census Tract 103, New Castle County, Delaware",4.861301,4.862170,0.000000,0.000000,0.000000,4.582260,0.000000,...,7.7,0.0,4.478898,10.2,3.977855,84.5,8.3,,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73051,53063010304,1400000US53063010304,"Census Tract 103.04, Spokane County, Washington",13.080946,0.000000,1.524033,0.000000,0.000000,8.334875,0.000000,...,39.4,11.7,2.067183,19.7,6.561411,48.1,6.6,76.9,0.0,0.0
73052,53063010501,1400000US53063010501,"Census Tract 105.01, Spokane County, Washington",17.694407,0.029326,6.354045,2.269243,0.714008,12.172874,0.000000,...,14.6,4.5,6.459948,7.1,11.185155,67.6,0.0,0.0,37.3,10.0
73053,53063010601,1400000US53063010601,"Census Tract 106.01, Spokane County, Washington",7.919509,0.275660,6.518171,0.490647,0.311567,5.579827,0.000000,...,11.0,0.0,1.119724,13.0,6.161575,73.4,0.0,0.0,53.8,0.0
73054,53073000700,1400000US53073000700,"Census Tract 7, Whatcom County, Washington",11.020919,1.167155,8.440797,0.705305,7.192003,9.365172,0.000000,...,0.0,23.8,2.239449,12.5,9.032192,52.4,3.2,0.0,0.0,46.2


### Table 3: cen_20185_gaz_v3

`GEOID` can be used to join to estimates data.

While a data dict was not prepared, it can be found here: https://www.census.gov/programs-surveys/geography/technical-documentation/records-layout/gaz18-record-layouts.html


Brief overview: 

* USPS - State
* GEOID - ID for tract
* ALAND - Area of Land, m^2
* AWATER - Area of Water, m^2
* ALAND_SQMI - Area of Land, mi^2
* AWATER_SQMI - Area of Water, mi^2
* INTPTLAT - Latitude
* INTPTLONG - Longitute

In [6]:
show_info(cen_20185_gaz)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73056 entries, 0 to 73055
Data columns (total 8 columns):
 #   Column                                                                                                                                  Non-Null Count  Dtype  
---  ------                                                                                                                                  --------------  -----  
 0   USPS                                                                                                                                    73056 non-null  object 
 1   GEOID                                                                                                                                   73056 non-null  int64  
 2   ALAND                                                                                                                                   73056 non-null  int64  
 3   AWATER                                                        

Unnamed: 0,USPS,GEOID,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
0,AL,1001020100,9817813,28435,3.791,0.011,32.481959,-86.491338
1,AL,1001020200,3325679,5669,1.284,0.002,32.475758,-86.472468
2,AL,1001020300,5349273,9054,2.065,0.003,32.474024,-86.459703
3,AL,1001020400,6384276,8408,2.465,0.003,32.471030,-86.444835
4,AL,1001020500,11408869,43534,4.405,0.017,32.458922,-86.421826
...,...,...,...,...,...,...,...,...
73051,WY,56043000200,5780716346,9742603,2231.947,3.762,43.878830,-107.669052
73052,WY,56043000301,1993203,0,0.770,0.000,44.014369,-107.956379
73053,WY,56043000302,15429213,687001,5.957,0.265,44.028771,-107.950748
73054,WY,56045951100,6100010068,5041727,2355.227,1.947,43.846213,-104.570020
