## Resources
- [FIPS codes](https://www2.census.gov/geo/docs/reference/codes/files/)
- [Census API: Datasets in /data/2018/acs/acs5/profile and its descendants](https://api.census.gov/data/2018/acs/acs5/profile.html)
- [American Community Survey 5-Year Data (2009-2018)](https://www.census.gov/data/developers/data-sets/acs-5year.html)
- [Census Datasets](https://api.census.gov/data.html)
- [Notes on ACS Estimate and Annotation Values (weird values)](https://www.census.gov/data/developers/data-sets/acs-1year/notes-on-acs-estimate-and-annotation-values.html)
- [Getting Census Data in 5 Easy Steps (towards data science)](https://towardsdatascience.com/getting-census-data-in-5-easy-steps-a08eeb63995d)
- [Using the Census Bureau's API (medium)](https://medium.com/@shep.nathan.d/using-the-u-s-census-bureaus-api-af113337f478)
- [Census Burearu YouTube tutorial (youtube)](https://www.youtube.com/watch?v=K0-ifZS0mQI&feature=emb_title&ab_channel=U.S.CensusBureau)
- [DATA GEMS: How to Extract Data from the Census API (youtube)](https://www.youtube.com/watch?v=0DVdHquaRiU)
- [Python Tutorial: Using the Census API (datacamp video)](https://www.youtube.com/watch?v=l47HptzM7ao)


- [Land Area and Persons per Square Mile](https://www.census.gov/quickfacts/fact/note/US/LND110210)
- [Gazetteer 2018 Geographic Data](https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.2018.html)

## Contents  
- [Pulling DP05 From Census API](#Pulling-DP05-From-Census-API)
- [Pulling DP03 From Census API](#Pulling-DP03-From-Census-API)

In [220]:
import pandas as pd
import requests

# Pulling DP05 From Census API
## States: TX (48), NY (36), CA (06), FL(12), IL(17)

Table: https://data.census.gov/cedsci/table?q=ACSDP1Y2019.DP05&tid=ACSDP5Y2018.DP05&hidePreview=false  
ACS 5-YEAR DEMOGRAPHIC AND HOUSING ESTIMATES   
Survey/Program: American Community Survey   
2018: ACS 5-Year Estimates Data Profiles  
TableID: DP05  

In [221]:
# Reference: Census YT tutorial: https://www.youtube.com/watch?v=K0-ifZS0mQI&feature=emb_title&ab_channel=U.S.CensusBureau
# Added in 'NAME' parameter
# Edit 'DP05' table
# add in state:XX for desired state

#### Format URL should be in

In [222]:
example_url = 'https://api.census.gov/data/2018/acs/acs5/profile?get=group(DP05),NAME&for=county:*&in=state:48&key=abaab7067d6de5d6a0216fca0b8fca4e9015a87f'

## Texas: 48

In [223]:
# Set base url
url = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params = {
    'get': 'group(DP05),NAME',
    'for': 'county:*',
    'in': 'state:48',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res = requests.get(url,params)
res

<Response [200]>

In [224]:
df_tx = pd.DataFrame(res.json())
df_tx.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,707,708,709,710,711,712,713,714,715,716
0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Austin County, Texas",0.6,87.0,1.9,-888888888,-888888888,29565,-555555555,29565,-888888888,...,,,,,,,,,48,015


In [225]:
# Set the values in the first row to the columns
df_tx.columns = df_tx.iloc[0]

In [226]:
# Drop the first row
df_tx = df_tx.iloc[1:, :]

df_tx.head(2)

Unnamed: 0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Austin County, Texas",0.6,87.0,1.9,-888888888,-888888888,29565,-555555555,29565,-888888888,...,,,,,,,,,48,15
2,"Kenedy County, Texas",20.5,34.7,38.9,-888888888,-888888888,595,181,595,-888888888,...,,,,,,,,,48,261


In [227]:
df_tx.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 1 to 254
Columns: 717 entries, NAME to county
dtypes: object(717)
memory usage: 1.4+ MB


In [228]:
df_tx.to_csv('../data/preprocessing/raw_dp05_tx.csv', index=False)

## New York: 36

In [229]:
# Set base url
url = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params = {
    'get': 'group(DP05),NAME',
    'for': 'county:*',
    'in': 'state:36',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res = requests.get(url,params)
res

<Response [200]>

In [230]:
df_ny = pd.DataFrame(res.json())
df_ny.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,707,708,709,710,711,712,713,714,715,716
0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Schoharie County, New York",0.3,93.7,1.2,-888888888,-888888888,31364,-555555555,31364,-888888888,...,,,,,,,,,36,095


In [231]:
# Set the values in the first row to the columns
df_ny.columns = df_ny.iloc[0]

In [232]:
# Drop the first row
df_ny = df_ny.iloc[1:, :]

df_ny.head(2)

Unnamed: 0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Schoharie County, New York",0.3,93.7,1.2,-888888888,-888888888,31364,-555555555,31364,-888888888,...,,,,,,,,,36,95
2,"Onondaga County, New York",0.1,75.6,0.1,-888888888,-888888888,464242,-555555555,464242,-888888888,...,,,,,,,,,36,67


In [233]:
df_ny.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 1 to 62
Columns: 717 entries, NAME to county
dtypes: object(717)
memory usage: 347.4+ KB


In [234]:
df_ny.to_csv('../data/preprocessing/raw_dp05_ny.csv', index=False)

## California: 06

In [235]:
# Set base url
url = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params = {
    'get': 'group(DP05),NAME',
    'for': 'county:*',
    'in': 'state:06',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res = requests.get(url,params)
res

<Response [200]>

In [236]:
df_ca = pd.DataFrame(res.json())
df_ca.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,707,708,709,710,711,712,713,714,715,716
0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Lake County, California",0.3,94.1,1.1,-888888888,-888888888,64148,-555555555,64148,-888888888,...,,,,,,,,,06,033


In [237]:
# Set the values in the first row to the columns
df_ca.columns = df_ca.iloc[0]

In [238]:
# Drop the first row
df_ca = df_ca.iloc[1:, :]

df_ca.head(2)

Unnamed: 0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Lake County, California",0.3,94.1,1.1,-888888888,-888888888,64148,-555555555,64148,-888888888,...,,,,,,,,,6,33
2,"Mariposa County, California",1.1,98.1,4.2,-888888888,-888888888,17540,-555555555,17540,-888888888,...,,,,,,,,,6,43


In [239]:
df_ca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58 entries, 1 to 58
Columns: 717 entries, NAME to county
dtypes: object(717)
memory usage: 325.0+ KB


In [240]:
df_ca.to_csv('../data/preprocessing/raw_dp05_ca.csv', index=False)

## Florida: 12

In [241]:
# Set base url
url = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params = {
    'get': 'group(DP05),NAME',
    'for': 'county:*',
    'in': 'state:12',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res = requests.get(url,params)
res

<Response [200]>

In [242]:
df_fl = pd.DataFrame(res.json())
df_fl.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,707,708,709,710,711,712,713,714,715,716
0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Okaloosa County, Florida",0.1,83.2,0.3,-888888888,-888888888,200737,-555555555,200737,-888888888,...,,,,,,,,,12,091


In [243]:
# Set the values in the first row to the columns
df_fl.columns = df_fl.iloc[0]

In [244]:
# Drop the first row
df_fl = df_fl.iloc[1:, :]

df_fl.head(2)

Unnamed: 0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Okaloosa County, Florida",0.1,83.2,0.3,-888888888,-888888888,200737,-555555555,200737,-888888888,...,,,,,,,,,12,91
2,"Taylor County, Florida",0.5,88.1,1.8,-888888888,-888888888,22098,-555555555,22098,-888888888,...,,,,,,,,,12,123


In [245]:
df_fl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 1 to 67
Columns: 717 entries, NAME to county
dtypes: object(717)
memory usage: 375.4+ KB


In [246]:
df_fl.to_csv('../data/preprocessing/raw_dp05_fl.csv', index=False)

## Illinois: 17

In [247]:
# Set base url
url = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params = {
    'get': 'group(DP05),NAME',
    'for': 'county:*',
    'in': 'state:17',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res = requests.get(url,params)
res

<Response [200]>

In [248]:
df_il = pd.DataFrame(res.json())
df_il.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,707,708,709,710,711,712,713,714,715,716
0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Jersey County, Illinois",1.0,81.3,3.2,-888888888,-888888888,22069,-555555555,22069,-888888888,...,,,,,,,,,17,083


In [249]:
# Set the values in the first row to the columns
df_il.columns = df_il.iloc[0]

In [250]:
# Drop the first row
df_il = df_il.iloc[1:, :]

df_il.head(2)

Unnamed: 0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Jersey County, Illinois",1.0,81.3,3.2,-888888888,-888888888,22069,-555555555,22069,-888888888,...,,,,,,,,,17,83
2,"Putnam County, Illinois",1.2,92.6,4.6,-888888888,-888888888,5746,-555555555,5746,-888888888,...,,,,,,,,,17,155


In [251]:
df_il.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 1 to 102
Columns: 717 entries, NAME to county
dtypes: object(717)
memory usage: 571.5+ KB


In [252]:
df_il.to_csv('../data/preprocessing/raw_dp05_il.csv', index=False)

## Combining States

In [253]:
df = pd.concat([df_tx, df_ny, df_ca, df_fl, df_il])
df

Unnamed: 0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029PEA,DP05_0030MA,DP05_0030EA,DP05_0030PMA,DP05_0030PEA,DP05_0031MA,DP05_0031EA,DP05_0031PEA,state,county
1,"Austin County, Texas",0.6,87.0,1.9,-888888888,-888888888,29565,-555555555,29565,-888888888,...,,,,,,,,,48,015
2,"Kenedy County, Texas",20.5,34.7,38.9,-888888888,-888888888,595,181,595,-888888888,...,,,,,,,,,48,261
3,"Nueces County, Texas",0.1,79.3,0.3,-888888888,-888888888,360486,-555555555,360486,-888888888,...,,,,,,,,,48,355
4,"Colorado County, Texas",0.1,85.6,0.3,-888888888,-888888888,21022,-555555555,21022,-888888888,...,,,,,,,,,48,089
5,"San Patricio County, Texas",0.5,86.1,1.6,-888888888,-888888888,67046,-555555555,67046,-888888888,...,,,,,,,,,48,409
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,"Champaign County, Illinois",0.2,78.4,0.8,-888888888,-888888888,209448,-555555555,209448,-888888888,...,,,,,,,,,17,019
99,"Ford County, Illinois",0.5,73.3,1.4,-888888888,-888888888,13398,-555555555,13398,-888888888,...,,,,,,,,,17,053
100,"Kendall County, Illinois",0.5,80.9,1.8,-888888888,-888888888,124626,-555555555,124626,-888888888,...,,,,,,,,,17,093
101,"Marion County, Illinois",0.4,81.3,1.4,-888888888,-888888888,38084,-555555555,38084,-888888888,...,,,,,,,,,17,121


In [254]:
# Export data to .csv
df.to_csv('../data/preprocessing/raw_dp05_five_states.csv', index=False)

## Pulling DP05 Headers

In [255]:
# Per directions in tutorial at https://www.youtube.com/watch?v=K0-ifZS0mQI&feature=emb_title&ab_channel=U.S.CensusBureau,
# downloaded table and saved csv data with overlays
# These headers correspond to the codes in the previous table

header_df = pd.read_csv('../data/preprocessing/acs5y2018_dp05_data_with_overlays.csv')
header_df

Unnamed: 0,GEO_ID,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,...,DP05_0029M,DP05_0029PE,DP05_0029PM,DP05_0030E,DP05_0030M,DP05_0030PE,DP05_0030PM,DP05_0031E,DP05_0031M,DP05_0031PE
0,id,Geographic Area Name,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!RACE!!Total population,Margin of Error!!RACE!!Total population,Percent Estimate!!RACE!!Total population,...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...
1,0100000US,United States,0.1,79.3,0.1,(X),(X),322903030,*****,322903030,...,5463,49238581,(X),21781300,3215,44.2,0.1,27457281,3635,55.8


In [256]:
# Get rid of geo ID column
header_df = header_df.iloc[:, 1:]

header_df

Unnamed: 0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029M,DP05_0029PE,DP05_0029PM,DP05_0030E,DP05_0030M,DP05_0030PE,DP05_0030PM,DP05_0031E,DP05_0031M,DP05_0031PE
0,Geographic Area Name,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!RACE!!Total population,Margin of Error!!RACE!!Total population,Percent Estimate!!RACE!!Total population,Percent Margin of Error!!RACE!!Total population,...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...
1,United States,0.1,79.3,0.1,(X),(X),322903030,*****,322903030,(X),...,5463,49238581,(X),21781300,3215,44.2,0.1,27457281,3635,55.8


In [257]:
header_df.to_csv('../data/preprocessing/dp05_headers.csv')

### Examine the columns

In [258]:
api_cols = list(df.columns)

In [259]:
header_cols = list(header_df.columns)

In [260]:
len(api_cols), len(header_cols)

(717, 357)

In [261]:
# Examine the difference in columns
difference = set(api_cols) - set(header_cols)

print(len(difference))

list(difference)[:10]

359


['DP05_0017PEA',
 'DP05_0049PMA',
 'DP05_0046PEA',
 'DP05_0089PEA',
 'DP05_0048EA',
 'DP05_0051MA',
 'DP05_0070EA',
 'DP05_0045PEA',
 'DP05_0033PEA',
 'DP05_0013MA']

In [262]:
# By looking some of these up in the .csv, it appears at least some are blank

In [263]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 543 entries, 1 to 102
Columns: 717 entries, NAME to county
dtypes: object(717)
memory usage: 3.0+ MB


In [264]:
df.isna().sum()

0
NAME              0
DP05_0031PM       0
DP05_0032E        0
DP05_0032M        0
DP05_0032PE       0
               ... 
DP05_0031MA     512
DP05_0031EA     543
DP05_0031PEA    543
state             0
county            0
Length: 717, dtype: int64

In [265]:
# Save the null values to a new dataframe.
nan_df = pd.DataFrame(df.isna().sum(), columns=['NaN Counts'])

# Display the entries with NaN counts for every one of 254 counties.
nan_df[nan_df['NaN Counts'] > 500]

Unnamed: 0_level_0,NaN Counts
0,Unnamed: 1_level_1
DP05_0031PMA,523
DP05_0032MA,523
DP05_0032EA,543
DP05_0033EA,543
DP05_0033PEA,543
...,...
DP05_0030PMA,523
DP05_0030PEA,543
DP05_0031MA,512
DP05_0031EA,543


## Create a dictionary of the columns and their identifiers

In [266]:
row_one_df = header_df.iloc[:1, :]

row_one_df

Unnamed: 0,NAME,DP05_0031PM,DP05_0032E,DP05_0032M,DP05_0032PE,DP05_0032PM,DP05_0033E,DP05_0033M,DP05_0033PE,DP05_0033PM,...,DP05_0029M,DP05_0029PE,DP05_0029PM,DP05_0030E,DP05_0030M,DP05_0030PE,DP05_0030PM,DP05_0031E,DP05_0031M,DP05_0031PE
0,Geographic Area Name,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!RACE!!Total population,Margin of Error!!RACE!!Total population,Percent Estimate!!RACE!!Total population,Percent Margin of Error!!RACE!!Total population,...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...,Percent Margin of Error!!SEX AND AGE!!Total po...,Estimate!!SEX AND AGE!!Total population!!65 ye...,Margin of Error!!SEX AND AGE!!Total population...,Percent Estimate!!SEX AND AGE!!Total populatio...


In [267]:
# Convert the row of the dataframe into a list
descriptions = row_one_df.values.tolist()

# The output is a nested list so convert it to a regular list
descriptions = descriptions[0]

# View the first five entries
descriptions[:5]

['Geographic Area Name',
 'Percent Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Female',
 'Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females)',
 'Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females)',
 'Percent Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females)']

In [268]:
# Check the length of the headers and their descriptions
len(header_cols), len(descriptions)

(357, 357)

In [269]:
# Create a dictionary from a zipped list of the header columns and descriptions
header_dict = dict(zip(header_cols, descriptions))

# View the first five entries in the dictionary
list(header_dict.items())[:5]

[('NAME', 'Geographic Area Name'),
 ('DP05_0031PM',
  'Percent Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Female'),
 ('DP05_0032E',
  'Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females)'),
 ('DP05_0032M',
  'Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females)'),
 ('DP05_0032PE',
  'Percent Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females)')]

In [270]:
# Rename the colums in the original dataframe according to the dictionary
df.rename(columns = header_dict, inplace=True)

In [271]:
# Drop columns with NaN values
df.dropna(axis=1, inplace=True)

In [272]:
# Display the dataframe
df

Unnamed: 0,Geographic Area Name,Percent Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Female,Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Percent Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Percent Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Estimate!!RACE!!Total population,Margin of Error!!RACE!!Total population,Percent Estimate!!RACE!!Total population,Percent Margin of Error!!RACE!!Total population,...,DP05_0004PMA,DP05_0004PEA,DP05_0018PMA,DP05_0018PEA,DP05_0025PMA,DP05_0028PMA,DP05_0028PEA,DP05_0029PMA,state,county
1,"Austin County, Texas",0.6,87.0,1.9,-888888888,-888888888,29565,-555555555,29565,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),48,015
2,"Kenedy County, Texas",20.5,34.7,38.9,-888888888,-888888888,595,181,595,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),48,261
3,"Nueces County, Texas",0.1,79.3,0.3,-888888888,-888888888,360486,-555555555,360486,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),48,355
4,"Colorado County, Texas",0.1,85.6,0.3,-888888888,-888888888,21022,-555555555,21022,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),48,089
5,"San Patricio County, Texas",0.5,86.1,1.6,-888888888,-888888888,67046,-555555555,67046,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),48,409
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,"Champaign County, Illinois",0.2,78.4,0.8,-888888888,-888888888,209448,-555555555,209448,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),17,019
99,"Ford County, Illinois",0.5,73.3,1.4,-888888888,-888888888,13398,-555555555,13398,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),17,053
100,"Kendall County, Illinois",0.5,80.9,1.8,-888888888,-888888888,124626,-555555555,124626,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),17,093
101,"Marion County, Illinois",0.4,81.3,1.4,-888888888,-888888888,38084,-555555555,38084,-888888888,...,(X),(X),(X),(X),(X),(X),(X),(X),17,121


In [273]:
df.to_csv('../data/preprocessing/raw_dp05_with_headers_five_states.csv', index=False)

# Pulling DP03 From Census API
## States: TX (48), NY (36), CA (06), FL(12), IL(17)

Table: https://data.census.gov/cedsci/table?q=ACSDP1Y2019.DP03&tid=ACSDP1Y2019.DP03&hidePreview=true  
SELECTED ECONOMIC CHARACTERISTICS  
Survey/Program: American Community Survey   
2018: ACS 5-Year Estimates Data Profiles  
TableID: DP03  

### Format URL should be in

In [274]:
example_url = 'https://api.census.gov/data/2018/acs/acs5/profile?get=group(DP03),NAME&for=county:*&in=state:481&key=abaab7067d6de5d6a0216fca0b8fca4e9015a87f'

## Texas: 48

In [275]:
# Set base url
url03 = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params03 = {
    'get': 'group(DP03),NAME',
    'for': 'county:*',
    'in': 'state:48',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res03 = requests.get(url03,params03)
res03

<Response [200]>

In [276]:
df03_tx = pd.DataFrame(res03.json())
df03_tx.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100
0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Austin County, Texas",0500000US48015,23354,108,23354,-888888888,14475,413,62.0,1.8,...,(X),(X),,,(X),(X),,,48,015
2,"Kenedy County, Texas",0500000US48261,428,122,428,-888888888,220,83,51.4,13.4,...,(X),(X),,,(X),(X),,,48,261


In [277]:
# Set the values in the first row to the columns
df03_tx.columns = df03_tx.iloc[0]

In [278]:
# Drop the first row
df03_tx = df03_tx.iloc[1:, :]

df03_tx.head(3)

Unnamed: 0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Austin County, Texas",0500000US48015,23354,108,23354,-888888888,14475,413,62.0,1.8,...,(X),(X),,,(X),(X),,,48,15
2,"Kenedy County, Texas",0500000US48261,428,122,428,-888888888,220,83,51.4,13.4,...,(X),(X),,,(X),(X),,,48,261
3,"Nueces County, Texas",0500000US48355,280990,413,280990,-888888888,177352,1636,63.1,0.6,...,(X),(X),,,(X),(X),,,48,355


In [279]:
# Display information about the dataframe.
df03_tx.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 1 to 254
Columns: 1101 entries, NAME to county
dtypes: object(1101)
memory usage: 2.1+ MB


In [280]:
# Export the dataframe as a csv.
df03_tx.to_csv('../data/preprocessing/raw_dp03_tx.csv', index=False)

## New York: 36

In [281]:
# Set base url
url03 = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params03 = {
    'get': 'group(DP03),NAME',
    'for': 'county:*',
    'in': 'state:36',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res03 = requests.get(url03,params03)
res03

<Response [200]>

In [282]:
df03_ny = pd.DataFrame(res03.json())
df03_ny.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100
0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Schoharie County, New York",0500000US36095,26508,75,26508,-888888888,15390,382,58.1,1.5,...,(X),(X),,,(X),(X),,,36,095
2,"Onondaga County, New York",0500000US36067,376244,342,376244,-888888888,235381,1621,62.6,0.4,...,(X),(X),,,(X),(X),,,36,067


In [283]:
# Set the values in the first row to the columns
df03_ny.columns = df03_ny.iloc[0]

In [284]:
# Drop the first row
df03_ny = df03_ny.iloc[1:, :]

df03_ny.head(3)

Unnamed: 0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Schoharie County, New York",0500000US36095,26508,75,26508,-888888888,15390,382,58.1,1.5,...,(X),(X),,,(X),(X),,,36,95
2,"Onondaga County, New York",0500000US36067,376244,342,376244,-888888888,235381,1621,62.6,0.4,...,(X),(X),,,(X),(X),,,36,67
3,"Clinton County, New York",0500000US36019,67919,178,67919,-888888888,37903,797,55.8,1.2,...,(X),(X),,,(X),(X),,,36,19


In [285]:
# Display information about the dataframe.
df03_ny.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 1 to 62
Columns: 1101 entries, NAME to county
dtypes: object(1101)
memory usage: 533.4+ KB


In [286]:
# Export the dataframe as a csv.
df03_ny.to_csv('../data/preprocessing/raw_dp03_ny.csv', index=False)

## California: 06

In [287]:
# Set base url
url03 = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params03 = {
    'get': 'group(DP03),NAME',
    'for': 'county:*',
    'in': 'state:06',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res03 = requests.get(url03,params03)
res03

<Response [200]>

In [288]:
df03_ca = pd.DataFrame(res03.json())
df03_ca.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100
0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Lake County, California",0500000US06033,52331,199,52331,-888888888,26160,858,50.0,1.6,...,(X),(X),,,(X),(X),,,06,033
2,"Mariposa County, California",0500000US06043,15019,113,15019,-888888888,7735,395,51.5,2.5,...,(X),(X),,,(X),(X),,,06,043


In [289]:
# Set the values in the first row to the columns
df03_ca.columns = df03_ca.iloc[0]

In [290]:
# Drop the first row
df03_ca = df03_ca.iloc[1:, :]

df03_ca.head(3)

Unnamed: 0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Lake County, California",0500000US06033,52331,199,52331,-888888888,26160,858,50.0,1.6,...,(X),(X),,,(X),(X),,,6,33
2,"Mariposa County, California",0500000US06043,15019,113,15019,-888888888,7735,395,51.5,2.5,...,(X),(X),,,(X),(X),,,6,43
3,"Yuba County, California",0500000US06115,56589,222,56589,-888888888,33246,972,58.7,1.7,...,(X),(X),,,(X),(X),,,6,115


In [291]:
# Display information about the dataframe.
df03_ca.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58 entries, 1 to 58
Columns: 1101 entries, NAME to county
dtypes: object(1101)
memory usage: 499.0+ KB


In [292]:
# Export the dataframe as a csv.
df03_ca.to_csv('../data/preprocessing/raw_dp03_ca.csv', index=False)

## Florida: 12

In [293]:
# Set base url
url03 = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params03 = {
    'get': 'group(DP03),NAME',
    'for': 'county:*',
    'in': 'state:12',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res03 = requests.get(url03,params03)
res03

<Response [200]>

In [294]:
df03_fl = pd.DataFrame(res03.json())
df03_fl.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100
0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Okaloosa County, Florida",0500000US12091,160704,276,160704,-888888888,102023,1281,63.5,0.8,...,(X),(X),,,(X),(X),,,12,091
2,"Taylor County, Florida",0500000US12123,18353,161,18353,-888888888,6918,443,37.7,2.4,...,(X),(X),,,(X),(X),,,12,123


In [295]:
# Set the values in the first row to the columns
df03_fl.columns = df03_fl.iloc[0]

In [296]:
# Drop the first row
df03_fl = df03_fl.iloc[1:, :]

df03_fl.head(3)

Unnamed: 0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Okaloosa County, Florida",0500000US12091,160704,276,160704,-888888888,102023,1281,63.5,0.8,...,(X),(X),,,(X),(X),,,12,91
2,"Taylor County, Florida",0500000US12123,18353,161,18353,-888888888,6918,443,37.7,2.4,...,(X),(X),,,(X),(X),,,12,123
3,"Washington County, Florida",0500000US12133,20237,91,20237,-888888888,9358,507,46.2,2.5,...,(X),(X),,,(X),(X),,,12,133


In [297]:
# Display information about the dataframe.
df03_fl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 1 to 67
Columns: 1101 entries, NAME to county
dtypes: object(1101)
memory usage: 576.4+ KB


In [298]:
# Export the dataframe as a csv.
df03_fl.to_csv('../data/preprocessing/raw_dp03_fl.csv', index=False)

## Illinois: 17

In [299]:
# Set base url
url03 = 'https://api.census.gov/data/2018/acs/acs5/profile?'

# Set params
params03 = {
    'get': 'group(DP03),NAME',
    'for': 'county:*',
    'in': 'state:17',
    'key': 'abaab7067d6de5d6a0216fca0b8fca4e9015a87f'
}

# Make a request and display the response code.
res03 = requests.get(url03,params03)
res03

<Response [200]>

In [300]:
df03_il = pd.DataFrame(res03.json())
df03_il.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100
0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Jersey County, Illinois",0500000US17083,18079,74,18079,-888888888,11127,375,61.5,2.1,...,(X),(X),,,(X),(X),,,17,083
2,"Putnam County, Illinois",0500000US17155,4777,27,4777,-888888888,3107,117,65.0,2.4,...,(X),(X),,,(X),(X),,,17,155


In [301]:
# Set the values in the first row to the columns
df03_il.columns = df03_il.iloc[0]

In [302]:
# Drop the first row
df03_il = df03_il.iloc[1:, :]

df03_il.head(3)

Unnamed: 0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Jersey County, Illinois",0500000US17083,18079,74,18079,-888888888,11127,375,61.5,2.1,...,(X),(X),,,(X),(X),,,17,83
2,"Putnam County, Illinois",0500000US17155,4777,27,4777,-888888888,3107,117,65.0,2.4,...,(X),(X),,,(X),(X),,,17,155
3,"De Witt County, Illinois",0500000US17039,13024,53,13024,-888888888,8614,250,66.1,1.9,...,(X),(X),,,(X),(X),,,17,39


In [303]:
# Display information about the dataframe.
df03_il.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 1 to 102
Columns: 1101 entries, NAME to county
dtypes: object(1101)
memory usage: 877.5+ KB


In [304]:
# Export the dataframe as a csv.
df03_il.to_csv('../data/preprocessing/raw_dp03_il.csv', index=False)

## Combining States

In [305]:
df03 = pd.concat([df03_tx, df03_ny, df03_ca, df03_fl, df03_il])
df03

Unnamed: 0,NAME,GEO_ID,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0136EA,DP03_0136MA,DP03_0136PEA,DP03_0136PMA,DP03_0137MA,DP03_0137EA,DP03_0137PEA,DP03_0137PMA,state,county
1,"Austin County, Texas",0500000US48015,23354,108,23354,-888888888,14475,413,62.0,1.8,...,(X),(X),,,(X),(X),,,48,015
2,"Kenedy County, Texas",0500000US48261,428,122,428,-888888888,220,83,51.4,13.4,...,(X),(X),,,(X),(X),,,48,261
3,"Nueces County, Texas",0500000US48355,280990,413,280990,-888888888,177352,1636,63.1,0.6,...,(X),(X),,,(X),(X),,,48,355
4,"Colorado County, Texas",0500000US48089,16585,107,16585,-888888888,9832,373,59.3,2.2,...,(X),(X),,,(X),(X),,,48,089
5,"San Patricio County, Texas",0500000US48409,50886,205,50886,-888888888,30618,712,60.2,1.4,...,(X),(X),,,(X),(X),,,48,409
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,"Champaign County, Illinois",0500000US17019,174427,251,174427,-888888888,108838,1206,62.4,0.7,...,(X),(X),,,(X),(X),,,17,019
99,"Ford County, Illinois",0500000US17053,10736,98,10736,-888888888,6618,184,61.6,1.8,...,(X),(X),,,(X),(X),,,17,053
100,"Kendall County, Illinois",0500000US17093,92166,424,92166,-888888888,67387,1364,73.1,1.4,...,(X),(X),,,(X),(X),,,17,093
101,"Marion County, Illinois",0500000US17121,30162,100,30162,-888888888,18482,320,61.3,1.0,...,(X),(X),,,(X),(X),,,17,121


In [306]:
# Export data to .csv
df03.to_csv('../data/preprocessing/raw_dp03_five_states.csv', index=False)

## Pulling DP03 Headers

In [307]:
# Per directions in tutorial at https://www.youtube.com/watch?v=K0-ifZS0mQI&feature=emb_title&ab_channel=U.S.CensusBureau,
# downloaded table and saved csv data with overlays
# These headers correspond to the codes in the previous table

header_df03 = pd.read_csv('../data/preprocessing/acs5y2018_dp03_data_with_overlays.csv')
header_df03.head(3)

Unnamed: 0,GEO_ID,NAME,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,...,DP03_0135PE,DP03_0135PM,DP03_0136E,DP03_0136M,DP03_0136PE,DP03_0136PM,DP03_0137E,DP03_0137M,DP03_0137PE,DP03_0137PM
0,id,Geographic Area Name,Estimate!!EMPLOYMENT STATUS!!Population 16 yea...,Margin of Error!!EMPLOYMENT STATUS!!Population...,Percent Estimate!!EMPLOYMENT STATUS!!Populatio...,Percent Margin of Error!!EMPLOYMENT STATUS!!Po...,Estimate!!EMPLOYMENT STATUS!!Population 16 yea...,Margin of Error!!EMPLOYMENT STATUS!!Population...,Percent Estimate!!EMPLOYMENT STATUS!!Populatio...,Percent Margin of Error!!EMPLOYMENT STATUS!!Po...,...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...,Estimate!!PERCENTAGE OF FAMILIES AND PEOPLE WH...,Margin of Error!!PERCENTAGE OF FAMILIES AND PE...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...,Estimate!!PERCENTAGE OF FAMILIES AND PEOPLE WH...,Margin of Error!!PERCENTAGE OF FAMILIES AND PE...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...
1,0100000US,United States,257754872,16354,257754872,(X),163276329,146596,63.3,0.1,...,9.3,0.1,(X),(X),11.3,0.1,(X),(X),25.6,0.1


In [308]:
# Get rid of geo ID column
header_df03 = header_df03.iloc[:, 1:]

header_df03.head(3)

Unnamed: 0,NAME,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,DP03_0003E,...,DP03_0135PE,DP03_0135PM,DP03_0136E,DP03_0136M,DP03_0136PE,DP03_0136PM,DP03_0137E,DP03_0137M,DP03_0137PE,DP03_0137PM
0,Geographic Area Name,Estimate!!EMPLOYMENT STATUS!!Population 16 yea...,Margin of Error!!EMPLOYMENT STATUS!!Population...,Percent Estimate!!EMPLOYMENT STATUS!!Populatio...,Percent Margin of Error!!EMPLOYMENT STATUS!!Po...,Estimate!!EMPLOYMENT STATUS!!Population 16 yea...,Margin of Error!!EMPLOYMENT STATUS!!Population...,Percent Estimate!!EMPLOYMENT STATUS!!Populatio...,Percent Margin of Error!!EMPLOYMENT STATUS!!Po...,Estimate!!EMPLOYMENT STATUS!!Population 16 yea...,...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...,Estimate!!PERCENTAGE OF FAMILIES AND PEOPLE WH...,Margin of Error!!PERCENTAGE OF FAMILIES AND PE...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...,Estimate!!PERCENTAGE OF FAMILIES AND PEOPLE WH...,Margin of Error!!PERCENTAGE OF FAMILIES AND PE...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...
1,United States,257754872,16354,257754872,(X),163276329,146596,63.3,0.1,162248196,...,9.3,0.1,(X),(X),11.3,0.1,(X),(X),25.6,0.1


In [309]:
# Export the dataframe as a csv.
header_df03.to_csv('../data/preprocessing/dp03_headers.csv')

### Examine the columns

In [310]:
api_cols = list(df03.columns)

In [311]:
header_cols = list(header_df03.columns)

In [312]:
len(api_cols), len(header_cols)

(1101, 549)

In [313]:
# Examine the difference in columns
difference = set(api_cols) - set(header_cols)

print(len(difference))

list(difference)[:10]

551


['DP03_0064EA',
 'DP03_0046MA',
 'DP03_0132EA',
 'DP03_0054EA',
 'DP03_0058PMA',
 'DP03_0029EA',
 'DP03_0132PEA',
 'DP03_0113MA',
 'DP03_0089MA',
 'DP03_0051PMA']

In [314]:
# By looking some of these up in the .csv, it appears at least some are blank

In [315]:
# Save the null values to a new dataframe.
nan_df = pd.DataFrame(df03.isna().sum(), columns=['NaN Counts'])

# Display the entries with many NaN values.
nan_df[nan_df['NaN Counts'] > 200]

Unnamed: 0_level_0,NaN Counts
0,Unnamed: 1_level_1
DP03_0001EA,543
DP03_0001MA,543
DP03_0001PEA,543
DP03_0002EA,543
DP03_0002MA,543
...,...
DP03_0135PEA,543
DP03_0136PEA,543
DP03_0136PMA,543
DP03_0137PEA,543


## Create a dictionary of the columns and their identifiers

In [316]:
row_one_df = header_df03.iloc[:1, :]

row_one_df

Unnamed: 0,NAME,DP03_0001E,DP03_0001M,DP03_0001PE,DP03_0001PM,DP03_0002E,DP03_0002M,DP03_0002PE,DP03_0002PM,DP03_0003E,...,DP03_0135PE,DP03_0135PM,DP03_0136E,DP03_0136M,DP03_0136PE,DP03_0136PM,DP03_0137E,DP03_0137M,DP03_0137PE,DP03_0137PM
0,Geographic Area Name,Estimate!!EMPLOYMENT STATUS!!Population 16 yea...,Margin of Error!!EMPLOYMENT STATUS!!Population...,Percent Estimate!!EMPLOYMENT STATUS!!Populatio...,Percent Margin of Error!!EMPLOYMENT STATUS!!Po...,Estimate!!EMPLOYMENT STATUS!!Population 16 yea...,Margin of Error!!EMPLOYMENT STATUS!!Population...,Percent Estimate!!EMPLOYMENT STATUS!!Populatio...,Percent Margin of Error!!EMPLOYMENT STATUS!!Po...,Estimate!!EMPLOYMENT STATUS!!Population 16 yea...,...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...,Estimate!!PERCENTAGE OF FAMILIES AND PEOPLE WH...,Margin of Error!!PERCENTAGE OF FAMILIES AND PE...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...,Estimate!!PERCENTAGE OF FAMILIES AND PEOPLE WH...,Margin of Error!!PERCENTAGE OF FAMILIES AND PE...,Percent Estimate!!PERCENTAGE OF FAMILIES AND P...,Percent Margin of Error!!PERCENTAGE OF FAMILIE...


In [317]:
# Convert the row of the dataframe into a list
descriptions = row_one_df.values.tolist()

# The output is a nested list so convert it to a regular list
descriptions = descriptions[0]

# View the first five entries
descriptions[:5]

['Geographic Area Name',
 'Estimate!!EMPLOYMENT STATUS!!Population 16 years and over',
 'Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over',
 'Percent Estimate!!EMPLOYMENT STATUS!!Population 16 years and over',
 'Percent Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over']

In [318]:
# Check the length of the headers and their descriptions
len(header_cols), len(descriptions)

(549, 549)

In [319]:
# Create a dictionary from a zipped list of the header columns and descriptions
header_dict = dict(zip(header_cols, descriptions))

# View the first five entries in the dictionary
list(header_dict.items())[:5]

[('NAME', 'Geographic Area Name'),
 ('DP03_0001E', 'Estimate!!EMPLOYMENT STATUS!!Population 16 years and over'),
 ('DP03_0001M',
  'Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over'),
 ('DP03_0001PE',
  'Percent Estimate!!EMPLOYMENT STATUS!!Population 16 years and over'),
 ('DP03_0001PM',
  'Percent Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over')]

In [320]:
# Rename the colums in the original dataframe according to the dictionary
df03 = df03.rename(columns = header_dict)

In [321]:
# Drop columsn with NaN values
df03 = df03.dropna(axis=1)

In [322]:
# Display the dataframe
df03.head(3)

Unnamed: 0,Geographic Area Name,GEO_ID,Estimate!!EMPLOYMENT STATUS!!Population 16 years and over,Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over,Percent Estimate!!EMPLOYMENT STATUS!!Population 16 years and over,Percent Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over,Estimate!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force,Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force,Percent Estimate!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force,Percent Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force,...,DP03_0134EA,DP03_0134MA,DP03_0135EA,DP03_0135MA,DP03_0136EA,DP03_0136MA,DP03_0137MA,DP03_0137EA,state,county
1,"Austin County, Texas",0500000US48015,23354,108,23354,-888888888,14475,413,62.0,1.8,...,(X),(X),(X),(X),(X),(X),(X),(X),48,15
2,"Kenedy County, Texas",0500000US48261,428,122,428,-888888888,220,83,51.4,13.4,...,(X),(X),(X),(X),(X),(X),(X),(X),48,261
3,"Nueces County, Texas",0500000US48355,280990,413,280990,-888888888,177352,1636,63.1,0.6,...,(X),(X),(X),(X),(X),(X),(X),(X),48,355


In [323]:
df03.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 543 entries, 1 to 102
Columns: 644 entries, Geographic Area Name to county
dtypes: object(644)
memory usage: 2.7+ MB


In [324]:
# Export the dataframe as a csv.
df03.to_csv('../data/preprocessing/raw_dp03_with_headers_five_states.csv', index=False)