# ACS data cleaning, variable selection and visualisation

Visualising ACS indicators and automating plotting per county in California



### Table of Contents
* [Dataset descriptions](#description)<br>
1. [Loading and exploring data:](#load) Familiarization with querying ACS API <br> 
1. [Selecting variables:](#select) /Work in progress/, select vb in RECS & ACS, income indicators, demographics for final insights<br>
1. [Visualising data:](#visualize) /Dummy/ Spot check plotting an indicator of choice on a certain county in California<br>
1. [Generalising visualisation:](#generic) Automatising spatial visualisation of indicators for later use<br>
1. [Exporting data](#export) /Work in progress/ export varialbes selected in 3 to use in ML models<br>

Note: /Work in progress/ to do the same with IPUMS disaggregate micordata

---

## Describing the ACS dataset<a id='description'></a>

- ##### American Community Survey (2015-2019?)

**Data collection and querying** 

I am looking at Census Data that comes from the 5-year American Community Survey. The data is collected over 5 years and gives a very high level of detail on health, demigraphics and housing of housholds across the US (data given per census block group). The ACS produces survey-based period estimates. For instance, the 5-year 2011-2015 estimates are based on data collected during all 5 years, through selecting a random sample of housholds (around 2 million per year) across the US, and interviewing them. Data is hence collected during 60 months.


In order to obtain the data I downloaded a python package that interfaces with the Census API, documentation on this package can be found here: https://jtleider.github.io/censusdata/. The API code repository: https://github.com/jtleider/censusdata. The API this package uses is the official US Census API, which can be found at: https://www.census.gov/data/developers.html




**Dataset description**

- Structure: Each entry (row) represents 5-year estimates for variables in a particular census block group in the US. Columns represent individual variables relating to health, housing and socio-economic indicators.
- Granularity: Data is given at the census block group level (smaller than zipcode and census tract), and represents aggregate data of randomized housholds in that census block group, over 5 years.
- Scope: Each obersvation (census block) has been randomly sampled from every state, the District of Columbia, and Puerto Rico, census blocks not sampled have synthetic data, estimated by the Census Bureau, so that there is full coverage of the US. Topics covered include income, employment, health insurance, the age distribution, and education. 
- Temporality: The 5-year ACS gives estimates of the values of the data at the Census Block level, based on the data collected in the ACS yearly surveys. However, as the Census Bureau points out as a disclaimer, this is not merely the average of the variable values of the 5 annual ACS surveys, but has gone through more statistically robust checks. Hence, each entry only has one estimate per variable that shows the representative value of that variable in the 5-year period of the ACS (2015-2019 for instance in the case above).
- Faithfullness: The data comes from a very offical source, the US Census Bureau. There is some metadata on what values variables take if there are any errors (-888888), which should be dropped or reserved to understand any systematic reasosn why those errors occured. Margins of errors/levels of uncertainty are reported for the variables, so we ahould check these when drawing conclusions from variables that might be more uncertain. Note: it is not clear to me how we deal with how errors propagate from our individual variables to our ML models, but I would like to know how to deal with this.

Maybe a reason to question its validity would be that the data is taken over a period of multiple years to only represent a typical household at one moment in time, but given the nature of census survey data collection methods, this seems acceptable to me. The great limiting factor is that that the data is given as census block estimates, whilst we will use ML on houshold level informatino form the RECS to train our algorithm. Although researches have used these two datasats in conjunction before, we have to be careful with how we interpret results. Another limiting aspect is that we don't have variables on energy consumption in this dataset, so we will need to use ML methods on the Residential Energy Consumption Survey (RECS) and then match/project rhose to our ACS data to generate eneryg conusption estimates at the census block. 

**Cleaning operations to do**

More than data cleaning I can forsee a long phase of work focused on understanding a lot of the vairables in this dataset (there are over 20,000 variables in this dataset, with information on the variables found here:https://www.census.gov/programs-surveys/acs/technical-documentation/summary-file-documentation.html).
There will be two major transformations to be done on this dataset in order to create "typical census block group housholds" from the ACS dataset:
- 1) Identify the variables that exist in both RECS and ACS surveys, or similar enough to transform one into the other (ie. if RECS has income per year and ACS has income per month - transform to month in both cases) 
- 2) For the variables matched, transform census block variables to household varaibles. For instance, in the case of demographic data, a census block will give percentage of households that are form a certain ethinity. We will have to make jusdgement calls on what threshold these census block variables should meet in order to associate to that "typical houshold" a certain ethinicity (not a %)
- 3) Normalize variables that are shared in RECS and ACS
- 4) Reserve some socio-demographic variables that will *not* be included in the model as explanatory variables in order to later reflect on the distribution of energy poverty indicators accross areas with different demographics.

---

## Section 1. Loading data<a id='load'></a>
Query data<br>

#### Dependencies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import warnings 
import geopandas as gpd
from mpl_toolkits.axes_grid1 import make_axes_locatable
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight') 

# Set some parameters
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 14
np.set_printoptions(4)

##### Installing the API package to the American Community Survey (5-year) dataset (ACS)

In [2]:
pip install CensusData

Note: you may need to restart the kernel to use updated packages.


In [3]:
import censusdata
#pd.set_option('display.expand_frame_repr', False)
#pd.set_option('display.precision', 2)

##### Identifying variables

Identify relevant tables containing the variables of interest, either through ACS documentation (Table Shells https://www.census.gov/programs-surveys/acs/technical-documentation/summary-file-documentation.html). Alternatively, it is possible to do this from within Python censusdata.search will search for given text patterns.

In [4]:
#The structure of each variable metadata is : "VB name", "Concept", "Label"
unemp = censusdata.search('acs5', 2015, 'label', 'unemploy')[150:170] # the numbers refer to the variables found with the string in the label
educ = censusdata.search('acs5', 2015, 'concept', 'education')[730:790]
#income = censusdata.search('acs5', 2015, 'concept', 'income') #there are 5966 variables realted to income !! A lot of data cleaning here

#If you already know the name of the variable you want, but there are subvariables to check
#censusdata.printtable(censusdata.censustable('acs5', 2015, 'B15003'))

##### Identify geographies of interest

Identify the geographies of interest: for CBG of Alameda County, CA, look for the geographic identifier (FIPS code) for CA, then the identifiers for all counties.


In [5]:
states_FIPS = censusdata.geographies(censusdata.censusgeo([('state', '*')]), 'acs5', 2015) 
#California is 06, now see counties

counties_FIPS_CA = censusdata.geographies(censusdata.censusgeo([('state', '06'), ('county', '*')]), 'acs5', 2015)
#Alameda is 001

##### Download data & preliminary transformations

With variable and geographies of interest, download the data using censusdata.download and compute variables for the percent unemployed and the percent with no high school degree

In [6]:
county_acs = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', '06'), ('county', '001'), ('block group', '*')]),
                             ['B23025_003E', 'B23025_005E', 'B15003_001E', 'B15003_002E', 'B15003_003E',
                              'B15003_004E', 'B15003_005E', 'B15003_006E', 'B15003_007E', 'B15003_008E',
                              'B15003_009E', 'B15003_010E', 'B15003_011E', 'B15003_012E', 'B15003_013E',
                              'B15003_014E', 'B15003_015E', 'B15003_016E'])
county_acs['percent_unemployed'] = county_acs.B23025_005E / county_acs.B23025_003E * 100
county_acs['percent_nohs'] = (county_acs.B15003_002E + county_acs.B15003_003E + county_acs.B15003_004E
                          + county_acs.B15003_005E + county_acs.B15003_006E + county_acs.B15003_007E + county_acs.B15003_008E
                          + county_acs.B15003_009E + county_acs.B15003_010E + county_acs.B15003_011E + county_acs.B15003_012E
                          + county_acs.B15003_013E + county_acs.B15003_014E +
                          county_acs.B15003_015E + county_acs.B15003_016E) / county_acs.B15003_001E * 100
#county_acs = county_acs[['percent_unemployed', 'percent_nohs']]



In [7]:
#Convert geo index into mutiple geolocation columns

#rest index and make copy
county_acs.reset_index(inplace = True)
county_acs.rename(columns={"index": "id"}, inplace = True)
county_acs['geo_inf'] = [str(county_acs.id[i]) for i in range(0, len(county_acs))]

#parse geo information
county_acs['geo_inf'] = [str(county_acs.id[i]) for i in range(0, len(county_acs))]
county_acs[['Block','Tract', 'County Name', 'State', 'FIP']] = county_acs.geo_inf.str.split(',',expand=True)                                                               
county_acs[['STATEFP','COUNTYFP', 'TRACTCE', 'BLKGRPCE']] = county_acs.FIP.str.split('>',expand=True)                                                               

#extract numbers
county_acs['STATEFP'] = county_acs['STATEFP'].str.extract('(\d+)')
county_acs['COUNTYFP'] = county_acs['COUNTYFP'].str.extract('(\d+)')
county_acs['TRACTCE'] = county_acs['TRACTCE'].str.extract('(\d+)')
county_acs['BLKGRPCE'] = county_acs['BLKGRPCE'].str.extract('(\d+)')
   
#drop columns
county_acs.drop(columns = {'id', 'geo_inf', 'FIP'}, inplace = True)

#reorder remaining columns
cols_to_order = ['BLKGRPCE', 'TRACTCE', 'COUNTYFP', 'STATEFP', 'Block', 'Tract', 'County Name', 'State' ]
new_columns = cols_to_order + (county_acs.columns.drop(cols_to_order).tolist())
county_acs = county_acs[new_columns]


In [8]:
#Preliminary data transformations
#Rank CBG in Alameda county from highest to lowest rate of unemployment
sorted_unemp = county_acs.sort_values('percent_unemployed', ascending=False).head(5) #there are 1047 CBG in Alameda County

#Show correlation between unemployment and degree of education attained
county_acs.corr()

Unnamed: 0,B23025_003E,B23025_005E,B15003_001E,B15003_002E,B15003_003E,B15003_004E,B15003_005E,B15003_006E,B15003_007E,B15003_008E,B15003_009E,B15003_010E,B15003_011E,B15003_012E,B15003_013E,B15003_014E,B15003_015E,B15003_016E,percent_unemployed,percent_nohs
B23025_003E,1.0,0.458544,0.934936,0.136133,-0.006521,-0.023377,0.099565,0.031795,0.09091,0.066327,0.124903,0.136836,0.065185,0.120634,0.109599,0.165062,0.089033,0.242528,-0.111272,-0.14371
B23025_005E,0.458544,1.0,0.373904,0.257057,0.028455,0.066268,0.088238,0.137116,0.169131,0.19182,0.179211,0.267996,0.104702,0.222197,0.254194,0.260726,0.242523,0.29343,0.663627,0.245175
B15003_001E,0.934936,0.373904,1.0,0.181188,-0.025269,-0.015157,0.113624,0.043239,0.10388,0.083552,0.138671,0.130795,0.062996,0.157447,0.131978,0.201386,0.166428,0.283824,-0.149983,-0.122739
B15003_002E,0.136133,0.257057,0.181188,1.0,-0.008426,0.037115,0.091265,0.181151,0.252039,0.253449,0.318449,0.334328,0.175006,0.259912,0.280916,0.198943,0.290531,0.330291,0.169744,0.579068
B15003_003E,-0.006521,0.028455,-0.025269,-0.008426,1.0,-0.011817,0.036735,0.058954,0.005814,0.000962,-0.023454,-0.006213,-0.021357,0.021216,0.026491,-0.034478,0.000275,0.013191,0.017671,0.025129
B15003_004E,-0.023377,0.066268,-0.015157,0.037115,-0.011817,1.0,-0.014678,0.020442,-0.010769,0.019987,0.068116,0.043581,-0.00316,0.018599,0.045101,0.036454,0.046824,0.032346,0.120226,0.077452
B15003_005E,0.099565,0.088238,0.113624,0.091265,0.036735,-0.014678,1.0,0.087442,0.093747,0.070154,0.041945,0.10709,0.080806,0.038632,0.133953,0.099638,0.034608,0.058807,0.029697,0.097191
B15003_006E,0.031795,0.137116,0.043239,0.181151,0.058954,0.020442,0.087442,1.0,0.09044,0.119616,0.106986,0.221564,0.090403,0.139269,0.148017,0.148038,0.105779,0.164723,0.125795,0.300634
B15003_007E,0.09091,0.169131,0.10388,0.252039,0.005814,-0.010769,0.093747,0.09044,1.0,0.222301,0.058814,0.317645,0.113559,0.166957,0.221852,0.204434,0.154839,0.156883,0.103029,0.336848
B15003_008E,0.066327,0.19182,0.083552,0.253449,0.000962,0.019987,0.070154,0.119616,0.222301,1.0,0.221498,0.25696,0.157756,0.194705,0.146175,0.191314,0.180731,0.154858,0.153507,0.366161


---
## Section 2: Selecting variables<a id='select'></a>
Now we need to select variables in the ACS that are also present in the RECS dataset

---
## Section 3: Visualising data on county/state<a id='visualize'></a>



Adapted from: https://towardsdatascience.com/mapping-census-data-fbab6722def0

Data for the census block group comes from: https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html

In [9]:
# All CBG in CA
blocks_map_ca = gpd.read_file('geo_data/cb_2019_06_bg_500k.shp')

In [10]:
# All counties in CA
counties_map_ca = gpd.read_file('geo_data/cb_2019_us_county_500k.shp')

#Select county of interest
county = counties_map_ca[counties_map_ca['NAME'] == 'Alameda']

Plot county in state

In [None]:

fig, ax = plt.subplots(figsize=(5,8))

county.plot(ax=ax, color = 'red', alpha = 0.7, markersize = 3)

blocks_map_ca.plot(ax=ax, color = 'white', edgecolor='gray', alpha = 0.3)

ax.axis('off') # You can optionally omit the axes

# Show a title
ax.set_title('Alameda county in California')

plt.show()

Define and plot census block groups in county

In [None]:
blocks_map_ca_county = blocks_map_ca[blocks_map_ca['COUNTYFP'] == county['COUNTYFP'].values[0]]
blocks_map_ca_county.plot()   

In [None]:
print(blocks_map_ca_county.crs)

Add ACS data to each census block group, checking whether there are the same number of observations or not. If not, spot check the CBG that are not represented

In [None]:
assert len(blocks_map_ca_county) == len(county_acs) 
#In this case, lengths don't match

In this case, there is a CBG  that is not in the geopsatial files, so after merging we check which one is left out and chack where its represented at: https://censusreporter.org/profiles/15000US060019900000-block-group-0-alameda-ca/

I this case, we see CBG 0 from Alameda county is actully the water in between SF & Berkeley, so we omit it (inner join is enough)

In [None]:
#merge CBG data to block geopandas dataframe
blocks_county_ind = blocks_map_ca_county.merge(county_acs, on = ['BLKGRPCE','TRACTCE','COUNTYFP','STATEFP'])

In [None]:
#Check instance not included in merge
county_acs[(~county_acs.BLKGRPCE.isin(blocks_county_ind.BLKGRPCE))&(~county_acs.TRACTCE.isin(blocks_county_ind.TRACTCE))]


In [None]:
fig, ax = plt.subplots(figsize=(10,15))
# lay down a background map for areas with no people
blocks_county_ind.plot(ax=ax, color='lightgrey', alpha=.5) #no data in red - Oakland port #red

# set the legend specifications
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
# add the map with the choropleth
blocks_county_ind.plot(ax=ax, 
                        column='percent_nohs', 
                        cmap='Purples_r',
                        legend=True, 
                        cax=cax,
                      )


ax.set_title('Alameda county % of no high school education (5y 2015 ACS)', fontsize=20)
fig.patch.set_visible(False)
ax.axis('off')
plt.tight_layout()
#plt.savefig('images/Alameda_education.png')

---
## Section 5: Generic function for indicator/county in California<a id='generic'></a>

These two functions are used to select indicators from ACS, transforme them and plot them geospatially, per census block, for a selected county at a time. The examples use counties in California, since the shapefiles downloaded are those of CA but can be flexible to other states, just need to download .shp files.


In [None]:
def county_indicators(state, county):
    """In this case we will create two inddicators, percentage unemployed 
        and no high school percentage, however we should 
        expand and change this once we know what indeicators we need"""
    
    #Download Census data indicators - TO CHANGE
    county_acs = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', state_FIPS), ('county', county), ('block group', '*')]),
                             ['B23025_003E', 'B23025_005E', 'B15003_001E', 'B15003_002E', 'B15003_003E',
                              'B15003_004E', 'B15003_005E', 'B15003_006E', 'B15003_007E', 'B15003_008E',
                              'B15003_009E', 'B15003_010E', 'B15003_011E', 'B15003_012E', 'B15003_013E',
                              'B15003_014E', 'B15003_015E', 'B15003_016E'])
    county_acs['percent_unemployed'] = county_acs.B23025_005E / county_acs.B23025_003E * 100
    county_acs['percent_nohs'] = (county_acs.B15003_002E + county_acs.B15003_003E + county_acs.B15003_004E
                              + county_acs.B15003_005E + county_acs.B15003_006E + county_acs.B15003_007E + county_acs.B15003_008E
                              + county_acs.B15003_009E + county_acs.B15003_010E + county_acs.B15003_011E + county_acs.B15003_012E
                              + county_acs.B15003_013E + county_acs.B15003_014E +
                              county_acs.B15003_015E + county_acs.B15003_016E) / county_acs.B15003_001E * 100 #county_acs = county_acs[['percent_unemployed', 'percent_nohs']]
    
    
    #Convert geo index into mutiple geolocation columns

    #rest index and make copy
    county_acs.reset_index(inplace = True)
    county_acs.rename(columns={"index": "id"}, inplace = True)
    county_acs['geo_inf'] = [str(county_acs.id[i]) for i in range(0, len(county_acs))]

    #parse geo information
    county_acs['geo_inf'] = [str(county_acs.id[i]) for i in range(0, len(county_acs))]
    county_acs[['Block','Tract', 'County Name', 'State', 'FIP']] = county_acs.geo_inf.str.split(',',expand=True)                                                               
    county_acs[['STATEFP','COUNTYFP', 'TRACTCE', 'BLKGRPCE']] = county_acs.FIP.str.split('>',expand=True)                                                               

    #extract numbers
    county_acs['STATEFP'] = county_acs['STATEFP'].str.extract('(\d+)')
    county_acs['COUNTYFP'] = county_acs['COUNTYFP'].str.extract('(\d+)')
    county_acs['TRACTCE'] = county_acs['TRACTCE'].str.extract('(\d+)')
    county_acs['BLKGRPCE'] = county_acs['BLKGRPCE'].str.extract('(\d+)')

    #drop columns
    county_acs.drop(columns = {'id', 'geo_inf', 'FIP'}, inplace = True)

    #reorder remaining columns
    cols_to_order = ['BLKGRPCE', 'TRACTCE', 'COUNTYFP', 'STATEFP', 'Block', 'Tract', 'County Name', 'State' ]
    new_columns = cols_to_order + (county_acs.columns.drop(cols_to_order).tolist())
    county_acs = county_acs[new_columns]


    return county_acs


In [None]:
def create_indicator_map(state, county, county_acs, image_path, color=None):
    """Creates a map visualizing the ACS indicator of choice
    
    Args:
        state: str
        county: str.
        county_acs: dataframe cointaining the ACS indicator/ transformed indiator that we want to plot 
        #image_path: str. the path to save the image "folder/file.png"
        color: int. 0=red 1=orange 2=green 3=blue 4=purple, random if none
        
    """
    # read data
    blocks_map_state = gpd.read_file('geo_data/cb_2019_'+state+'_bg_500k.shp')
    
    #filter all CBG in the selected county
    blocks_map_state_county = blocks_map_state[blocks_map_state['COUNTYFP'] == county]
    
    # randomly choose a color scheme
    cmaps = ['Reds_r', 'Oranges_r', 'Greens_r', 'Blues_r', 'Purples_r']
    if color == None:
        cmap = random.choice(cmaps)
    else:
        cmap = cmaps[color]
        
    #combine geospatial data with cnsus indicators
    blocks_county_ind = blocks_map_state_county.merge(county_acs, on = ['BLKGRPCE','TRACTCE','COUNTYFP','STATEFP'])
    
    #save county name
    county_name = blocks_county_ind['County Name'].values[0]
    
    # combine blocks to get a city outline
    outline = blocks_map_state_county.dissolve(by=blocks_map_state_county.columns[0], 
                                       aggfunc='first')

    # plot
    fig, ax = plt.subplots(figsize=(12,12))
    #no data in gery
    blocks_county_ind.plot(ax=ax, color='lightgrey', alpha=.5) #or red?

    # set the legend specifications
    
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="5%", pad=0.05)
    # add the map with the choropleth
    blocks_county_ind.plot(ax=ax, 
                            column='percent_nohs', 
                            cmap=cmap,
                            legend=True, 
                            cax=cax,
                          )
    outline.plot(ax=ax, 
                 facecolor='none', 
                 edgecolor='grey', 
                 linewidth=.5)
    ax.set_title(county_name + ''+'% no High school per CBG (2015 5y ACS)', fontsize=20)
    fig.patch.set_visible(False)
    ax.axis('off')
    plt.tight_layout()
    #plt.savefig(image_path)

1. Check the state code and the county code we are interested in

In [None]:
states_FIPS = censusdata.geographies(censusdata.censusgeo([('state', '*')]), 'acs5', 2015) 
#California is 06, now see counties

counties_FIPS_CA = censusdata.geographies(censusdata.censusgeo([('state', '06'), ('county', '*')]), 'acs5', 2015)
#Alameda '001', Marin '041', Napa '055', SF '075'

2. Plot maps for each selected county and the list of indicators downloaded and constructed in county_indicators()  

In [None]:
#select state and county of interest
state_FIPS = '06'
counties_FIPS = ['001', '041','055','075'] #Alameda '001', Marin '041', Napa '055', SF '075'
blocks_map_state = gpd.read_file('geo_data/cb_2019_'+state_FIPS+'_bg_500k.shp')

#county = ['Alameda', 'Santa Clara', 'San Francisco'] #change

for i, county in enumerate(counties_FIPS):
    #get county ACS indicators
    county_acs = county_indicators(state_FIPS, county)
    
    #plot CBG indicators on County maps
    image_path = 'images/'+ county + '.png'
    create_indicator_map(state_FIPS, county, county_acs, image_path, color=i)


---
## Section 4: Exporting data<a id='export'></a>

## WORK IN PROGRESS:

### IPUMS Dataset

IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community contexts. Data and services available free of charge.