# ACS data cleaning, variable selection and export

We will clean and prepare the dataset to be used as out of sample observations to create our predictions for the resource allocation problem.



### Table of Contents
1. [Loading and exploring data:](#load) Familiarization with querying ACS API <br> 
1. [Selecting variables:](#select) Select variables in RECS & ACS, income indicators, and demographics for final insights <br>
1. [Adding variables:](#add) Merging energy expenditure and the percenatge of the population that lived in rural or urban areas in last year <br>
1. [Saving Data:](#saving)  Save our data to an exportable csv file <br>


---

## Section 1. Loading data<a id='load'></a>
Query data<br>

Import the necessary libraries for the notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import warnings 
import geopandas as gpd
import re
from mpl_toolkits.axes_grid1 import make_axes_locatable
from sklearn.preprocessing import MinMaxScaler
import os
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight') 

# Set some parameters
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 14
np.set_printoptions(4)

Pip install the CensusData API

In [3]:
pip install CensusData

Processing ./.cache/pip/wheels/ff/34/2c/e77833bac8e3bbd9f2d2d40d1dc3d1cd975872792aeba374a9/CensusData-1.10-py3-none-any.whl
Installing collected packages: CensusData
Successfully installed CensusData-1.10
Note: you may need to restart the kernel to use updated packages.


Now import the Census data library to access the Census API

In [4]:
import censusdata

Create a list of all of the unique county codes in California

In [5]:
ca_county_lst = []
for i in range(1, 113, 2):
    if i in range(1,10):
        ca_county_lst.append('00' + str(i))
    elif i in range(11,100):
        ca_county_lst.append('0' + str(i))
    else:
        ca_county_lst.append(str(i))
ca_county_lst

['001',
 '003',
 '005',
 '007',
 '009',
 '011',
 '013',
 '015',
 '017',
 '019',
 '021',
 '023',
 '025',
 '027',
 '029',
 '031',
 '033',
 '035',
 '037',
 '039',
 '041',
 '043',
 '045',
 '047',
 '049',
 '051',
 '053',
 '055',
 '057',
 '059',
 '061',
 '063',
 '065',
 '067',
 '069',
 '071',
 '073',
 '075',
 '077',
 '079',
 '081',
 '083',
 '085',
 '087',
 '089',
 '091',
 '093',
 '095',
 '097',
 '099',
 '101',
 '103',
 '105',
 '107',
 '109',
 '111']

---
## Section 2: Selecting variables<a id='select'></a>
Now we need to select variables in the ACS that are also present in the RECS dataset

We will be loading and cleaning the data for the ACS variable B25024, which represents the number of units in the given structure

In [6]:
#Print the dummy columns of the variable B25024
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B25024'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B25024_001E  | B25024.  Units in Structure    | Total:                                                   | int  
B25024_002E  | B25024.  Units in Structure    | 1, detached                                              | int  
B25024_003E  | B25024.  Units in Structure    | 1, attached                                              | int  
B25024_004E  | B25024.  Units in Structure    | 2                                                        | int  
B25024_005E  | B25024.  Units in Structure    | 3 or 4                                                   | int  
B25024_006E  | B25024.  Units in Structure    | 5 to 9                                                   | int  
B25024_007E  | B25024.  Units in Structure    | 10 to 19                                     

In [7]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned and 
#normalized dataframe for the number of units in the given structure
units_in_structure_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'units_in_structure_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                             ['B25024_010E', 'B25024_002E', 'B25024_003E', 'B25024_004E','B25024_005E', 'B25024_006E', 'B25024_007E'
                              ,'B25024_008E', 'B25024_009E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B25024_010E':'Mobile Home', 'B25024_002E': '1, detached', 'B25024_003E':'1, attached ', 'B25024_004E':'2',
                                        'B25024_005E':'3 or 4', 'B25024_006E':'5 to 9', 'B25024_007E':'10 to 19', 
                                        'B25024_008E':'20 to 49','B25024_009E':'50 or more'})
    
    df_data['2 to 4'] = df_data['2'] + df_data['3 or 4']
    df_data['5 or more'] = df_data['5 to 9'] + df_data['10 to 19'] + df_data['20 to 49'] + df_data['50 or more']
    df_data = df_data.drop(['2', '3 or 4', '5 to 9', '10 to 19', '20 to 49', '50 or more'], axis=1)
    df_data = df_data.div(df_data.sum(axis=1), axis=0)
    
    
    units_in_structure_df_names[df_name] = df_data
units_in_structure_df_names['units_in_structure_df_county_001'].head()

Unnamed: 0,Mobile Home,"1, detached","1, attached",2 to 4,5 or more
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",0.0,0.619048,0.0,0.258503,0.122449
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",0.0,1.0,0.0,0.0,0.0
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",0.0,0.479472,0.0,0.225806,0.294721
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",0.0,1.0,0.0,0.0,0.0
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",0.0,0.819608,0.07451,0.105882,0.0


We will be loading and cleaning the data for the ACS variable B25008, which represents the percentage of home owners who own or rent their home

In [8]:
#Print the dummy columns of the variable B25008
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B25008'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B25008_001E  | B25008.  Total Population in O | Total:                                                   | int  
B25008_002E  | B25008.  Total Population in O | Owner occupied                                           | int  
B25008_003E  | B25008.  Total Population in O | Renter occupied                                          | int  
-------------------------------------------------------------------------------------------------------------------


In [9]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned 
#dataframe for the percentage of home owners who own or rent their home
tenure_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'tenure_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                             ['B25008_002E', 'B25008_003E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B25008_002E':'Owned', 'B25008_003E':'Rented'})
    
    
    df_data = df_data.div(df_data.sum(axis=1), axis=0)
    
    
    tenure_df_names[df_name] = df_data
tenure_df_names['tenure_df_county_001'].head()

Unnamed: 0,Owned,Rented
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",0.506536,0.493464
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",0.646226,0.353774
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",0.372405,0.627595
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",0.842905,0.157095
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",0.901991,0.098009


We will be loading and cleaning the data for the ACS variable B25034, which represents when the year range the house was built

In [10]:
#Print the dummy columns of the variable B25034
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B25034'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B25034_001E  | B25034. Year Structure Built   | Total:                                                   | int  
B25034_002E  | B25034. Year Structure Built   | Built 2014 or later                                      | int  
B25034_003E  | B25034. Year Structure Built   | Built 2010 to 2013                                       | int  
B25034_004E  | B25034. Year Structure Built   | Built 2000 to 2009                                       | int  
B25034_005E  | B25034. Year Structure Built   | Built 1990 to 1999                                       | int  
B25034_006E  | B25034. Year Structure Built   | Built 1980 to 1989                                       | int  
B25034_007E  | B25034. Year Structure Built   | Built 1970 to 1979                           

In [11]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned, 
#normalized dataframe mutliplyed by the midpoint of each range, then summed to one aggregate value for each census 
#block for the year the home was built
year_structure_built_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'year_structure_built_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                             ['B25034_002E', 'B25034_003E', 'B25034_004E', 'B25034_005E', 'B25034_006E',
                              'B25034_007E', 'B25034_008E','B25034_009E', 'B25034_010E','B25034_011E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B25034_002E':'Built 2014 or later', 'B25034_003E':'Built 2010 to 2013', 
                                      'B25034_004E':'Built 2000 to 2009', 'B25034_005E':'Built 1990 to 1999', 
                                      'B25034_006E':'Built 1980 to 1989', 'B25034_007E':'Built 1970 to 1979', 
                                      'B25034_008E':'Built 1960 to 1969','B25034_009E':'1950 to 1959', 
                                      'B25034_010E':'1940 to 1949','B25034_011E':'Built 1939 or earlier'})

    df_data = df_data.div(df_data.sum(axis=1), axis=0)
        
    df_midpoints_lst = [2015, 2012, 2005, 1995, 1985, 1975, 1965, 1955, 1945, 1920]
    

    for i in range(len(df_data.columns)):
        df_data.iloc[:,i] = df_data.iloc[:,i] * df_midpoints_lst[i]

    
    df_data["Aggregate Year House Built"] = df_data.sum(axis=1)
    df_data = df_data[["Aggregate Year House Built"]]
    
    
    year_structure_built_df_names[df_name] = df_data
year_structure_built_df_names['year_structure_built_df_county_001']

Unnamed: 0,Aggregate Year House Built
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",1959.489796
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",1945.647383
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",1963.064516
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",1960.609319
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",1956.568627
...,...
"Block Group 2, Census Tract 4507.44, Alameda County, California: Summary level: 150, state:06> county:001> tract:450744> block group:2",1986.172805
"Block Group 1, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:1",1992.441520
"Block Group 2, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:2",1989.053697
"Block Group 1, Census Tract 4507.46, Alameda County, California: Summary level: 150, state:06> county:001> tract:450746> block group:1",1982.757390


We will be loading and cleaning the data for the ACS variable B25038, which represents the year range that the reident of the house moved in

In [12]:
#Print the dummy columns of the variable B25038
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B25038'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B25038_001E  | B25038. Tenure by Year Househo | Total:                                                   | int  
B25038_002E  | B25038. Tenure by Year Househo | Owner occupied:                                          | int  
B25038_003E  | B25038. Tenure by Year Househo | !! Owner occupied: Moved in 2015 or later                | int  
B25038_004E  | B25038. Tenure by Year Househo | !! Owner occupied: Moved in 2010 to 2014                 | int  
B25038_005E  | B25038. Tenure by Year Househo | !! Owner occupied: Moved in 2000 to 2009                 | int  
B25038_006E  | B25038. Tenure by Year Househo | !! Owner occupied: Moved in 1990 to 1999                 | int  
B25038_007E  | B25038. Tenure by Year Househo | !! Owner occupied: Moved in 1980 to 1989     

In [13]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned, 
#normalized dataframe mutliplyed by the midpoint of each range, then summed to one aggregate value for each census 
#block for the year range the resident of the home moved in
year_moved_in_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'year_moved_in_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                             ['B25038_003E', 'B25038_004E', 'B25038_005E', 'B25038_006E', 'B25038_007E', 'B25038_008E',
                              'B25038_010E', 'B25038_011E', 'B25038_012E', 'B25038_013E', 'B25038_014E', 'B25038_015E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B25038_003E':'Owner occupied: Moved in 2015 or later', 'B25038_004E':'Owner occupied: Moved in 2010 to 2014',
                                      'B25038_005E':'Owner occupied: Moved in 2000 to 2009', 'B25038_006E':'Owner occupied: Moved in 1990 to 1999', 
                                      'B25038_007E':'Owner occupied: Moved in 1980 to 1989', 'B25038_008E':'Owner occupied: Moved in 1979 or earlier',
                              'B25038_010E':'Renter occupied: Moved in 2015 or later', 'B25038_011E':'Renter occupied: Moved in 2010 to 2014',
                                      'B25038_012E':'Renter occupied: Moved in 2000 to 2009', 'B25038_013E':'Renter occupied: Moved in 1990 to 1999',
                                      'B25038_014E':'Renter occupied: Moved in 1980 to 1989', 'B25038_015E':'Renter occupied: Moved in 1979 or earlier'})

    df_data = df_data.div(df_data.sum(axis=1), axis=0)
    
    df_midpoints_lst = [2015, 2012, 2005, 1995, 1985, 1965, 2015, 2012, 2005, 1995, 1985, 1965]
    

    for i in range(len(df_data.columns)):
        df_data.iloc[:,i] = df_data.iloc[:,i] * df_midpoints_lst[i]

    
    df_data["Aggregate Year Moved In"] = df_data.sum(axis=1)
    df_data = df_data[["Aggregate Year Moved In"]]
           
    
    year_moved_in_df_names[df_name] = df_data
year_moved_in_df_names['year_moved_in_df_county_001']


Unnamed: 0,Aggregate Year Moved In
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",2003.476923
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",1992.547059
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",2000.325939
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",1997.079295
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",1991.588235
...,...
"Block Group 2, Census Tract 4507.44, Alameda County, California: Summary level: 150, state:06> county:001> tract:450744> block group:2",1999.009331
"Block Group 1, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:1",2005.516484
"Block Group 2, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:2",2004.220951
"Block Group 1, Census Tract 4507.46, Alameda County, California: Summary level: 150, state:06> county:001> tract:450746> block group:1",2000.575943


We will be loading and cleaning the data for the ACS variable B25041, which represents the number of bedrooms in the home

In [14]:
#Print the dummy columns of the variable B25041
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B25041'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B25041_001E  | B25041.  Bedrooms              | Total:                                                   | int  
B25041_002E  | B25041.  Bedrooms              | No bedroom                                               | int  
B25041_003E  | B25041.  Bedrooms              | 1 bedroom                                                | int  
B25041_004E  | B25041.  Bedrooms              | 2 bedrooms                                               | int  
B25041_005E  | B25041.  Bedrooms              | 3 bedrooms                                               | int  
B25041_006E  | B25041.  Bedrooms              | 4 bedrooms                                               | int  
B25041_007E  | B25041.  Bedrooms              | 5 or more bedrooms                           

In [15]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned, 
#normalized dataframe mutliplyed by the midpoint of each range, then summed to one aggregate value for each census 
#block for the number of bedrooms in the home
num_bedrooms_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'num_bedrooms_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                             ['B25041_002E', 'B25041_003E', 'B25041_004E', 'B25041_005E', 'B25041_006E',
                              'B25041_007E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B25041_002E':'No bedroom', 'B25041_003E':'1 bedrooms',
                                      'B25041_004E':'2 bedrooms', 'B25041_005E':'3 bedrooms', 'B25041_006E':'4 bedroom',
                              'B25041_007E':'5 or more bedrooms'})
    

    df_data = df_data.div(df_data.sum(axis=1), axis=0)
    
    df_midpoints_lst = [0, 1, 2, 3, 4, 10]
    

    for i in range(len(df_data.columns)):
        df_data.iloc[:,i] = df_data.iloc[:,i] * df_midpoints_lst[i]

    
    df_data["Aggregate Number of Bedrooms Per Household"] = df_data.sum(axis=1)
    df_data = df_data[["Aggregate Number of Bedrooms Per Household"]]
    

    num_bedrooms_df_names[df_name] = df_data
num_bedrooms_df_names['num_bedrooms_df_county_001']


Unnamed: 0,Aggregate Number of Bedrooms Per Household
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",2.591837
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",2.964187
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",2.580645
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",3.727599
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",3.039216
...,...
"Block Group 2, Census Tract 4507.44, Alameda County, California: Summary level: 150, state:06> county:001> tract:450744> block group:2",3.617564
"Block Group 1, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:1",3.295322
"Block Group 2, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:2",3.508803
"Block Group 1, Census Tract 4507.46, Alameda County, California: Summary level: 150, state:06> county:001> tract:450746> block group:1",2.798165


We will be loading and cleaning the data for the ACS variable B25017, which represents the number of rooms in the home

In [45]:
#Print the dummy columns of the variable B25017
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B25017'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B25017_001E  | B25017.  Rooms                 | Total:                                                   | int  
B25017_002E  | B25017.  Rooms                 | 1 room                                                   | int  
B25017_003E  | B25017.  Rooms                 | 2 rooms                                                  | int  
B25017_004E  | B25017.  Rooms                 | 3 rooms                                                  | int  
B25017_005E  | B25017.  Rooms                 | 4 rooms                                                  | int  
B25017_006E  | B25017.  Rooms                 | 5 rooms                                                  | int  
B25017_007E  | B25017.  Rooms                 | 6 rooms                                      

In [17]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned, 
#normalized dataframe mutliplyed by the midpoint of each range, then summed to one aggregate value for each census 
#block for the number of rooms in the home
num_rooms_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'num_rooms_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                             ['B25017_002E', 'B25017_003E', 'B25017_004E', 'B25017_005E', 'B25017_006E',
                              'B25017_007E', 'B25017_008E', 'B25017_009E', 'B25017_010E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B25017_002E':'1 room', 'B25017_003E':'2 rooms', 
                                      'B25017_004E':'3 rooms', 'B25017_005E':'4 rooms', 'B25017_006E':'5 rooms',
                                      'B25017_007E':'6 rooms', 'B25017_008E':'7 rooms', 'B25017_009E':'8 rooms', 
                                      'B25017_010E':'9 or more rooms'})

    df_data = df_data.div(df_data.sum(axis=1), axis=0)
    
    df_midpoints_lst = [1, 2, 3, 4, 5, 6, 7, 8, 11]
    

    for i in range(len(df_data.columns)):
        df_data.iloc[:,i] = df_data.iloc[:,i] * df_midpoints_lst[i]

    
    df_data["Aggregate Number of Rooms Per Household"] = df_data.sum(axis=1)
    df_data = df_data[["Aggregate Number of Rooms Per Household"]]
    
    
    num_rooms_df_names[df_name] = df_data
num_rooms_df_names['num_rooms_df_county_001']


Unnamed: 0,Aggregate Number of Rooms Per Household
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",4.812925
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",6.347107
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",5.126100
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",6.519713
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",6.831373
...,...
"Block Group 2, Census Tract 4507.44, Alameda County, California: Summary level: 150, state:06> county:001> tract:450744> block group:2",7.001416
"Block Group 1, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:1",5.719298
"Block Group 2, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:2",6.111796
"Block Group 1, Census Tract 4507.46, Alameda County, California: Summary level: 150, state:06> county:001> tract:450746> block group:1",5.423038


We will be loading and cleaning the data for the ACS variable B25040, which represents the primary type of fuel used to heat the home

In [18]:
#Print the dummy columns of the variable B25040
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B25040'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B25040_001E  | B25040.  House Heating Fuel    | Total:                                                   | int  
B25040_002E  | B25040.  House Heating Fuel    | Utility gas                                              | int  
B25040_003E  | B25040.  House Heating Fuel    | Bottled, tank, or LP gas                                 | int  
B25040_004E  | B25040.  House Heating Fuel    | Electricity                                              | int  
B25040_005E  | B25040.  House Heating Fuel    | Fuel oil, kerosene, etc.                                 | int  
B25040_006E  | B25040.  House Heating Fuel    | Coal or coke                                             | int  
B25040_007E  | B25040.  House Heating Fuel    | Wood                                         

In [19]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned and 
#normalized dataframe for the primary type of fuel used in the home
primary_heating_fuel_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'primary_heating_fuel_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                             censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                             ['B25040_002E', 'B25040_003E', 'B25040_005E', 'B25040_004E',
                              'B25040_007E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B25040_002E':'Utility Gas', 'B25040_003E':'Bottled, tank, or LP gas', 
                                    'B25040_005E':'Fuel oil, kerosene, etc.', 'B25040_004E':'Electricity',
                                      'B25040_007E':'Wood'})
    
    df_data = df_data.div(df_data.sum(axis=1), axis=0)

    
    primary_heating_fuel_df_names[df_name] = df_data
primary_heating_fuel_df_names['primary_heating_fuel_df_county_001'].head()


Unnamed: 0,Utility Gas,"Bottled, tank, or LP gas","Fuel oil, kerosene, etc.",Electricity,Wood
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",0.636735,0.0,0.0,0.363265,0.0
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",0.508982,0.0,0.0,0.491018,0.0
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",0.83959,0.03413,0.0,0.12628,0.0
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",0.726872,0.0,0.0,0.251101,0.022026
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",0.847059,0.0,0.0,0.152941,0.0


We will be loading and cleaning the data for the ACS variable B11016, which represents the number of members in each household

In [20]:
#Print the dummy columns of the variable B11016
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B11016'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B11016_001E  | B11016.  HOUSEHOLD TYPE BY HOU | Total:                                                   | int  
B11016_002E  | B11016.  HOUSEHOLD TYPE BY HOU | Family households:                                       | int  
B11016_003E  | B11016.  HOUSEHOLD TYPE BY HOU | !! Family households: 2-person household                 | int  
B11016_004E  | B11016.  HOUSEHOLD TYPE BY HOU | !! Family households: 3-person household                 | int  
B11016_005E  | B11016.  HOUSEHOLD TYPE BY HOU | !! Family households: 4-person household                 | int  
B11016_006E  | B11016.  HOUSEHOLD TYPE BY HOU | !! Family households: 5-person household                 | int  
B11016_007E  | B11016.  HOUSEHOLD TYPE BY HOU | !! Family households: 6-person household     

In [21]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned, 
#normalized dataframe mutliplyed by the midpoint of each range, then summed to one aggregate value for each census 
#block for the number of residents in the home
num_residents_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'num_residents_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                                  censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                                  ['B11016_003E', 'B11016_004E', 'B11016_005E', 'B11016_006E', 'B11016_007E', 'B11016_008E',
                                   'B11016_011E', 'B11016_012E', 'B11016_013E', 'B11016_014E', 'B11016_015E', 'B11016_016E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B11016_003E':'Family households: 2-person household', 'B11016_004E':'Family households: 3-person household',
                                      'B11016_005E':'Family households: 4-person household', 'B11016_006E':'Family households: 5-person household',
                                      'B11016_007E':'Family households: 6-person household', 'B11016_008E':'Family households: 7-or-more-person household',
                                      'B11016_010E':'Nonfamily households: 1-person household','B11016_011E':'Nonfamily households: 2-person household', 
                                      'B11016_012E':'Nonfamily households: 3-person household','B11016_013E':'Nonfamily households: 4-person household', 
                                      'B11016_014E':'Nonfamily households: 5-person household','B11016_015E':'Nonfamily households: 6-person household', 
                                      'B11016_016E':'Nonfamily households: 7-person household'})

    df_data = df_data.div(df_data.sum(axis=1), axis=0)
    
    df_midpoints_lst = [2, 3, 4, 5, 6, 9, 1, 2, 3, 4, 5, 6, 9]
    

    for i in range(len(df_data.columns)):
        df_data.iloc[:,i] = df_data.iloc[:,i] * df_midpoints_lst[i]

    
    df_data["Aggregate Members Per Household"] = df_data.sum(axis=1)
    df_data = df_data[["Aggregate Members Per Household"]]
    
    
    num_residents_df_names[df_name] = df_data
num_residents_df_names['num_residents_df_county_001']


Unnamed: 0,Aggregate Members Per Household
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",4.578947
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",2.928270
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",3.387097
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",2.912791
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",2.803279
...,...
"Block Group 2, Census Tract 4507.44, Alameda County, California: Summary level: 150, state:06> county:001> tract:450744> block group:2",3.146048
"Block Group 1, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:1",3.343886
"Block Group 2, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:2",2.847755
"Block Group 1, Census Tract 4507.46, Alameda County, California: Summary level: 150, state:06> county:001> tract:450746> block group:1",2.852288


We will be loading and cleaning the data for the ACS variable B19001, which represents the income from the past 12 months of the household

In [22]:
#Print the dummy columns of the variable B19001, the income from the past 12 months of the household
censusdata.printtable(censusdata.censustable('acs5', 2015, 'B19001'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B19001_001E  | B19001. Household Income in th | Total:                                                   | int  
B19001_002E  | B19001. Household Income in th | Less than $10,000                                        | int  
B19001_003E  | B19001. Household Income in th | $10,000 to $14,999                                       | int  
B19001_004E  | B19001. Household Income in th | $15,000 to $19,999                                       | int  
B19001_005E  | B19001. Household Income in th | $20,000 to $24,999                                       | int  
B19001_006E  | B19001. Household Income in th | $25,000 to $29,999                                       | int  
B19001_007E  | B19001. Household Income in th | $30,000 to $34,999                           

In [23]:
#Create a dictionary that takes in the created df name from the specific CA county code, and the cleaned, 
#normalized dataframe mutliplyed by the midpoint of each range, then summed to one aggregate value for each census 
#block for the income from the past 12 months of the household
household_income_past_year_df_names = {}
for i in range(len(ca_county_lst)):
    df_name = 'household_income_past_year_df_county_' + str(ca_county_lst[i])
    
    df_data = censusdata.download('acs5', 2015,
                                  censusdata.censusgeo([('state', '06'), ('county', ca_county_lst[i]), ('block group', '*')]),
                                  ['B19001_002E', 'B19001_003E', 'B19001_004E', 'B19001_005E', 'B19001_006E',
                                  'B19001_007E', 'B19001_008E', 'B19001_009E', 'B19001_010E','B19001_011E', 'B19001_012E',
                                  'B19001_013E', 'B19001_014E','B19001_015E', 'B19001_016E', 'B19001_017E'], key = '80125df736ca356c2f7c99ef78ca3e7acea58265')
    
    df_data = df_data.rename(columns={'B19001_002E':'Less than \$10,000', 'B19001_003E':'\$10,000 to \$14,999', 
                                      'B19001_004E':'\$15,000 to \$19,999', 'B19001_005E':'\$20,000 to \$24,999', 
                                      'B19001_006E':'\$25,000 to \$29,999',
                                      'B19001_007E':'\$30,000 to \$34,999', 'B19001_008E':'\$35,000 to \$39,999', 
                                      'B19001_009E':'\$40,000 to \$44,999', 'B19001_010E':'\$45,000 to \$49,999',
                                      'B19001_011E':'\$50,000 to \$59,999', 'B19001_012E':'\$60,000 to \$74,999',
                                      'B19001_013E':'\$75,000 to \$99,999', 'B19001_014E':'\$100,000 to \$124,999',
                                      'B19001_015E':'\$125,000 to \$149,999', 'B19001_016E':'\$150,000 to \$199,999',
                                      'B19001_017E':'\$200,000 or more'})

    df_data = df_data.div(df_data.sum(axis=1), axis=0)
    
    df_midpoints_lst = [5000, 12500, 17500, 22500, 27500, 32500, 37500, 42500, 47500, 55500, 67500, 87500, 112500, 137500, 175000, 400000]
    

    for i in range(len(df_data.columns)):
        df_data.iloc[:,i] = df_data.iloc[:,i] * df_midpoints_lst[i]

    
    df_data["Aggregate Household Income in the Past 12 Months"] = df_data.sum(axis=1)
    df_data = df_data[["Aggregate Household Income in the Past 12 Months"]]
    
    
    household_income_past_year_df_names[df_name] = df_data
household_income_past_year_df_names['household_income_past_year_df_county_001']


Unnamed: 0,Aggregate Household Income in the Past 12 Months
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",79778.846154
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",100992.647059
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",51383.105802
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",93458.149780
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",148760.784314
...,...
"Block Group 2, Census Tract 4507.44, Alameda County, California: Summary level: 150, state:06> county:001> tract:450744> block group:2",163992.223950
"Block Group 1, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:1",226242.257742
"Block Group 2, Census Tract 4507.45, Alameda County, California: Summary level: 150, state:06> county:001> tract:450745> block group:2",178470.070423
"Block Group 1, Census Tract 4507.46, Alameda County, California: Summary level: 150, state:06> county:001> tract:450746> block group:1",122273.700306


##### Begin merging individual datasets together

In [24]:
#Create a list of the dictionaries corresponding to each of the X variables created above
dict_lst = [units_in_structure_df_names, tenure_df_names, year_structure_built_df_names, year_moved_in_df_names, num_bedrooms_df_names,
           num_rooms_df_names, primary_heating_fuel_df_names, num_residents_df_names, household_income_past_year_df_names]

In [25]:
#For each dictionary of dataframes in dct_list, concatenate all the dataframes into 1 dataframe, and then place 
#this add this dataframe to the variable_dct dictionary with the corresponding variable name
variable_dct = {}

variable_dct_names = []

for i in dict_lst:
    name = re.search(r"(\w+df)", str(i)).group(0)
    variable_dct_names.append(name)
    starter_df = pd.DataFrame()
    for key, value in i.items():
        #i[key].reset_index(inplace=True)
        starter_df = pd.concat([starter_df, i[key]], axis=0)
    variable_dct[name] = starter_df

        
variable_dct["units_in_structure_df"]

Unnamed: 0,Mobile Home,"1, detached","1, attached",2 to 4,5 or more
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",0.000000,0.619048,0.000000,0.258503,0.122449
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",0.000000,1.000000,0.000000,0.000000,0.000000
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",0.000000,0.479472,0.000000,0.225806,0.294721
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",0.000000,1.000000,0.000000,0.000000,0.000000
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",0.000000,0.819608,0.074510,0.105882,0.000000
...,...,...,...,...,...
"Block Group 1, Census Tract 28, Ventura County, California: Summary level: 150, state:06> county:111> tract:002800> block group:1",0.064639,0.169202,0.024715,0.242395,0.499049
"Block Group 2, Census Tract 28, Ventura County, California: Summary level: 150, state:06> county:111> tract:002800> block group:2",0.028916,0.971084,0.000000,0.000000,0.000000
"Block Group 3, Census Tract 28, Ventura County, California: Summary level: 150, state:06> county:111> tract:002800> block group:3",0.000000,1.000000,0.000000,0.000000,0.000000
"Block Group 4, Census Tract 28, Ventura County, California: Summary level: 150, state:06> county:111> tract:002800> block group:4",0.000000,0.862816,0.092058,0.045126,0.000000


In [26]:
#For each variable dataframe stored in variable_dct, concatenate them into a single, final dataframe. Then, drop rows
#with NA values or where they are equal to 0, to deal with missing data
final_df = pd.DataFrame()
for i in variable_dct_names:
    final_df = pd.concat([final_df, variable_dct[i]], axis=1)



final_df = final_df.dropna(axis=0,how='any')
final_df = final_df.loc[~(final_df==0).all(axis=1)]


final_df.head()

Unnamed: 0,Mobile Home,"1, detached","1, attached",2 to 4,5 or more,Owned,Rented,Aggregate Year House Built,Aggregate Year Moved In,Aggregate Number of Bedrooms Per Household,Aggregate Number of Rooms Per Household,Utility Gas,"Bottled, tank, or LP gas","Fuel oil, kerosene, etc.",Electricity,Wood,Aggregate Members Per Household,Aggregate Household Income in the Past 12 Months
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",0.0,0.619048,0.0,0.258503,0.122449,0.506536,0.493464,1959.489796,2003.476923,2.591837,4.812925,0.636735,0.0,0.0,0.363265,0.0,4.578947,79778.846154
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",0.0,1.0,0.0,0.0,0.0,0.646226,0.353774,1945.647383,1992.547059,2.964187,6.347107,0.508982,0.0,0.0,0.491018,0.0,2.92827,100992.647059
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",0.0,0.479472,0.0,0.225806,0.294721,0.372405,0.627595,1963.064516,2000.325939,2.580645,5.1261,0.83959,0.03413,0.0,0.12628,0.0,3.387097,51383.105802
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",0.0,1.0,0.0,0.0,0.0,0.842905,0.157095,1960.609319,1997.079295,3.727599,6.519713,0.726872,0.0,0.0,0.251101,0.022026,2.912791,93458.14978
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",0.0,0.819608,0.07451,0.105882,0.0,0.901991,0.098009,1956.568627,1991.588235,3.039216,6.831373,0.847059,0.0,0.0,0.152941,0.0,2.803279,148760.784314


##### Begin scaling the ordinal variables in our dataset

In [27]:
#For the ordinal variables stored in final_df, apply the sklearn MinMaxScaler in order to scale these variables 
#approprtiately
scaled_columns_lst = ["Aggregate Year House Built", "Aggregate Year Moved In", "Aggregate Number of Bedrooms Per Household",
"Aggregate Number of Rooms Per Household", "Aggregate Members Per Household", "Aggregate Household Income in the Past 12 Months"]

scaling_df = final_df[scaled_columns_lst]

scaler = MinMaxScaler()
scaled_df = pd.DataFrame(scaler.fit_transform(scaling_df), index = scaling_df.index, columns = scaling_df.columns)

scaled_df.head()

Unnamed: 0,Aggregate Year House Built,Aggregate Year Moved In,Aggregate Number of Bedrooms Per Household,Aggregate Number of Rooms Per Household,Aggregate Members Per Household,Aggregate Household Income in the Past 12 Months
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",0.437511,0.699541,0.323652,0.404898,0.508772,0.189314
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",0.28415,0.41455,0.370149,0.567814,0.325363,0.243019
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",0.477116,0.617381,0.322255,0.438154,0.376344,0.117426
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",0.449914,0.532726,0.465479,0.586143,0.323643,0.223945
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",0.405147,0.389549,0.379518,0.619239,0.311475,0.363951


In [28]:
#Now replace the original columns in final_df with the new scaled ones
for i in scaled_df.columns:
    if i == "Aggregate Household Income in the Past 12 Months":
        final_df["Household Income in the Past 12 Months"] = scaled_df[i]
    else:
        final_df[i] = scaled_df[i]

final_df.head()

Unnamed: 0,Mobile Home,"1, detached","1, attached",2 to 4,5 or more,Owned,Rented,Aggregate Year House Built,Aggregate Year Moved In,Aggregate Number of Bedrooms Per Household,Aggregate Number of Rooms Per Household,Utility Gas,"Bottled, tank, or LP gas","Fuel oil, kerosene, etc.",Electricity,Wood,Aggregate Members Per Household,Aggregate Household Income in the Past 12 Months,Household Income in the Past 12 Months
"Block Group 4, Census Tract 4097, Alameda County, California: Summary level: 150, state:06> county:001> tract:409700> block group:4",0.0,0.619048,0.0,0.258503,0.122449,0.506536,0.493464,0.437511,0.699541,0.323652,0.404898,0.636735,0.0,0.0,0.363265,0.0,0.508772,79778.846154,0.189314
"Block Group 1, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:1",0.0,1.0,0.0,0.0,0.0,0.646226,0.353774,0.28415,0.41455,0.370149,0.567814,0.508982,0.0,0.0,0.491018,0.0,0.325363,100992.647059,0.243019
"Block Group 2, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:2",0.0,0.479472,0.0,0.225806,0.294721,0.372405,0.627595,0.477116,0.617381,0.322255,0.438154,0.83959,0.03413,0.0,0.12628,0.0,0.376344,51383.105802,0.117426
"Block Group 3, Census Tract 4098, Alameda County, California: Summary level: 150, state:06> county:001> tract:409800> block group:3",0.0,1.0,0.0,0.0,0.0,0.842905,0.157095,0.449914,0.532726,0.465479,0.586143,0.726872,0.0,0.0,0.251101,0.022026,0.323643,93458.14978,0.223945
"Block Group 1, Census Tract 4099, Alameda County, California: Summary level: 150, state:06> county:001> tract:409900> block group:1",0.0,0.819608,0.07451,0.105882,0.0,0.901991,0.098009,0.405147,0.389549,0.379518,0.619239,0.847059,0.0,0.0,0.152941,0.0,0.311475,148760.784314,0.363951


---
## Section 3: Adding variables<a id='add'></a>
Now we need to parse geospatial information to add energy expenditure and the percenatge of the population that lived in rural or urban areas in last year

#### Parse geospatial information

In [29]:
#Convert geo index into mutiple geolocation columns

ca_acs = final_df.copy()
#rest index and make copy
ca_acs.reset_index(inplace = True)
ca_acs.rename(columns={"index": "id"}, inplace = True)
ca_acs['geo_inf'] = [str(ca_acs.id[i]) for i in range(0, len(ca_acs))]

#parse geo information
ca_acs['geo_inf'] = [str(ca_acs.id[i]) for i in range(0, len(ca_acs))]
ca_acs[['Block','Tract', 'County Name', 'State', 'FIP']] = ca_acs.geo_inf.str.split(',',expand=True)                                                               
ca_acs[['STATEFP','COUNTYFP', 'TRACTCE', 'BLKGRPCE']] = ca_acs.FIP.str.split('>',expand=True)                                                               

#extract numbers
ca_acs['STATEFP'] = ca_acs['STATEFP'].str.extract('(\d+)')
ca_acs['COUNTYFP'] = ca_acs['COUNTYFP'].str.extract('(\d+)')
ca_acs['TRACTCE'] = ca_acs['TRACTCE'].str.extract('(\d+)')
ca_acs['BLKGRPCE'] = ca_acs['BLKGRPCE'].str.extract('(\d+)')
ca_acs['Tract ID'] = ca_acs[['STATEFP', 'COUNTYFP', 'TRACTCE']].apply(lambda x: ''.join(x), axis=1).astype(int) 
   
#drop columns
ca_acs.drop(columns = {'id', 'geo_inf', 'FIP','Block', 'Tract', 'State'}, inplace = True)

#reorder remaining columns
cols_to_order = ['Tract ID','BLKGRPCE', 'TRACTCE', 'COUNTYFP', 'STATEFP', 'County Name']#,  ]
new_columns = cols_to_order + (ca_acs.columns.drop(cols_to_order).tolist())
ca_acs = ca_acs[new_columns]

ca_acs.head()

Unnamed: 0,Tract ID,BLKGRPCE,TRACTCE,COUNTYFP,STATEFP,County Name,Mobile Home,"1, detached","1, attached",2 to 4,...,Aggregate Number of Bedrooms Per Household,Aggregate Number of Rooms Per Household,Utility Gas,"Bottled, tank, or LP gas","Fuel oil, kerosene, etc.",Electricity,Wood,Aggregate Members Per Household,Aggregate Household Income in the Past 12 Months,Household Income in the Past 12 Months
0,6001409700,4,409700,1,6,Alameda County,0.0,0.619048,0.0,0.258503,...,0.323652,0.404898,0.636735,0.0,0.0,0.363265,0.0,0.508772,79778.846154,0.189314
1,6001409800,1,409800,1,6,Alameda County,0.0,1.0,0.0,0.0,...,0.370149,0.567814,0.508982,0.0,0.0,0.491018,0.0,0.325363,100992.647059,0.243019
2,6001409800,2,409800,1,6,Alameda County,0.0,0.479472,0.0,0.225806,...,0.322255,0.438154,0.83959,0.03413,0.0,0.12628,0.0,0.376344,51383.105802,0.117426
3,6001409800,3,409800,1,6,Alameda County,0.0,1.0,0.0,0.0,...,0.465479,0.586143,0.726872,0.0,0.0,0.251101,0.022026,0.323643,93458.14978,0.223945
4,6001409900,1,409900,1,6,Alameda County,0.0,0.819608,0.07451,0.105882,...,0.379518,0.619239,0.847059,0.0,0.0,0.152941,0.0,0.311475,148760.784314,0.363951


#### Add Energy expenditure

The energy expenditure data is from the Low-Income Energy Affordability Data program conducted by the Department of Energy https://www.energy.gov/eere/slsc/maps/lead-tool. The resolution of this dataset is per county and the data comes from the disaggregated Public use microdata (PUMS) from the US Census Bureau, for 2015.

In [31]:
#Read in a csv file that has the total energy expenditure for each census tract group
expenditure_ct = pd.read_csv("data/lead-tool-map-data.csv", skiprows=8, sep=",")
expenditure_ct.head()

Unnamed: 0,Geography ID,Name,County,Tribal Areas,Avg. Annual Energy Cost
0,6005000101,Census Tract Census Tract 1.01,Amador County,,3815
1,6015000101,Census Tract Census Tract 1.01,Del Norte County,,1574
2,6029000101,Census Tract Census Tract 1.01,Kern County,,1683
3,6043000101,Census Tract Census Tract 1.01,Mariposa County,,2579
4,6051000101,Census Tract Census Tract 1.01,Mono County,Benton Paiute Reservation and Off-Reservation ...,3831


In [32]:
expenditure_ct_copy = expenditure_ct.copy()
energy_scaler = MinMaxScaler()
expenditure_ct_copy = expenditure_ct_copy[["Avg. Annual Energy Cost"]]
energy_scaled_df = pd.DataFrame(energy_scaler.fit_transform(expenditure_ct_copy), index = expenditure_ct_copy.index, columns = expenditure_ct_copy.columns)
energy_scaled_df = energy_scaled_df.rename(columns = {'Avg. Annual Energy Cost':'Normalized Avg. Annual Energy Cost'})
energy_scaled_df.head()
expenditure_ct = pd.concat([expenditure_ct, energy_scaled_df], axis=1)
expenditure_ct.head()

Unnamed: 0,Geography ID,Name,County,Tribal Areas,Avg. Annual Energy Cost,Normalized Avg. Annual Energy Cost
0,6005000101,Census Tract Census Tract 1.01,Amador County,,3815,0.653108
1,6015000101,Census Tract Census Tract 1.01,Del Norte County,,1574,0.255062
2,6029000101,Census Tract Census Tract 1.01,Kern County,,1683,0.274423
3,6043000101,Census Tract Census Tract 1.01,Mariposa County,,2579,0.43357
4,6051000101,Census Tract Census Tract 1.01,Mono County,Benton Paiute Reservation and Off-Reservation ...,3831,0.65595


In [33]:
#Now merge the energy expenditure data with our final_df to make our enhanced_df
enhanced_df = ca_acs.merge(expenditure_ct[['Avg. Annual Energy Cost', 'Normalized Avg. Annual Energy Cost', 'Geography ID']], left_on='Tract ID', right_on='Geography ID').drop(columns= ['Geography ID']) 
enhanced_df.head()

Unnamed: 0,Tract ID,BLKGRPCE,TRACTCE,COUNTYFP,STATEFP,County Name,Mobile Home,"1, detached","1, attached",2 to 4,...,Utility Gas,"Bottled, tank, or LP gas","Fuel oil, kerosene, etc.",Electricity,Wood,Aggregate Members Per Household,Aggregate Household Income in the Past 12 Months,Household Income in the Past 12 Months,Avg. Annual Energy Cost,Normalized Avg. Annual Energy Cost
0,6001409700,4,409700,1,6,Alameda County,0.0,0.619048,0.0,0.258503,...,0.636735,0.0,0.0,0.363265,0.0,0.508772,79778.846154,0.189314,1690,0.275666
1,6001409700,1,409700,1,6,Alameda County,0.0,0.639665,0.092179,0.111732,...,0.595611,0.062696,0.0,0.341693,0.0,0.367339,51989.028213,0.11896,1690,0.275666
2,6001409700,2,409700,1,6,Alameda County,0.0,0.446404,0.052917,0.181818,...,0.648387,0.0,0.0,0.345161,0.006452,0.415312,44130.537975,0.099065,1690,0.275666
3,6001409700,3,409700,1,6,Alameda County,0.0,0.582051,0.066667,0.2,...,0.810888,0.0,0.0,0.189112,0.0,0.473265,50243.315508,0.11454,1690,0.275666
4,6001409800,1,409800,1,6,Alameda County,0.0,1.0,0.0,0.0,...,0.508982,0.0,0.0,0.491018,0.0,0.325363,100992.647059,0.243019,1843,0.302842


#### Add Urban/Rural divide classifier

The Urban/Rural divide classifier is obtained from the US Centus data 
https://www.census.gov/programs-surveys/geography/technical-documentation/records-layout/2010-urban-lists-record-layout.html. The resolution is per county.

Accordin to this classification, area can either be an Urban Cluster (2,500 to 50,000 people), an Urban Area (50,000 or more people) or a Rural area (otherwise).

In our case we have to just select the variables POPPCT_RURAL, POPPCT_UC, POPPCT_UA from the following code and divide them by 100 to get the percentage of population leaving in each of these areas. Then, we will merge to the enhanced or final df on the STATE and COUNTY codes in this file with our datasets state and county codes. This is a very simple step if you follow what I did for energy expenditure.


In [42]:
#Read in a csv file that has the amount of the population that lives ina rural, urban cluster, or urban area, 
#mutate the COUNTY column to add a leading space to match the County Name column in enhanced_df, then save three
#dictionarues contining the county name as the key, and the population statistic as the value
urban_rural_df = pd.read_csv("data/PctUrbanRural_County-2.csv")
urban_rural_df["COUNTYNAME"] = " " + urban_rural_df["COUNTYNAME"] + " County"

urban_rural_poppct_rural_dct = pd.Series(urban_rural_df.POPPCT_RURAL.values,index=urban_rural_df.COUNTYNAME).to_dict()
urban_rural_poppct_uc_dct = pd.Series(urban_rural_df.POPPCT_UC.values,index=urban_rural_df.COUNTYNAME).to_dict()
urban_rural_poppct_ua_dct = pd.Series(urban_rural_df.POPPCT_UA.values,index=urban_rural_df.COUNTYNAME).to_dict()

In [43]:
#Add the Percentage of Rural Population, Percentage of Urban Cluster Population, Percentage of Urban Area Population
#coumns to enhanced_df using the map function to produce our finished dataframe

enhanced_df["Percentage of Rural Population"] = enhanced_df["County Name"].map(urban_rural_poppct_rural_dct)/100
enhanced_df["Percentage of Urban Cluster Population"] = enhanced_df["County Name"].map(urban_rural_poppct_uc_dct)/100
enhanced_df["Percentage of Urban Area Population"] = enhanced_df["County Name"].map(urban_rural_poppct_ua_dct)/100
enhanced_df.head()

Unnamed: 0,Tract ID,BLKGRPCE,TRACTCE,COUNTYFP,STATEFP,County Name,Mobile Home,"1, detached","1, attached",2 to 4,...,Electricity,Wood,Aggregate Members Per Household,Aggregate Household Income in the Past 12 Months,Household Income in the Past 12 Months,Avg. Annual Energy Cost,Normalized Avg. Annual Energy Cost,Percentage of Rural Population,Percentage of Urban Cluster Population,Percentage of Urban Area Population
0,6001409700,4,409700,1,6,Alameda County,0.0,0.619048,0.0,0.258503,...,0.363265,0.0,0.508772,79778.846154,0.189314,1690,0.275666,0.0039,0.0,0.9961
1,6001409700,1,409700,1,6,Alameda County,0.0,0.639665,0.092179,0.111732,...,0.341693,0.0,0.367339,51989.028213,0.11896,1690,0.275666,0.0039,0.0,0.9961
2,6001409700,2,409700,1,6,Alameda County,0.0,0.446404,0.052917,0.181818,...,0.345161,0.006452,0.415312,44130.537975,0.099065,1690,0.275666,0.0039,0.0,0.9961
3,6001409700,3,409700,1,6,Alameda County,0.0,0.582051,0.066667,0.2,...,0.189112,0.0,0.473265,50243.315508,0.11454,1690,0.275666,0.0039,0.0,0.9961
4,6001409800,1,409800,1,6,Alameda County,0.0,1.0,0.0,0.0,...,0.491018,0.0,0.325363,100992.647059,0.243019,1843,0.302842,0.0039,0.0,0.9961


# Save to csv <a id='saving'></a>

In [None]:
#Save our enhanced_df as a zipped csv file and export it to our working directory
compression_opts = dict(method='zip',
                        archive_name='ACS_data.csv')  
enhanced_df.to_csv('ACS_data.zip', index=True,
          compression=compression_opts)