# Week 1: Census Data Analysis

## Notes

### About the PDB

- Data is from the Census Planning Database (PDB) (full dataset is downloadable as a .csv).
- The PDB contains data from both the 2010 decennial census and the 2010-2014 American Community Survey (ACS). Since the purpose of the ACS is to measure changing social and economic characteristics of the population, we primarily refer to ACS variables in this analysis.
- PDB data is at the census tract or block group (which is more granular) level.
- Variable names are explained here: https://api.census.gov/data/2016/pdb/blockgroup/variables.html, https://api.census.gov/data/2016/pdb/tract/variables.html

### Getting geographic information
- Locations are given as State/County/Tract/BG codes. In order to interpret these as longitude/latitude coordinates, we need a mapping from block group/census tract to geography.
- Census tract to longitude/latitude coordinates are available in the Census Tracts Gazetteer file (https://www.census.gov/geo/maps-data/data/gazetteer2017.html). This is what we use in this preliminary analysis.
- Mappings from block group can be accessed by opening the relevant shapefiles in ArcGIS (http://gif.berkeley.edu/resources/arcgis_education_edition.html).
    - About Shapefiles: https://www2.census.gov/geo/pdfs/maps-data/data/tiger/tgrshp2017/TGRSHP2017_TechDoc_Ch2.pdf

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
import numpy as np

In [4]:
# Census Planning Database - Block Group
# full_pdb16_bg_df = pd.read_csv("raw-data/pdb2016_bg_v8_us.csv", encoding="ISO-8859-1")

# Census Planning Database - Census Tract
full_pdb16_tr_df = pd.read_csv("raw-data/pdb2016_tr_v8_us.csv", encoding="ISO-8859-1")

### Alameda County

In [14]:
def df_for_county(county_name):
    return full_pdb16_tr_df.loc[full_pdb16_tr_df['County_name'] == county_name]

alameda_tr_df = df_for_county("Alameda County")

In [20]:
# Importing longitude/latitude mappings
gaz_tracts_df = pd.read_csv("raw-data/2017_gaz_tracts_06.csv", encoding="ISO-8859-1")

def map_lat_long_geoid(geoid_min, geoid_max, df):
    # Adds latitude and longitude columns to the dataframe.
    # Modifies df in place.
    lat_long_df = gaz_tracts_df[gaz_tracts_df['GEOID'] >= geoid_min]
    lat_long_df = lat_long_df[lat_long_df['GEOID'] < geoid_max]
    lat_long_df = lat_long_df[['GEOID', 'INTPTLAT', 'INTPTLONG                                                                                                                             ']]

    num_tracts = len(lat_long_df) - 1
    gidtr_lat, gidtr_long = {}, {}
    for i in range(num_tracts):
        geoid, lat, long = lat_long_df.iloc[i][0], lat_long_df.iloc[i][1], lat_long_df.iloc[i][2]
        gidtr_lat[geoid], gidtr_long[geoid] = lat, long
    
    df['Latitude'] = df['GIDTR'].map(gidtr_lat)
    df['Longitude'] = df['GIDTR'].map(gidtr_long)

In [17]:
map_lat_long_geoid(6001000000, 6002000000, alameda_tr_df)
alameda_tr_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,GIDTR,State,State_name,County,County_name,Tract,Flag,Num_BGs_in_Tract,LAND_AREA,AIAN_LAND,...,pct_Census_Mail_Returns_CEN_2010,pct_Vacants_CEN_2010,pct_Deletes_CEN_2010,pct_Census_UAA_CEN_2010,pct_Mailback_Count_CEN_2010,pct_FRST_FRMS_CEN_2010,pct_RPLCMNT_FRMS_CEN_2010,pct_BILQ_Mailout_count_CEN_2010,Latitude,Longitude
3563,6001400100,6,California,1,Alameda County,400100,,1,2.657,0,...,77.76,5.05,0,3.25,91.70,77.69,0.07,,37.867628,-122.231946
3564,6001400200,6,California,1,Alameda County,400200,,2,0.230,0,...,79.14,6.42,0,3.42,90.16,79.14,0.00,,37.848139,-122.249597
3565,6001400300,6,California,1,Alameda County,400300,,4,0.427,0,...,77.20,4.12,0,5.76,90.13,69.50,7.70,,37.840598,-122.254436
3566,6001400400,6,California,1,Alameda County,400400,,3,0.271,0,...,77.05,4.86,0,5.06,90.08,69.10,7.96,,37.848280,-122.257453
3567,6001400500,6,California,1,Alameda County,400500,,3,0.227,0,...,71.76,4.01,0,6.13,89.86,65.04,6.72,,37.848541,-122.264728
3568,6001400600,6,California,1,Alameda County,400600,,2,0.115,0,...,67.79,6.75,0,4.68,88.57,60.00,7.79,,37.841991,-122.264888
3569,6001400700,6,California,1,Alameda County,400700,,4,0.340,0,...,60.92,5.42,0,7.62,86.96,49.82,11.10,,37.841767,-122.272353
3570,6001400800,6,California,1,Alameda County,400800,,3,0.268,0,...,62.31,5.30,0,7.70,87.00,52.11,10.21,,37.845467,-122.283394
3571,6001400900,6,California,1,Alameda County,400900,,2,0.165,0,...,60.50,6.51,0,7.81,85.68,49.74,10.76,,37.839491,-122.280265
3572,6001401000,6,California,1,Alameda County,401000,,6,0.446,0,...,55.11,5.61,0,11.43,82.96,44.96,10.15,,37.831226,-122.271901


In [18]:
# Some preliminary datasets
gender_alameda_tr_df = alameda_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'pct_Females_ACS_10_14']]
ethnicity_alameda_tr_df = alameda_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'NH_AIAN_alone_ACS_10_14', 'NH_Asian_alone_ACS_10_14', 'NH_Blk_alone_ACS_10_14', 'NH_NHOPI_alone_ACS_10_14', 'NH_SOR_alone_ACS_10_14', 'NH_White_alone_ACS_10_14']]

health_ins_alameda_tr_df = alameda_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'No_Health_Ins_ACS_10_14', 'pct_No_Health_Ins_ACS_10_14', 'One_Health_Ins_ACS_10_14', 'pct_One_Health_Ins_ACS_10_14', 'pct_TwoPHealthIns_ACS_10_14', 'Two_Plus_Health_Ins_ACS_10_14']]
income_alameda_tr_df = alameda_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'Med_HHD_Inc_ACS_10_14', 'Prs_Blw_Pov_Lev_ACS_10_14', 'PUB_ASST_INC_ACS_10_14']]

In [19]:
income_alameda_tr_df

Unnamed: 0,Latitude,Longitude,LAND_AREA,Tot_Population_ACS_10_14,Med_HHD_Inc_ACS_10_14,Prs_Blw_Pov_Lev_ACS_10_14,PUB_ASST_INC_ACS_10_14
3563,37.867628,-122.231946,2.657,3385,"$165,625",137,0
3564,37.848139,-122.249597,0.230,1939,"$134,531",78,0
3565,37.840598,-122.254436,0.427,5428,"$71,618",460,180
3566,37.848280,-122.257453,0.271,4279,"$98,824",307,40
3567,37.848541,-122.264728,0.227,3516,"$73,837",530,47
3568,37.841991,-122.264888,0.115,1750,"$57,639",143,36
3569,37.841767,-122.272353,0.340,4396,"$41,023",832,9
3570,37.845467,-122.283394,0.268,3218,"$59,018",243,36
3571,37.839491,-122.280265,0.165,2031,"$60,089",243,50
3572,37.831226,-122.271901,0.446,5505,"$38,403",1630,112


In [35]:
gender_alameda_tr_df.to_csv('datasets/alameda/gender_alameda_tr.csv')
ethnicity_alameda_tr_df.to_csv('datasets/alameda/ethnicity_alameda_tr.csv')
health_ins_alameda_tr_df.to_csv('datasets/alameda/health_ins_alameda_tr.csv')
income_alameda_tr_df.to_csv('datasets/alameda/income_alameda_tr.csv')

### Other Bay Area Counties

In [37]:
sanfrancisco_tr_df = df_for_county("San Francisco County")

In [30]:
map_lat_long_geoid(6075000000, 6076000000, sanfrancisco_tr_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [32]:
gender_sanfrancisco_tr_df = sanfrancisco_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'pct_Females_ACS_10_14']]
ethnicity_sanfrancisco_tr_df = sanfrancisco_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'NH_AIAN_alone_ACS_10_14', 'NH_Asian_alone_ACS_10_14', 'NH_Blk_alone_ACS_10_14', 'NH_NHOPI_alone_ACS_10_14', 'NH_SOR_alone_ACS_10_14', 'NH_White_alone_ACS_10_14']]

health_ins_sanfrancisco_tr_df = sanfrancisco_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'No_Health_Ins_ACS_10_14', 'pct_No_Health_Ins_ACS_10_14', 'One_Health_Ins_ACS_10_14', 'pct_One_Health_Ins_ACS_10_14', 'pct_TwoPHealthIns_ACS_10_14', 'Two_Plus_Health_Ins_ACS_10_14']]
income_sanfrancisco_tr_df = sanfrancisco_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'Med_HHD_Inc_ACS_10_14', 'Prs_Blw_Pov_Lev_ACS_10_14', 'PUB_ASST_INC_ACS_10_14']]

In [36]:
gender_sanfrancisco_tr_df.to_csv('datasets/san-francisco/gender_sanfrancisco_tr.csv')
ethnicity_sanfrancisco_tr_df.to_csv('datasets/san-francisco/ethnicity_sanfrancisco_tr.csv')
health_ins_sanfrancisco_tr_df.to_csv('datasets/san-francisco/health_ins_sanfrancisco_tr.csv')
income_sanfrancisco_tr_df.to_csv('datasets/san-francisco/income_sanfrancisco_tr.csv')

In [47]:
sanmateo_tr_df = df_for_county("San Mateo County")

In [48]:
map_lat_long_geoid(6081000000, 6082000000, sanmateo_tr_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [44]:
gender_sanmateo_tr_df = sanmateo_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'pct_Females_ACS_10_14']]
ethnicity_sanmateo_tr_df = sanmateo_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'NH_AIAN_alone_ACS_10_14', 'NH_Asian_alone_ACS_10_14', 'NH_Blk_alone_ACS_10_14', 'NH_NHOPI_alone_ACS_10_14', 'NH_SOR_alone_ACS_10_14', 'NH_White_alone_ACS_10_14']]

health_ins_sanmateo_tr_df = sanmateo_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'No_Health_Ins_ACS_10_14', 'pct_No_Health_Ins_ACS_10_14', 'One_Health_Ins_ACS_10_14', 'pct_One_Health_Ins_ACS_10_14', 'pct_TwoPHealthIns_ACS_10_14', 'Two_Plus_Health_Ins_ACS_10_14']]
income_sanmateo_tr_df = sanmateo_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'Med_HHD_Inc_ACS_10_14', 'Prs_Blw_Pov_Lev_ACS_10_14', 'PUB_ASST_INC_ACS_10_14']]

gender_sanmateo_tr_df.to_csv('datasets/san-mateo/gender_sanmateo_tr.csv')
ethnicity_sanmateo_tr_df.to_csv('datasets/san-mateo/ethnicity_sanmateo_tr.csv')
health_ins_sanmateo_tr_df.to_csv('datasets/san-mateo/health_ins_sanmateo_tr.csv')
income_sanmateo_tr_df.to_csv('datasets/san-mateo/income_sanmateo_tr.csv')

In [49]:
santaclara_tr_df = df_for_county("Santa Clara County")

In [51]:
map_lat_long_geoid(6085000000, 6086000000, santaclara_tr_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [52]:
gender_santaclara_tr_df = santaclara_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'pct_Females_ACS_10_14']]
ethnicity_santaclara_tr_df = santaclara_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'NH_AIAN_alone_ACS_10_14', 'NH_Asian_alone_ACS_10_14', 'NH_Blk_alone_ACS_10_14', 'NH_NHOPI_alone_ACS_10_14', 'NH_SOR_alone_ACS_10_14', 'NH_White_alone_ACS_10_14']]

health_ins_santaclara_tr_df = santaclara_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'No_Health_Ins_ACS_10_14', 'pct_No_Health_Ins_ACS_10_14', 'One_Health_Ins_ACS_10_14', 'pct_One_Health_Ins_ACS_10_14', 'pct_TwoPHealthIns_ACS_10_14', 'Two_Plus_Health_Ins_ACS_10_14']]
income_santaclara_tr_df = santaclara_tr_df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'Med_HHD_Inc_ACS_10_14', 'Prs_Blw_Pov_Lev_ACS_10_14', 'PUB_ASST_INC_ACS_10_14']]

gender_santaclara_tr_df.to_csv('datasets/santa-clara/gender_santaclara_tr.csv')
ethnicity_santaclara_tr_df.to_csv('datasets/santa-clara/ethnicity_santaclara_tr.csv')
health_ins_santaclara_tr_df.to_csv('datasets/santa-clara/health_ins_santaclara_tr.csv')
income_santaclara_tr_df.to_csv('datasets/santa-clara/income_santaclara_tr.csv')

In [72]:
def write_datasets_for_county(county_name, dir_path):
    # Gets the data for the county, maps the latitude/longitude coordinates,
    # and writes the relevant datasets.
    df = df_for_county(county_name)
    
    state, county = df['State'].iloc[0], df['County'].iloc[0]
    gidtr_min = int(str(state) + str(county).zfill(3) + '000000')
    gidtr_max = int(str(state) + str(county + 1).zfill(3) + '000000')
    map_lat_long_geoid(gidtr_min, gidtr_max, df)
    
    gender = df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'pct_Females_ACS_10_14']]
    ethnicity = df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'NH_AIAN_alone_ACS_10_14', 'NH_Asian_alone_ACS_10_14', 'NH_Blk_alone_ACS_10_14', 'NH_NHOPI_alone_ACS_10_14', 'NH_SOR_alone_ACS_10_14', 'NH_White_alone_ACS_10_14']]
    health_ins = df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'No_Health_Ins_ACS_10_14', 'pct_No_Health_Ins_ACS_10_14', 'One_Health_Ins_ACS_10_14', 'pct_One_Health_Ins_ACS_10_14', 'pct_TwoPHealthIns_ACS_10_14', 'Two_Plus_Health_Ins_ACS_10_14']]
    income = df[['Latitude', 'Longitude', 'LAND_AREA', 'Tot_Population_ACS_10_14', 'Med_HHD_Inc_ACS_10_14', 'Prs_Blw_Pov_Lev_ACS_10_14', 'PUB_ASST_INC_ACS_10_14']]
    
    gender.to_csv(dir_path + "gender_" + county_name.lower().replace(" ", "") + "_tr.csv")
    ethnicity.to_csv(dir_path + "ethnicity_" + county_name.lower().replace(" ", "") + "_tr.csv")
    health_ins.to_csv(dir_path + "health_ins_" + county_name.lower().replace(" ", "") + "_tr.csv")
    income.to_csv(dir_path + "income_" + county_name.lower().replace(" ", "") + "_tr.csv")

In [74]:
write_datasets_for_county("Marin County", "datasets/marin/")
write_datasets_for_county("Contra Costa County", "datasets/contra-costa/")
write_datasets_for_county("Napa County", "datasets/napa/")
write_datasets_for_county("Sonoma County", "datasets/sonoma/")
write_datasets_for_county("Solano County", "datasets/solano/")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


# Pre-Process Census Data for Unknowns

## Contra Costa

### Ethnicity

In [4]:
pd.read_csv('census-datasets/contra-costa/ethnicity_contracostacounty_tr.csv')

Unnamed: 0.1,Unnamed: 0,Latitude,Longitude,LAND_AREA,Tot_Population_ACS_10_14,NH_AIAN_alone_ACS_10_14,NH_Asian_alone_ACS_10_14,NH_Blk_alone_ACS_10_14,NH_NHOPI_alone_ACS_10_14,NH_SOR_alone_ACS_10_14,NH_White_alone_ACS_10_14
0,4000,38.038123,-121.629968,30.506,4024.0,0.0,181.0,210.0,0.0,0.0,2566.0
1,4001,37.999316,-121.732928,1.378,6736.0,45.0,460.0,448.0,31.0,0.0,2889.0
2,4002,38.009414,-121.705822,2.363,3991.0,0.0,205.0,57.0,0.0,0.0,2056.0
3,4003,37.983711,-121.705654,1.670,6014.0,86.0,316.0,550.0,23.0,0.0,2294.0
4,4004,37.989109,-121.670210,7.252,8860.0,231.0,970.0,866.0,0.0,0.0,3108.0
5,4005,37.971409,-121.747864,2.031,6431.0,34.0,1158.0,1421.0,0.0,0.0,1463.0
6,4006,37.983675,-121.730632,2.652,10400.0,48.0,715.0,378.0,18.0,115.0,5758.0
7,4007,37.957794,-121.707487,2.513,8247.0,0.0,1136.0,527.0,1.0,0.0,3264.0
8,4008,37.936299,-121.671017,12.980,10735.0,63.0,369.0,502.0,6.0,0.0,4876.0
9,4009,37.943061,-121.758766,5.135,10676.0,23.0,1027.0,1637.0,36.0,0.0,4501.0


In [None]:
set(pd.read_csv('census-datasets/alameda/poverty_level_alameda_tr_split.csv')['Variable']) 

## Health Ins

In [7]:
pd.read_csv('census-datasets/alameda/income_alameda_tr.csv')

Unnamed: 0.1,Unnamed: 0,Latitude,Longitude,LAND_AREA,Tot_Population_ACS_10_14,Med_HHD_Inc_ACS_10_14,Prs_Blw_Pov_Lev_ACS_10_14,PUB_ASST_INC_ACS_10_14
0,3563,37.867628,-122.231946,2.657,3385.0,"$165,625",137.0,0.0
1,3564,37.848139,-122.249597,0.230,1939.0,"$134,531",78.0,0.0
2,3565,37.840598,-122.254436,0.427,5428.0,"$71,618",460.0,180.0
3,3566,37.848280,-122.257453,0.271,4279.0,"$98,824",307.0,40.0
4,3567,37.848541,-122.264728,0.227,3516.0,"$73,837",530.0,47.0
5,3568,37.841991,-122.264888,0.115,1750.0,"$57,639",143.0,36.0
6,3569,37.841767,-122.272353,0.340,4396.0,"$41,023",832.0,9.0
7,3570,37.845467,-122.283394,0.268,3218.0,"$59,018",243.0,36.0
8,3571,37.839491,-122.280265,0.165,2031.0,"$60,089",243.0,50.0
9,3572,37.831226,-122.271901,0.446,5505.0,"$38,403",1630.0,112.0


In [6]:
set(pd.read_csv('census-datasets/alameda/poverty_level_alameda_tr_split.csv')['Variable'])

{'Above_poverty_level', 'Prs_Blw_Pov_Lev_ACS_10_14'}

In [9]:
health_ins_binarized = pd.read_csv('census-datasets/alameda/health_ins_alameda_tr_split_binarized.csv')

In [22]:
health_ins_binarized[(health_ins_binarized['variable'] == 'One_Plus_Health_Ins')\
                     & (np.abs(health_ins_binarized['Latitude'] - 37.867) < 5e-4)]

Unnamed: 0,Latitude,Longitude,value,variable
493,37.866788,-122.259979,6691,One_Plus_Health_Ins
495,37.866844,-122.277135,3867,One_Plus_Health_Ins
853,37.866788,-122.259979,975,One_Plus_Health_Ins
855,37.866844,-122.277135,848,One_Plus_Health_Ins


In [23]:
health_ins = pd.read_csv('census-datasets/alameda/health_ins_alameda_tr.csv')

In [24]:
health_ins.shape

(361, 11)

In [26]:
health_ins[np.abs(health_ins['Latitude'] - 37.867) < 5e-4]

Unnamed: 0.1,Unnamed: 0,Latitude,Longitude,LAND_AREA,Tot_Population_ACS_10_14,No_Health_Ins_ACS_10_14,pct_No_Health_Ins_ACS_10_14,One_Health_Ins_ACS_10_14,pct_One_Health_Ins_ACS_10_14,pct_TwoPHealthIns_ACS_10_14,Two_Plus_Health_Ins_ACS_10_14
133,3696,37.866788,-122.259979,0.158,8154,488,5.98,6691,82.06,11.96,975
135,3698,37.866844,-122.277135,0.282,5045,317,6.28,3867,76.65,16.81,848


In [30]:
health_ins_binarized.iloc[493]['Latitude'] - health_ins_binarized.iloc[855]['Latitude']

-5.660000000062837e-05

In [31]:
health_ins_binarized.shape

(1440, 4)

In [32]:
health_ins_binarized.iloc[3]

Latitude                     37.8483
Longitude                   -122.257
value                            292
variable     No_Health_Ins_ACS_10_14
Name: 3, dtype: object

In [38]:
health_ins_binarized.iloc[[3, 363, 723, 1083]]

Unnamed: 0,Latitude,Longitude,value,variable
3,37.84828,-122.257453,292,No_Health_Ins_ACS_10_14
363,37.84828,-122.257453,3456,One_Plus_Health_Ins
723,37.84828,-122.257453,526,One_Plus_Health_Ins
1083,37.84828,-122.257453,5,Unknown


### Add Two Health Ins to One Health Ins

In [43]:
health_ins_cleaned_df = pd.DataFrame.copy(health_ins_binarized.iloc[:720])

In [54]:
vals_to_add = np.zeros(shape=len(health_ins_cleaned_df,))

In [55]:
vals_to_add.shape

(720,)

In [56]:
# add one and two health ins populations together for each tract
for i in np.arange(720, 1080): 
    vals_to_add[i - 360] = health_ins_binarized.iloc[i]['value']

In [59]:
health_ins_cleaned_df['value'] = health_ins_cleaned_df['value'] + vals_to_add

### Split the Unknowns

In [68]:
unknown_pop_vals = np.array(health_ins_binarized.iloc[1080:]['value'])

In [79]:
vals_to_add = np.zeros(shape=len(health_ins_cleaned_df, ))

In [80]:
vals_to_add.shape

(720,)

In [81]:
for i in range(360): 
    no_ins_pop = health_ins_cleaned_df.iloc[i]['value']
    one_ins_pop = health_ins_cleaned_df.iloc[i + 360]['value']
    frac_no_ins = no_ins_pop / (no_ins_pop + one_ins_pop)
    frac_with_ins = 1.0 - frac_no_ins
    vals_to_add[i] = np.round(frac_no_ins * unknown_pop_vals[i], decimals=0)
    vals_to_add[360 + i] = np.round(frac_with_ins * unknown_pop_vals[i], decimals=0)

In [82]:
for i in range(360): 
    print(vals_to_add[i] + vals_to_add[360 + i] - unknown_pop_vals[i])

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0


In [84]:
health_ins_cleaned_df['value'] = health_ins_cleaned_df['value'] + vals_to_add

In [85]:
health_ins_cleaned_df.shape

(720, 4)

In [86]:
health_ins.shape

(361, 11)

In [87]:
health_ins_cleaned_df.head()

Unnamed: 0,Latitude,Longitude,value,variable
0,37.867628,-122.231946,114.0,No_Health_Ins_ACS_10_14
1,37.848139,-122.249597,105.0,No_Health_Ins_ACS_10_14
2,37.840598,-122.254436,404.0,No_Health_Ins_ACS_10_14
3,37.84828,-122.257453,292.0,No_Health_Ins_ACS_10_14
4,37.848541,-122.264728,555.0,No_Health_Ins_ACS_10_14


In [88]:
health_ins_cleaned_df.to_csv('census-datasets/alameda/health_ins_binarized_unknown_removed.csv')

## Race

In [95]:
race_split_df = pd.read_csv('census-datasets/alameda/ethnicity_alameda_tr_racial_split.csv')

In [102]:
race_split_df.head()

Unnamed: 0,Latitude,Longitude,Value,Variable
0,37.671028,-121.913048,7,NH_AIAN_alone_ACS_10_14
1,37.551134,-121.80546,0,NH_AIAN_alone_ACS_10_14
2,37.866788,-122.259979,73,NH_AIAN_alone_ACS_10_14
3,37.698814,-122.123482,0,NH_AIAN_alone_ACS_10_14
4,37.556962,-122.021234,0,NH_AIAN_alone_ACS_10_14


In [104]:
set(race_split_df['Variable'].values)

{'NH_AIAN_alone_ACS_10_14',
 'NH_Asian_alone_ACS_10_14',
 'NH_Blk_alone_ACS_10_14',
 'NH_NHOPI_alone_ACS_10_14',
 'NH_SOR_alone_ACS_10_14',
 'NH_White_alone_ACS_10_14',
 'Other'}

In [128]:
vals_to_add = np.zeros(shape=len(race_split_df) - 360,)

In [129]:
unknown_pop_vals = np.array(race_split_df['Value'].iloc[2160:].values)

In [130]:
unknown_pop_vals.shape

(360,)

In [131]:
for i in range(360):
    indices = i + np.arange(0, 6) * 360
    pop_values = np.array(race_split_df.iloc[indices]['Value'].values)
    pop_fractions = pop_values / np.sum(pop_values)
    to_add = np.round(pop_fractions * unknown_pop_vals[i])
    for relative_index, true_index in enumerate(indices): 
        vals_to_add[true_index] = to_add[relative_index]

In [133]:
vals_to_add.shape

(2160,)

In [136]:
race_split_df_clean = pd.DataFrame.copy(race_split_df.iloc[:2160])

In [None]:
race_s

In [138]:
race_split_df_clean['Value'] = race_split_df_clean['Value'] + vals_to_add

In [139]:
race_split_df_clean['Value'] - race_split_df.iloc[:2160]['Value']

0         1.0
1         0.0
2        17.0
3         0.0
4         0.0
5         0.0
6        49.0
7         5.0
8        24.0
9        17.0
10        6.0
11        3.0
12        0.0
13       29.0
14        0.0
15       17.0
16        4.0
17       10.0
18        0.0
19        0.0
20       15.0
21        7.0
22        9.0
23        2.0
24        0.0
25       19.0
26        0.0
27       23.0
28        0.0
29        6.0
        ...  
2130    254.0
2131    196.0
2132    141.0
2133    359.0
2134    137.0
2135    388.0
2136    107.0
2137     79.0
2138    161.0
2139    252.0
2140     81.0
2141     31.0
2142    167.0
2143    202.0
2144    263.0
2145    138.0
2146    100.0
2147    204.0
2148    186.0
2149    145.0
2150    138.0
2151     65.0
2152     30.0
2153    161.0
2154     45.0
2155      4.0
2156     43.0
2157     67.0
2158     13.0
2159      0.0
Name: Value, Length: 2160, dtype: float64

In [142]:
race_split_df_clean.to_csv('census-datasets/alameda/ethnicity_alameda_tr_split_unknown_removed.csv')

In [141]:
vals_to_add

array([ 1.,  0., 17., ..., 67., 13.,  0.])

In [127]:
np.arange(0, 6) * 360 + 359

array([ 359,  719, 1079, 1439, 1799, 2159])

In [120]:
5 + np.arange(0, 7) * 360

array([   5,  365,  725, 1085, 1445, 1805, 2165])