## What is SAHIE?

The Small Area Health Insurance Estimates (SAHIE) is a program of the U.S. Census Bureau. SAHIE produces and disseminates model-based estimates of health insurance coverage for counties and states.

## What is BEA?

The Bureau of Economic Analysis (BEA) provides data pertaining to economic variables.

## Objective

Combine both the SAHIE and BEA data to create visualizations between personal income per capita, population, and insurance rates for both California and the entire US.

In [1]:
import pandas as pd
import numpy as np
import requests
import os

In [2]:
keys = pd.read_csv('keys.csv', header = None)
census_api_key = keys[0]
bea_api_key = keys[1]

In [3]:
def json_to_df(response):
    return pd.DataFrame(response.json()[1:], columns = response.json()[0])

In [4]:
# Census SAHIE data for counties from 2015 - 2020
url = 'https://api.census.gov/data/timeseries/healthins/sahie?get=GEOID,NIC_PT,NUI_PT,NAME&for=county:*&time=2020'.format(census_api_key)
response = requests.request('GET', url)
dfCensus2020 = json_to_df(response)
url = 'https://api.census.gov/data/timeseries/healthins/sahie?get=GEOID,NIC_PT,NUI_PT,NAME&for=county:*&time=2019'.format(census_api_key)
response = requests.request('GET', url)
dfCensus2019 = json_to_df(response)
url = 'https://api.census.gov/data/timeseries/healthins/sahie?get=GEOID,NIC_PT,NUI_PT,NAME&for=county:*&time=2018'.format(census_api_key)
response = requests.request('GET', url)
dfCensus2018 = json_to_df(response)
url = 'https://api.census.gov/data/timeseries/healthins/sahie?get=GEOID,NIC_PT,NUI_PT,NAME&for=county:*&time=2017'.format(census_api_key)
response = requests.request('GET', url)
dfCensus2017 = json_to_df(response)
url = 'https://api.census.gov/data/timeseries/healthins/sahie?get=GEOID,NIC_PT,NUI_PT,NAME&for=county:*&time=2016'.format(census_api_key)
response = requests.request('GET', url)
dfCensus2016 = json_to_df(response)
url = 'https://api.census.gov/data/timeseries/healthins/sahie?get=GEOID,NIC_PT,NUI_PT,NAME&for=county:*&time=2015'.format(census_api_key)
response = requests.request('GET', url)
dfCensus2015 = json_to_df(response)

In [5]:
# Concatenating dataframes
dfCensus = pd.concat([dfCensus2020, dfCensus2019, dfCensus2018, dfCensus2017, dfCensus2016, dfCensus2015])

In [6]:
dfCensus.shape

(18853, 7)

In [7]:
# Checking for nulls
dfCensus.isna().sum()

GEOID     0
NIC_PT    0
NUI_PT    0
NAME      0
time      0
state     0
county    0
dtype: int64

In [8]:
dfCensus.head()

Unnamed: 0,GEOID,NIC_PT,NUI_PT,NAME,time,state,county
0,1001,41521,4902,"Autauga County, AL",2020,1,1
1,1003,158919,19391,"Baldwin County, AL",2020,1,3
2,1005,13928,2337,"Barbour County, AL",2020,1,5
3,1007,14024,2097,"Bibb County, AL",2020,1,7
4,1009,40288,6156,"Blount County, AL",2020,1,9


In [9]:
# Renaming column
dfCensus.rename(columns = {'time': 'Year'}, inplace = True)

In [10]:
# BEA Regional income data for 2020
dfBEA2020 = pd.read_csv('BEAData2020.csv', header = 4)
dfBEA2019 = pd.read_csv('BEAData2019.csv', header = 4)
dfBEA2018 = pd.read_csv('BEAData2018.csv', header = 4)
dfBEA2017 = pd.read_csv('BEAData2017.csv', header = 4)
dfBEA2016 = pd.read_csv('BEAData2016.csv', header = 4)
dfBEA2015 = pd.read_csv('BEAData2015.csv', header = 4)

In [11]:
# Changing column names and inserting years as values into records to prepare for dataframe concatenation
dfBEA2020.rename(columns = {'2020': 'Values'}, inplace = True)
dfBEA2019.rename(columns = {'2019': 'Values'}, inplace = True)
dfBEA2018.rename(columns = {'2018': 'Values'}, inplace = True)
dfBEA2017.rename(columns = {'2017': 'Values'}, inplace = True)
dfBEA2016.rename(columns = {'2016': 'Values'}, inplace = True)
dfBEA2015.rename(columns = {'2015': 'Values'}, inplace = True)

dfBEA2020['Year'] = '2020'
dfBEA2019['Year'] = '2019'
dfBEA2018['Year'] = '2018'
dfBEA2017['Year'] = '2017'
dfBEA2016['Year'] = '2016'
dfBEA2015['Year'] = '2015'

In [12]:
# Concatenating dataframes
dfBEA = pd.concat([dfBEA2020, dfBEA2019, dfBEA2018, dfBEA2017, dfBEA2016, dfBEA2015])
dfBEA.rename(columns = {'2020': 'Values'}, inplace = True)

In [13]:
dfBEA.head()

Unnamed: 0,GeoFips,GeoName,LineCode,Description,Values,Year
0,1001,"Autauga, AL",1.0,Personal income (thousands of dollars),2628375,2020
1,1001,"Autauga, AL",2.0,Population (persons) 1/,56145,2020
2,1001,"Autauga, AL",3.0,Per capita personal income (dollars) 2/,46814,2020
3,1003,"Baldwin, AL",1.0,Personal income (thousands of dollars),11682821,2020
4,1003,"Baldwin, AL",2.0,Population (persons) 1/,229287,2020


In [14]:
# Checking for nulls in BEA dataframe
dfBEA.isnull().sum()

GeoFips          0
GeoName        120
LineCode       120
Description    120
Values         120
Year             0
dtype: int64

In [15]:
# Dropping distinct year column headers and dropping nulls as we can't impute for missing GeoName values
dfBEA.dropna(inplace = True)

In [16]:
# Getting data into lists to transpose BEA data 
# Creating empty lists that we will insert elements into then splitting data into lists by value type and year

personalIncomeValues2020 = []
personalIncomeValues2019 = []
personalIncomeValues2018 = []
personalIncomeValues2017 = []
personalIncomeValues2016 = []
personalIncomeValues2015 = []

populationValues2020 = []
populationValues2019 = []
populationValues2018 = []
populationValues2017 = []
populationValues2016 = []
populationValues2015 = []

perCapitaValues2020 = []
perCapitaValues2019 = []
perCapitaValues2018 = []
perCapitaValues2017 = []
perCapitaValues2016 = []
perCapitaValues2015 = []


def toList(df):
    for row in range(len(df)):
        if df.iloc[row][2] == 1.0:
            if df.iloc[row][5] == '2020':
                personalIncomeValues2020.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2019':
                personalIncomeValues2019.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2018':
                personalIncomeValues2018.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2017':
                personalIncomeValues2017.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2016':
                personalIncomeValues2016.append(df.iloc[row][4])
            else:
                personalIncomeValues2015.append(df.iloc[row][4])
        elif df.iloc[row][2] == 2.0:
            if df.iloc[row][5] == '2020':
                populationValues2020.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2019':
                populationValues2019.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2018':
                populationValues2018.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2017':
                populationValues2017.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2016':
                populationValues2016.append(df.iloc[row][4])
            else:
                populationValues2015.append(df.iloc[row][4])
        elif df.iloc[row][2] == 3.0:
            if df.iloc[row][5] == '2020':
                perCapitaValues2020.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2019':
                perCapitaValues2019.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2018':
                perCapitaValues2018.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2017':
                perCapitaValues2017.append(df.iloc[row][4])
            elif df.iloc[row][5] == '2016':
                perCapitaValues2016.append(df.iloc[row][4])
            else:
                perCapitaValues2015.append(df.iloc[row][4])

In [17]:
toList(dfBEA)

In [18]:
# Creating dataframe with just values in it
dfValues2020 = pd.DataFrame(list(zip(personalIncomeValues2020, populationValues2020, perCapitaValues2020)))
dfValues2019 = pd.DataFrame(list(zip(personalIncomeValues2019, populationValues2019, perCapitaValues2019)))
dfValues2018 = pd.DataFrame(list(zip(personalIncomeValues2018, populationValues2018, perCapitaValues2018)))
dfValues2017 = pd.DataFrame(list(zip(personalIncomeValues2017, populationValues2017, perCapitaValues2017)))
dfValues2016 = pd.DataFrame(list(zip(personalIncomeValues2016, populationValues2016, perCapitaValues2016)))
dfValues2015 = pd.DataFrame(list(zip(personalIncomeValues2015, populationValues2015, perCapitaValues2015)))
dfValues = pd.concat([dfValues2020, dfValues2019, dfValues2018, dfValues2017, dfValues2016, dfValues2015])

In [19]:
dfValues

Unnamed: 0,0,1,2
0,2628375,56145,46814
1,11682821,229287,50953
2,930687,24589,37850
3,759270,22136,34300
4,2246150,57879,38808
...,...,...,...
3135,2236617,44780,49947
3136,4589490,23083,198826
3137,818349,20777,39387
3138,373506,8282,45099


In [20]:
# Dropping duplicates using subsetting
dfBEA = dfBEA.loc[dfBEA['LineCode'] == 1.0]
# Resetting index
dfBEA.reset_index(inplace = True)
dfValues.reset_index(inplace = True)
df = pd.concat([dfBEA, dfValues], axis = 1)
df.drop(columns = 'index', inplace = True)
df.rename(columns = {0: 'Personal Income', 1: 'Population', 2: 'Per Capita Income'}, inplace = True)

In [21]:
df.head()

Unnamed: 0,GeoFips,GeoName,LineCode,Description,Values,Year,Personal Income,Population,Per Capita Income
0,1001,"Autauga, AL",1.0,Personal income (thousands of dollars),2628375,2020,2628375,56145,46814
1,1003,"Baldwin, AL",1.0,Personal income (thousands of dollars),11682821,2020,11682821,229287,50953
2,1005,"Barbour, AL",1.0,Personal income (thousands of dollars),930687,2020,930687,24589,37850
3,1007,"Bibb, AL",1.0,Personal income (thousands of dollars),759270,2020,759270,22136,34300
4,1009,"Blount, AL",1.0,Personal income (thousands of dollars),2246150,2020,2246150,57879,38808


In [22]:
# Shapes of both dataframes similar
df.shape, dfCensus.shape

((18840, 9), (18853, 7))

In [23]:
dfCensus.head()

Unnamed: 0,GEOID,NIC_PT,NUI_PT,NAME,Year,state,county
0,1001,41521,4902,"Autauga County, AL",2020,1,1
1,1003,158919,19391,"Baldwin County, AL",2020,1,3
2,1005,13928,2337,"Barbour County, AL",2020,1,5
3,1007,14024,2097,"Bibb County, AL",2020,1,7
4,1009,40288,6156,"Blount County, AL",2020,1,9


In [24]:
dfCensus.loc[dfCensus['NAME'].str.endswith('CA')]

Unnamed: 0,GEOID,NIC_PT,NUI_PT,NAME,Year,state,county
187,06001,1368328,70851,"Alameda County, CA",2020,06,001
188,06003,746,92,"Alpine County, CA",2020,06,003
189,06005,23225,1599,"Amador County, CA",2020,06,005
190,06007,158242,12276,"Butte County, CA",2020,06,007
191,06009,30272,2365,"Calaveras County, CA",2020,06,009
...,...,...,...,...,...,...,...
239,06107,358698,45247,"Tulare County, CA",2015,06,107
240,06109,35027,2736,"Tuolumne County, CA",2015,06,109
241,06111,649972,71633,"Ventura County, CA",2015,06,111
242,06113,167209,13510,"Yolo County, CA",2015,06,113


In [25]:
df.loc[df['GeoName'].str.endswith('CA')]

Unnamed: 0,GeoFips,GeoName,LineCode,Description,Values,Year,Personal Income,Population,Per Capita Income
212,6001,"Alameda, CA",1.0,Personal income (thousands of dollars),144751041,2020,144751041,1662323,87078
213,6003,"Alpine, CA",1.0,Personal income (thousands of dollars),85240,2020,85240,1119,76175
214,6005,"Amador, CA",1.0,Personal income (thousands of dollars),1912798,2020,1912798,40083,47721
215,6007,"Butte, CA",1.0,Personal income (thousands of dollars),10696500,2020,10696500,212744,50279
216,6009,"Calaveras, CA",1.0,Personal income (thousands of dollars),2455536,2020,2455536,46308,53026
...,...,...,...,...,...,...,...,...,...
15965,06107,"Tulare, CA",1.0,Personal income (thousands of dollars),16464644,2015,16464644,456794,36044
15966,06109,"Tuolumne, CA",1.0,Personal income (thousands of dollars),2234222,2015,2234222,53599,41684
15967,06111,"Ventura, CA",1.0,Personal income (thousands of dollars),46380512,2015,46380512,845599,54849
15968,06113,"Yolo, CA",1.0,Personal income (thousands of dollars),10425111,2015,10425111,211998,49176


In [26]:
# We will join the two dataframes based on FIPS/GeoID
# Trimming leading 0's from both datasets
for row in range(len(dfCensus)):
    if dfCensus.iloc[row][0][0] == '0':
        dfCensus.iloc[row][0] = dfCensus.iloc[row][0].lstrip('0')
        
for row in range(len(df)):
    if df.iloc[row, 0][0] == '0':
        df.iloc[row, 0] = df.iloc[row, 0].lstrip('0')

In [27]:
# Renaming column to prepare for dataframe merge
df.rename(columns = {'GeoFips': 'GEOID'}, inplace = True)

In [28]:
df.head()

Unnamed: 0,GEOID,GeoName,LineCode,Description,Values,Year,Personal Income,Population,Per Capita Income
0,1001,"Autauga, AL",1.0,Personal income (thousands of dollars),2628375,2020,2628375,56145,46814
1,1003,"Baldwin, AL",1.0,Personal income (thousands of dollars),11682821,2020,11682821,229287,50953
2,1005,"Barbour, AL",1.0,Personal income (thousands of dollars),930687,2020,930687,24589,37850
3,1007,"Bibb, AL",1.0,Personal income (thousands of dollars),759270,2020,759270,22136,34300
4,1009,"Blount, AL",1.0,Personal income (thousands of dollars),2246150,2020,2246150,57879,38808


In [29]:
# Merging SAHIE and BEA Data to begin creating final dataframe
df = df.merge(dfCensus, how = 'left', left_on = ['GEOID', 'Year'], right_on = ['GEOID', 'Year'])

In [30]:
df.head()

Unnamed: 0,GEOID,GeoName,LineCode,Description,Values,Year,Personal Income,Population,Per Capita Income,NIC_PT,NUI_PT,NAME,state,county
0,1001,"Autauga, AL",1.0,Personal income (thousands of dollars),2628375,2020,2628375,56145,46814,41521,4902,"Autauga County, AL",1,1
1,1003,"Baldwin, AL",1.0,Personal income (thousands of dollars),11682821,2020,11682821,229287,50953,158919,19391,"Baldwin County, AL",1,3
2,1005,"Barbour, AL",1.0,Personal income (thousands of dollars),930687,2020,930687,24589,37850,13928,2337,"Barbour County, AL",1,5
3,1007,"Bibb, AL",1.0,Personal income (thousands of dollars),759270,2020,759270,22136,34300,14024,2097,"Bibb County, AL",1,7
4,1009,"Blount, AL",1.0,Personal income (thousands of dollars),2246150,2020,2246150,57879,38808,40288,6156,"Blount County, AL",1,9


In [31]:
# Removing columns we do not need anymore
df.drop(columns = ['GEOID', 'county', 'GeoName', 'LineCode', 'Description', 'Values'], inplace = True)

# Renaming columns for better clarity
df.rename(columns = {'NIC_PT': 'Number Insured', 'NUI_PT': 'Number Uninsured', 'NAME': 'County', 'state': 'State'}, inplace = True)

In [32]:
df.head()

Unnamed: 0,Year,Personal Income,Population,Per Capita Income,Number Insured,Number Uninsured,County,State
0,2020,2628375,56145,46814,41521,4902,"Autauga County, AL",1
1,2020,11682821,229287,50953,158919,19391,"Baldwin County, AL",1
2,2020,930687,24589,37850,13928,2337,"Barbour County, AL",1
3,2020,759270,22136,34300,14024,2097,"Bibb County, AL",1
4,2020,2246150,57879,38808,40288,6156,"Blount County, AL",1


In [33]:
# Checking for nulls in merged dataframe
df.isnull().sum()

Year                   0
Personal Income        0
Population             0
Per Capita Income      0
Number Insured       305
Number Uninsured     305
County               305
State                305
dtype: int64

In [34]:
# Dropping nulls as we cannot impute for County or State
df.dropna(inplace = True)

In [35]:
df

Unnamed: 0,Year,Personal Income,Population,Per Capita Income,Number Insured,Number Uninsured,County,State
0,2020,2628375,56145,46814,41521,4902,"Autauga County, AL",01
1,2020,11682821,229287,50953,158919,19391,"Baldwin County, AL",01
2,2020,930687,24589,37850,13928,2337,"Barbour County, AL",01
3,2020,759270,22136,34300,14024,2097,"Bibb County, AL",01
4,2020,2246150,57879,38808,40288,6156,"Blount County, AL",01
...,...,...,...,...,...,...,...,...
18835,2015,2236617,44780,49947,34742,4905,"Sweetwater County, WY",56
18836,2015,4589490,23083,198826,16943,3060,"Teton County, WY",56
18837,2015,818349,20777,39387,15960,2242,"Uinta County, WY",56
18838,2015,373506,8282,45099,5397,1140,"Washakie County, WY",56


In [36]:
# Separating county from state
for row in range(len(df)):
    df.iloc[row, 7] = df.iloc[row, 6][-2:]
    df.iloc[row, 6] = df.iloc[row, 6][:-4]

In [37]:
df.head()

Unnamed: 0,Year,Personal Income,Population,Per Capita Income,Number Insured,Number Uninsured,County,State
0,2020,2628375,56145,46814,41521,4902,Autauga County,AL
1,2020,11682821,229287,50953,158919,19391,Baldwin County,AL
2,2020,930687,24589,37850,13928,2337,Barbour County,AL
3,2020,759270,22136,34300,14024,2097,Bibb County,AL
4,2020,2246150,57879,38808,40288,6156,Blount County,AL


In [38]:
# Reordering columns
columns = ['County', 'Year', 'State', 'Personal Income', 'Population', 'Per Capita Income', 'Number Insured', 'Number Uninsured']
df = df.reindex(columns = columns)

In [39]:
df.shape

(18535, 8)

In [40]:
# Fixing data types for aggregations
df['Number Insured'] = pd.to_numeric(df['Number Insured'])
df['Number Uninsured'] = pd.to_numeric(df['Number Uninsured'])
df['Personal Income'] = pd.to_numeric(df['Personal Income'])
df['Population'] = pd.to_numeric(df['Population'])
df['Per Capita Income'] = pd.to_numeric(df['Per Capita Income'])

In [41]:
# We can calculate the percentage of insured, uninsured, and nonrespondants for insurance
df['Percent Insured'] = round(df['Number Insured'] / df['Population'] * 100, 2)
df['Percent Uninsured'] = round(df['Number Uninsured'] / df['Population'] * 100, 2)
df['Percent Nonrespondent'] = round(100 - (df['Percent Insured'] + df['Percent Uninsured']), 2)

In [42]:
# Cleaning lowercase state value
df['State'].replace('ia', 'IA', inplace = True)

In [43]:
# We now have our final dataset ready for analysis
df.head()

Unnamed: 0,County,Year,State,Personal Income,Population,Per Capita Income,Number Insured,Number Uninsured,Percent Insured,Percent Uninsured,Percent Nonrespondent
0,Autauga County,2020,AL,2628375,56145,46814,41521,4902,73.95,8.73,17.32
1,Baldwin County,2020,AL,11682821,229287,50953,158919,19391,69.31,8.46,22.23
2,Barbour County,2020,AL,930687,24589,37850,13928,2337,56.64,9.5,33.86
3,Bibb County,2020,AL,759270,22136,34300,14024,2097,63.35,9.47,27.18
4,Blount County,2020,AL,2246150,57879,38808,40288,6156,69.61,10.64,19.75


In [44]:
# Exporting final dataframe as csv
path = os.getcwd()
export_path = os.path.join(path,'Datasets\\SAHIE_BEA_2015_to_2020.csv')
if os.path.exists('SAHIE_BEA_2015_to_2020.csv') == False:
    df.to_csv(export_path,  index = False)

## Data Visualization - Tableau

With the consolidated we can now create a Tableau story with out data to convey findings and insights

In [1]:
%%html

<div class='tableauPlaceholder' id='viz1664945379035' style='position: relative'><noscript><a href='#'><img alt='Percent Insured vs. Population in US&#47;California from 2015 to 2020 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;SA&#47;SAHIEBEATableau&#47;PercentInsuredvs_PopulationinUSCaliforniafrom2015to2020&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='path' value='views&#47;SAHIEBEATableau&#47;PercentInsuredvs_PopulationinUSCaliforniafrom2015to2020?:language=en-US&amp;:embed=true' /> <param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;SA&#47;SAHIEBEATableau&#47;PercentInsuredvs_PopulationinUSCaliforniafrom2015to2020&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1664945379035');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1016px';vizElement.style.height='991px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>