## Contents
- [Data Dictionary](#Data-Dictionary)
- [Importing and Combining COVID Data](#Importing-and-Combining-COVID-Data) 
- [Importing and Combining Census Number Data](#Importing-and-Combining-Census-Number-Data) 
- [Combining COVID and Census Data](#Combining-COVID-and-Census-Data) 
- [Feature Engineering](#Feature-Engineering)
- [Dataframe to Hold Both Total Numbers and Percent Values](#Dataframe-to-Hold-Both-Total-Numbers-and-Percent-Values)
- [Dataframe to Hold Percent Values](#Dataframe-to-Hold-Percent-Values)

# Data Dictionary

|Column Label|Type|Description
|------------|----|-----------|
|health_ins_noninst_pop|int|(Discrete): Total population civilian noninstitutionalized used to calculate yes/no coverage %
|health_ins_noninst_pop_cov_no|int| (Discrete): Population civilian noninstitutionalized without health insurance coverage
|health_ins_noninst_pop_cov_yes|int|(Discrete): Population civilian noninstitutionalized with health insurance coverage
|health_ins_noninst_pop_private|int|(Discrete): Population civilian noninstitutionalized with private health insurance
|health_ins_noninst_pop_public|int|(Discrete): Population civilian noninstitutionalized with public coverage
|inc_hhlds|int|(Discrete): Total households used to calculate household income %
|inc_hhlds_less_than_10_000|int|(Discrete): Households less than 10,000 income
|inc_hhlds_10_000_to_14_999|int|(Discrete): Households 10,000 to 14,999 income
|inc_hhlds_15_000_to_24_999|int|(Discrete): Households 15,000 to 24,999 income
|inc_hhlds_25_000_to_34_999|int|(Discrete): Households 25,000 to 34,999 income
|inc_hhlds_35_000_to_49_999|int|(Discrete): Households 35,000 to 49,999 income
|inc_hhlds_50_000_to_74_999|int|(Discrete): Households 50,000 to 74,999 income
|inc_hhlds_75_000_to_99_999|int|(Discrete): Households 75,000 to 99,999 income
|inc_hhlds_100_000_to_149_999|int|(Discrete): Households 100,000 to 149,999 income
|inc_hhlds_150_000_to_199_999|int|(Discrete): Households 150,000 to 199,999 income
|inc_hhlds_200_000_or_more|int|(Discrete): Households 200,000+ income
|inc_mean_hhld_inc_dol|int|(Discrete): Mean household income (dollars)
|inc_med_hhld_inc_dol|int|(Discrete): Median household income (dollars)
|inc_med_earn_female_full_yr_workers_dol|int|(Discrete): Median earnings for female full-time, year-round workers (dollars)
|inc_med_earn_male_full_yr_workers_dol|int|(Discrete): Median earnings for male full-time, year-round workers (dollars)
|inc_per_capita_inc_dol|int|(Discrete): Per capita income (dollars)
|obes_percent|float|(Continuous): Population obesity percentage
|pop_density|float|(Continuous): People per square mile
|race_pop|int|(Discrete): Total population used to calculate race demographic %
|race_pop_american_indian_and_alaska_native_alone|int|(Discrete): Population American Indian and Alaska Native
|race_pop_asian_alone|int|(Discrete): Population Asian
|race_pop_black_or_african_american_alone|int|(Discrete): Population Black or African American
|race_pop_hispanic_or_latino_of_any_race|int|(Discrete): Population Hispanic or Latino of any race
|race_pop_native_hawaiian_and_other_pacific_islander_alone|int|(Discrete): Population Native Hawaiian or other Pacific Islander
|race_pop_some_other_race_alone|int|(Discrete): Population some other race
|race_pop_two_or_more_races|int|(Discrete): Population two or more races
|race_pop_white_alone|int|(Discrete): Population White
|sex_age_median_age_in_years|float|(Continuous): Median age (years)
|sex_age_pop|int|(Discrete): Total population used to calculate sex/age demographic %
|sex_age_pop_under_5|int|(Discrete): Population under 5
|sex_age_pop_5_to_9|int|(Discrete): Population 5-9
|sex_age_pop_10_to_14|int|(Discrete): Population 10-14
|sex_age_pop_15_to_19|int|(Discrete): Population 15-19
|sex_age_pop_20_to_24|int|(Discrete): Population 20-24
|sex_age_pop_25_to_34|int|(Discrete): Population 25-34
|sex_age_pop_35_to_44|int|(Discrete): Population 35-44
|sex_age_pop_45_to_54|int|(Discrete): Population 45-54
|sex_age_pop_55_to_59|int|(Discrete): Population 55-59
|sex_age_pop_60_to_64|int|(Discrete): Population 60-64
|sex_age_pop_65_to_74|int|(Discrete): Population 65-74
|sex_age_pop_75_to_84|int|(Discrete): Population 75-84
|sex_age_pop_85_and_over|int|(Discrete): Population 85+
|sex_age_pop_female|int|(Discrete): Population females
|sex_age_pop_male|int|(Discrete): Population male
|sq_mi|float|(Continuous): Square miles

In [1]:
import pandas as pd

# Importing and Combining COVID Data

In [2]:
# Import the relevant dataframes.
ca = pd.read_csv('../data/preprocessing/cleaned_covid_ca.csv')
fl = pd.read_csv('../data/preprocessing/cleaned_covid_fl.csv')
il = pd.read_csv('../data/preprocessing/cleaned_covid_il.csv')
ny = pd.read_csv('../data/preprocessing/cleaned_covid_ny.csv')
tx = pd.read_csv('../data/preprocessing/cleaned_covid_tx.csv')

In [3]:
# Display the first few rows of data. 
ca.head(2)

Unnamed: 0,Geographic Area Name,total_cases,total_fatalities,death_rate,total_tests
0,"Santa Clara County, California",23978.0,388.0,0.016181,839764
1,"San Mateo County, California",10942.0,159.0,0.014531,285657


In [4]:
# Display the first few rows of data. 
fl.head(2)

Unnamed: 0,Geographic Area Name,total_cases,total_fatalities,death_rate,total_tests
0,"Alachua County, Florida",10044,74,0.007368,110327
1,"Baker County, Florida",1750,19,0.010857,9005


In [5]:
# Display the first few rows of data. 
il.head(2)

Unnamed: 0,Geographic Area Name,total_cases,total_fatalities,death_rate,total_tests
0,"Adams County, Illinois",1862,19,0.010204,38542
1,"Alexander County, Illinois",120,1,0.008333,2350


In [6]:
# Display the first few rows of data. 
ny.head(2)

Unnamed: 0,Geographic Area Name,total_cases,total_fatalities,death_rate,total_tests
0,"Albany County, New York",3577,128,0.035784,176101
1,"Allegany County, New York",271,9,0.03321,24790


In [7]:
# Display the first few rows of data. 
tx.head(2)

Unnamed: 0,Geographic Area Name,total_cases,total_fatalities,death_rate,total_tests
0,"Anderson County, Texas",2941,39,0.013261,30052
1,"Andrews County, Texas",584,10,0.017123,2127


In [8]:
covid = pd.concat([ca, fl, il, ny, tx])
covid

Unnamed: 0,Geographic Area Name,total_cases,total_fatalities,death_rate,total_tests
0,"Santa Clara County, California",23978.0,388.0,0.016181,839764
1,"San Mateo County, California",10942.0,159.0,0.014531,285657
2,"Santa Barbara County, California",9781.0,120.0,0.012269,158693
3,"Tuolumne County, California",269.0,4.0,0.014870,20986
4,"Sierra County, California",6.0,0.0,0.000000,641
...,...,...,...,...,...
249,"Wood County, Texas",626.0,39.0,0.062300,6844
250,"Yoakum County, Texas",289.0,6.0,0.020761,1531
251,"Young County, Texas",535.0,8.0,0.014953,3363
252,"Zapata County, Texas",345.0,9.0,0.026087,4355


# Importing and Combining Census Number Data

In [9]:
# Import the relevant dataframes.
race = pd.read_csv('../data/preprocessing/cleaned_dp05_race_five_states.csv')
sa = pd.read_csv('../data/preprocessing/cleaned_dp05_sex_age_five_states.csv')
land = pd.read_csv('../data/preprocessing/cleaned_area_five_states.csv')
ins = pd.read_csv('../data/preprocessing/cleaned_dp03_insurance_five_states.csv')
inc = pd.read_csv('../data/preprocessing/cleaned_dp03_income_five_states.csv')
obes = pd.read_csv('../data/preprocessing/cleaned_obesity_five_states.csv')

In [10]:
# Display the first few rows of data. 
race.head(2)

Unnamed: 0,Geographic Area Name,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races
0,"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456
1,"Kenedy County, Texas",595,522,72,0,0,1,0,0,0


In [11]:
# Display the first few rows of data. 
sa.head(2)

Unnamed: 0,Geographic Area Name,sex_age_pop,sex_age_pop_male,sex_age_pop_female,sex_age_pop_under_5,sex_age_pop_5_to_9,sex_age_pop_10_to_14,sex_age_pop_15_to_19,sex_age_pop_20_to_24,sex_age_pop_25_to_34,sex_age_pop_35_to_44,sex_age_pop_45_to_54,sex_age_pop_55_to_59,sex_age_pop_60_to_64,sex_age_pop_65_to_74,sex_age_pop_75_to_84,sex_age_pop_85_and_over,sex_age_median_age_in_years
0,"Austin County, Texas",29565,14684,14881,1780,1960,2118,1861,1712,3339,3275,3821,2327,1978,3243,1532,619,40.7
1,"Kenedy County, Texas",595,286,309,85,37,40,10,10,95,47,75,51,9,85,29,22,39.5


In [12]:
# Display the first few rows of data. 
land.head(2)

Unnamed: 0,Geographic Area Name,sq_mi
0,"Anderson County, Texas",1062.63
1,"Andrews County, Texas",1500.721


In [13]:
# Display the first few rows of data. 
ins.head(2)

Unnamed: 0,Geographic Area Name,health_ins_noninst_pop,health_ins_noninst_pop_cov_yes,health_ins_noninst_pop_private,health_ins_noninst_pop_public,health_ins_noninst_pop_cov_no
0,"Austin County, Texas",29298,25749,20393,8863,3549
1,"Kenedy County, Texas",595,467,212,276,128


In [14]:
# Display the first few rows of data. 
inc.head(2)

Unnamed: 0,Geographic Area Name,inc_hhlds,inc_hhlds_less_than_10_000,inc_hhlds_10_000_to_14_999,inc_hhlds_15_000_to_24_999,inc_hhlds_25_000_to_34_999,inc_hhlds_35_000_to_49_999,inc_hhlds_50_000_to_74_999,inc_hhlds_75_000_to_99_999,inc_hhlds_100_000_to_149_999,inc_hhlds_150_000_to_199_999,inc_hhlds_200_000_or_more,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol
0,"Austin County, Texas",11041,482,459,1255,927,1186,1851,1651,2150,551,529,65365,80769,30858,55417,38603
1,"Kenedy County, Texas",209,25,4,49,13,71,16,23,8,0,0,36125,40908,15820,40848,23295


In [15]:
# Display the first few rows of data. 
obes.head(2)

Unnamed: 0,Geographic Area Name,obes_percent
0,"Anderson County, Texas",0.373
1,"Andrews County, Texas",0.313


In [16]:
# Merge the first three dataframes on Geographic Area Name.
# This will be the dataframe for tracking total numbers.
df_num = race.merge(sa,on='Geographic Area Name').merge(land,on='Geographic Area Name').merge(obes,on='Geographic Area Name')
df_num.head(2)

Unnamed: 0,Geographic Area Name,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races,...,sex_age_pop_35_to_44,sex_age_pop_45_to_54,sex_age_pop_55_to_59,sex_age_pop_60_to_64,sex_age_pop_65_to_74,sex_age_pop_75_to_84,sex_age_pop_85_and_over,sex_age_median_age_in_years,sq_mi,obes_percent
0,"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456,...,3275,3821,2327,1978,3243,1532,619,40.7,646.492,0.32
1,"Kenedy County, Texas",595,522,72,0,0,1,0,0,0,...,47,75,51,9,85,29,22,39.5,1458.453,0.22


In [17]:
# Merge the remainder dataframes on Geographic Area Name.
df_num = df_num.merge(ins,on='Geographic Area Name').merge(inc,on='Geographic Area Name')
df_num.head(2)

Unnamed: 0,Geographic Area Name,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races,...,inc_hhlds_50_000_to_74_999,inc_hhlds_75_000_to_99_999,inc_hhlds_100_000_to_149_999,inc_hhlds_150_000_to_199_999,inc_hhlds_200_000_or_more,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol
0,"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456,...,1851,1651,2150,551,529,65365,80769,30858,55417,38603
1,"Kenedy County, Texas",595,522,72,0,0,1,0,0,0,...,16,23,8,0,0,36125,40908,15820,40848,23295


# Combining COVID and Census Data

In [18]:
# Merge the remainder dataframes on Geographic Area Name.
df_num = covid.merge(df_num,on='Geographic Area Name')

In [19]:
df_num.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 543 entries, 0 to 542
Data columns (total 54 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   Geographic Area Name                                       543 non-null    object 
 1   total_cases                                                543 non-null    float64
 2   total_fatalities                                           543 non-null    float64
 3   death_rate                                                 542 non-null    float64
 4   total_tests                                                543 non-null    int64  
 5   race_pop                                                   543 non-null    int64  
 6   race_pop_hispanic_or_latino_of_any_race                    543 non-null    int64  
 7   race_pop_white_alone                                       543 non-null    int64  
 8   race_pop_b

# Feature Engineering

In [20]:
df_num['deaths_per_100_cases'] = df_num['death_rate'] * 100

In [21]:
df_num['cases_per_100_people'] =  (df_num['total_cases']/df_num['race_pop']) * 100

In [22]:
df_num['tests_per_100_people'] = (df_num['total_tests']/df_num['race_pop']) * 100

In [23]:
# Display the columns.
list(df_num.columns)

['Geographic Area Name',
 'total_cases',
 'total_fatalities',
 'death_rate',
 'total_tests',
 'race_pop',
 'race_pop_hispanic_or_latino_of_any_race',
 'race_pop_white_alone',
 'race_pop_black_or_african_american_alone',
 'race_pop_american_indian_and_alaska_native_alone',
 'race_pop_asian_alone',
 'race_pop_native_hawaiian_and_other_pacific_islander_alone',
 'race_pop_some_other_race_alone',
 'race_pop_two_or_more_races',
 'sex_age_pop',
 'sex_age_pop_male',
 'sex_age_pop_female',
 'sex_age_pop_under_5',
 'sex_age_pop_5_to_9',
 'sex_age_pop_10_to_14',
 'sex_age_pop_15_to_19',
 'sex_age_pop_20_to_24',
 'sex_age_pop_25_to_34',
 'sex_age_pop_35_to_44',
 'sex_age_pop_45_to_54',
 'sex_age_pop_55_to_59',
 'sex_age_pop_60_to_64',
 'sex_age_pop_65_to_74',
 'sex_age_pop_75_to_84',
 'sex_age_pop_85_and_over',
 'sex_age_median_age_in_years',
 'sq_mi',
 'obes_percent',
 'health_ins_noninst_pop',
 'health_ins_noninst_pop_cov_yes',
 'health_ins_noninst_pop_private',
 'health_ins_noninst_pop_public',

In [24]:
# Rename the median age column for the function.
df_num = df_num.rename(columns={'Geographic Area Name': 'county_state'})

In [25]:
# Set the index to the geography.
df_num = df_num.set_index('county_state')

In [26]:
# Display the first few rows of the dataframe.
df_num.head(2)

Unnamed: 0_level_0,total_cases,total_fatalities,death_rate,total_tests,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,...,inc_hhlds_150_000_to_199_999,inc_hhlds_200_000_or_more,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol,deaths_per_100_cases,cases_per_100_people,tests_per_100_people
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Santa Clara County, California",23978.0,388.0,0.016181,839764,1922200,495455,615912,45379,3003,685265,...,83325,160340,116178,154183,52451,90862,64739,1.61815,1.247425,43.68765
"San Mateo County, California",10942.0,159.0,0.014531,285657,765935,189002,303047,16838,1151,212474,...,32301,67087,113776,162639,57375,79347,65524,1.453116,1.428581,37.295201


In [27]:
# Compare the base populations to be used in calculations.
df_num[['race_pop', 'sex_age_pop', 'health_ins_noninst_pop', 'inc_hhlds']]

Unnamed: 0_level_0,race_pop,sex_age_pop,health_ins_noninst_pop,inc_hhlds
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Santa Clara County, California",1922200,1922200,1912773,635525
"San Mateo County, California",765935,765935,762101,261969
"Santa Barbara County, California",443738,443738,436711,144962
"Tuolumne County, California",53932,53932,50912,22427
"Sierra County, California",2930,2930,2903,1241
...,...,...,...,...
"Wood County, Texas",43815,43815,42880,16531
"Yoakum County, Texas",8571,8571,8571,2676
"Young County, Texas",18114,18114,17825,7105
"Zapata County, Texas",14369,14369,14334,4405


In [28]:
# Create a new column for county population density which is the result of
# dividing population by square miles
df_num['pop_density'] = df_num['race_pop'] / df_num['sq_mi']

In [29]:
df_num.head(2)

Unnamed: 0_level_0,total_cases,total_fatalities,death_rate,total_tests,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,...,inc_hhlds_200_000_or_more,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol,deaths_per_100_cases,cases_per_100_people,tests_per_100_people,pop_density
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Santa Clara County, California",23978.0,388.0,0.016181,839764,1922200,495455,615912,45379,3003,685265,...,160340,116178,154183,52451,90862,64739,1.61815,1.247425,43.68765,1488.824137
"San Mateo County, California",10942.0,159.0,0.014531,285657,765935,189002,303047,16838,1151,212474,...,67087,113776,162639,57375,79347,65524,1.453116,1.428581,37.295201,1707.25265


In [30]:
# Export the data.
df_num.to_csv('../data/cleaned_numbers_five_states.csv')

# Dataframe to Hold Both Total Numbers and Percent Values

In [31]:
# Make a copy of the dataframe that will hold total numbers AND percentages.
df = df_num.copy()

In [32]:
for column in df.columns:
    print(column)

total_cases
total_fatalities
death_rate
total_tests
race_pop
race_pop_hispanic_or_latino_of_any_race
race_pop_white_alone
race_pop_black_or_african_american_alone
race_pop_american_indian_and_alaska_native_alone
race_pop_asian_alone
race_pop_native_hawaiian_and_other_pacific_islander_alone
race_pop_some_other_race_alone
race_pop_two_or_more_races
sex_age_pop
sex_age_pop_male
sex_age_pop_female
sex_age_pop_under_5
sex_age_pop_5_to_9
sex_age_pop_10_to_14
sex_age_pop_15_to_19
sex_age_pop_20_to_24
sex_age_pop_25_to_34
sex_age_pop_35_to_44
sex_age_pop_45_to_54
sex_age_pop_55_to_59
sex_age_pop_60_to_64
sex_age_pop_65_to_74
sex_age_pop_75_to_84
sex_age_pop_85_and_over
sex_age_median_age_in_years
sq_mi
obes_percent
health_ins_noninst_pop
health_ins_noninst_pop_cov_yes
health_ins_noninst_pop_private
health_ins_noninst_pop_public
health_ins_noninst_pop_cov_no
inc_hhlds
inc_hhlds_less_than_10_000
inc_hhlds_10_000_to_14_999
inc_hhlds_15_000_to_24_999
inc_hhlds_25_000_to_34_999
inc_hhlds_35_000_to_

# Dataframe to Hold Percent Values

In [33]:
# Define a function to create new columns with percentages.
def to_percentage(dataframe):
    
    for column in dataframe.columns:
        if column.startswith('race_pop_'):
            dataframe['percent_' + column] = dataframe[column] / dataframe['race_pop']
        
        elif column.startswith('sex_age_pop_'):
            dataframe['percent_' + column] = dataframe[column] / dataframe['sex_age_pop']
            
        elif column.startswith('health_ins_noninst_pop_cov'):
            dataframe['percent_' + column] = dataframe[column] / dataframe['health_ins_noninst_pop']
            
        elif column.startswith('inc_hhlds_'):
            dataframe['percent_' + column] = dataframe[column] / dataframe['inc_hhlds']
    
    return

In [34]:
# Apply the function to the numbers dataframe
to_percentage(df)

In [35]:
# Display the first few rows of the dataframe. 
df.head(2)

Unnamed: 0_level_0,total_cases,total_fatalities,death_rate,total_tests,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,...,percent_inc_hhlds_less_than_10_000,percent_inc_hhlds_10_000_to_14_999,percent_inc_hhlds_15_000_to_24_999,percent_inc_hhlds_25_000_to_34_999,percent_inc_hhlds_35_000_to_49_999,percent_inc_hhlds_50_000_to_74_999,percent_inc_hhlds_75_000_to_99_999,percent_inc_hhlds_100_000_to_149_999,percent_inc_hhlds_150_000_to_199_999,percent_inc_hhlds_200_000_or_more
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Santa Clara County, California",23978.0,388.0,0.016181,839764,1922200,495455,615912,45379,3003,685265,...,0.032184,0.027568,0.045773,0.046066,0.069604,0.112076,0.10289,0.180432,0.131112,0.252295
"San Mateo County, California",10942.0,159.0,0.014531,285657,765935,189002,303047,16838,1151,212474,...,0.029114,0.023087,0.044536,0.047429,0.069649,0.115617,0.110609,0.180571,0.123301,0.256088


In [36]:
# Extract the columns with percentages, save to a new dataframe.
df_percent = df.filter(regex = 'percent', axis = 1)

In [37]:
# Display the first few rows of the dataframe.
df_percent.head(2)

Unnamed: 0_level_0,obes_percent,percent_race_pop_hispanic_or_latino_of_any_race,percent_race_pop_white_alone,percent_race_pop_black_or_african_american_alone,percent_race_pop_american_indian_and_alaska_native_alone,percent_race_pop_asian_alone,percent_race_pop_native_hawaiian_and_other_pacific_islander_alone,percent_race_pop_some_other_race_alone,percent_race_pop_two_or_more_races,percent_sex_age_pop_male,...,percent_inc_hhlds_less_than_10_000,percent_inc_hhlds_10_000_to_14_999,percent_inc_hhlds_15_000_to_24_999,percent_inc_hhlds_25_000_to_34_999,percent_inc_hhlds_35_000_to_49_999,percent_inc_hhlds_50_000_to_74_999,percent_inc_hhlds_75_000_to_99_999,percent_inc_hhlds_100_000_to_149_999,percent_inc_hhlds_150_000_to_199_999,percent_inc_hhlds_200_000_or_more
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Santa Clara County, California",0.081,0.257754,0.32042,0.023608,0.001562,0.3565,0.003242,0.002467,0.034446,0.504874,...,0.032184,0.027568,0.045773,0.046066,0.069604,0.112076,0.10289,0.180432,0.131112,0.252295
"San Mateo County, California",0.071,0.24676,0.395656,0.021984,0.001503,0.277405,0.013411,0.003233,0.040049,0.49332,...,0.029114,0.023087,0.044536,0.047429,0.069649,0.115617,0.110609,0.180571,0.123301,0.256088


## Combine percent dataframe with other key features

In [38]:
# Other metrics from the original dataframe to carry over.
# These were not total counts and thus not calculated in percentage step.
df_temp = df[[
    'total_cases',
    'total_fatalities',
    'death_rate',
    'total_tests',
    'sex_age_median_age_in_years', 
    'health_ins_noninst_pop_private',
    'health_ins_noninst_pop_public',
    'inc_med_hhld_inc_dol',
    'inc_mean_hhld_inc_dol',
    'inc_per_capita_inc_dol',
    'inc_med_earn_male_full_yr_workers_dol',
    'inc_med_earn_female_full_yr_workers_dol',
    'deaths_per_100_cases',
    'cases_per_100_people',
    'tests_per_100_people',
    'pop_density'
]]

In [39]:
# Concatenate the two dataframes to get a complete feature set.
df_percent = pd.concat([df_temp, df_percent], axis=1)

In [42]:
# Display the dataframe.
df_percent.head(2)

Unnamed: 0_level_0,total_cases,total_fatalities,death_rate,total_tests,sex_age_median_age_in_years,health_ins_noninst_pop_private,health_ins_noninst_pop_public,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,...,percent_inc_hhlds_less_than_10_000,percent_inc_hhlds_10_000_to_14_999,percent_inc_hhlds_15_000_to_24_999,percent_inc_hhlds_25_000_to_34_999,percent_inc_hhlds_35_000_to_49_999,percent_inc_hhlds_50_000_to_74_999,percent_inc_hhlds_75_000_to_99_999,percent_inc_hhlds_100_000_to_149_999,percent_inc_hhlds_150_000_to_199_999,percent_inc_hhlds_200_000_or_more
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Santa Clara County, California",23978.0,388.0,0.016181,839764,37.0,1466921,513162,116178,154183,52451,...,0.032184,0.027568,0.045773,0.046066,0.069604,0.112076,0.10289,0.180432,0.131112,0.252295
"San Mateo County, California",10942.0,159.0,0.014531,285657,39.6,597415,211256,113776,162639,57375,...,0.029114,0.023087,0.044536,0.047429,0.069649,0.115617,0.110609,0.180571,0.123301,0.256088


In [43]:
# Export the data.
df_percent.to_csv('../data/cleaned_percent_five_states.csv')