## Contents
- [Data Dictionary](#Data-Dictionary)
- [Importing and Combining Total Number Data](#Importing-and-Combining-Total-Number-Data)  
- [Feature Engineering](#Feature-Engineering)
- [Dataframe to Hold Both Total Numbers and Percent Values](#Dataframe-to-Hold-Both-Total-Numbers-and-Percent-Values)
- [Dataframe to Hold Percent Values](#Dataframe-to-Hold-Percent-Values)

|Column Label|Type|Description
|---|---|---|
inc_hhlds_less_than_10_000|int|(Discrete): Households less than \\$10,000 income
inc_hhlds_10_000_to_14_999|int|(Discrete): Households \\$10,000 to \\$14,999 income
inc_hhlds_15_000_to_24_999|int|(Discrete): Households \\$15,000 to \\$24,999 income
inc_hhlds_25_000_to_34_999|int|(Discrete): Households \\$25,000 to \\$34,999 income
inc_hhlds_35_000_to_49_999|int|(Discrete): Households \\$35,000 to \\$49,999 income
inc_hhlds_50_000_to_74_999|int|(Discrete): Households \\$50,000 to \\$74,999 income
inc_hhlds_75_000_to_99_999|int|(Discrete): Households \\$75,000 to \\$99,999 income
inc_hhlds_100_000_to_149_999|int|(Discrete): Households \\$100,000 to \\$149,999 income
inc_hhlds_150_000_to_199_999|int|(Discrete): Households \\$150,000 to \\$199,999 income
inc_hhlds_200_000_or_more|int|(Discrete): Households \\$200,000+ income

# Data Dictionary

|Column Label|Type|Description
|---|---|---|
health_ins_noninst_pop|int|(Discrete): Total population civilian noninstitutionalized used to calculate yes/no coverage percentages
health_ins_noninst_pop_cov_no|int| (Discrete): Population civilian noninstitutionalized without health insurance coverage
health_ins_noninst_pop_cov_yes|int|(Discrete): Population civilian noninstitutionalized with health insurance coverage
health_ins_noninst_pop_private|int|(Discrete): Population civilian noninstitutionalized with private health insurance
health_ins_noninst_pop_public|int|(Discrete): Population civilian noninstitutionalized with public coverage
inc_hhlds|int|(Discrete): Total households used to calculate household income percentages
inc_mean_hhld_inc_dol|int|(Discrete): Mean household income (dollars)
inc_med_hhld_inc_dol|int|(Discrete): Median household income (dollars)
inc_med_earn_female_full_yr_workers_dol|int|(Discrete): Median earnings for female full-time, year-round workers (dollars)
inc_med_earn_male_full_yr_workers_dol|int|(Discrete): Median earnings for male full-time, year-round workers (dollars)
inc_med_earn_workers_dol|int|(Discrete): Median earnings for workers (dollars)
inc_per_capita_inc_dol|int|(Discrete): Per capita income (dollars)
pop_density|float|(Continuous): People per square mile
race_pop|int|(Discrete): Total population used to calculate race demographic percentages
race_pop_american_indian_and_alaska_native_alone|int|(Discrete): Population American Indian and Alaska Native 
race_pop_asian_alone|int|(Discrete): Population Asian
race_pop_black_or_african_american_alone|int|(Discrete): Population Black or African American
race_pop_hispanic_or_latino_of_any_race|int|(Discrete): Population Hispanic or Latino of any race
race_pop_native_hawaiian_and_other_pacific_islander_alone|int|(Discrete): Population Native Hawaiian or other Pacific Islander
race_pop_some_other_race_alone|int|(Discrete): Population some other race
race_pop_two_or_more_races|int|(Discrete): Population two or more races
race_pop_white_alone|int|(Discrete): Population White
sex_age_median_age_in_years|float|(Continuous): Median age (years)
sex_age_pop|int|(Discrete): Total population used to calculate sex/age demographic percentages
sex_age_pop_under_5|int|(Discrete): Population under 5
sex_age_pop_5_to_9|int|(Discrete): Population 5-9
sex_age_pop_10_to_14|int|(Discrete): Population 10-14
sex_age_pop_15_to_19|int|(Discrete): Population 15-19
sex_age_pop_20_to_24|int|(Discrete): Population 20-24
sex_age_pop_25_to_34|int|(Discrete): Population 25-34
sex_age_pop_35_to_44|int|(Discrete): Population 35-44
sex_age_pop_45_to_54|int|(Discrete): Population 45-54
sex_age_pop_55_to_59|int|(Discrete): Population 55-59
sex_age_pop_60_to_64|int|(Discrete): Population 60-64
sex_age_pop_65_to_74|int|(Discrete): Population 65-74
sex_age_pop_75_to_84|int|(Discrete): Population 75-84
sex_age_pop_85_and_over|int|(Discrete): Population 85+
sex_age_pop_female|int|(Discrete): Population females
sex_age_pop_male|int|(Discrete): Population male
sq_mi|float|(Continuous): Square miles

# Importing and Combining Total Number Data

In [1]:
import pandas as pd

In [2]:
# Import the relevant dataframes.
race = pd.read_csv('../data/preprocessing/cleaned_tx_dp05_race.csv')
sa = pd.read_csv('../data/preprocessing/cleaned_tx_dp05_sex_age.csv')
land = pd.read_csv('../data/preprocessing/cleaned_tx_area.csv')
ins = pd.read_csv('../data/preprocessing/cleaned_tx_dp03_insurance.csv')
inc = pd.read_csv('../data/preprocessing/cleaned_tx_dp03_income.csv')

In [3]:
# Display the first few rows of data. 
race.head(2)

Unnamed: 0,Geographic Area Name,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races
0,"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456
1,"Kenedy County, Texas",595,522,72,0,0,1,0,0,0


In [4]:
# Display the first few rows of data. 
sa.head(2)

Unnamed: 0,Geographic Area Name,sex_age_pop,sex_age_pop_male,sex_age_pop_female,sex_age_pop_under_5,sex_age_pop_5_to_9,sex_age_pop_10_to_14,sex_age_pop_15_to_19,sex_age_pop_20_to_24,sex_age_pop_25_to_34,sex_age_pop_35_to_44,sex_age_pop_45_to_54,sex_age_pop_55_to_59,sex_age_pop_60_to_64,sex_age_pop_65_to_74,sex_age_pop_75_to_84,sex_age_pop_85_and_over,sex_age_median_age_in_years
0,"Austin County, Texas",29565,14684,14881,1780,1960,2118,1861,1712,3339,3275,3821,2327,1978,3243,1532,619,40.7
1,"Kenedy County, Texas",595,286,309,85,37,40,10,10,95,47,75,51,9,85,29,22,39.5


In [5]:
# Display the first few rows of data. 
land.head(2)

Unnamed: 0,Geographic Area Name,sq_mi
0,"Anderson County, Texas",1062.63
1,"Andrews County, Texas",1500.721


In [6]:
# Display the first few rows of data. 
ins.head(2)

Unnamed: 0,Geographic Area Name,health_ins_noninst_pop,health_ins_noninst_pop_cov_yes,health_ins_noninst_pop_private,health_ins_noninst_pop_public,health_ins_noninst_pop_cov_no
0,"Austin County, Texas",29298,25749,20393,8863,3549
1,"Kenedy County, Texas",595,467,212,276,128


In [7]:
# Display the first few rows of data. 
inc.head(2)

Unnamed: 0,Geographic Area Name,inc_hhlds,inc_hhlds_less_than_10_000,inc_hhlds_10_000_to_14_999,inc_hhlds_15_000_to_24_999,inc_hhlds_25_000_to_34_999,inc_hhlds_35_000_to_49_999,inc_hhlds_50_000_to_74_999,inc_hhlds_75_000_to_99_999,inc_hhlds_100_000_to_149_999,inc_hhlds_150_000_to_199_999,inc_hhlds_200_000_or_more,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_workers_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol
0,"Austin County, Texas",11041,482,459,1255,927,1186,1851,1651,2150,551,529,65365,80769,30858,33993,55417,38603
1,"Kenedy County, Texas",209,25,4,49,13,71,16,23,8,0,0,36125,40908,15820,29453,40848,23295


In [8]:
# Merge the first three dataframes on Geographic Area Name.
# This will be the dataframe for tracking total numbers.
df_num = race.merge(sa,on='Geographic Area Name').merge(land,on='Geographic Area Name')
df_num.head()

Unnamed: 0,Geographic Area Name,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races,...,sex_age_pop_25_to_34,sex_age_pop_35_to_44,sex_age_pop_45_to_54,sex_age_pop_55_to_59,sex_age_pop_60_to_64,sex_age_pop_65_to_74,sex_age_pop_75_to_84,sex_age_pop_85_and_over,sex_age_median_age_in_years,sq_mi
0,"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456,...,3339,3275,3821,2327,1978,3243,1532,619,40.7,646.492
1,"Kenedy County, Texas",595,522,72,0,0,1,0,0,0,...,95,47,75,51,9,85,29,22,39.5,1458.453
2,"Nueces County, Texas",360486,228462,107652,13071,919,7134,242,226,2780,...,52547,45030,43503,22563,21051,28881,15165,5299,35.3,838.316
3,"Colorado County, Texas",21022,6200,11855,2655,27,7,0,0,278,...,2054,2233,2440,1280,1866,2467,1356,640,42.5,960.284
4,"San Patricio County, Texas",67046,38483,26032,1003,101,671,30,7,719,...,8923,8328,8078,4417,3367,5759,2935,929,35.3,693.436


In [9]:
# Merge the remainder dataframes on Geographic Area Name.
df_num = df_num.merge(ins,on='Geographic Area Name').merge(inc,on='Geographic Area Name')
df_num.head()

Unnamed: 0,Geographic Area Name,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races,...,inc_hhlds_75_000_to_99_999,inc_hhlds_100_000_to_149_999,inc_hhlds_150_000_to_199_999,inc_hhlds_200_000_or_more,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_workers_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol
0,"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456,...,1651,2150,551,529,65365,80769,30858,33993,55417,38603
1,"Kenedy County, Texas",595,522,72,0,0,1,0,0,0,...,23,8,0,0,36125,40908,15820,29453,40848,23295
2,"Nueces County, Texas",360486,228462,107652,13071,919,7134,242,226,2780,...,16265,18198,6452,5780,55048,74820,27649,30869,48043,34488
3,"Colorado County, Texas",21022,6200,11855,2655,27,7,0,0,278,...,935,678,428,329,49504,70324,27861,30469,41853,34692
4,"San Patricio County, Texas",67046,38483,26032,1003,101,671,30,7,719,...,2928,3112,1483,767,55229,71588,25281,31318,50594,31847


In [10]:
# Display the columns.
list(df_num.columns)

['Geographic Area Name',
 'race_pop',
 'race_pop_hispanic_or_latino_of_any_race',
 'race_pop_white_alone',
 'race_pop_black_or_african_american_alone',
 'race_pop_american_indian_and_alaska_native_alone',
 'race_pop_asian_alone',
 'race_pop_native_hawaiian_and_other_pacific_islander_alone',
 'race_pop_some_other_race_alone',
 'race_pop_two_or_more_races',
 'sex_age_pop',
 'sex_age_pop_male',
 'sex_age_pop_female',
 'sex_age_pop_under_5',
 'sex_age_pop_5_to_9',
 'sex_age_pop_10_to_14',
 'sex_age_pop_15_to_19',
 'sex_age_pop_20_to_24',
 'sex_age_pop_25_to_34',
 'sex_age_pop_35_to_44',
 'sex_age_pop_45_to_54',
 'sex_age_pop_55_to_59',
 'sex_age_pop_60_to_64',
 'sex_age_pop_65_to_74',
 'sex_age_pop_75_to_84',
 'sex_age_pop_85_and_over',
 'sex_age_median_age_in_years',
 'sq_mi',
 'health_ins_noninst_pop',
 'health_ins_noninst_pop_cov_yes',
 'health_ins_noninst_pop_private',
 'health_ins_noninst_pop_public',
 'health_ins_noninst_pop_cov_no',
 'inc_hhlds',
 'inc_hhlds_less_than_10_000',
 'inc

In [11]:
# Rename the median age column for the function.
df_num = df_num.rename(columns={'Geographic Area Name': 'county_state'})

In [12]:
# Set the index to the geography.
df_num = df_num.set_index('county_state')

In [13]:
# Display the first few rows of the dataframe.
df_num.head(3)

Unnamed: 0_level_0,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races,sex_age_pop,...,inc_hhlds_75_000_to_99_999,inc_hhlds_100_000_to_149_999,inc_hhlds_150_000_to_199_999,inc_hhlds_200_000_or_more,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_workers_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456,29565,...,1651,2150,551,529,65365,80769,30858,33993,55417,38603
"Kenedy County, Texas",595,522,72,0,0,1,0,0,0,595,...,23,8,0,0,36125,40908,15820,29453,40848,23295
"Nueces County, Texas",360486,228462,107652,13071,919,7134,242,226,2780,360486,...,16265,18198,6452,5780,55048,74820,27649,30869,48043,34488


# Feature Engineering

In [14]:
# Compare the base populations to be used in calculations.
df_num[['race_pop', 'sex_age_pop', 'health_ins_noninst_pop', 'inc_hhlds']]

Unnamed: 0_level_0,race_pop,sex_age_pop,health_ins_noninst_pop,inc_hhlds
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Austin County, Texas",29565,29565,29298,11041
"Kenedy County, Texas",595,595,595,209
"Nueces County, Texas",360486,360486,355767,128926
"Colorado County, Texas",21022,21022,20703,7511
"San Patricio County, Texas",67046,67046,66274,23121
...,...,...,...,...
"McCulloch County, Texas",8098,8098,7960,3255
"Lee County, Texas",16952,16952,16491,6104
"Ellis County, Texas",168838,168838,167585,55840
"Kerr County, Texas",51365,51365,50494,20766


In [15]:
# Create a new column for county population density which is the result of
# dividing population by square miles
df_num['pop_density'] = df_num['race_pop'] / df_num['sq_mi']

In [16]:
df_num.head()

Unnamed: 0_level_0,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races,sex_age_pop,...,inc_hhlds_100_000_to_149_999,inc_hhlds_150_000_to_199_999,inc_hhlds_200_000_or_more,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_workers_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol,pop_density
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456,29565,...,2150,551,529,65365,80769,30858,33993,55417,38603,45.731424
"Kenedy County, Texas",595,522,72,0,0,1,0,0,0,595,...,8,0,0,36125,40908,15820,29453,40848,23295,0.407967
"Nueces County, Texas",360486,228462,107652,13071,919,7134,242,226,2780,360486,...,18198,6452,5780,55048,74820,27649,30869,48043,34488,430.012072
"Colorado County, Texas",21022,6200,11855,2655,27,7,0,0,278,21022,...,678,428,329,49504,70324,27861,30469,41853,34692,21.89144
"San Patricio County, Texas",67046,38483,26032,1003,101,671,30,7,719,67046,...,3112,1483,767,55229,71588,25281,31318,50594,31847,96.686644


In [17]:
# Export the data.
df_num.to_csv('../data/tx_cleaned_numbers.csv')

# Dataframe to Hold Both Total Numbers and Percent Values

In [18]:
# Make a copy of the dataframe that will hold total numbers AND percentages.
df = df_num.copy()

In [19]:
for column in df.columns:
    print(column)

race_pop
race_pop_hispanic_or_latino_of_any_race
race_pop_white_alone
race_pop_black_or_african_american_alone
race_pop_american_indian_and_alaska_native_alone
race_pop_asian_alone
race_pop_native_hawaiian_and_other_pacific_islander_alone
race_pop_some_other_race_alone
race_pop_two_or_more_races
sex_age_pop
sex_age_pop_male
sex_age_pop_female
sex_age_pop_under_5
sex_age_pop_5_to_9
sex_age_pop_10_to_14
sex_age_pop_15_to_19
sex_age_pop_20_to_24
sex_age_pop_25_to_34
sex_age_pop_35_to_44
sex_age_pop_45_to_54
sex_age_pop_55_to_59
sex_age_pop_60_to_64
sex_age_pop_65_to_74
sex_age_pop_75_to_84
sex_age_pop_85_and_over
sex_age_median_age_in_years
sq_mi
health_ins_noninst_pop
health_ins_noninst_pop_cov_yes
health_ins_noninst_pop_private
health_ins_noninst_pop_public
health_ins_noninst_pop_cov_no
inc_hhlds
inc_hhlds_less_than_10_000
inc_hhlds_10_000_to_14_999
inc_hhlds_15_000_to_24_999
inc_hhlds_25_000_to_34_999
inc_hhlds_35_000_to_49_999
inc_hhlds_50_000_to_74_999
inc_hhlds_75_000_to_99_999
inc_

# Dataframe to Hold Percent Values

In [20]:
# Define a function to create new columns with percentages.
def to_percentage(dataframe):
    
    for column in dataframe.columns:
        if column.startswith('race_pop_'):
            dataframe['percent_' + column] = dataframe[column] / dataframe['race_pop']
        
        elif column.startswith('sex_age_pop_'):
            dataframe['percent_' + column] = dataframe[column] / dataframe['race_pop']
            
        elif column.startswith('health_ins_noninst_pop_cov'):
            dataframe['percent_' + column] = dataframe[column] / dataframe['health_ins_noninst_pop']
            
        elif column.startswith('inc_hhlds_'):
            dataframe['percent_' + column] = dataframe[column] / dataframe['inc_hhlds']
    
    return

In [21]:
# Apply the function to the numbers dataframe
to_percentage(df)

In [22]:
# Display the 
df.head(3)

Unnamed: 0_level_0,race_pop,race_pop_hispanic_or_latino_of_any_race,race_pop_white_alone,race_pop_black_or_african_american_alone,race_pop_american_indian_and_alaska_native_alone,race_pop_asian_alone,race_pop_native_hawaiian_and_other_pacific_islander_alone,race_pop_some_other_race_alone,race_pop_two_or_more_races,sex_age_pop,...,percent_inc_hhlds_less_than_10_000,percent_inc_hhlds_10_000_to_14_999,percent_inc_hhlds_15_000_to_24_999,percent_inc_hhlds_25_000_to_34_999,percent_inc_hhlds_35_000_to_49_999,percent_inc_hhlds_50_000_to_74_999,percent_inc_hhlds_75_000_to_99_999,percent_inc_hhlds_100_000_to_149_999,percent_inc_hhlds_150_000_to_199_999,percent_inc_hhlds_200_000_or_more
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",29565,7819,18525,2576,48,89,0,52,456,29565,...,0.043655,0.041572,0.113667,0.08396,0.107418,0.167648,0.149534,0.194729,0.049905,0.047912
"Kenedy County, Texas",595,522,72,0,0,1,0,0,0,595,...,0.119617,0.019139,0.23445,0.062201,0.339713,0.076555,0.110048,0.038278,0.0,0.0
"Nueces County, Texas",360486,228462,107652,13071,919,7134,242,226,2780,360486,...,0.069179,0.053853,0.102229,0.100228,0.130897,0.18143,0.126158,0.141151,0.050044,0.044832


In [23]:
# Extract the columns with percentages, save to a new dataframe.
df_percent = df.filter(regex = 'percent', axis = 1)

In [24]:
# Display the first few rows of the dataframe.
df_percent.head(3)

Unnamed: 0_level_0,percent_race_pop_hispanic_or_latino_of_any_race,percent_race_pop_white_alone,percent_race_pop_black_or_african_american_alone,percent_race_pop_american_indian_and_alaska_native_alone,percent_race_pop_asian_alone,percent_race_pop_native_hawaiian_and_other_pacific_islander_alone,percent_race_pop_some_other_race_alone,percent_race_pop_two_or_more_races,percent_sex_age_pop_male,percent_sex_age_pop_female,...,percent_inc_hhlds_less_than_10_000,percent_inc_hhlds_10_000_to_14_999,percent_inc_hhlds_15_000_to_24_999,percent_inc_hhlds_25_000_to_34_999,percent_inc_hhlds_35_000_to_49_999,percent_inc_hhlds_50_000_to_74_999,percent_inc_hhlds_75_000_to_99_999,percent_inc_hhlds_100_000_to_149_999,percent_inc_hhlds_150_000_to_199_999,percent_inc_hhlds_200_000_or_more
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",0.264468,0.626585,0.08713,0.001624,0.00301,0.0,0.001759,0.015424,0.496668,0.503332,...,0.043655,0.041572,0.113667,0.08396,0.107418,0.167648,0.149534,0.194729,0.049905,0.047912
"Kenedy County, Texas",0.877311,0.121008,0.0,0.0,0.001681,0.0,0.0,0.0,0.480672,0.519328,...,0.119617,0.019139,0.23445,0.062201,0.339713,0.076555,0.110048,0.038278,0.0,0.0
"Nueces County, Texas",0.633761,0.29863,0.036259,0.002549,0.01979,0.000671,0.000627,0.007712,0.493833,0.506167,...,0.069179,0.053853,0.102229,0.100228,0.130897,0.18143,0.126158,0.141151,0.050044,0.044832


## Combine percent dataframe with other key features

In [25]:
# Display the columns.
df_percent.columns

Index(['percent_race_pop_hispanic_or_latino_of_any_race',
       'percent_race_pop_white_alone',
       'percent_race_pop_black_or_african_american_alone',
       'percent_race_pop_american_indian_and_alaska_native_alone',
       'percent_race_pop_asian_alone',
       'percent_race_pop_native_hawaiian_and_other_pacific_islander_alone',
       'percent_race_pop_some_other_race_alone',
       'percent_race_pop_two_or_more_races', 'percent_sex_age_pop_male',
       'percent_sex_age_pop_female', 'percent_sex_age_pop_under_5',
       'percent_sex_age_pop_5_to_9', 'percent_sex_age_pop_10_to_14',
       'percent_sex_age_pop_15_to_19', 'percent_sex_age_pop_20_to_24',
       'percent_sex_age_pop_25_to_34', 'percent_sex_age_pop_35_to_44',
       'percent_sex_age_pop_45_to_54', 'percent_sex_age_pop_55_to_59',
       'percent_sex_age_pop_60_to_64', 'percent_sex_age_pop_65_to_74',
       'percent_sex_age_pop_75_to_84', 'percent_sex_age_pop_85_and_over',
       'percent_health_ins_noninst_pop_cov_ye

In [26]:
# Other metrics from the original dataframe to carry over.
# These were not total counts and thus not calculated in percentage step.
df_temp = df[[
    'sex_age_median_age_in_years', 
    'health_ins_noninst_pop_private',
    'health_ins_noninst_pop_public',
    'inc_med_hhld_inc_dol',
    'inc_mean_hhld_inc_dol',
    'inc_per_capita_inc_dol',
    'inc_med_earn_workers_dol',
    'inc_med_earn_male_full_yr_workers_dol',
    'inc_med_earn_female_full_yr_workers_dol',
    'pop_density',
]]

In [27]:
# Concatenate the two dataframes to get a complete feature set.
df_percent = pd.concat([df_temp, df_percent], axis=1)

In [28]:
# Display the dataframe.
df_percent.head(3)

Unnamed: 0_level_0,sex_age_median_age_in_years,health_ins_noninst_pop_private,health_ins_noninst_pop_public,inc_med_hhld_inc_dol,inc_mean_hhld_inc_dol,inc_per_capita_inc_dol,inc_med_earn_workers_dol,inc_med_earn_male_full_yr_workers_dol,inc_med_earn_female_full_yr_workers_dol,pop_density,...,percent_inc_hhlds_less_than_10_000,percent_inc_hhlds_10_000_to_14_999,percent_inc_hhlds_15_000_to_24_999,percent_inc_hhlds_25_000_to_34_999,percent_inc_hhlds_35_000_to_49_999,percent_inc_hhlds_50_000_to_74_999,percent_inc_hhlds_75_000_to_99_999,percent_inc_hhlds_100_000_to_149_999,percent_inc_hhlds_150_000_to_199_999,percent_inc_hhlds_200_000_or_more
county_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",40.7,20393,8863,65365,80769,30858,33993,55417,38603,45.731424,...,0.043655,0.041572,0.113667,0.08396,0.107418,0.167648,0.149534,0.194729,0.049905,0.047912
"Kenedy County, Texas",39.5,212,276,36125,40908,15820,29453,40848,23295,0.407967,...,0.119617,0.019139,0.23445,0.062201,0.339713,0.076555,0.110048,0.038278,0.0,0.0
"Nueces County, Texas",35.3,208747,119691,55048,74820,27649,30869,48043,34488,430.012072,...,0.069179,0.053853,0.102229,0.100228,0.130897,0.18143,0.126158,0.141151,0.050044,0.044832


In [29]:
# Export the data.
df_percent.to_csv('../data/tx_cleaned_percent.csv')