## Contents
- [Importing and Cleaning DP05 Data](#Importing-and-Cleaning-DP05-Data)  
- [Create 'SEX AND AGE' Dataframe](#Create-'SEX-AND-AGE'-Dataframe)
- [Create 'RACE' Dataframe](#Create-'RACE'-Dataframe)
- [Importing and Cleaning DP03 Data](#Importing-and-Cleaning-DP03-Data) 
- [Importing and Cleaning Testing Data](#Importing-and-Cleaning-Testing-Data) 

In [2]:
import pandas as pd
import numpy as np

# Importing and Cleaning DP05 Data

In [2]:
# Read in data with heads as dataframe, setting the geo name to the idex.
dp05 = pd.read_csv('../data/preprocessing/tx_dp05_data_with_headers.csv', index_col=0)
dp05.head(3)

Unnamed: 0_level_0,Percent Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Female,Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Percent Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Percent Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Estimate!!RACE!!Total population,Margin of Error!!RACE!!Total population,Percent Estimate!!RACE!!Total population,Percent Margin of Error!!RACE!!Total population,Estimate!!RACE!!Total population!!One race,...,DP05_0004PMA,DP05_0004PEA,DP05_0018PMA,DP05_0018PEA,DP05_0025PMA,DP05_0028PMA,DP05_0028PEA,DP05_0029PMA,state,county
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",0.6,87.0,1.9,-888888888,-888888888,29565,-555555555,29565,-888888888,29004,...,(X),(X),(X),(X),(X),(X),(X),(X),48,15
"Kenedy County, Texas",20.5,34.7,38.9,-888888888,-888888888,595,181,595,-888888888,595,...,(X),(X),(X),(X),(X),(X),(X),(X),48,261
"Nueces County, Texas",0.1,79.3,0.3,-888888888,-888888888,360486,-555555555,360486,-888888888,354655,...,(X),(X),(X),(X),(X),(X),(X),(X),48,355


In [3]:
# Replace !! in column titles with empty space.
dp05.columns = dp05.columns.str.replace('!!', ' ')

In [4]:
# Drop the state and county number columns.
dp05 = dp05.drop(columns= ['state', 'county'])

In [5]:
# Display the first few rows of the dataframe
dp05.head(3)

Unnamed: 0_level_0,Percent Margin of Error SEX AND AGE Total population 65 years and over Female,Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Margin of Error SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Percent Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Percent Margin of Error SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Estimate RACE Total population,Margin of Error RACE Total population,Percent Estimate RACE Total population,Percent Margin of Error RACE Total population,Estimate RACE Total population One race,...,Geographic Area Name.1,DP05_0001PMA,DP05_0004PMA,DP05_0004PEA,DP05_0018PMA,DP05_0018PEA,DP05_0025PMA,DP05_0028PMA,DP05_0028PEA,DP05_0029PMA
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",0.6,87.0,1.9,-888888888,-888888888,29565,-555555555,29565,-888888888,29004,...,"Austin County, Texas",(X),(X),(X),(X),(X),(X),(X),(X),(X)
"Kenedy County, Texas",20.5,34.7,38.9,-888888888,-888888888,595,181,595,-888888888,595,...,"Kenedy County, Texas",(X),(X),(X),(X),(X),(X),(X),(X),(X)
"Nueces County, Texas",0.1,79.3,0.3,-888888888,-888888888,360486,-555555555,360486,-888888888,354655,...,"Nueces County, Texas",(X),(X),(X),(X),(X),(X),(X),(X),(X)


## Replace -888888888, -555555555, and (X) with NaN

-888888888, or '(X)' means that the estimate is not applicable or not available.  
-555555555, or '*****' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate.  

In [6]:
# Replace -888888888, -555555555, and '(X)' with NaN.
dp05 = dp05.replace([-888888888, -555555555, '(X)'], np.nan)

## Drop Margin of Error Columns

In [7]:
# Drop columns that contain 'Margin' in the name.
# https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word
dp05 = dp05.filter(regex='^((?!Margin).)*$', axis=1)

## Drop Percent Columns

In [8]:
# Drop columns that contain 'Percent' in the name.
# Any percentages will be recalculated after combining total counts.
dp05 = dp05.filter(regex='^((?!Percent).)*$', axis=1)

## Drop Columns with NaN Values

In [9]:
# Drop the columns that have NaN values.
dp05 = dp05.loc[:, (dp05.isna().sum() < 1)]

# Create 'SEX AND AGE' Dataframe

In [10]:
# Extract the columns that have SEX AND AGE in the title
# and save to a new dataframe.
dp05_sex_age = dp05.filter(regex = '(SEX AND AGE)', axis=1)
dp05_sex_age.head(3)

Unnamed: 0_level_0,Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Estimate SEX AND AGE Total population,Estimate SEX AND AGE Total population Male,Estimate SEX AND AGE Total population Female,Estimate SEX AND AGE Total population Sex ratio (males per 100 females),Estimate SEX AND AGE Total population Under 5 years,Estimate SEX AND AGE Total population 5 to 9 years,Estimate SEX AND AGE Total population 10 to 14 years,Estimate SEX AND AGE Total population 15 to 19 years,Estimate SEX AND AGE Total population 20 to 24 years,...,Estimate SEX AND AGE Total population 21 years and over,Estimate SEX AND AGE Total population 62 years and over,Estimate SEX AND AGE Total population 65 years and over,Estimate SEX AND AGE Total population 18 years and over.1,Estimate SEX AND AGE Total population 18 years and over Male,Estimate SEX AND AGE Total population 18 years and over Female,Estimate SEX AND AGE Total population 18 years and over Sex ratio (males per 100 females),Estimate SEX AND AGE Total population 65 years and over.1,Estimate SEX AND AGE Total population 65 years and over Male,Estimate SEX AND AGE Total population 65 years and over Female
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",87.0,29565,14684,14881,98.7,1780,1960,2118,1861,1712,...,21531,6718,5394,22460,11107,11353,97.8,5394,2509,2885
"Kenedy County, Texas",34.7,595,286,309,92.6,85,37,40,10,10,...,413,137,136,428,190,238,79.8,136,35,101
"Nueces County, Texas",79.3,360486,178020,182466,97.6,24665,25055,24806,25524,26397,...,254583,62009,49345,270482,132024,138458,95.4,49345,21822,27523


In [11]:
# Display information about the SEX and AGE dataframe
dp05_sex_age.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Austin County, Texas to Falls County, Texas
Data columns (total 32 columns):
 #   Column                                                                                     Non-Null Count  Dtype  
---  ------                                                                                     --------------  -----  
 0   Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females)  254 non-null    float64
 1   Estimate SEX AND AGE Total population                                                      254 non-null    int64  
 2   Estimate SEX AND AGE Total population Male                                                 254 non-null    int64  
 3   Estimate SEX AND AGE Total population Female                                               254 non-null    int64  
 4   Estimate SEX AND AGE Total population Sex ratio (males per 100 females)                    254 non-null    float64
 5   Estimate SEX AND AGE

In [12]:
# These are the columns to be used in the analysis.
columns_sex_age = [
    'Estimate SEX AND AGE Total population',
    'Estimate SEX AND AGE Total population Male',
    'Estimate SEX AND AGE Total population Female',
    'Estimate SEX AND AGE Total population Under 5 years',
    'Estimate SEX AND AGE Total population 5 to 9 years',
    'Estimate SEX AND AGE Total population 10 to 14 years',
    'Estimate SEX AND AGE Total population 15 to 19 years',
    'Estimate SEX AND AGE Total population 20 to 24 years',
    'Estimate SEX AND AGE Total population 25 to 34 years',
    'Estimate SEX AND AGE Total population 35 to 44 years',
    'Estimate SEX AND AGE Total population 45 to 54 years',
    'Estimate SEX AND AGE Total population 55 to 59 years',
    'Estimate SEX AND AGE Total population 60 to 64 years',
    'Estimate SEX AND AGE Total population 65 to 74 years',
    'Estimate SEX AND AGE Total population 75 to 84 years',
    'Estimate SEX AND AGE Total population 85 years and over',
    'Estimate SEX AND AGE Total population Median age (years)',
]

In [13]:
# Extract the important columns for gender and age analysis.
dp05_sex_age = dp05_sex_age[columns_sex_age]

In [14]:
# Display the dataframe
dp05_sex_age.head(3)

Unnamed: 0_level_0,Estimate SEX AND AGE Total population,Estimate SEX AND AGE Total population Male,Estimate SEX AND AGE Total population Female,Estimate SEX AND AGE Total population Under 5 years,Estimate SEX AND AGE Total population 5 to 9 years,Estimate SEX AND AGE Total population 10 to 14 years,Estimate SEX AND AGE Total population 15 to 19 years,Estimate SEX AND AGE Total population 20 to 24 years,Estimate SEX AND AGE Total population 25 to 34 years,Estimate SEX AND AGE Total population 35 to 44 years,Estimate SEX AND AGE Total population 45 to 54 years,Estimate SEX AND AGE Total population 55 to 59 years,Estimate SEX AND AGE Total population 60 to 64 years,Estimate SEX AND AGE Total population 65 to 74 years,Estimate SEX AND AGE Total population 75 to 84 years,Estimate SEX AND AGE Total population 85 years and over,Estimate SEX AND AGE Total population Median age (years)
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
"Austin County, Texas",29565,14684,14881,1780,1960,2118,1861,1712,3339,3275,3821,2327,1978,3243,1532,619,40.7
"Kenedy County, Texas",595,286,309,85,37,40,10,10,95,47,75,51,9,85,29,22,39.5
"Nueces County, Texas",360486,178020,182466,24665,25055,24806,25524,26397,52547,45030,43503,22563,21051,28881,15165,5299,35.3


In [15]:
# Clean up column names.
dp05_sex_age.columns = dp05_sex_age.columns.str.lower().str.replace('estimate ', '').str.replace('sex and age', 'sex_age')

In [16]:
# Clean up column names.
dp05_sex_age.columns = dp05_sex_age.columns.str.replace('total population', 'pop').str.replace(' years', '').str.replace(' ', '_')

In [17]:
# Display columns.
list(dp05_sex_age.columns)

['sex_age_pop',
 'sex_age_pop_male',
 'sex_age_pop_female',
 'sex_age_pop_under_5',
 'sex_age_pop_5_to_9',
 'sex_age_pop_10_to_14',
 'sex_age_pop_15_to_19',
 'sex_age_pop_20_to_24',
 'sex_age_pop_25_to_34',
 'sex_age_pop_35_to_44',
 'sex_age_pop_45_to_54',
 'sex_age_pop_55_to_59',
 'sex_age_pop_60_to_64',
 'sex_age_pop_65_to_74',
 'sex_age_pop_75_to_84',
 'sex_age_pop_85_and_over',
 'sex_age_pop_median_age_(years)']

In [18]:
# Rename the median age column for the function.
dp05_sex_age = dp05_sex_age.rename(columns={'sex_age_pop_median_age_(years)': 'sex_age_median_age_in_years'})

In [19]:
# Export the cleaned county data for sex and age by percent.
dp05_sex_age.to_csv('../data/preprocessing/cleaned_tx_dp05_sex_age.csv')

# Create 'RACE' Dataframe

In [20]:
# Extract the columns that don't have SEX AND AGE in the title
# and save to a new dataframe.
dp05_race = dp05.filter(regex='(RACE)', axis=1)
dp05_race.head(3)

Unnamed: 0_level_0,Estimate RACE Total population,Estimate RACE Total population One race,Estimate RACE Total population Two or more races,Estimate RACE Total population One race.1,Estimate RACE Total population One race White,Estimate RACE Total population One race Black or African American,Estimate RACE Total population One race American Indian and Alaska Native,Estimate RACE Total population One race American Indian and Alaska Native Cherokee tribal grouping,Estimate RACE Total population One race American Indian and Alaska Native Chippewa tribal grouping,Estimate RACE Total population One race American Indian and Alaska Native Navajo tribal grouping,...,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino White alone,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Black or African American alone,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino American Indian and Alaska Native alone,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Asian alone,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Native Hawaiian and Other Pacific Islander alone,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Some other race alone,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races Two races including Some other race,"Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races Two races excluding Some other race, and Three or more races"
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",29565,29004,561,29004,23810,2626,48,0,0,0,...,21746,18525,2576,48,89,0,52,456,0,456
"Kenedy County, Texas",595,595,0,595,573,0,0,0,0,0,...,73,72,0,0,1,0,0,0,0,0
"Nueces County, Texas",360486,354655,5831,354655,324198,13620,1638,426,31,37,...,132024,107652,13071,919,7134,242,226,2780,145,2635


In [21]:
# Display information about the dataframe.
dp05_race.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Austin County, Texas to Falls County, Texas
Data columns (total 46 columns):
 #   Column                                                                                                                                                       Non-Null Count  Dtype
---  ------                                                                                                                                                       --------------  -----
 0   Estimate RACE Total population                                                                                                                               254 non-null    int64
 1   Estimate RACE Total population One race                                                                                                                      254 non-null    int64
 2   Estimate RACE Total population Two or more races                                                                            

In [22]:
# The race columns are broken down into two basic parts, one witout hispanic, and one with. 
# The hispanic section appears to be more accurate.

columns_non_hispanic = [
    'Estimate RACE Total population',
    'Estimate RACE Total population One race White',
    'Estimate RACE Total population One race Black or African American',
    'Estimate RACE Total population One race American Indian and Alaska Native',
    'Estimate RACE Total population One race Asian',
    'Estimate RACE Total population One race Some other race',
    'Estimate RACE Total population Two or more races.1'
]

# This grouping seems to be better
columns_hispanic = [
    'Estimate RACE Total population',
    'Estimate HISPANIC OR LATINO AND RACE Total population Hispanic or Latino (of any race)',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino White alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Black or African American alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino American Indian and Alaska Native alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Asian alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Native Hawaiian and Other Pacific Islander alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Some other race alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races'
]

In [23]:
# Extract the important columns for analysis.
dp05_race = dp05_race[columns_hispanic]

In [24]:
# Display the columns of the dataframe.
dp05_race.columns

Index(['Estimate RACE Total population',
       'Estimate HISPANIC OR LATINO AND RACE Total population Hispanic or Latino (of any race)',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino White alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Black or African American alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino American Indian and Alaska Native alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Asian alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Native Hawaiian and Other Pacific Islander alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Some other race alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races'],
      dtype='object')

In [25]:
# Clean up column names.
dp05_race.columns = dp05_race.columns.str.lower().str.replace('estimate hispanic or latino and race total population', 'race pop')

In [26]:
# Clean up column names.
dp05_race.columns = dp05_race.columns.str.replace('not hispanic or latino ', '').str.replace('estimate race total population', 'race pop')

In [27]:
# Clean up column names.
dp05_race.columns = dp05_race.columns.str.replace('\(', '').str.replace('\)', '').str.replace(' ', '_')

In [28]:
# Display the columns of the dataframe.
list(dp05_race.columns)

['race_pop',
 'race_pop_hispanic_or_latino_of_any_race',
 'race_pop_white_alone',
 'race_pop_black_or_african_american_alone',
 'race_pop_american_indian_and_alaska_native_alone',
 'race_pop_asian_alone',
 'race_pop_native_hawaiian_and_other_pacific_islander_alone',
 'race_pop_some_other_race_alone',
 'race_pop_two_or_more_races']

In [29]:
# Export the cleaned county data for sex and age by percent.
dp05_race.to_csv('../data/preprocessing/cleaned_tx_dp05_race.csv')

# Importing and Cleaning DP03 Data

In [30]:
# Read in data with heads as dataframe, setting the geo name to the idex.
dp03 = pd.read_csv('../data/preprocessing/tx_dp03_data_with_headers.csv', index_col=0)
dp03.head(3)

Unnamed: 0_level_0,GEO_ID,Estimate!!EMPLOYMENT STATUS!!Population 16 years and over,Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over,Percent Estimate!!EMPLOYMENT STATUS!!Population 16 years and over,Percent Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over,Estimate!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force,Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force,Percent Estimate!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force,Percent Margin of Error!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force,Estimate!!EMPLOYMENT STATUS!!Population 16 years and over!!In labor force!!Civilian labor force,...,DP03_0134EA,DP03_0134MA,DP03_0135EA,DP03_0135MA,DP03_0136EA,DP03_0136MA,DP03_0137MA,DP03_0137EA,state,county
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",0500000US48015,23354,108,23354,-888888888,14475,413,62.0,1.8,14475,...,(X),(X),(X),(X),(X),(X),(X),(X),48,15
"Kenedy County, Texas",0500000US48261,428,122,428,-888888888,220,83,51.4,13.4,220,...,(X),(X),(X),(X),(X),(X),(X),(X),48,261
"Nueces County, Texas",0500000US48355,280990,413,280990,-888888888,177352,1636,63.1,0.6,175891,...,(X),(X),(X),(X),(X),(X),(X),(X),48,355


In [31]:
# Replace !! in column titles with empty space.
dp03.columns = dp03.columns.str.replace('!!', ' ')

In [32]:
# Drop the state and county number columns.
dp03 = dp03.drop(columns= ['state', 'county', 'Geographic Area Name.1'])

## Replace -888888888, -555555555, AND (X) with NaN

-888888888, or '(X)' means that the estimate is not applicable or not available.  
-555555555, or '*****' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate.  

In [33]:
# Replace -888888888, -555555555, and '(X)' with NaN
dp03 = dp03.replace([-888888888, -555555555, '(X)'], np.nan)

## Drop Margin of Error Columns

In [34]:
# Drop columns that contain 'Margin' in the name.
# https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word
dp03 = dp03.filter(regex='^((?!Margin).)*$', axis=1)

## Drop Percent Columns

In [35]:
# Drop columns that contain 'Percent' in the name.
# Any percentages will be calculated after recombining total counts
dp03 = dp03.filter(regex='^((?!Percent).)*$', axis=1)

## Drop Columns with NaN Values

In [36]:
# Drop the columns that have NaN values.
dp03 = dp03.loc[:, (dp03.isna().sum() < 1)]

In [37]:
# Display the first few rows of the dataframe
dp03.head(3)

Unnamed: 0_level_0,GEO_ID,Estimate EMPLOYMENT STATUS Population 16 years and over,Estimate EMPLOYMENT STATUS Population 16 years and over In labor force,Estimate EMPLOYMENT STATUS Population 16 years and over In labor force Civilian labor force,Estimate EMPLOYMENT STATUS Population 16 years and over In labor force Civilian labor force Employed,Estimate EMPLOYMENT STATUS Population 16 years and over In labor force Civilian labor force Unemployed,Estimate EMPLOYMENT STATUS Population 16 years and over In labor force Armed Forces,Estimate EMPLOYMENT STATUS Population 16 years and over Not in labor force,Estimate EMPLOYMENT STATUS Civilian labor force,Estimate EMPLOYMENT STATUS Females 16 years and over,...,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed With health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed With health insurance coverage With private health insurance,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed With health insurance coverage With public coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed No health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force With health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force With health insurance coverage With private health insurance,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force With health insurance coverage With public coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force No health insurance coverage
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",0500000US48015,23354,14475,14475,13801,674,0,8879,14475,11825,...,645,469,393,78,176,3758,2918,2081,1036,840
"Kenedy County, Texas",0500000US48261,428,220,220,220,0,0,208,220,238,...,0,0,0,0,0,83,38,33,5,45
"Nueces County, Texas",0500000US48355,280990,177352,175891,165496,10395,1461,103638,175891,143930,...,8603,4279,2743,1640,4324,49810,37679,22693,18091,12131


Economic categories covered in DP03:
1. Employment Status
1. Commuting to Work
1. Occupation
1. Industry
1. Class of Worker
1. Income and Benefits
1. Health Insurance Coverage

In [38]:
# Display DP03 columns as a list.
list(dp03.columns)

['GEO_ID',
 'Estimate EMPLOYMENT STATUS Population 16 years and over',
 'Estimate EMPLOYMENT STATUS Population 16 years and over In labor force',
 'Estimate EMPLOYMENT STATUS Population 16 years and over In labor force Civilian labor force',
 'Estimate EMPLOYMENT STATUS Population 16 years and over In labor force Civilian labor force Employed',
 'Estimate EMPLOYMENT STATUS Population 16 years and over In labor force Civilian labor force Unemployed',
 'Estimate EMPLOYMENT STATUS Population 16 years and over In labor force Armed Forces',
 'Estimate EMPLOYMENT STATUS Population 16 years and over Not in labor force',
 'Estimate EMPLOYMENT STATUS Civilian labor force',
 'Estimate EMPLOYMENT STATUS Females 16 years and over',
 'Estimate EMPLOYMENT STATUS Females 16 years and over In labor force',
 'Estimate EMPLOYMENT STATUS Females 16 years and over In labor force Civilian labor force',
 'Estimate EMPLOYMENT STATUS Females 16 years and over In labor force Civilian labor force Employed',
 'E

# Create 'INCOME AND BENEFITS' Dataframe

In [39]:
# Extract the columns that have INCOME AND BENEFITS in the title
# and save to a new dataframe.
dp03_income = dp03.filter(regex = '(INCOME AND BENEFITS)', axis=1)
dp03_income.head(3)

Unnamed: 0_level_0,Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households,"Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households Less than $10,000","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $10,000 to $14,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $15,000 to $24,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $25,000 to $34,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $35,000 to $49,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $50,000 to $74,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $75,000 to $99,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $100,000 to $149,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $150,000 to $199,999",...,"Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Families $200,000 or more",Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Families Median family income (dollars),Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Families Mean family income (dollars),Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Per capita income (dollars),Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Nonfamily households,Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Nonfamily households Median nonfamily income (dollars),Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Nonfamily households Mean nonfamily income (dollars),Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for workers (dollars),"Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for male full-time, year-round workers (dollars)","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for female full-time, year-round workers (dollars)"
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",11041,482,459,1255,927,1186,1851,1651,2150,551,...,479,79066,92253,30858,2781,27492,44884,33993,55417,38603
"Kenedy County, Texas",209,25,4,49,13,71,16,23,8,0,...,0,40625,44286,15820,55,23828,28953,29453,40848,23295
"Nueces County, Texas",128926,8919,6943,13180,12922,16876,23391,16265,18198,6452,...,5221,65679,86004,27649,41539,32517,47182,30869,48043,34488


In [40]:
# Display information about the dataframe.
dp03_income.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Austin County, Texas to Falls County, Texas
Data columns (total 44 columns):
 #   Column                                                                                                                                                              Non-Null Count  Dtype
---  ------                                                                                                                                                              --------------  -----
 0   Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households                                                                                  254 non-null    int64
 1   Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households Less than $10,000                                                                254 non-null    int64
 2   Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $10,000 to $1

In [41]:
# These are the columns to be used in the analysis.
columns_income = [
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households Less than $10,000',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $10,000 to $14,999',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $15,000 to $24,999',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $25,000 to $34,999',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $35,000 to $49,999',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $50,000 to $74,999',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $75,000 to $99,999',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $100,000 to $149,999',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $150,000 to $199,999',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $200,000 or more',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households Median household income (dollars)',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households Mean household income (dollars)',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Per capita income (dollars)',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for workers (dollars)',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for male full-time, year-round workers (dollars)',
    'Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for female full-time, year-round workers (dollars)'
]

In [42]:
# Extract the important columns for income analysis.
dp03_income = dp03_income[columns_income]

In [43]:
# Display the first few rows of the dataframe.
dp03_income.head(3)

Unnamed: 0_level_0,Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households,"Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households Less than $10,000","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $10,000 to $14,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $15,000 to $24,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $25,000 to $34,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $35,000 to $49,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $50,000 to $74,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $75,000 to $99,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $100,000 to $149,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $150,000 to $199,999","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households $200,000 or more",Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households Median household income (dollars),Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Total households Mean household income (dollars),Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Per capita income (dollars),Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for workers (dollars),"Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for male full-time, year-round workers (dollars)","Estimate INCOME AND BENEFITS (IN 2018 INFLATION-ADJUSTED DOLLARS) Median earnings for female full-time, year-round workers (dollars)"
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
"Austin County, Texas",11041,482,459,1255,927,1186,1851,1651,2150,551,529,65365,80769,30858,33993,55417,38603
"Kenedy County, Texas",209,25,4,49,13,71,16,23,8,0,0,36125,40908,15820,29453,40848,23295
"Nueces County, Texas",128926,8919,6943,13180,12922,16876,23391,16265,18198,6452,5780,55048,74820,27649,30869,48043,34488


In [44]:
# Clean up column names.
dp03_income.columns = dp03_income.columns.str.lower().str.replace('estimate ', '')

In [45]:
# Clean up column names.
dp03_income.columns = dp03_income.columns.str.replace('income', 'inc').str.replace(' and benefits \(in 2018 inflation-adjusted dollars\)', '')

In [46]:
# Clean up column names.
dp03_income.columns = dp03_income.columns.str.replace('total ', '').str.replace(', ', '_').str.replace('$', '')

In [47]:
# Clean up column names.
dp03_income.columns = dp03_income.columns.str.replace('households', 'hhlds').str.replace('household', 'hhld').str.replace(',', '_')

In [48]:
# Clean up column names.
dp03_income.columns = dp03_income.columns.str.replace('inc hhlds median', 'inc median').str.replace('inc hhlds mean', 'inc mean')

In [49]:
# Clean up column names.
dp03_income.columns = dp03_income.columns.str.replace('\(', '').str.replace('\)', '').str.replace('for ', '').str.replace(' ', '_')

In [50]:
# Clean up column names.
dp03_income.columns = dp03_income.columns.str.replace('median', 'med').str.replace('full-time_year-round', 'full_yr')

In [51]:
# Clean up column names.
dp03_income.columns = dp03_income.columns.str.replace('dollars', 'dol').str.replace('earnings', 'earn')

In [52]:
# Display column names.
list(dp03_income.columns)

['inc_hhlds',
 'inc_hhlds_less_than_10_000',
 'inc_hhlds_10_000_to_14_999',
 'inc_hhlds_15_000_to_24_999',
 'inc_hhlds_25_000_to_34_999',
 'inc_hhlds_35_000_to_49_999',
 'inc_hhlds_50_000_to_74_999',
 'inc_hhlds_75_000_to_99_999',
 'inc_hhlds_100_000_to_149_999',
 'inc_hhlds_150_000_to_199_999',
 'inc_hhlds_200_000_or_more',
 'inc_med_hhld_inc_dol',
 'inc_mean_hhld_inc_dol',
 'inc_per_capita_inc_dol',
 'inc_med_earn_workers_dol',
 'inc_med_earn_male_full_yr_workers_dol',
 'inc_med_earn_female_full_yr_workers_dol']

In [53]:
# Export the cleaned county data for sex and age by percent.
dp03_income.to_csv('../data/preprocessing/cleaned_tx_dp03_income.csv')

# Create 'HEALTH INSURANCE' Dataframe

In [54]:
# Extract the columns that have HEALTH INSURANCE in the title
# and save to a new dataframe.
dp03_ins = dp03.filter(regex = '(HEALTH INSURANCE)', axis=1)
dp03_ins.head(3)

Unnamed: 0_level_0,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage With private health insurance,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage With public coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population No health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population under 19 years,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population under 19 years No health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Employed,...,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed With health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed With health insurance coverage With private health insurance,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed With health insurance coverage With public coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years In labor force Unemployed No health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force With health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force With health insurance coverage With private health insurance,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force With health insurance coverage With public coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population Civilian noninstitutionalized population 19 to 64 years Not in labor force No health insurance coverage
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",29298,25749,20393,8863,3549,7434,415,16634,12876,12231,...,645,469,393,78,176,3758,2918,2081,1036,840
"Kenedy County, Texas",595,467,212,276,128,167,4,292,209,209,...,0,0,0,0,0,83,38,33,5,45
"Nueces County, Texas",355767,295165,208747,119691,60602,95317,8580,212235,162425,153822,...,8603,4279,2743,1640,4324,49810,37679,22693,18091,12131


In [55]:
# Display information about the dataframe.
dp03_ins.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Austin County, Texas to Falls County, Texas
Data columns (total 24 columns):
 #   Column                                                                                                                                                                                                                      Non-Null Count  Dtype
---  ------                                                                                                                                                                                                                      --------------  -----
 0   Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population                                                                                                                                                 254 non-null    int64
 1   Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage             

In [56]:
# These are the columns to be used in the analysis.
columns_ins = [
    'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population',
    'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage',
    'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage With private health insurance',
    'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage With public coverage',
    'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population No health insurance coverage'
]

## NOTE: 
We may not be able to analyze public/private coverage.

1. 'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population', (DP03_0095E)
2. 'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage', (DP03_0096E)
3. 'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage With private health insurance', (DP03_0097E)
4. 'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage With public coverage', (DP03_0098E)
5. 'Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population No health insurance coverage' (DP03_0099E) 

Values 2 and 5 add up to value 1, as expected. But values 3 and 4 do not add up to value 2! It's consistently more, and I'm not sure why. I've asked the Census Bureau. 

In [57]:
# Extract the important columns for analysis.
dp03_ins = dp03_ins[columns_ins]

In [58]:
# Display the first few rows of the dataframe.
dp03_ins.head(3)

Unnamed: 0_level_0,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage With private health insurance,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population With health insurance coverage With public coverage,Estimate HEALTH INSURANCE COVERAGE Civilian noninstitutionalized population No health insurance coverage
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Austin County, Texas",29298,25749,20393,8863,3549
"Kenedy County, Texas",595,467,212,276,128
"Nueces County, Texas",355767,295165,208747,119691,60602


In [59]:
# Clean up column names.
dp03_ins.columns = dp03_ins.columns.str.lower().str.replace('estimate ', '')

In [60]:
# Clean up column names.
dp03_ins.columns = dp03_ins.columns.str.replace(\
    'health insurance coverage civilian noninstitutionalized population', 'health_ins_noninst_pop')

In [61]:
dp03_ins.columns = dp03_ins.columns.str.replace('with health insurance coverage', 'cov_yes')

In [62]:
# Clean up column names.
dp03_ins.columns = dp03_ins.columns.str.replace('no health insurance coverage', 'cov_no')

In [63]:
# Clean up column names.
dp03_ins.columns = dp03_ins.columns.str.replace('cov_yes with private health insurance', 'private')

In [64]:
# Clean up column names.
dp03_ins.columns = dp03_ins.columns.str.replace('cov_yes with public coverage', 'public')

In [65]:
# Clean up column names.
dp03_ins.columns = dp03_ins.columns.str.replace(' ', '_')

In [66]:
# Display the columns as a list.
list(dp03_ins.columns)

['health_ins_noninst_pop',
 'health_ins_noninst_pop_cov_yes',
 'health_ins_noninst_pop_private',
 'health_ins_noninst_pop_public',
 'health_ins_noninst_pop_cov_no']

In [67]:
# Export the cleaned county data for sex and age by percent.
dp03_ins.to_csv('../data/preprocessing/cleaned_tx_dp03_insurance.csv')

# Importing and Cleaning Obesity Data

In [68]:
# Read in data with heads as dataframe, setting the geo name to the idex.
tx_ob = pd.read_csv('../data/preprocessing/tx_obesity_data.csv')

In [69]:
tx_ob.head()

Unnamed: 0,County,State,CountyFIPS,Percentage,Lower Limit,Upper Limit
0,Anderson County,Texas,48001,37.3,28.1,47.5
1,Andrews County,Texas,48003,31.3,20.0,44.2
2,Angelina County,Texas,48005,39.6,35.6,43.6
3,Aransas County,Texas,48007,37.7,26.6,49.8
4,Archer County,Texas,48009,28.3,18.9,39.4


## Create a county column using the lowercase county-only formatting

In [70]:
tx_ob['county'] = tx_ob['County'].str.replace(' County', '')

In [71]:
tx_ob['county'] = tx_ob['county'].str.lower()

In [72]:
tx_ob.head()

Unnamed: 0,County,State,CountyFIPS,Percentage,Lower Limit,Upper Limit,county
0,Anderson County,Texas,48001,37.3,28.1,47.5,anderson
1,Andrews County,Texas,48003,31.3,20.0,44.2,andrews
2,Angelina County,Texas,48005,39.6,35.6,43.6,angelina
3,Aransas County,Texas,48007,37.7,26.6,49.8,aransas
4,Archer County,Texas,48009,28.3,18.9,39.4,archer


In [73]:
tx_ob = tx_ob.rename(columns={'Percentage': 'obes_percent', 'county': 'county_name'})
tx_ob.head(3)

Unnamed: 0,County,State,CountyFIPS,obes_percent,Lower Limit,Upper Limit,county_name
0,Anderson County,Texas,48001,37.3,28.1,47.5,anderson
1,Andrews County,Texas,48003,31.3,20.0,44.2,andrews
2,Angelina County,Texas,48005,39.6,35.6,43.6,angelina


In [74]:
tx_ob = tx_ob[['county_name', 'obes_percent']]

In [76]:
tx_ob['obes_percent'] = tx_ob['obes_percent'] / 100

In [77]:
tx_ob

Unnamed: 0,county_name,obes_percent
0,anderson,0.373
1,andrews,0.313
2,angelina,0.396
3,aransas,0.377
4,archer,0.283
...,...,...
249,wood,0.331
250,yoakum,0.295
251,young,0.359
252,zapata,0.302


In [78]:
tx_ob.to_csv('../data/preprocessing/cleaned_tx_obesity.csv', index=False)

# Importing and Cleaning Testing Data

In [79]:
tests = pd.read_csv('../data/preprocessing/Texas_total_tests_oct-22.csv')

In [80]:
tests.head(3)

Unnamed: 0,COVID-19 Cumulative Total Tests Performed in Texas by County,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,County,13-Oct,14-Oct,15-Oct,16-Oct,17-Oct,18-Oct,19-Oct,20-Oct,21-Oct,22-Oct,23-Oct
1,Anderson,28600,28641,28844,29027,29075,29317,29350,29371,29485,29843,29964
2,Andrews,1843,1854,1888,1894,1903,1915,1925,1930,1943,2092,2112


In [81]:
# Rename the columns using the first row.
tests = tests.rename(columns=tests.iloc[0])
tests

Unnamed: 0,County,13-Oct,14-Oct,15-Oct,16-Oct,17-Oct,18-Oct,19-Oct,20-Oct,21-Oct,22-Oct,23-Oct
0,County,13-Oct,14-Oct,15-Oct,16-Oct,17-Oct,18-Oct,19-Oct,20-Oct,21-Oct,22-Oct,23-Oct
1,Anderson,28600,28641,28844,29027,29075,29317,29350,29371,29485,29843,29964
2,Andrews,1843,1854,1888,1894,1903,1915,1925,1930,1943,2092,2112
3,Angelina,18736,18815,18974,19091,19564,20161,20196,20211,20269,20325,21026
4,Aransas,3135,3239,3340,3354,3414,3433,3476,3501,3534,3575,3619
...,...,...,...,...,...,...,...,...,...,...,...,...
252,Young,3045,3063,3189,3201,3223,3233,3259,3262,3273,3298,3347
253,Zapata,3948,3958,4074,4208,4218,4232,4287,4291,4307,4324,4346
254,Zavala,2248,2264,2305,2329,2345,2358,2370,2375,2377,2390,2494
255,Unknown,113478,113579,113719,113765,113798,113824,113906,113933,114111,114179,114247


In [82]:
# Drop the first and last two rows.
tests = tests.iloc[1:255, :]

In [83]:
tests

Unnamed: 0,County,13-Oct,14-Oct,15-Oct,16-Oct,17-Oct,18-Oct,19-Oct,20-Oct,21-Oct,22-Oct,23-Oct
1,Anderson,28600,28641,28844,29027,29075,29317,29350,29371,29485,29843,29964
2,Andrews,1843,1854,1888,1894,1903,1915,1925,1930,1943,2092,2112
3,Angelina,18736,18815,18974,19091,19564,20161,20196,20211,20269,20325,21026
4,Aransas,3135,3239,3340,3354,3414,3433,3476,3501,3534,3575,3619
5,Archer,956,964,990,1003,1013,1015,1031,1033,1045,1062,1073
...,...,...,...,...,...,...,...,...,...,...,...,...
250,Wood,6095,6166,6242,6289,6342,6393,6447,6483,6554,6699,6804
251,Yoakum,1439,1444,1446,1448,1478,1479,1480,1481,1486,1495,1525
252,Young,3045,3063,3189,3201,3223,3233,3259,3262,3273,3298,3347
253,Zapata,3948,3958,4074,4208,4218,4232,4287,4291,4307,4324,4346


In [84]:
# The final date to use is '2020-10-21'
tests = tests[['County', '21-Oct']]

In [85]:
tests['County'] = tests['County'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [86]:
tests = tests.rename(columns={'County': 'county_name', '21-Oct': 'total_tests'})

In [87]:
tests

Unnamed: 0,county_name,total_tests
1,anderson,29485
2,andrews,1943
3,angelina,20269
4,aransas,3534
5,archer,1045
...,...,...
250,wood,6554
251,yoakum,1486
252,young,3273
253,zapata,4307


In [88]:
tests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 1 to 254
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   county_name  254 non-null    object
 1   total_tests  254 non-null    object
dtypes: object(2)
memory usage: 4.1+ KB


In [89]:
# Define a function to clean the entries.
def clean_nums(text):
    new_string = ''
    for ch in text:
        if ch != ',':
            new_string = new_string + ch

    return int(new_string)

In [90]:
tests['total_tests'] = tests['total_tests'].apply(clean_nums)

In [91]:
tests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 254 entries, 1 to 254
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   county_name  254 non-null    object
 1   total_tests  254 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 4.1+ KB


In [92]:
tests

Unnamed: 0,county_name,total_tests
1,anderson,29485
2,andrews,1943
3,angelina,20269
4,aransas,3534
5,archer,1045
...,...,...
250,wood,6554
251,yoakum,1486
252,young,3273
253,zapata,4307


In [93]:
# Export the cleaned data.
tests.to_csv('../data/preprocessing/cleaned_tx_tests.csv', index=False)