## Contents
- [Importing and Cleaning DP05 Data](#Importing-and-Cleaning-DP05-Data)  
- [Create 'SEX AND AGE' Dataframe](#Create-'SEX-AND-AGE'-Dataframe)
- [Create 'RACE' Dataframe](#Create-'RACE'-Dataframe)

In [1]:
import pandas as pd
import numpy as np

# Importing and Cleaning DP05 Data

In [2]:
# Read in data with heads as dataframe, setting the geo name to the idex.
dp05 = pd.read_csv('./data/preprocessing/tx_dp05_data_with_headers.csv', index_col=0)
dp05.head(3)

Unnamed: 0_level_0,Percent Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Female,Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Percent Estimate!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Percent Margin of Error!!SEX AND AGE!!Total population!!65 years and over!!Sex ratio (males per 100 females),Estimate!!RACE!!Total population,Margin of Error!!RACE!!Total population,Percent Estimate!!RACE!!Total population,Percent Margin of Error!!RACE!!Total population,Estimate!!RACE!!Total population!!One race,...,DP05_0004PMA,DP05_0004PEA,DP05_0018PMA,DP05_0018PEA,DP05_0025PMA,DP05_0028PMA,DP05_0028PEA,DP05_0029PMA,state,county
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",0.6,87.0,1.9,-888888888,-888888888,29565,-555555555,29565,-888888888,29004,...,(X),(X),(X),(X),(X),(X),(X),(X),48,15
"Kenedy County, Texas",20.5,34.7,38.9,-888888888,-888888888,595,181,595,-888888888,595,...,(X),(X),(X),(X),(X),(X),(X),(X),48,261
"Nueces County, Texas",0.1,79.3,0.3,-888888888,-888888888,360486,-555555555,360486,-888888888,354655,...,(X),(X),(X),(X),(X),(X),(X),(X),48,355


In [3]:
# Replace !! in column titles with empty space.
dp05.columns = dp05.columns.str.replace('!!', ' ')

In [4]:
# Drop the state and county number columns.
dp05 = dp05.drop(columns= ['state', 'county'])

In [5]:
# Display the first few rows of the dataframe
dp05.head(3)

Unnamed: 0_level_0,Percent Margin of Error SEX AND AGE Total population 65 years and over Female,Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Margin of Error SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Percent Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Percent Margin of Error SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Estimate RACE Total population,Margin of Error RACE Total population,Percent Estimate RACE Total population,Percent Margin of Error RACE Total population,Estimate RACE Total population One race,...,Geographic Area Name.1,DP05_0001PMA,DP05_0004PMA,DP05_0004PEA,DP05_0018PMA,DP05_0018PEA,DP05_0025PMA,DP05_0028PMA,DP05_0028PEA,DP05_0029PMA
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",0.6,87.0,1.9,-888888888,-888888888,29565,-555555555,29565,-888888888,29004,...,"Austin County, Texas",(X),(X),(X),(X),(X),(X),(X),(X),(X)
"Kenedy County, Texas",20.5,34.7,38.9,-888888888,-888888888,595,181,595,-888888888,595,...,"Kenedy County, Texas",(X),(X),(X),(X),(X),(X),(X),(X),(X)
"Nueces County, Texas",0.1,79.3,0.3,-888888888,-888888888,360486,-555555555,360486,-888888888,354655,...,"Nueces County, Texas",(X),(X),(X),(X),(X),(X),(X),(X),(X)


## Replace -888888888, -555555555, AND (X) with NaN

-888888888, or '(X)' means that the estimate is not applicable or not available.  
-555555555, or '*****' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate.  

In [6]:
dp05 = dp05.replace([-888888888, -555555555, '(X)'], np.nan)

## Drop Margin of Error Columns

In [7]:
# Drop columns that contain 'Margin' in the name.
# https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word
dp05 = dp05.filter(regex='^((?!Margin).)*$', axis=1)

# Create 'SEX AND AGE' Dataframe

In [8]:
# Extract the columns that have SEX AND AGE in the title
# and sae to a new dataframe.
dp05_sex_age = dp05.filter(regex = '(SEX AND AGE)', axis=1)
dp05_sex_age.head(3)

Unnamed: 0_level_0,Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Percent Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females),Estimate SEX AND AGE Total population,Percent Estimate SEX AND AGE Total population,Estimate SEX AND AGE Total population Male,Percent Estimate SEX AND AGE Total population Male,Estimate SEX AND AGE Total population Female,Percent Estimate SEX AND AGE Total population Female,Estimate SEX AND AGE Total population Sex ratio (males per 100 females),Percent Estimate SEX AND AGE Total population Sex ratio (males per 100 females),...,Estimate SEX AND AGE Total population 18 years and over Female,Percent Estimate SEX AND AGE Total population 18 years and over Female,Estimate SEX AND AGE Total population 18 years and over Sex ratio (males per 100 females),Percent Estimate SEX AND AGE Total population 18 years and over Sex ratio (males per 100 females),Estimate SEX AND AGE Total population 65 years and over.1,Percent Estimate SEX AND AGE Total population 65 years and over.1,Estimate SEX AND AGE Total population 65 years and over Male,Percent Estimate SEX AND AGE Total population 65 years and over Male,Estimate SEX AND AGE Total population 65 years and over Female,Percent Estimate SEX AND AGE Total population 65 years and over Female
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",87.0,,29565,29565,14684,49.7,14881,50.3,98.7,,...,11353,50.5,97.8,,5394,5394,2509,46.5,2885,53.5
"Kenedy County, Texas",34.7,,595,595,286,48.1,309,51.9,92.6,,...,238,55.6,79.8,,136,136,35,25.7,101,74.3
"Nueces County, Texas",79.3,,360486,360486,178020,49.4,182466,50.6,97.6,,...,138458,51.2,95.4,,49345,49345,21822,44.2,27523,55.8


In [9]:
# Drop the columns that have NaN values.
dp05_sex_age = dp05_sex_age.loc[:, (dp05_sex_age.isna().sum() < 1)]

In [10]:
# Display information about the SEX and AGE dataframe
dp05_sex_age.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Austin County, Texas to Falls County, Texas
Data columns (total 60 columns):
 #   Column                                                                                     Non-Null Count  Dtype  
---  ------                                                                                     --------------  -----  
 0   Estimate SEX AND AGE Total population 65 years and over Sex ratio (males per 100 females)  254 non-null    float64
 1   Estimate SEX AND AGE Total population                                                      254 non-null    int64  
 2   Percent Estimate SEX AND AGE Total population                                              254 non-null    int64  
 3   Estimate SEX AND AGE Total population Male                                                 254 non-null    int64  
 4   Percent Estimate SEX AND AGE Total population Male                                         254 non-null    float64
 5   Estimate SEX AND AGE

In [11]:
# These are the columns to be used in the analysis.
columns_sex_age = [
    'Estimate SEX AND AGE Total population',
    'Estimate SEX AND AGE Total population Male',
    'Estimate SEX AND AGE Total population Female',
    'Estimate SEX AND AGE Total population Under 5 years',
    'Estimate SEX AND AGE Total population 5 to 9 years',
    'Estimate SEX AND AGE Total population 10 to 14 years',
    'Estimate SEX AND AGE Total population 15 to 19 years',
    'Estimate SEX AND AGE Total population 20 to 24 years',
    'Estimate SEX AND AGE Total population 25 to 34 years',
    'Estimate SEX AND AGE Total population 35 to 44 years',
    'Estimate SEX AND AGE Total population 45 to 54 years',
    'Estimate SEX AND AGE Total population 55 to 59 years',
    'Estimate SEX AND AGE Total population 60 to 64 years',
    'Estimate SEX AND AGE Total population 65 to 74 years',
    'Estimate SEX AND AGE Total population 75 to 84 years',
    'Estimate SEX AND AGE Total population 85 years and over',
    'Estimate SEX AND AGE Total population Median age (years)',
]

In [12]:
# Extract the important columns for gender and age analysis.
dp05_sex_age = dp05_sex_age[columns_sex_age]

In [13]:
# Display the dataframe
dp05_sex_age.head(3)

Unnamed: 0_level_0,Estimate SEX AND AGE Total population,Estimate SEX AND AGE Total population Male,Estimate SEX AND AGE Total population Female,Estimate SEX AND AGE Total population Under 5 years,Estimate SEX AND AGE Total population 5 to 9 years,Estimate SEX AND AGE Total population 10 to 14 years,Estimate SEX AND AGE Total population 15 to 19 years,Estimate SEX AND AGE Total population 20 to 24 years,Estimate SEX AND AGE Total population 25 to 34 years,Estimate SEX AND AGE Total population 35 to 44 years,Estimate SEX AND AGE Total population 45 to 54 years,Estimate SEX AND AGE Total population 55 to 59 years,Estimate SEX AND AGE Total population 60 to 64 years,Estimate SEX AND AGE Total population 65 to 74 years,Estimate SEX AND AGE Total population 75 to 84 years,Estimate SEX AND AGE Total population 85 years and over,Estimate SEX AND AGE Total population Median age (years)
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
"Austin County, Texas",29565,14684,14881,1780,1960,2118,1861,1712,3339,3275,3821,2327,1978,3243,1532,619,40.7
"Kenedy County, Texas",595,286,309,85,37,40,10,10,95,47,75,51,9,85,29,22,39.5
"Nueces County, Texas",360486,178020,182466,24665,25055,24806,25524,26397,52547,45030,43503,22563,21051,28881,15165,5299,35.3


In [14]:
# Clean up column names.
dp05_sex_age.columns = dp05_sex_age.columns.str.lower().str.replace('estimate ', '').str.replace('sex and age', 'sex_age')

In [15]:
# Clean up column names.
dp05_sex_age.columns = dp05_sex_age.columns.str.replace('total population', 'pop').str.replace(' years', '').str.replace(' ', '_')

In [16]:
# Display columns.
dp05_sex_age.columns

Index(['sex_age_pop', 'sex_age_pop_male', 'sex_age_pop_female',
       'sex_age_pop_under_5', 'sex_age_pop_5_to_9', 'sex_age_pop_10_to_14',
       'sex_age_pop_15_to_19', 'sex_age_pop_20_to_24', 'sex_age_pop_25_to_34',
       'sex_age_pop_35_to_44', 'sex_age_pop_45_to_54', 'sex_age_pop_55_to_59',
       'sex_age_pop_60_to_64', 'sex_age_pop_65_to_74', 'sex_age_pop_75_to_84',
       'sex_age_pop_85_and_over', 'sex_age_pop_median_age_(years)'],
      dtype='object')

In [17]:
# Rename the median age column for the function.
dp05_sex_age = dp05_sex_age.rename(columns={'sex_age_pop_median_age_(years)': 'sex_age_median_age_in_years'})

In [18]:
# Export the cleaned county data for sex and age by percent.
dp05_sex_age.to_csv('./data/preprocessing/tx_dp05_sex_age_cleaned.csv')

# Create 'RACE' Dataframe

In [19]:
# Extract the columns that don't have SEX AND AGE in the title
# and save to a new dataframe.
dp05_race = dp05.filter(regex='(RACE)', axis=1)
dp05_race.head(3)

Unnamed: 0_level_0,Estimate RACE Total population,Percent Estimate RACE Total population,Estimate RACE Total population One race,Percent Estimate RACE Total population One race,Estimate RACE Total population Two or more races,Percent Estimate RACE Total population Two or more races,Estimate RACE Total population One race.1,Percent Estimate RACE Total population One race.1,Estimate RACE Total population One race White,Percent Estimate RACE Total population One race White,...,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Native Hawaiian and Other Pacific Islander alone,Percent Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Native Hawaiian and Other Pacific Islander alone,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Some other race alone,Percent Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Some other race alone,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races,Percent Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races,Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races Two races including Some other race,Percent Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races Two races including Some other race,"Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races Two races excluding Some other race, and Three or more races","Percent Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races Two races excluding Some other race, and Three or more races"
Geographic Area Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Austin County, Texas",29565,29565,29004,98.1,561,1.9,29004,98.1,23810,80.5,...,0,0.0,52,0.2,456,1.5,0,0.0,456,1.5
"Kenedy County, Texas",595,595,595,100.0,0,0.0,595,100.0,573,96.3,...,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
"Nueces County, Texas",360486,360486,354655,98.4,5831,1.6,354655,98.4,324198,89.9,...,242,0.1,226,0.1,2780,0.8,145,0.0,2635,0.7


In [20]:
# Drop the columns that have NaN values.
dp05_race = dp05_race.loc[:, (dp05_race.isna().sum() < 1)]

In [21]:
# Display information about the dataframe.
dp05_race.info()

<class 'pandas.core.frame.DataFrame'>
Index: 254 entries, Austin County, Texas to Falls County, Texas
Data columns (total 92 columns):
 #   Column                                                                                                                                                               Non-Null Count  Dtype  
---  ------                                                                                                                                                               --------------  -----  
 0   Estimate RACE Total population                                                                                                                                       254 non-null    int64  
 1   Percent Estimate RACE Total population                                                                                                                               254 non-null    int64  
 2   Estimate RACE Total population One race                                             

In [22]:
# Export the cleaned data for race
dp05_race.to_csv('./data/preprocessing/tx_dp05_race_cleaned.csv')

In [23]:
# The race columns are broken down into two basic parts, one witout hispanic, and one with. 
# The hispanic section appears to be more accurate.

columns_non_hispanic = [
    'Estimate RACE Total population',
    'Estimate RACE Total population One race White',
    'Estimate RACE Total population One race Black or African American',
    'Estimate RACE Total population One race American Indian and Alaska Native',
    'Estimate RACE Total population One race Asian',
    'Estimate RACE Total population One race Some other race',
    'Estimate RACE Total population Two or more races.1'
]

# This grouping seems to be better
columns_hispanic = [
    'Estimate RACE Total population',
    'Estimate HISPANIC OR LATINO AND RACE Total population Hispanic or Latino (of any race)',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino White alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Black or African American alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino American Indian and Alaska Native alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Asian alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Native Hawaiian and Other Pacific Islander alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Some other race alone',
    'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races'
]

In [24]:
# Extract the important columns for analysis.
dp05_race = dp05_race[columns_hispanic]

In [25]:
dp05_race.columns

Index(['Estimate RACE Total population',
       'Estimate HISPANIC OR LATINO AND RACE Total population Hispanic or Latino (of any race)',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino White alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Black or African American alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino American Indian and Alaska Native alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Asian alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Native Hawaiian and Other Pacific Islander alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Some other race alone',
       'Estimate HISPANIC OR LATINO AND RACE Total population Not Hispanic or Latino Two or more races'],
      dtype='object')

In [26]:
# Clean up column names.
dp05_race.columns = dp05_race.columns.str.lower().str.replace('estimate hispanic or latino and race total population', 'race pop').str.replace('not hispanic or latino ', '')

In [27]:
# Clean up column names.
dp05_race.columns = dp05_race.columns.str.replace('estimate race total population', 'race pop').str.replace(' ', '_')

In [28]:
# Display columns.
dp05_race.columns

Index(['race_pop', 'race_pop_hispanic_or_latino_(of_any_race)',
       'race_pop_white_alone', 'race_pop_black_or_african_american_alone',
       'race_pop_american_indian_and_alaska_native_alone',
       'race_pop_asian_alone',
       'race_pop_native_hawaiian_and_other_pacific_islander_alone',
       'race_pop_some_other_race_alone', 'race_pop_two_or_more_races'],
      dtype='object')

In [29]:
# Export the cleaned county data for sex and age by percent.
dp05_race.to_csv('./data/preprocessing/tx_dp05_race_cleaned.csv')