# Data Processing and Integration

This notebook focused on cleaning and merging datasets to ensure consistency and enable meaningful analysis.

**Main data**
- 2021 Childhood Blood Lead Surveillance: State Data
Reference: [CDC](https://www.cdc.gov/lead-prevention/php/data/state-surveillance-data.html)

**Sub data**
- Age of housing
- Insurance coverage
- Child poverty
- Housing insecurity(rented vs owned)
- Parent's Occupation
Reference: [Census](https://data.census.gov/)

## 1. Load Data

**1) main data**

In [1]:
import pandas as pd
import numpy as np

In [64]:
lead_df = pd.read_csv("data/2021-blood-lead-by-state-county.csv", encoding="ISO-8859-1")

In [65]:
lead_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2159 entries, 0 to 2158
Data columns (total 15 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   State                                               2159 non-null   object 
 1   County                                              2159 non-null   object 
 2   year                                                1602 non-null   float64
 3   Total Population of Children <72 Months of Age      1496 non-null   float64
 4   Number of Children Tested <72 Months of Age         1597 non-null   object 
 5   Number of Children with Confirmed BLLs ³5 µg/dL     1508 non-null   object 
 6   Percent of Children with Confirmed BLLs ³5 µg/dL    2158 non-null   object 
 7   Number of Children with Confirmed BLLs ³10 µg/dL    1496 non-null   object 
 8   Percent of Children with Confirmed BLLs ³10 µg/dL   1496 non-null   object 
 9

In [4]:
lead_df.head()

Unnamed: 0,State,County,year,Total Population of Children <72 Months of Age,Number of Children Tested <72 Months of Age,Number of Children with Confirmed BLLs ³5 µg/dL,Percent of Children with Confirmed BLLs ³5 µg/dL,Number of Children with Confirmed BLLs ³10 µg/dL,Percent of Children with Confirmed BLLs ³10 µg/dL,Number of Children with Confirmed BLLs 5-9 µg/dL,Number of Children with Confirmed BLLs 10-14 µg/dL,Number of Children with Confirmed BLLs 15-19 µg/dL,Number of Children with Confirmed BLLs 20-24 µg/dL,Number of Children with Confirmed BLLs 25-44 µg/dL,Number of Children with Confirmed BLLs ³45 µg/dL
0,AL,Autauga,2021.0,4045.0,238,SD,SD,SD,SD,0,SD,0,0,0,0
1,AL,Baldwin,2021.0,14651.0,552,SD,SD,0,0.00%,SD,0,0,0,0,0
2,AL,Barbour,2021.0,1571.0,268,SD,SD,SD,SD,0,SD,SD,0,0,0
3,AL,Bibb,2021.0,1459.0,105,SD,SD,SD,SD,0,SD,0,0,0,0
4,AL,Blount,2021.0,4148.0,365,0,0.00%,0,0.00%,0,0,0,0,0,0


**2) Sub data**

In [66]:
# House relevant
house_built_year_df = pd.read_csv("data/2021-year-home-built.csv",skiprows=1)
house_plumbing_df = pd.read_csv("data/2021-plumbing.csv",skiprows=1)
house_price_df = pd.read_csv("data/2021-price.csv",skiprows=1)

# Insurance
insurance_df = pd.read_csv("data/2021.AGE BY HEALTH INSURANCE COVERAGE STATUS.K202701-Data.csv")

# Poverty
poverty_df = pd.read_csv("data/2021.POVERTY STATUS IN THE PAST 12 MONTHS BY AGE.K201701-Data.csv")

# House Insecurity
insecurity_df = pd.read_csv("data/2021.TENURE.B25003.csv")

# Parent's Occupation
occupation_df = pd.read_csv("data/2021. OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER.K202401-Data.csv")
industry_df = pd.read_csv("data/2021.INDUSTRY FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER.K202403-Data.csv")

## 2. Cleaning the Data

### 1) House relevant data

In [41]:
house_built_year_df.info()
house_built_year_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3221 entries, 0 to 3220
Data columns (total 25 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Geography                                       3221 non-null   object 
 1   Geographic Area Name                            3221 non-null   object 
 2   Estimate!!Total:                                3221 non-null   int64  
 3   Margin of Error!!Total:                         3221 non-null   int64  
 4   Estimate!!Total:!!Built 2020 or later           3221 non-null   float64
 5   Margin of Error!!Total:!!Built 2020 or later    3221 non-null   int64  
 6   Estimate!!Total:!!Built 2010 to 2019            3221 non-null   float64
 7   Margin of Error!!Total:!!Built 2010 to 2019     3221 non-null   int64  
 8   Estimate!!Total:!!Built 2000 to 2009            3221 non-null   float64
 9   Margin of Error!!Total:!!Built 2000 to 20

Unnamed: 0,Geography,Geographic Area Name,Estimate!!Total:,Margin of Error!!Total:,Estimate!!Total:!!Built 2020 or later,Margin of Error!!Total:!!Built 2020 or later,Estimate!!Total:!!Built 2010 to 2019,Margin of Error!!Total:!!Built 2010 to 2019,Estimate!!Total:!!Built 2000 to 2009,Margin of Error!!Total:!!Built 2000 to 2009,...,Margin of Error!!Total:!!Built 1970 to 1979,Estimate!!Total:!!Built 1960 to 1969,Margin of Error!!Total:!!Built 1960 to 1969,Estimate!!Total:!!Built 1950 to 1959,Margin of Error!!Total:!!Built 1950 to 1959,Estimate!!Total:!!Built 1940 to 1949,Margin of Error!!Total:!!Built 1940 to 1949,Estimate!!Total:!!Built 1939 or earlier,Margin of Error!!Total:!!Built 1939 or earlier,Unnamed: 24
0,0500000US01001,"Autauga County, Alabama",24170,70,0.002938,82,0.101283,392,0.216177,566,...,587,0.116922,437,0.035581,238,0.013529,123,0.02077,199,
1,0500000US01003,"Baldwin County, Alabama",121763,201,0.003622,193,0.135008,1146,0.272086,1319,...,870,0.051099,699,0.023891,528,0.014134,441,0.022215,494,
2,0500000US01005,"Barbour County, Alabama",11667,141,0.0,24,0.036856,124,0.093512,178,...,236,0.124882,201,0.085712,199,0.049713,176,0.076198,167,
3,0500000US01007,"Bibb County, Alabama",9013,86,0.0,24,0.068679,146,0.148341,230,...,254,0.110618,299,0.049706,221,0.029735,123,0.071896,252,
4,0500000US01009,"Blount County, Alabama",24527,93,0.000897,25,0.055123,292,0.194194,461,...,503,0.093774,388,0.047825,336,0.039711,198,0.05076,284,


In [42]:
house_plumbing_df.info()
house_plumbing_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3221 entries, 0 to 3220
Data columns (total 9 columns):
 #   Column                                                         Non-Null Count  Dtype  
---  ------                                                         --------------  -----  
 0   Geography                                                      3221 non-null   object 
 1   Geographic Area Name                                           3221 non-null   object 
 2   Estimate!!Total:                                               3221 non-null   int64  
 3   Margin of Error!!Total:                                        3221 non-null   int64  
 4   Estimate!!Total:!!Complete plumbing facilities                 3221 non-null   int64  
 5   Margin of Error!!Total:!!Complete plumbing facilities          3221 non-null   int64  
 6   Estimate!!Total:!!Lacking complete plumbing facilities         3221 non-null   int64  
 7   Margin of Error!!Total:!!Lacking complete plumbing facilities

Unnamed: 0,Geography,Geographic Area Name,Estimate!!Total:,Margin of Error!!Total:,Estimate!!Total:!!Complete plumbing facilities,Margin of Error!!Total:!!Complete plumbing facilities,Estimate!!Total:!!Lacking complete plumbing facilities,Margin of Error!!Total:!!Lacking complete plumbing facilities,Unnamed: 8
0,0500000US01001,"Autauga County, Alabama",21856,424,21746,429,110,91,
1,0500000US01003,"Baldwin County, Alabama",87190,1307,86952,1314,238,148,
2,0500000US01005,"Barbour County, Alabama",9088,301,9069,302,19,20,
3,0500000US01007,"Bibb County, Alabama",7083,289,7034,289,49,49,
4,0500000US01009,"Blount County, Alabama",21300,411,21181,429,119,98,


In [25]:
house_price_df.info()
house_price_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3218 entries, 0 to 3217
Data columns (total 59 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   Geography                                          3218 non-null   object 
 1   Geographic Area Name                               3218 non-null   object 
 2   Race/Ethnic Group                                  3218 non-null   int64  
 3   Population Groups                                  3218 non-null   object 
 4   Estimate!!Total:                                   3218 non-null   int64  
 5   Margin of Error!!Total:                            3218 non-null   int64  
 6   Estimate!!Total:!!Less than $10,000                3218 non-null   int64  
 7   Margin of Error!!Total:!!Less than $10,000         3218 non-null   int64  
 8   Estimate!!Total:!!$10,000 to $14,999               3218 non-null   int64  
 9   Margin o

Unnamed: 0,Geography,Geographic Area Name,Race/Ethnic Group,Population Groups,Estimate!!Total:,Margin of Error!!Total:,"Estimate!!Total:!!Less than $10,000","Margin of Error!!Total:!!Less than $10,000","Estimate!!Total:!!$10,000 to $14,999","Margin of Error!!Total:!!$10,000 to $14,999",...,"Margin of Error!!Total:!!$500,000 to $749,999","Estimate!!Total:!!$750,000 to $999,999","Margin of Error!!Total:!!$750,000 to $999,999","Estimate!!Total:!!$1,000,000 to $1,499,999","Margin of Error!!Total:!!$1,000,000 to $1,499,999","Estimate!!Total:!!$1,500,000 to $1,999,999","Margin of Error!!Total:!!$1,500,000 to $1,999,999","Estimate!!Total:!!$2,000,000 or more","Margin of Error!!Total:!!$2,000,000 or more",Unnamed: 58
0,0500000US01001,"Autauga County, Alabama",1,Total population,16227,520,525,153,325,132,...,193,31,49,2,5,0,30,12,15,
1,0500000US01003,"Baldwin County, Alabama",1,Total population,67242,1296,778,271,621,170,...,564,1874,526,943,294,115,71,435,331,
2,0500000US01005,"Barbour County, Alabama",1,Total population,5654,299,210,93,302,141,...,56,46,31,31,30,0,24,0,24,
3,0500000US01007,"Bibb County, Alabama",1,Total population,5580,408,408,168,96,77,...,22,0,24,44,56,25,43,0,24,
4,0500000US01009,"Blount County, Alabama",1,Total population,16865,661,441,139,292,119,...,84,235,121,52,52,0,30,51,42,


- 'house_price_df' and 'house_plumbing_df' are percent but 'house_built_year_df' is not. normalize required

**Normalize house_built_year_df to merge**

In [67]:
## clean the column names
house_built_year_df.columns = house_built_year_df.columns.str.strip()
house_plumbing_df.columns = house_plumbing_df.columns.str.strip()
house_price_df.columns = house_price_df.columns.str.strip()

In [36]:
house_built_year_df.shape

(3221, 25)

In [68]:
def normalize_df(df):
    total_col = "Estimate!!Total:"
    df[total_col] = pd.to_numeric(df[total_col], errors='coerce')
    
    for col in df.columns:
        if col.startswith("Estimate!!Total:") and col != total_col:
            df[col] = pd.to_numeric(df[col], errors='coerce') / df[total_col]
            
    df = df.drop(columns=["Estimate!!Total:","Geography"], errors="ignore")
    
    return df

In [69]:
normalized_house_built_year_df = normalize_df(house_built_year_df).drop(columns=['Unnamed: 24'])
normalized_house_built_year_df.head()

Unnamed: 0,Geographic Area Name,Margin of Error!!Total:,Estimate!!Total:!!Built 2020 or later,Margin of Error!!Total:!!Built 2020 or later,Estimate!!Total:!!Built 2010 to 2019,Margin of Error!!Total:!!Built 2010 to 2019,Estimate!!Total:!!Built 2000 to 2009,Margin of Error!!Total:!!Built 2000 to 2009,Estimate!!Total:!!Built 1990 to 1999,Margin of Error!!Total:!!Built 1990 to 1999,...,Estimate!!Total:!!Built 1970 to 1979,Margin of Error!!Total:!!Built 1970 to 1979,Estimate!!Total:!!Built 1960 to 1969,Margin of Error!!Total:!!Built 1960 to 1969,Estimate!!Total:!!Built 1950 to 1959,Margin of Error!!Total:!!Built 1950 to 1959,Estimate!!Total:!!Built 1940 to 1949,Margin of Error!!Total:!!Built 1940 to 1949,Estimate!!Total:!!Built 1939 or earlier,Margin of Error!!Total:!!Built 1939 or earlier
0,"Autauga County, Alabama",70,0.002938,82,0.101283,392,0.216177,566,0.215929,620,...,0.178486,587,0.116922,437,0.035581,238,0.013529,123,0.02077,199
1,"Baldwin County, Alabama",201,0.003622,193,0.135008,1146,0.272086,1319,0.234956,1517,...,0.090183,870,0.051099,699,0.023891,528,0.014134,441,0.022215,494
2,"Barbour County, Alabama",141,0.0,24,0.036856,124,0.093512,178,0.218651,337,...,0.150081,236,0.124882,201,0.085712,199,0.049713,176,0.076198,167
3,"Bibb County, Alabama",86,0.0,24,0.068679,146,0.148341,230,0.203817,309,...,0.156774,254,0.110618,299,0.049706,221,0.029735,123,0.071896,252
4,"Blount County, Alabama",93,0.000897,25,0.055123,292,0.194194,461,0.226077,548,...,0.148612,503,0.093774,388,0.047825,336,0.039711,198,0.05076,284


In [70]:
normalized_house_built_year_df.shape

(3221, 22)

In [71]:
normalized_house_plumbing_df = normalize_df(house_plumbing_df).drop(columns=['Unnamed: 8'])
normalized_house_plumbing_df.head()

Unnamed: 0,Geographic Area Name,Margin of Error!!Total:,Estimate!!Total:!!Complete plumbing facilities,Margin of Error!!Total:!!Complete plumbing facilities,Estimate!!Total:!!Lacking complete plumbing facilities,Margin of Error!!Total:!!Lacking complete plumbing facilities
0,"Autauga County, Alabama",424,0.994967,429,0.005033,91
1,"Baldwin County, Alabama",1307,0.99727,1314,0.00273,148
2,"Barbour County, Alabama",301,0.997909,302,0.002091,20
3,"Bibb County, Alabama",289,0.993082,289,0.006918,49
4,"Blount County, Alabama",411,0.994413,429,0.005587,98


In [72]:
normalized_house_price_df = normalize_df(house_price_df).drop(columns=['Unnamed: 58','Race/Ethnic Group','Population Groups'])
normalized_house_price_df.head()

Unnamed: 0,Geographic Area Name,Margin of Error!!Total:,"Estimate!!Total:!!Less than $10,000","Margin of Error!!Total:!!Less than $10,000","Estimate!!Total:!!$10,000 to $14,999","Margin of Error!!Total:!!$10,000 to $14,999","Estimate!!Total:!!$15,000 to $19,999","Margin of Error!!Total:!!$15,000 to $19,999","Estimate!!Total:!!$20,000 to $24,999","Margin of Error!!Total:!!$20,000 to $24,999",...,"Estimate!!Total:!!$500,000 to $749,999","Margin of Error!!Total:!!$500,000 to $749,999","Estimate!!Total:!!$750,000 to $999,999","Margin of Error!!Total:!!$750,000 to $999,999","Estimate!!Total:!!$1,000,000 to $1,499,999","Margin of Error!!Total:!!$1,000,000 to $1,499,999","Estimate!!Total:!!$1,500,000 to $1,999,999","Margin of Error!!Total:!!$1,500,000 to $1,999,999","Estimate!!Total:!!$2,000,000 or more","Margin of Error!!Total:!!$2,000,000 or more"
0,"Autauga County, Alabama",520,0.032353,153,0.020028,132,0.019042,183,0.008936,107,...,0.023541,193,0.00191,49,0.000123,5,0.0,30,0.00074,15
1,"Baldwin County, Alabama",1296,0.01157,271,0.009235,170,0.008209,196,0.006038,153,...,0.063561,564,0.027869,526,0.014024,294,0.00171,71,0.006469,331
2,"Barbour County, Alabama",299,0.037142,93,0.053414,141,0.039795,98,0.034135,76,...,0.019632,56,0.008136,31,0.005483,30,0.0,24,0.0,24
3,"Bibb County, Alabama",408,0.073118,168,0.017204,77,0.014875,60,0.018459,80,...,0.003405,22,0.0,24,0.007885,56,0.00448,43,0.0,24
4,"Blount County, Alabama",661,0.026149,139,0.017314,119,0.010258,84,0.015357,114,...,0.013104,84,0.013934,121,0.003083,52,0.0,30,0.003024,42


**Merge data by 'Geographic Area Name'**

In [73]:
merged_house_df = normalized_house_built_year_df.merge(normalized_house_plumbing_df, on="Geographic Area Name", how="inner") \
                              .merge(normalized_house_price_df, on="Geographic Area Name", how="inner")


merged_house_df.head()

Unnamed: 0,Geographic Area Name,Margin of Error!!Total:_x,Estimate!!Total:!!Built 2020 or later,Margin of Error!!Total:!!Built 2020 or later,Estimate!!Total:!!Built 2010 to 2019,Margin of Error!!Total:!!Built 2010 to 2019,Estimate!!Total:!!Built 2000 to 2009,Margin of Error!!Total:!!Built 2000 to 2009,Estimate!!Total:!!Built 1990 to 1999,Margin of Error!!Total:!!Built 1990 to 1999,...,"Estimate!!Total:!!$500,000 to $749,999","Margin of Error!!Total:!!$500,000 to $749,999","Estimate!!Total:!!$750,000 to $999,999","Margin of Error!!Total:!!$750,000 to $999,999","Estimate!!Total:!!$1,000,000 to $1,499,999","Margin of Error!!Total:!!$1,000,000 to $1,499,999","Estimate!!Total:!!$1,500,000 to $1,999,999","Margin of Error!!Total:!!$1,500,000 to $1,999,999","Estimate!!Total:!!$2,000,000 or more","Margin of Error!!Total:!!$2,000,000 or more"
0,"Autauga County, Alabama",70,0.002938,82,0.101283,392,0.216177,566,0.215929,620,...,0.023541,193,0.00191,49,0.000123,5,0.0,30,0.00074,15
1,"Baldwin County, Alabama",201,0.003622,193,0.135008,1146,0.272086,1319,0.234956,1517,...,0.063561,564,0.027869,526,0.014024,294,0.00171,71,0.006469,331
2,"Barbour County, Alabama",141,0.0,24,0.036856,124,0.093512,178,0.218651,337,...,0.019632,56,0.008136,31,0.005483,30,0.0,24,0.0,24
3,"Bibb County, Alabama",86,0.0,24,0.068679,146,0.148341,230,0.203817,309,...,0.003405,22,0.0,24,0.007885,56,0.00448,43,0.0,24
4,"Blount County, Alabama",93,0.000897,25,0.055123,292,0.194194,461,0.226077,548,...,0.013104,84,0.013934,121,0.003083,52,0.0,30,0.003024,42


In [46]:
merged_house_df.shape


(3218, 85)

**Drop the margin of error_colums**

In [74]:
def drop_margin_of_error_columns(df):
    return df.drop(columns=[col for col in df.columns if col.startswith("Margin of Error")], errors='ignore')

In [75]:
cleaned_house_df = drop_margin_of_error_columns(merged_house_df)

cleaned_house_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3218 entries, 0 to 3217
Data columns (total 39 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   Geographic Area Name                                    3218 non-null   object 
 1   Estimate!!Total:!!Built 2020 or later                   3218 non-null   float64
 2   Estimate!!Total:!!Built 2010 to 2019                    3218 non-null   float64
 3   Estimate!!Total:!!Built 2000 to 2009                    3218 non-null   float64
 4   Estimate!!Total:!!Built 1990 to 1999                    3218 non-null   float64
 5   Estimate!!Total:!!Built 1980 to 1989                    3218 non-null   float64
 6   Estimate!!Total:!!Built 1970 to 1979                    3218 non-null   float64
 7   Estimate!!Total:!!Built 1960 to 1969                    3218 non-null   float64
 8   Estimate!!Total:!!Built 1950 to 1959  

### 2) Insurance

In [76]:
insurance_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1905 entries, 0 to 1904
Data columns (total 21 columns):
 #   Column                                                                       Non-Null Count  Dtype 
---  ------                                                                       --------------  ----- 
 0   Geographic Area Name                                                         1905 non-null   object
 1   Estimate!!Total:                                                             1905 non-null   int64 
 2   Margin of Error!!Total:                                                      1905 non-null   int64 
 3   Estimate!!Total:!!Under 19 years:                                            1905 non-null   int64 
 4   Margin of Error!!Total:!!Under 19 years:                                     1905 non-null   int64 
 5   Estimate!!Total:!!Under 19 years:!!With health insurance coverage            1905 non-null   int64 
 6   Margin of Error!!Total:!!Under 19 years:!!With h

In [77]:
insurance_df.head()

Unnamed: 0,Geographic Area Name,Estimate!!Total:,Margin of Error!!Total:,Estimate!!Total:!!Under 19 years:,Margin of Error!!Total:!!Under 19 years:,Estimate!!Total:!!Under 19 years:!!With health insurance coverage,Margin of Error!!Total:!!Under 19 years:!!With health insurance coverage,Estimate!!Total:!!Under 19 years:!!No health insurance coverage,Margin of Error!!Total:!!Under 19 years:!!No health insurance coverage,Estimate!!Total:!!19 to 64 years:,...,Estimate!!Total:!!19 to 64 years:!!With health insurance coverage,Margin of Error!!Total:!!19 to 64 years:!!With health insurance coverage,Estimate!!Total:!!19 to 64 years:!!No health insurance coverage,Margin of Error!!Total:!!19 to 64 years:!!No health insurance coverage,Estimate!!Total:!!65 years and over:,Margin of Error!!Total:!!65 years and over:,Estimate!!Total:!!65 years and over:!!With health insurance coverage,Margin of Error!!Total:!!65 years and over:!!With health insurance coverage,Estimate!!Total:!!65 years and over:!!No health insurance coverage,Margin of Error!!Total:!!65 years and over:!!No health insurance coverage
0,"Autauga County, Alabama",56855,1194,14580,770,14169,743,411,333,33031,...,29728,1781,3303,1317,9244,499,9184,517,60,102
1,"Baldwin County, Alabama",235756,2108,52496,1056,49496,1488,3000,1270,132966,...,115945,3431,17021,2775,50294,1384,49522,1018,772,1034
2,"Barbour County, Alabama",21787,946,5563,427,5548,445,15,38,11783,...,10368,1069,1415,651,4441,449,4441,449,0,216
3,"Bibb County, Alabama",21544,1444,4365,754,4232,777,133,211,13170,...,11940,1346,1230,711,4009,1106,4009,1106,0,216
4,"Blount County, Alabama",58690,266,14112,389,13838,488,274,248,34088,...,28940,1262,5148,1233,10490,270,10490,270,0,216


**Clean the column name**


In [78]:
insurance_df.columns = insurance_df.columns.str.strip()

**Drop the margin of error**

In [79]:
insurance_cleaned_df = drop_margin_of_error_columns(insurance_df)
insurance_cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1905 entries, 0 to 1904
Data columns (total 11 columns):
 #   Column                                                                Non-Null Count  Dtype 
---  ------                                                                --------------  ----- 
 0   Geographic Area Name                                                  1905 non-null   object
 1   Estimate!!Total:                                                      1905 non-null   int64 
 2   Estimate!!Total:!!Under 19 years:                                     1905 non-null   int64 
 3   Estimate!!Total:!!Under 19 years:!!With health insurance coverage     1905 non-null   int64 
 4   Estimate!!Total:!!Under 19 years:!!No health insurance coverage       1905 non-null   int64 
 5   Estimate!!Total:!!19 to 64 years:                                     1905 non-null   int64 
 6   Estimate!!Total:!!19 to 64 years:!!With health insurance coverage     1905 non-null   int64 
 7   Estima

**Normalize the data**


In [80]:
cleaned_insurance_df = normalize_df(insurance_cleaned_df)

In [81]:
cleaned_insurance_df.head()

Unnamed: 0,Geographic Area Name,Estimate!!Total:!!Under 19 years:,Estimate!!Total:!!Under 19 years:!!With health insurance coverage,Estimate!!Total:!!Under 19 years:!!No health insurance coverage,Estimate!!Total:!!19 to 64 years:,Estimate!!Total:!!19 to 64 years:!!With health insurance coverage,Estimate!!Total:!!19 to 64 years:!!No health insurance coverage,Estimate!!Total:!!65 years and over:,Estimate!!Total:!!65 years and over:!!With health insurance coverage,Estimate!!Total:!!65 years and over:!!No health insurance coverage
0,"Autauga County, Alabama",0.256442,0.249213,0.007229,0.580969,0.522874,0.058095,0.162589,0.161534,0.001055
1,"Baldwin County, Alabama",0.222671,0.209946,0.012725,0.563998,0.491801,0.072198,0.213331,0.210056,0.003275
2,"Barbour County, Alabama",0.255336,0.254647,0.000688,0.540827,0.47588,0.064947,0.203837,0.203837,0.0
3,"Bibb County, Alabama",0.202609,0.196435,0.006173,0.611307,0.554215,0.057092,0.186084,0.186084,0.0
4,"Blount County, Alabama",0.24045,0.235781,0.004669,0.580814,0.493099,0.087715,0.178736,0.178736,0.0


### 3) Child poverty

In [82]:
poverty_df.head()

Unnamed: 0,Geographic Area Name,Estimate!!Total:,Margin of Error!!Total:,Estimate!!Total:!!Income in the past 12 months below poverty level:,Margin of Error!!Total:!!Income in the past 12 months below poverty level:,Estimate!!Total:!!Income in the past 12 months below poverty level:!!Under 18 years,Margin of Error!!Total:!!Income in the past 12 months below poverty level:!!Under 18 years,Estimate!!Total:!!Income in the past 12 months below poverty level:!!18 to 64 years,Margin of Error!!Total:!!Income in the past 12 months below poverty level:!!18 to 64 years,Estimate!!Total:!!Income in the past 12 months below poverty level:!!65 years and over,Margin of Error!!Total:!!Income in the past 12 months below poverty level:!!65 years and over,Estimate!!Total:!!Income in the past 12 months at or above poverty level:,Margin of Error!!Total:!!Income in the past 12 months at or above poverty level:,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Under 18 years,Margin of Error!!Total:!!Income in the past 12 months at or above poverty level:!!Under 18 years,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!18 to 64 years,Margin of Error!!Total:!!Income in the past 12 months at or above poverty level:!!18 to 64 years,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!65 years and over,Margin of Error!!Total:!!Income in the past 12 months at or above poverty level:!!65 years and over
0,"Autauga County, Alabama",58601,548,3648,1933,808,747,2479,1417,361,293,54953,2018,12800,967,33270,1419,8883,553
1,"Baldwin County, Alabama",235395,2413,25321,5128,8633,2679,14091,3068,2597,1106,210074,5512,40988,2596,121389,4057,47697,1841
2,"Barbour County, Alabama",21787,946,2636,1098,766,669,1515,572,355,210,19151,1475,4364,669,10701,974,4086,462
3,"Bibb County, Alabama",21544,1444,4394,1874,1901,835,2427,1206,66,100,17150,2186,2404,939,10803,1648,3943,1100
4,"Blount County, Alabama",58342,591,5840,1742,1439,817,3563,1128,838,449,52502,1910,11804,939,31046,1134,9652,481


In [83]:
poverty_df.columns =  poverty_df.columns.str.strip()

# drop the margin error
poverty_df = drop_margin_of_error_columns(poverty_df)

# normalize
cleaned_poverty_df = normalize_df(poverty_df)

cleaned_poverty_df.head()

Unnamed: 0,Geographic Area Name,Estimate!!Total:!!Income in the past 12 months below poverty level:,Estimate!!Total:!!Income in the past 12 months below poverty level:!!Under 18 years,Estimate!!Total:!!Income in the past 12 months below poverty level:!!18 to 64 years,Estimate!!Total:!!Income in the past 12 months below poverty level:!!65 years and over,Estimate!!Total:!!Income in the past 12 months at or above poverty level:,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Under 18 years,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!18 to 64 years,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!65 years and over
0,"Autauga County, Alabama",0.062251,0.013788,0.042303,0.00616,0.937749,0.218426,0.567738,0.151584
1,"Baldwin County, Alabama",0.107568,0.036675,0.059861,0.011033,0.892432,0.174124,0.515682,0.202625
2,"Barbour County, Alabama",0.12099,0.035159,0.069537,0.016294,0.87901,0.200303,0.491164,0.187543
3,"Bibb County, Alabama",0.203955,0.088238,0.112653,0.003063,0.796045,0.111586,0.501439,0.183021
4,"Blount County, Alabama",0.100099,0.024665,0.061071,0.014364,0.899901,0.202324,0.532138,0.165438


### 4) Housing insecurity(rented vs owned)


In [84]:
insecurity_df.head()

Unnamed: 0,Label (Grouping),Total:,Total:!!Owner occupied,Total:!!Renter occupied
0,"Baldwin County, Alabama",,,
1,Estimate,94105.0,71380.0,22725.0
2,"Calhoun County, Alabama",,,
3,Estimate,44631.0,32470.0,12161.0
4,"Cullman County, Alabama",,,


- Data stored in Estimate label, we need to match the couty name and data.

In [85]:
insecurity_df.columns = insecurity_df.columns.str.strip()
insecurity_fillna_df = insecurity_df.fillna(method="bfill")

  insecurity_fillna_df = insecurity_df.fillna(method="bfill")


In [None]:
insecurity_fillna_df.head()

Unnamed: 0,Label (Grouping),Total:,Total:!!Owner occupied,Total:!!Renter occupied
0,"Baldwin County, Alabama",94105,71380,22725
1,Estimate,94105,71380,22725
2,"Calhoun County, Alabama",44631,32470,12161
3,Estimate,44631,32470,12161
4,"Cullman County, Alabama",35131,26688,8443


In [87]:
insecurity_fillna_df.tail()

Unnamed: 0,Label (Grouping),Total:,Total:!!Owner occupied,Total:!!Renter occupied
1675,Estimate,135865,70013,65852
1676,"Toa Alta Municipio, Puerto Rico",20918,17421,3497
1677,Estimate,20918,17421,3497
1678,"Toa Baja Municipio, Puerto Rico",28276,20613,7663
1679,Estimate,28276,20613,7663


In [88]:
insecurity_drop_est_df = insecurity_fillna_df[insecurity_fillna_df["Label (Grouping)"].str.strip().str.lower() != "estimate"]
insecurity_drop_est_df = insecurity_drop_est_df.rename(columns={"Label (Grouping)": "Geographic Area Name"})

insecurity_drop_est_df.head()

Unnamed: 0,Geographic Area Name,Total:,Total:!!Owner occupied,Total:!!Renter occupied
0,"Baldwin County, Alabama",94105,71380,22725
2,"Calhoun County, Alabama",44631,32470,12161
4,"Cullman County, Alabama",35131,26688,8443
6,"DeKalb County, Alabama",24979,19663,5316
8,"Elmore County, Alabama",32108,22990,9118


In [89]:
df_numeric = insecurity_drop_est_df.copy()

for col in df_numeric.columns[1:]:  
    df_numeric[col] = df_numeric[col].astype(str).str.replace(",", "").str.strip() 
    df_numeric[col] = pd.to_numeric(df_numeric[col], errors='coerce')


for col in df_numeric.columns[2:]:  
    df_numeric[col] = df_numeric[col] / df_numeric["Total:"]

df_normalized = df_numeric.drop(columns=["Total:"])

df_normalized.head()

Unnamed: 0,Geographic Area Name,Total:!!Owner occupied,Total:!!Renter occupied
0,"Baldwin County, Alabama",0.758514,0.241486
2,"Calhoun County, Alabama",0.727521,0.272479
4,"Cullman County, Alabama",0.759671,0.240329
6,"DeKalb County, Alabama",0.787181,0.212819
8,"Elmore County, Alabama",0.716021,0.283979


In [90]:
cleaned_insecurity_df = df_normalized

### 5) Parent's Occupation

In [91]:
print(occupation_df.shape)
occupation_df.head()

(1904, 13)


Unnamed: 0,Geographic Area Name,Estimate!!Total:,Margin of Error!!Total:,"Estimate!!Total:!!Management, business, science, and arts occupations","Margin of Error!!Total:!!Management, business, science, and arts occupations",Estimate!!Total:!!Service occupations,Margin of Error!!Total:!!Service occupations,Estimate!!Total:!!Sales and office occupations,Margin of Error!!Total:!!Sales and office occupations,"Estimate!!Total:!!Natural resources, construction, and maintenance occupations","Margin of Error!!Total:!!Natural resources, construction, and maintenance occupations","Estimate!!Total:!!Production, transportation, and material moving occupations","Margin of Error!!Total:!!Production, transportation, and material moving occupations"
0,"Autauga County, Alabama",26405,2208,7355,1541,3913,1061,7157,1400,3143,1143,4837,1064
1,"Baldwin County, Alabama",110347,4579,40689,3375,18103,3162,26503,3506,9963,2611,15089,3038
2,"Barbour County, Alabama",9848,1107,2525,760,1643,771,2474,869,1187,504,2019,709
3,"Bibb County, Alabama",7153,1331,2052,779,1196,753,940,640,1701,650,1264,538
4,"Blount County, Alabama",25646,1601,6924,1490,3842,1517,5734,1255,5073,1053,4073,1210


In [92]:
print(industry_df.shape)
industry_df.head()

(1904, 29)


Unnamed: 0,Geographic Area Name,Estimate!!Total:,Margin of Error!!Total:,"Estimate!!Total:!!Agriculture, forestry, fishing and hunting, and mining","Margin of Error!!Total:!!Agriculture, forestry, fishing and hunting, and mining",Estimate!!Total:!!Construction,Margin of Error!!Total:!!Construction,Estimate!!Total:!!Manufacturing,Margin of Error!!Total:!!Manufacturing,Estimate!!Total:!!Wholesale trade,...,"Estimate!!Total:!!Professional, scientific, and management, and administrative and waste management services","Margin of Error!!Total:!!Professional, scientific, and management, and administrative and waste management services","Estimate!!Total:!!Educational services, and health care and social assistance","Margin of Error!!Total:!!Educational services, and health care and social assistance","Estimate!!Total:!!Arts, entertainment, and recreation, and accommodation and food services","Margin of Error!!Total:!!Arts, entertainment, and recreation, and accommodation and food services","Estimate!!Total:!!Other services, except public administration","Margin of Error!!Total:!!Other services, except public administration",Estimate!!Total:!!Public administration,Margin of Error!!Total:!!Public administration
0,"Autauga County, Alabama",26405.0,2208.0,294.0,302.0,1701.0,811.0,3981.0,1139.0,305.0,...,2188.0,1324.0,5463.0,1358.0,2575.0,930.0,1180.0,580.0,1948.0,828.0
1,"Baldwin County, Alabama",110347.0,4579.0,856.0,628.0,10194.0,2535.0,8893.0,1926.0,2664.0,...,14185.0,3460.0,21659.0,3540.0,11201.0,2691.0,5536.0,1781.0,5488.0,1853.0
2,"Barbour County, Alabama",9848.0,1107.0,429.0,278.0,223.0,184.0,2390.0,775.0,127.0,...,470.0,326.0,1737.0,619.0,886.0,480.0,519.0,326.0,433.0,281.0
3,"Bibb County, Alabama",7153.0,1331.0,212.0,291.0,1023.0,592.0,797.0,479.0,257.0,...,324.0,224.0,1104.0,478.0,126.0,215.0,329.0,314.0,907.0,594.0
4,"Blount County, Alabama",25646.0,1601.0,593.0,365.0,3771.0,1057.0,4285.0,1337.0,443.0,...,3047.0,1053.0,4858.0,1100.0,1045.0,648.0,837.0,558.0,1078.0,545.0


In [93]:
occupation_df.columns = occupation_df.columns.str.strip()
industry_df.columns = industry_df.columns.str.strip()

In [94]:
occupation_drop_est_df = drop_margin_of_error_columns(occupation_df)
industry_drop_est_df = drop_margin_of_error_columns(industry_df)

In [95]:
normalized_occupation_df = normalize_df(occupation_drop_est_df)
normalized_industry_df = normalize_df(industry_drop_est_df)

In [96]:
normalized_occupation_df.head()

Unnamed: 0,Geographic Area Name,"Estimate!!Total:!!Management, business, science, and arts occupations",Estimate!!Total:!!Service occupations,Estimate!!Total:!!Sales and office occupations,"Estimate!!Total:!!Natural resources, construction, and maintenance occupations","Estimate!!Total:!!Production, transportation, and material moving occupations"
0,"Autauga County, Alabama",0.278546,0.148192,0.271047,0.11903,0.183185
1,"Baldwin County, Alabama",0.368737,0.164055,0.240179,0.090288,0.136741
2,"Barbour County, Alabama",0.256397,0.166836,0.251219,0.120532,0.205016
3,"Bibb County, Alabama",0.286873,0.167203,0.131413,0.237802,0.176709
4,"Blount County, Alabama",0.269984,0.149809,0.223583,0.197809,0.158816


In [97]:
normalized_industry_df.head()

Unnamed: 0,Geographic Area Name,"Estimate!!Total:!!Agriculture, forestry, fishing and hunting, and mining",Estimate!!Total:!!Construction,Estimate!!Total:!!Manufacturing,Estimate!!Total:!!Wholesale trade,Estimate!!Total:!!Retail trade,"Estimate!!Total:!!Transportation and warehousing, and utilities",Estimate!!Total:!!Information,"Estimate!!Total:!!Finance and insurance, and real estate and rental and leasing","Estimate!!Total:!!Professional, scientific, and management, and administrative and waste management services","Estimate!!Total:!!Educational services, and health care and social assistance","Estimate!!Total:!!Arts, entertainment, and recreation, and accommodation and food services","Estimate!!Total:!!Other services, except public administration",Estimate!!Total:!!Public administration
0,"Autauga County, Alabama",0.011134,0.06442,0.150767,0.011551,0.113804,0.066124,0.023594,0.052869,0.082863,0.206893,0.097519,0.044689,0.073774
1,"Baldwin County, Alabama",0.007757,0.092381,0.080591,0.024142,0.156089,0.031446,0.010349,0.071003,0.128549,0.196281,0.101507,0.050169,0.049734
2,"Barbour County, Alabama",0.043562,0.022644,0.242689,0.012896,0.167953,0.064683,0.0,0.034829,0.047725,0.176381,0.089968,0.052701,0.043968
3,"Bibb County, Alabama",0.029638,0.143017,0.111422,0.035929,0.142737,0.058856,0.007829,0.080526,0.045296,0.154341,0.017615,0.045995,0.1268
4,"Blount County, Alabama",0.023123,0.14704,0.167083,0.017274,0.146027,0.043827,0.007097,0.024877,0.11881,0.189425,0.040747,0.032637,0.042034


In [98]:
merged_parent_df = normalized_industry_df.merge(normalized_occupation_df, on="Geographic Area Name", how="left")

merged_parent_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1904 entries, 0 to 1903
Data columns (total 19 columns):
 #   Column                                                                                                        Non-Null Count  Dtype  
---  ------                                                                                                        --------------  -----  
 0   Geographic Area Name                                                                                          1904 non-null   object 
 1   Estimate!!Total:!!Agriculture, forestry, fishing and hunting, and mining                                      1881 non-null   float64
 2   Estimate!!Total:!!Construction                                                                                1881 non-null   float64
 3   Estimate!!Total:!!Manufacturing                                                                               1881 non-null   float64
 4   Estimate!!Total:!!Wholesale trade                     

In [99]:
cleaned_parent_df =  merged_parent_df

## 3. Export Cleaned Data

In [100]:
sub_data_arr = [cleaned_house_df, cleaned_insecurity_df, cleaned_insurance_df, cleaned_parent_df, cleaned_poverty_df]

**Export cleaned data to cleaned folder**

In [101]:
# remove duplicated row
for i, df in enumerate(sub_data_arr):
    sub_data_arr[i] = df.drop_duplicates(subset=["Geographic Area Name"])

In [102]:
state_abbreviations = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA",
    "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE", "Florida": "FL", "Georgia": "GA",
    "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA", "Kansas": "KS",
    "Kentucky": "KY", "Louisiana": "LA", "Maine": "ME", "Maryland": "MD", "Massachusetts": "MA",
    "Michigan": "MI", "Minnesota": "MN", "Mississippi": "MS", "Missouri": "MO", "Montana": "MT",
    "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH", "New Jersey": "NJ", "New Mexico": "NM",
    "New York": "NY", "North Carolina": "NC", "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK",
    "Oregon": "OR", "Pennsylvania": "PA", "Rhode Island": "RI", "South Carolina": "SC",
    "South Dakota": "SD", "Tennessee": "TN", "Texas": "TX", "Utah": "UT", "Vermont": "VT",
    "Virginia": "VA", "Washington": "WA", "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY",
    "Puerto Rico": "PR"
}

for i in range(len(sub_data_arr)):
    df = sub_data_arr[i]
    updated_df = df.copy()
    
    updated_df[["County", "State"]] = updated_df["Geographic Area Name"].str.rsplit(", ", n=1, expand=True)
    
    updated_df["State"] = updated_df["State"].map(state_abbreviations)
    updated_df["County"] = updated_df["County"].str.replace(" County", "", regex=True)
    
    column_order = ["State", "County"] + [col for col in updated_df.columns if col not in ["State", "County"]]
    updated_df = updated_df[column_order]

    sub_data_arr[i] = updated_df


In [103]:
sub_data_arr[0].head()

Unnamed: 0,State,County,Geographic Area Name,Estimate!!Total:!!Built 2020 or later,Estimate!!Total:!!Built 2010 to 2019,Estimate!!Total:!!Built 2000 to 2009,Estimate!!Total:!!Built 1990 to 1999,Estimate!!Total:!!Built 1980 to 1989,Estimate!!Total:!!Built 1970 to 1979,Estimate!!Total:!!Built 1960 to 1969,...,"Estimate!!Total:!!$175,000 to $199,999","Estimate!!Total:!!$200,000 to $249,999","Estimate!!Total:!!$250,000 to $299,999","Estimate!!Total:!!$300,000 to $399,999","Estimate!!Total:!!$400,000 to $499,999","Estimate!!Total:!!$500,000 to $749,999","Estimate!!Total:!!$750,000 to $999,999","Estimate!!Total:!!$1,000,000 to $1,499,999","Estimate!!Total:!!$1,500,000 to $1,999,999","Estimate!!Total:!!$2,000,000 or more"
0,AL,Autauga,"Autauga County, Alabama",0.002938,0.101283,0.216177,0.215929,0.098386,0.178486,0.116922,...,0.08227,0.145313,0.078326,0.096876,0.02804,0.023541,0.00191,0.000123,0.0,0.00074
1,AL,Baldwin,"Baldwin County, Alabama",0.003622,0.135008,0.272086,0.234956,0.152805,0.090183,0.051099,...,0.087297,0.145014,0.101068,0.149891,0.067577,0.063561,0.027869,0.014024,0.00171,0.006469
2,AL,Barbour,"Barbour County, Alabama",0.0,0.036856,0.093512,0.218651,0.164395,0.150081,0.124882,...,0.029537,0.03219,0.061372,0.047931,0.015741,0.019632,0.008136,0.005483,0.0,0.0
3,AL,Bibb,"Bibb County, Alabama",0.0,0.068679,0.148341,0.203817,0.160435,0.156774,0.110618,...,0.041039,0.068459,0.062366,0.051434,0.007168,0.003405,0.0,0.007885,0.00448,0.0
4,AL,Blount,"Blount County, Alabama",0.000897,0.055123,0.194194,0.226077,0.143026,0.148612,0.093774,...,0.045894,0.114082,0.065461,0.069315,0.020338,0.013104,0.013934,0.003083,0.0,0.003024


In [112]:
# export to csv file
file_names = ["house_df.csv", "insecurity_df.csv", "insurance_df.csv", "parent_df.csv", "poverty_df.csv"]
file_paths = [f"data/cleaned/{file}" for file in file_names]
# 
for df, path in zip(sub_data_arr, file_paths):
     df.to_csv(path, index=False)

## 4. Merge sub data and main data

In [105]:
# Drop the same columns
for i in range(len(sub_data_arr)):
    sub_data_arr[i] = sub_data_arr[i].drop(columns=["Geographic Area Name"], errors="ignore")

Merged all sub data

In [106]:
merged_sub_df = sub_data_arr[0].copy()

for df in sub_data_arr[1:]:
    merged_sub_df = merged_sub_df.merge(df, on=["State", "County"], how="outer", suffixes=("", "_dup"))

merged_sub_df.shape

(3218, 77)

In [107]:
merged_sub_df

Unnamed: 0,State,County,Estimate!!Total:!!Built 2020 or later,Estimate!!Total:!!Built 2010 to 2019,Estimate!!Total:!!Built 2000 to 2009,Estimate!!Total:!!Built 1990 to 1999,Estimate!!Total:!!Built 1980 to 1989,Estimate!!Total:!!Built 1970 to 1979,Estimate!!Total:!!Built 1960 to 1969,Estimate!!Total:!!Built 1950 to 1959,...,"Estimate!!Total:!!Natural resources, construction, and maintenance occupations","Estimate!!Total:!!Production, transportation, and material moving occupations",Estimate!!Total:!!Income in the past 12 months below poverty level:,Estimate!!Total:!!Income in the past 12 months below poverty level:!!Under 18 years,Estimate!!Total:!!Income in the past 12 months below poverty level:!!18 to 64 years,Estimate!!Total:!!Income in the past 12 months below poverty level:!!65 years and over,Estimate!!Total:!!Income in the past 12 months at or above poverty level:,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Under 18 years,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!18 to 64 years,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!65 years and over
0,AK,Aleutians East Borough,0.000000,0.021563,0.071878,0.108715,0.371069,0.194070,0.100629,0.038634,...,,,,,,,,,,
1,AK,Aleutians West Census Area,0.000687,0.057005,0.032967,0.247940,0.266484,0.213599,0.048764,0.010302,...,,,,,,,,,,
2,AK,Anchorage Municipality,0.000390,0.046563,0.118767,0.118657,0.257871,0.280064,0.102105,0.058295,...,0.081916,0.129813,0.090879,0.029282,0.052536,0.009061,0.909121,0.213045,0.577913,0.118162
3,AK,Bethel Census Area,0.000501,0.050067,0.181409,0.204940,0.262517,0.223131,0.058411,0.007844,...,,,,,,,,,,
4,AK,Bristol Bay Borough,0.000000,0.011931,0.086768,0.250542,0.226681,0.250542,0.086768,0.016269,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3213,WY,Teton,0.005583,0.043380,0.199321,0.225500,0.217805,0.183553,0.035534,0.031837,...,0.064701,0.013272,0.076713,0.025096,0.048440,0.003178,0.923287,0.159985,0.566284,0.197018
3214,WY,Uinta,0.000000,0.037533,0.113278,0.089693,0.376800,0.192766,0.040367,0.040821,...,0.214531,0.110490,0.058630,0.006730,0.032406,0.019493,0.941370,0.240951,0.545917,0.154502
3215,WY,Washakie,0.000000,0.042686,0.074440,0.081989,0.083550,0.229828,0.082249,0.160854,...,,,,,,,,,,
3216,WY,Weston,0.002915,0.048674,0.133489,0.157680,0.097348,0.121539,0.067910,0.092101,...,,,,,,,,,,


In [108]:
# "Lead States"와 "Merged States"의 차이를 찾기
lead_df["State"] = lead_df["State"].str.strip()
lead_df["County"] = lead_df["County"].str.strip()

merged_sub_df["State"] = merged_sub_df["State"].str.strip()
merged_sub_df["County"] = merged_sub_df["County"].str.strip()

lead_states = sorted(lead_df["State"].dropna().unique())
merged_states = sorted(merged_sub_df["State"].dropna().unique())

lead_states_set = set(lead_states)
merged_states_set = set(merged_states)

only_in_lead = lead_states_set - merged_states_set
only_in_merged = merged_states_set - lead_states_set

state_difference_df = pd.DataFrame({
    "Only in Lead States": pd.Series(list(only_in_lead)),
    "Only in Merged States": pd.Series(list(only_in_merged))
})

state_difference_df

Unnamed: 0,Only in Lead States,Only in Merged States
0,,PA
1,,ME
2,,VA
3,,MT
4,,OR
5,,AR
6,,PR
7,,VT
8,,KY
9,,CO


Merged with main data

In [109]:
# inner join
final_merged_df = lead_df.merge(merged_sub_df, on=["State", "County"], how="inner")

final_merged_df.shape

(1782, 90)

In [110]:
final_merged_df

Unnamed: 0,State,County,year,Total Population of Children <72 Months of Age,Number of Children Tested <72 Months of Age,Number of Children with Confirmed BLLs ³5 µg/dL,Percent of Children with Confirmed BLLs ³5 µg/dL,Number of Children with Confirmed BLLs ³10 µg/dL,Percent of Children with Confirmed BLLs ³10 µg/dL,Number of Children with Confirmed BLLs 5-9 µg/dL,...,"Estimate!!Total:!!Natural resources, construction, and maintenance occupations","Estimate!!Total:!!Production, transportation, and material moving occupations",Estimate!!Total:!!Income in the past 12 months below poverty level:,Estimate!!Total:!!Income in the past 12 months below poverty level:!!Under 18 years,Estimate!!Total:!!Income in the past 12 months below poverty level:!!18 to 64 years,Estimate!!Total:!!Income in the past 12 months below poverty level:!!65 years and over,Estimate!!Total:!!Income in the past 12 months at or above poverty level:,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!Under 18 years,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!18 to 64 years,Estimate!!Total:!!Income in the past 12 months at or above poverty level:!!65 years and over
0,AL,Autauga,2021.0,4045.0,238,SD,SD,SD,SD,0,...,0.119030,0.183185,0.062251,0.013788,0.042303,0.006160,0.937749,0.218426,0.567738,0.151584
1,AL,Baldwin,2021.0,14651.0,552,SD,SD,0,0.00%,SD,...,0.090288,0.136741,0.107568,0.036675,0.059861,0.011033,0.892432,0.174124,0.515682,0.202625
2,AL,Barbour,2021.0,1571.0,268,SD,SD,SD,SD,0,...,0.120532,0.205016,0.120990,0.035159,0.069537,0.016294,0.879010,0.200303,0.491164,0.187543
3,AL,Bibb,2021.0,1459.0,105,SD,SD,SD,SD,0,...,0.237802,0.176709,0.203955,0.088238,0.112653,0.003063,0.796045,0.111586,0.501439,0.183021
4,AL,Blount,2021.0,4148.0,365,0,0.00%,0,0.00%,0,...,0.197809,0.158816,0.100099,0.024665,0.061071,0.014364,0.899901,0.202324,0.532138,0.165438
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1777,IN,Washington,2021.0,,,,0.28,,,,...,0.095438,0.317467,0.075221,0.016652,0.043143,0.015426,0.924779,0.217300,0.542945,0.164534
1778,IN,Wayne,2021.0,,,,0.7,,,,...,0.085093,0.227488,0.163569,0.047653,0.098396,0.017520,0.836431,0.184959,0.480386,0.171086
1779,IN,Wells,2021.0,,,,0.53,,,,...,0.118619,0.266711,0.055379,0.008598,0.043133,0.003649,0.944621,0.235243,0.540460,0.168918
1780,IN,White,2021.0,,,,0.35,,,,...,0.098389,0.326958,0.097718,0.025934,0.063485,0.008299,0.902282,0.201369,0.504772,0.196141


In [111]:
final_merged_df.to_csv("data/cleaned/Lead_Poisoning_Risk_Factors.csv", index=False)