*DATA PREPARATION*
**Milestone 2: Data Wrangling – CDC PLACES Dataset**
***Daniel Solis Toro***

*DATA WRANGLING*

**Step 0: Initial Data Inspection**

In [56]:
#Load original dataset and inspect shape, structure, and missing values.

import pandas as pd

# Load dataset with default settings for initial inspection
raw_df = pd.read_csv('cdc_places_data.csv')

# Basic structure
print("Original dataset shape:", raw_df.shape)

# Column names and types
print("\nColumn info:")
print(raw_df.info())

# Sample of data
print("\nSample rows:")
display(raw_df.sample(5))

# Count of missing values per column
print("\nMissing values per column:")
print(raw_df.isnull().sum())

  raw_df = pd.read_csv('cdc_places_data.csv')


Original dataset shape: (240886, 22)

Column info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240886 entries, 0 to 240885
Data columns (total 22 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Year                        240886 non-null  int64  
 1   StateAbbr                   240886 non-null  object 
 2   StateDesc                   240886 non-null  object 
 3   LocationName                240806 non-null  object 
 4   DataSource                  240886 non-null  object 
 5   Category                    240886 non-null  object 
 6   Measure                     240886 non-null  object 
 7   Data_Value_Unit             240886 non-null  object 
 8   Data_Value_Type             240886 non-null  object 
 9   Data_Value                  240886 non-null  float64
 10  Data_Value_Footnote_Symbol  14 non-null      object 
 11  Data_Value_Footnote         14 non-null      object 
 12  Low_Confidence_Limit 

Unnamed: 0,Year,StateAbbr,StateDesc,LocationName,DataSource,Category,Measure,Data_Value_Unit,Data_Value_Type,Data_Value,...,Low_Confidence_Limit,High_Confidence_Limit,TotalPopulation,TotalPop18plus,LocationID,CategoryID,MeasureId,DataValueTypeID,Short_Question_Text,Geolocation
204588,2022,TX,Texas,Hall,BRFSS,Prevention,Mammography use among women aged 50-74 years,%,Age-adjusted prevalence,66.5,...,59.7,72.8,2810,2205,48191,PREVENT,MAMMOUSE,AgeAdjPrv,Mammography,POINT (-100.68136474111 34.5307336832386)
85867,2022,KY,Kentucky,Trimble,BRFSS,Health Outcomes,Current asthma among adults,%,Age-adjusted prevalence,11.1,...,10.2,12.0,8539,6660,21223,HLTHOUT,CASTHMA,AgeAdjPrv,Current Asthma,POINT (-85.3370797534498 38.6131211554902)
15960,2022,AK,Alaska,Southeast Fairbanks,BRFSS,Health-Related Social Needs,Lack of reliable transportation in the past 12...,%,Crude prevalence,10.2,...,9.2,11.2,7021,5207,2240,SOCLNEED,LACKTRPT,CrdPrv,Transportation Barriers,POINT (-143.213550848982 63.8762788125121)
181600,2022,SD,South Dakota,Tripp,BRFSS,Health Status,Frequent physical distress among adults,%,Crude prevalence,15.2,...,13.6,17.1,5565,4279,46123,HLTHSTAT,PHLTH,CrdPrv,Frequent Physical Distress,POINT (-99.8839778474528 43.346001016691)
130354,2022,NM,New Mexico,Hidalgo,BRFSS,Disability,Any disability among adults,%,Age-adjusted prevalence,35.6,...,31.0,40.4,4003,3120,35023,DISABLT,DISABILITY,AgeAdjPrv,Any Disability,POINT (-108.714784306827 31.914023667001)



Missing values per column:
Year                               0
StateAbbr                          0
StateDesc                          0
LocationName                      80
DataSource                         0
Category                           0
Measure                            0
Data_Value_Unit                    0
Data_Value_Type                    0
Data_Value                         0
Data_Value_Footnote_Symbol    240872
Data_Value_Footnote           240872
Low_Confidence_Limit               0
High_Confidence_Limit              0
TotalPopulation                    0
TotalPop18plus                     0
LocationID                         0
CategoryID                         0
MeasureId                          0
DataValueTypeID                    0
Short_Question_Text                0
Geolocation                       80
dtype: int64


**Step 1: Initial Data Loading**

In [17]:
from fuzzywuzzy import process, fuzz

# Load CDC PLACES data, specifying column types to handle potential mixed formats
cdc_df = pd.read_csv('cdc_places_data.csv',
    dtype={
        'Data_Value_Footnote_Symbol': str,
        'Data_Value_Footnote': str
    })

**Step 2: Header Replacement and Standardization**

In [20]:
# Rename column headers to lowercase and snake_case format for clarity and consistency
new_headers = {
    'Year': 'year',
    'StateAbbr': 'state_abbr',
    'StateDesc': 'state_name',
    'LocationName': 'county_name',
    'DataSource': 'data_source',
    'Category': 'category',
    'Measure': 'health_measure',
    'Data_Value_Unit': 'unit',
    'Data_Value_Type': 'value_type',
    'Data_Value': 'prevalence_value',
    'Data_Value_Footnote_Symbol': 'footnote_symbol',
    'Data_Value_Footnote': 'footnote_text',
    'Low_Confidence_Limit': 'confidence_low',
    'High_Confidence_Limit': 'confidence_high',
    'TotalPopulation': 'total_pop',
    'TotalPop18plus': 'pop_adults',
    'LocationID': 'location_id',
    'CategoryID': 'category_id',
    'MeasureId': 'measure_id',
    'DataValueTypeID': 'value_type_id',
    'Short_Question_Text': 'measure_short',
    'Geolocation': 'geo_coordinates'
}
cdc_df = cdc_df.rename(columns=new_headers)
print("\nAfter Header Renaming:")
print(cdc_df.columns.tolist())


After Header Renaming:
['year', 'state_abbr', 'state_name', 'county_name', 'data_source', 'category', 'health_measure', 'unit', 'value_type', 'prevalence_value', 'footnote_symbol', 'footnote_text', 'confidence_low', 'confidence_high', 'total_pop', 'pop_adults', 'location_id', 'category_id', 'measure_id', 'value_type_id', 'measure_short', 'geo_coordinates']


**Step 3: Data Type Cleaning and Formatting**

In [23]:
# Ensure FIPS codes are 5-digit strings
cdc_df['location_id'] = cdc_df['location_id'].astype(str).str.zfill(5)

# Convert prevalence values to numeric
cdc_df['prevalence_value'] = pd.to_numeric(cdc_df['prevalence_value'], errors='coerce')

# Fill missing confidence intervals with median values
cdc_df['confidence_low'] = cdc_df['confidence_low'].fillna(cdc_df['confidence_low'].median())
cdc_df['confidence_high'] = cdc_df['confidence_high'].fillna(cdc_df['confidence_high'].median())

**Step 4: Text Value Standardization**

In [33]:
# Standardize casing for readability
text_cols = ['state_name', 'county_name', 'health_measure', 'category']
cdc_df[text_cols] = cdc_df[text_cols].apply(lambda x: x.str.title())

# Ensure consistent state abbreviation format
cdc_df['state_abbr'] = cdc_df['state_abbr'].str.upper()


**Step 5: Outlier Detection**

In [61]:
# Use IQR per health_measure
def flag_outliers_iqr(group, column='prevalence_value'):
    Q1 = group[column].quantile(0.25)
    Q3 = group[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return ~group[column].between(lower_bound, upper_bound)

cdc_df['outlier_flag'] = cdc_df.groupby('health_measure').apply(
    lambda grp: flag_outliers_iqr(grp),
    include_groups=False
).reset_index(level=0, drop=True)

print(f"\nFound {cdc_df['outlier_flag'].sum()} potential outliers based on IQR method")



Found 1607 potential outliers based on IQR method


**Step 6: Duplicate Handling**

In [64]:
# Identify duplicate rows by key identifying fields
dupes = cdc_df.duplicated(subset=['year', 'state_abbr', 'county_name', 'health_measure'], keep=False)
print(f"Found {dupes.sum()} duplicate entries")

# Keep the most recent entry for each unique county and measure
cdc_df = cdc_df.sort_values('year').drop_duplicates(
    subset=['state_abbr', 'county_name', 'health_measure'],
    keep='last'
)

Found 0 duplicate entries


**Step 7: Fuzzy Matching for County Name Corrections**

In [67]:
# Dictionary of common county name corrections
county_standard = {
    'De Kalb': 'DeKalb',
    'St. Marys': 'St. Mary\'s',
    'Prince Georges': 'Prince George\'s',
    'Ste. Genevieve': 'Ste Genevieve',
    'La Porte': 'LaPorte'
}

# Apply fuzzy matching only if name is not null
def fuzzy_fix_county(name):
    if pd.isnull(name):
        return name
    matches = process.extractOne(name, county_standard.keys(), scorer=fuzz.token_sort_ratio)
    if matches and matches[1] > 85:
        return county_standard[matches[0]]
    return name

# Apply fuzzy matching to clean county names
cdc_df['county_name_clean'] = cdc_df['county_name'].apply(fuzzy_fix_county)


**Step 8: Final Data Quality Checks**

In [69]:
print("\nFinal Data Quality Check:")
print(cdc_df.info())

print("\nPrevalence Value Statistics:")
print(cdc_df['prevalence_value'].describe())



Final Data Quality Check:
<class 'pandas.core.frame.DataFrame'>
Index: 120231 entries, 208810 to 240885
Data columns (total 24 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   year               120231 non-null  int64  
 1   state_abbr         120231 non-null  object 
 2   state_name         120231 non-null  object 
 3   county_name        120191 non-null  object 
 4   data_source        120231 non-null  object 
 5   category           120231 non-null  object 
 6   health_measure     120231 non-null  object 
 7   unit               120231 non-null  object 
 8   value_type         120231 non-null  object 
 9   prevalence_value   120231 non-null  float64
 10  footnote_symbol    7 non-null       object 
 11  footnote_text      7 non-null       object 
 12  confidence_low     120231 non-null  float64
 13  confidence_high    120231 non-null  float64
 14  total_pop          120231 non-null  int64  
 15  pop_adults         12023

**Step 9: Save Final Cleaned Dataset**

In [71]:
cdc_df.to_csv('cleaned_cdc_data.csv', index=False)
print("\nCleaning complete. Saved to 'cleaned_cdc_data.csv'")



Cleaning complete. Saved to 'cleaned_cdc_data.csv'


**Step 10: Final Human-Readable Dataset Preview**

In [73]:
# Display a sample of the cleaned dataset
print("\nFinal Cleaned Data Sample:")
display(cdc_df[['year', 'state_name', 'county_name_clean', 'health_measure', 'prevalence_value']].head(20))


Final Cleaned Data Sample:


Unnamed: 0,year,state_name,county_name_clean,health_measure,prevalence_value
208810,2021,Texas,Madison,Cholesterol Screening Among Adults,79.9
52909,2021,Georgia,Pike,High Cholesterol Among Adults Who Have Ever Be...,36.3
83939,2021,Kentucky,Hart,Cholesterol Screening Among Adults,79.7
52907,2021,Georgia,Wilcox,High Cholesterol Among Adults Who Have Ever Be...,39.8
165232,2021,Oklahoma,Cimarron,High Cholesterol Among Adults Who Have Ever Be...,39.9
165225,2021,Ohio,Stark,Taking Medicine To Control High Blood Pressure...,81.4
193117,2021,Tennessee,Lake,Cholesterol Screening Among Adults,81.8
83960,2021,Kansas,Morris,Taking Medicine To Control High Blood Pressure...,83.2
52882,2021,Georgia,Putnam,High Blood Pressure Among Adults,45.5
83966,2021,Kentucky,Edmonson,Taking Medicine To Control High Blood Pressure...,83.1


*ETHICAL REFLECTION*

**Data Transformations**

The data underwent several transformations to improve its quality and usability. Column names were renamed for readability, string formatting was standardized, and missing values were imputed with median values. Outliers were flagged using a robust interquartile range (IQR) method per health measure, duplicate rows were removed, and fuzzy string matching corrected known county name errors. The CDC PLACES dataset, the source of this data, is publicly available for academic and public health use, with no specific legal restrictions, though ethical use and transparency are expected. However, these changes carry some risks, such as imputing missing values, potentially masking true variability, fuzzy matching introducing errors, or outlier detection flagging valid but rare data points. Assumptions were made during cleaning, including the reliability of the IQR method for outlier detection and the accuracy of predefined county name corrections.

**Data Sourcing and Ethical Considerations**

The data was sourced directly from the CDC’s official repositories, ensuring credibility, and was acquired ethically as it is open access and contains no personally identifiable information. To mitigate ethical concerns, all transformations were thoroughly documented, and the raw data was preserved for reproducibility. Conservative thresholds were applied in outlier detection, and fuzzy matches were made reviewable to minimize errors and maintain data integrity. This approach ensures transparency and reliability while adhering to ethical data handling practices.