*DATA PREPARATION Milestone 4:* Data Wrangling U.S. Census Bureau API  - Daniel Solis Toro

In [3]:
import requests
import pandas as pd
import numpy as np
from scipy import stats

# API Key
api_key = "e74d75dcaedb5caf0ceb4ddb4465b4be3ca1cc93"

# Step 0: Connect to the API and pull raw data
url = "https://api.census.gov/data/2021/acs/acs5"
params = {
    "get": "NAME,B19013_001E,B15003_017E,B15003_022E",
    "for": "county:*",
    "key": api_key
}

response = requests.get(url, params=params)
data = response.json()

# Create a DataFrame
df = pd.DataFrame(data[1:], columns=data[0])

# Step #1 – Rename Columns
# Improve readability by using human-friendly column names
df.rename(columns={
    "NAME": "County Name",
    "B19013_001E": "Median Household Income",
    "B15003_017E": "High School Graduates",
    "B15003_022E": "Bachelor's Degree",
    "state": "State FIPS",
    "county": "County FIPS"
}, inplace=True)

# Step #2 – Convert Data Types
# Convert selected columns from strings to numeric
numeric_cols = ["Median Household Income", "High School Graduates", "Bachelor's Degree"]
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors="coerce")

# Step #3 – Drop Rows with Missing Values
# Remove any county entries missing income or education values
df.dropna(subset=numeric_cols, inplace=True)

# Step #4 – Format Numbers
# Round numeric values to 2 decimal places
df[numeric_cols] = df[numeric_cols].round(2)

# Step #5 – Detect and Remove Outliers
# Using Z-score to remove rows with extreme values
z_scores = np.abs(stats.zscore(df[numeric_cols]))
df = df[(z_scores < 3).all(axis=1)]

# Step #6 – Create Education Ratio Column
# Add a new column calculating the ratio of bachelor’s degree to high school graduates
df["Bachelor-to-HS Ratio"] = (df["Bachelor's Degree"] / df["High School Graduates"]).round(3)

# Final Cleaned Dataset Preview
df.reset_index(drop=True, inplace=True)
df.head(10)

Unnamed: 0,County Name,Median Household Income,High School Graduates,Bachelor's Degree,State FIPS,County FIPS,Bachelor-to-HS Ratio
0,"Autauga County, Alabama",62660,10458,6507,1,1,0.622
1,"Baldwin County, Alabama",64346,36186,33379,1,3,0.922
2,"Barbour County, Alabama",36422,5204,1212,1,5,0.233
3,"Bibb County, Alabama",54277,5556,1276,1,7,0.23
4,"Blount County, Alabama",52830,11019,3783,1,9,0.343
5,"Bullock County, Alabama",29063,2455,560,1,11,0.228
6,"Butler County, Alabama",45236,5292,1087,1,13,0.205
7,"Calhoun County, Alabama",50977,22101,9159,1,15,0.414
8,"Chambers County, Alabama",47232,7212,2383,1,17,0.33
9,"Cherokee County, Alabama",43475,5451,1148,1,19,0.211


*Ethical Considerations of Data Wrangling (U.S. Census Data)*
For this project, we collected public socioeconomic data from the U.S. Census API, covering all counties in the United States. The data was cleaned by renaming variables, converting data types, handling missing values, rounding numerical data, detecting outliers, and adding a calculated education ratio. These transformations improve the dataset's usability but could introduce bias, particularly when outliers represent marginalized communities. The dataset adheres to federal privacy and ethical standards, as it is sourced directly from the U.S. Census Bureau. All cleaning steps were transparent, and no data was fabricated. Assumptions made—such as dropping missing values—could affect fairness in analysis, and further contextual evaluation is recommended to ensure equity. The data was acquired ethically and verified through a credible source.