*DATA PREPARATION*
**Milestone 3: Cleaning/Formatting Website Data –  Wikipedia Dataset**
***Daniel Solis Toro***

*CLEANING/FORMATTING WEBSITE DATA*:
This notebook extracts and cleans county income data from Wikipedia.

**Step 0: Setup & Import Libraries**

In [109]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import numpy as np

**Step 1: Extract Raw Data from Wikipedia**

In [112]:
url = "https://en.wikipedia.org/wiki/List_of_United_States_counties_by_per_capita_income"
headers = {'User-Agent': 'Mozilla/5.0'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'wikitable'})

# Extract table data
data = []
for row in table.find_all('tr'):
    cols = row.find_all(['th', 'td'])
    cols = [col.text.strip() for col in cols]
    data.append(cols)

df = pd.DataFrame(data)
df.columns = df.iloc[0]  # Set headers
df = df[1:]  # Remove header row

print("Raw data shape:", df.shape)
df.head()

Raw data shape: (3297, 8)


Unnamed: 0,Rank,County or county-equivalent,"State, federal district or territory",Per capitaincome,Medianhouseholdincome,Medianfamilyincome,Population,Number ofhouseholds
1,1,New York County,New York,"$76,592","$69,659","$86,553",1628706,759460
2,2,Arlington,Virginia,"$62,018","$103,208","$139,244",214861,94454
3,3,Falls Church City,Virginia,"$59,088","$120,000","$152,857",12731,5020
4,4,Marin,California,"$56,791","$90,839","$117,357",254643,102912
5,5,Santa Clara,California,"$56,248","$124,055","$124,055",1927852,640215


**Step 2: Data Cleaning Transformations**

***Transformation #1: Clean Column Names***

In [116]:
# Remove spaces and special characters from column names for easier processing
df.columns = [col.replace('\n', ' ').replace(' ', '_').lower() for col in df.columns]
print("Cleaned columns:", df.columns.tolist())

Cleaned columns: ['rank', 'county_or_county-equivalent', 'state,_federal_district_or_territory', 'per_capitaincome', 'medianhouseholdincome', 'medianfamilyincome', 'population', 'number_ofhouseholds']


***Transformation #2: Convert Currency to Numeric***

In [119]:
# Remove $ and commas from income columns, convert to float, handle empty values

currency_cols = ['per_capitaincome', 'medianhouseholdincome', 'medianfamilyincome']

for col in currency_cols:
    # Replace empty strings with NaN first
    df[col] = df[col].replace('', np.nan)
    # Then clean and convert
    df[col] = (df[col].str.replace('$', '', regex=False)
               .str.replace(',', '', regex=False)
               .astype(float))
    
print(f"Converted {currency_cols} to numeric")
print("Missing values after conversion:")
print(df[currency_cols].isna().sum())

Converted ['per_capitaincome', 'medianhouseholdincome', 'medianfamilyincome'] to numeric
Missing values after conversion:
per_capitaincome         9
medianhouseholdincome    9
medianfamilyincome       9
dtype: int64


***Transformation #3: Clean and Standardize Location Names***

In [122]:
# Remove footnotes (like [5]) and extra spaces from county names
df['county_or_county-equivalent'] = df['county_or_county-equivalent'].str.replace(r'\[.*\]', '', regex=True).str.strip()
df['county_or_county-equivalent'].head()

1      New York County
2            Arlington
3    Falls Church City
4                Marin
5          Santa Clara
Name: county_or_county-equivalent, dtype: object

***Transformation #4: Handle Missing/Inconsistent Data***

In [125]:
# Identify and clean rows with missing or placeholder values
print("Missing values before cleaning:")
print(df.isna().sum())

# Remove rows where critical columns are missing
df = df.dropna(subset=['per_capitaincome', 'county_or_county-equivalent'])

# Fill remaining numeric missing values with median
for col in currency_cols:
    df[col] = df[col].fillna(df[col].median())

print("\nMissing values after cleaning:")
print(df.isna().sum())

Missing values before cleaning:
rank                                     0
county_or_county-equivalent              0
state,_federal_district_or_territory     0
per_capitaincome                         9
medianhouseholdincome                    9
medianfamilyincome                       9
population                              56
number_ofhouseholds                     66
dtype: int64

Missing values after cleaning:
rank                                     0
county_or_county-equivalent              0
state,_federal_district_or_territory     0
per_capitaincome                         0
medianhouseholdincome                    0
medianfamilyincome                       0
population                              56
number_ofhouseholds                     57
dtype: int64


***Transformation #5: Convert Population to Numeric***

In [128]:
numeric_cols = ['population', 'number_ofhouseholds']

for col in numeric_cols:
    # Replace empty strings and non-numeric values
    df[col] = df[col].replace('', np.nan)
    # Remove commas and convert
    df[col] = (df[col].str.replace(',', '', regex=False)
               .astype('Int64'))  # Uses pandas' nullable integer type

print("Converted population metrics to integers")
print(df[numeric_cols].head())

Converted population metrics to integers
   population  number_ofhouseholds
1     1628706               759460
2      214861                94454
3       12731                 5020
4      254643               102912
5     1927852               640215


**Final Cleaned Dataset**

In [131]:
# Add calculated column
df['household_size'] = df['population'] / df['number_ofhouseholds']

# Display cleaned data
print("Final cleaned data shape:", df.shape)
df.head(10)

Final cleaned data shape: (3288, 9)


Unnamed: 0,rank,county_or_county-equivalent,"state,_federal_district_or_territory",per_capitaincome,medianhouseholdincome,medianfamilyincome,population,number_ofhouseholds,household_size
1,1,New York County,New York,76592.0,69659.0,86553.0,1628706,759460,2.144558
2,2,Arlington,Virginia,62018.0,103208.0,139244.0,214861,94454,2.274769
3,3,Falls Church City,Virginia,59088.0,120000.0,152857.0,12731,5020,2.536056
4,4,Marin,California,56791.0,90839.0,117357.0,254643,102912,2.474376
5,5,Santa Clara,California,56248.0,124055.0,124055.0,1927852,640215,3.011257
6,6,Alexandria City,Virginia,54608.0,85706.0,107511.0,143684,65369,2.198045
7,7,Pitkin,Colorado,51814.0,72745.0,93981.0,17173,7507,2.287598
8,8,Los Alamos,New Mexico,51044.0,106686.0,124979.0,17979,7590,2.368775
9,9,Fairfax County,Virginia,50532.0,110292.0,128596.0,1101071,389908,2.823925
10,10,Hunterdon,New Jersey,50349.0,106143.0,125828.0,127047,46816,2.713752


**Ethical Implications**

In preparing the Wikipedia-based dataset, several data transformations were performed, including standardizing column names, converting currency and population values to numeric types, cleaning location names, imputing missing numeric values using the median, and creating a derived household size metric. The data was ethically sourced under Wikipedia’s Creative Commons license; however, its crowd-sourced nature raises concerns about accuracy and reliability. While no specific legal or regulatory guidelines apply to this dataset, ethical considerations are critical, especially given the inclusion of sensitive socioeconomic indicators like income. Risks introduced through data cleaning include the potential for bias, particularly due to assumptions made during imputation and data standardization. To minimize these risks, the analyst chose median imputation over mean to avoid skewing distributions, preserved original values when possible, and avoided removing outliers unless clearly erroneous. All transformations were made transparently and designed to be reversible.