## Project Milestone 3 :  Cleaning and Formatting Life Expectancy Data from Wikipedia

### Step 0: Import Libraries

In [96]:
# Need pandas for data manipulation, numpy for numerical operations,
# requests to fetch the webpage, and BeautifulSoup to parse HTML content
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO

### Step 1: Load the HTML page

In [98]:
# Send a GET request to the Wikipedia page and parse the content
url = "https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_life_expectancy"
print("Fetching data from Wikipedia...")
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

Fetching data from Wikipedia...


### Extract all HTML tables from the page using pandas

In [100]:
print("Extracting tables from the page...")
tables = pd.read_html(StringIO(response.text))
print(f"{len(tables)} tables found on the page.")

Extracting tables from the page...
21 tables found on the page.


### Using the second table (manually identified to be the relevant one for life expectancy data)

In [102]:
life_expectancy_df = tables[1]
print("Preview of raw data:")
print(life_expectancy_df.head())

Preview of raw data:
  State / Territory   2014                     2014 →2019   2019         \
  State / Territory    All   Male Female F Δ M 2014 →2019    All   Male   
0     United States  78.89  76.46  81.25  4.79       0.02  78.91  76.40   
1            Hawaii  81.46  78.63  84.19  5.56       0.14  81.60  78.63   
2        California  81.04  78.73  83.26  4.53       0.11  81.15  78.69   
3          New York  80.69  78.27  82.90  4.63       0.37  81.06  78.54   
4         Minnesota  80.90  78.84  82.89  4.05      −0.23  80.67  78.45   

                ...   2020                     2020 →2021   2021         \
  Female F Δ M  ...    All   Male Female F Δ M 2020 →2021    All   Male   
0  81.43  5.03  ...  77.05  74.32  79.88  5.56      −0.60  76.45  73.63   
1  84.60  5.97  ...  81.64  78.60  84.71  6.11      −0.75  80.89  77.85   
2  83.60  4.91  ...  79.24  76.44  82.16  5.72      −0.66  78.58  75.59   
3  83.43  4.89  ...  78.25  75.32  81.19  5.87       1.18  79.43  76.68   
4  

### Step 2: Rename columns for clarity

In [104]:
# Replacing confusing or overly long column headers with simplified versions
print("Renaming columns for clarity...")
print(f"Original columns: {life_expectancy_df.columns.tolist()}")

Renaming columns for clarity...
Original columns: [('State / Territory', 'State / Territory'), ('2014', 'All'), ('2014', 'Male'), ('2014', 'Female'), ('2014', 'F Δ M'), ('2014 →2019', '2014 →2019'), ('2019', 'All'), ('2019', 'Male'), ('2019', 'Female'), ('2019', 'F Δ M'), ('2019 →2020', '2019 →2020'), ('2020', 'All'), ('2020', 'Male'), ('2020', 'Female'), ('2020', 'F Δ M'), ('2020 →2021', '2020 →2021'), ('2021', 'All'), ('2021', 'Male'), ('2021', 'Female'), ('2021', 'F Δ M'), ('2019 →2021', '2019 →2021')]


In [105]:
# Print the number of columns to help match with correct renaming
print(f"Number of columns in the table: {len(life_expectancy_df.columns)}")

Number of columns in the table: 21


In [106]:
# Dynamically renaming the first 9 columns and leave others unchanged
new_column_names = [
    "State_Territory", "2020_Life_Expectancy", "2020_Male", "2020_Female",
    "Change_2019_2020", "2019_Life_Expectancy", "2010_Life_Expectancy",
    "Change_2010_2020", "2000_Life_Expectancy"
] + life_expectancy_df.columns[9:].tolist()

In [107]:
life_expectancy_df.columns = new_column_names

print("Updated column names:")
print(life_expectancy_df.columns.tolist())

Updated column names:
['State_Territory', '2020_Life_Expectancy', '2020_Male', '2020_Female', 'Change_2019_2020', '2019_Life_Expectancy', '2010_Life_Expectancy', 'Change_2010_2020', '2000_Life_Expectancy', ('2019', 'F Δ M'), ('2019 →2020', '2019 →2020'), ('2020', 'All'), ('2020', 'Male'), ('2020', 'Female'), ('2020', 'F Δ M'), ('2020 →2021', '2020 →2021'), ('2021', 'All'), ('2021', 'Male'), ('2021', 'Female'), ('2021', 'F Δ M'), ('2019 →2021', '2019 →2021')]


### Step 3: Remove rows with non-state data (e.g., footnotes or summary rows)

In [109]:
print("Removing rows with missing life expectancy values...")
initial_row_count = len(life_expectancy_df)
life_expectancy_df = life_expectancy_df[life_expectancy_df['2020_Life_Expectancy'].notna()]
print(f"Removed {initial_row_count - len(life_expectancy_df)} rows with missing data.")

Removing rows with missing life expectancy values...
Removed 0 rows with missing data.


### Step 4: Strip whitespace and fix casing in the State_Territory column

In [111]:
# Ensuring consistency in naming format for downstream processing
print("Standardizing state and territory names...")
life_expectancy_df['State_Territory'] = life_expectancy_df['State_Territory'].str.strip().str.title()
print("Preview after casing fix:")
print(life_expectancy_df[['State_Territory']].head())

Standardizing state and territory names...
Preview after casing fix:
  State_Territory
0   United States
1          Hawaii
2      California
3        New York
4       Minnesota


### Step 5: Convert numeric columns from object to float

In [113]:
# Some columns are stored as objects due to formatting.Converting them to float for analysis
print("Converting numeric columns from object to float...")
num_cols = [
    "2020_Life_Expectancy", "2020_Male", "2020_Female",
    "Change_2019_2020", "2019_Life_Expectancy", "2010_Life_Expectancy",
    "Change_2010_2020", "2000_Life_Expectancy"
]
life_expectancy_df[num_cols] = life_expectancy_df[num_cols].apply(pd.to_numeric, errors='coerce')
print("Numeric conversion complete.")

Converting numeric columns from object to float...
Numeric conversion complete.


### Step 6: Identify and handle outliers

In [115]:
# Flagging any life expectancy values outside reasonable bounds (<60 or >90)
print("Checking for outliers in life expectancy (2020)...")
outlier_mask = (life_expectancy_df['2020_Life_Expectancy'] < 60) | (life_expectancy_df['2020_Life_Expectancy'] > 90)
outliers = life_expectancy_df[outlier_mask]
print(f"Found {len(outliers)} outliers. Removing them...")
life_expectancy_df = life_expectancy_df[~outlier_mask]

Checking for outliers in life expectancy (2020)...
Found 0 outliers. Removing them...


### Step 7: Extract and print section headers using BeautifulSoup

In [117]:
print("\nSection Headers on Wikipedia Page:\n")
section_headers = soup.find_all(['h2', 'h3'])
for header in section_headers:
    header_text = header.get_text(strip=True).replace('[edit]', '')
    print("-", header_text)



Section Headers on Wikipedia Page:

- Contents
- History
- Methodology
- National Center for Health Statistics (2019–2021)
- US Mortality DataBase (2014–2021)
- Global Data Lab (2019–2021)
- Percentage surviving in 2019
- Data of the National Center for Health Statistics
- Data of the United States Mortality DataBase
- Past life expectancy, 1940–2019
- Life expectancy in counties with 500,000+ people in 2019
- Charts
- See also
- References


### Step 8: Final cleaned dataset output

In [119]:
print("\nCleaned Life Expectancy Dataset:\n")
print(life_expectancy_df.to_string(index=False))


Cleaned Life Expectancy Dataset:

   State_Territory  2020_Life_Expectancy  2020_Male  2020_Female  Change_2019_2020  2019_Life_Expectancy  2010_Life_Expectancy  Change_2010_2020  2000_Life_Expectancy  (2019, F Δ M) (2019 →2020, 2019 →2020)  (2020, All)  (2020, Male)  (2020, Female)  (2020, F Δ M) (2020 →2021, 2020 →2021)  (2021, All)  (2021, Male)  (2021, Female)  (2021, F Δ M) (2019 →2021, 2019 →2021)
     United States                 78.89      76.46        81.25              4.79                  0.02                 78.91             76.40                 81.43           5.03                    −1.86        77.05         74.32           79.88           5.56                    −0.60        76.45         73.63           79.41           5.78                    −2.46
            Hawaii                 81.46      78.63        84.19              5.56                  0.14                 81.60             78.63                 84.60           5.97                     0.04        81.64

### Step 8: Summary of Ethical Considerations

### Ethical Considerations of Data Wrangling

1. **Changes Made**: Column headers were renamed for clarity, rows with missing values in key numeric fields were removed, casing in state/territory names was standardized, and numeric data stored as strings was converted to appropriate numeric formats. Outliers (values outside 60–90 for life expectancy) were also removed for consistency.

2. **Regulatory & Legal Guidelines**: The data was sourced from a publicly available Wikipedia page, which aggregates statistics from government and health agencies (e.g., the CDC). For use in regulated environments or research, referencing and verifying against the original government sources would be necessary.

3. **Risks Introduced**: Transformations may introduce bias, particularly through the removal of outliers which might represent real but extreme values. Additionally, assumptions made about missing or malformed data may misrepresent some regions.

4. **Assumptions Made**: Values outside the range of 60–90 years for 2020 life expectancy were assumed to be errors or anomalies. Missing values were treated as non-usable data points.

5. **Data Source Credibility**: While Wikipedia is a secondary source, it includes citations from reputable institutions. However, data from the original sources (such as CDC reports) would be more reliable for critical analysis.

6. **Data Acquisition Ethics**: The dataset was collected using ethical means from a publicly accessible online source using standard data scraping methods.

7. **Mitigation Strategies**: All cleaning and transformation steps were documented to ensure transparency. The use of open tools and reproducible code helps promote ethical data practices. Where possible, cross-validation with original sources is recommended to ensure data accuracy.
