# Life Expectancy Data Cleaning Notebook

This notebook documents the data cleaning process for the Life Expectancy dataset in preparation for building linear regression models. The dataset contains health and economic indicators for various countries from 2000-2015.

**Dataset Source**: WHO Life Expectancy Data

**Cleaning Steps**:
1. Initial Exploration
2. Missing Value Handling
3. Outlier Treatment
4. Feature Engineering
5. Data Export

## 1. Initial Data Exploration

In [1]:
import pandas as pd

# Load dataset
df = pd.read_csv('Life_Expectancy/Life Expectancy Data.csv')

# Initial exploration
print("Dataset shape:", df.shape)
print("\nFirst 5 rows:")
df.head(1)

Dataset shape: (2938, 22)

First 5 rows:


Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1


### Key Observations:
- 2938 rows × 22 columns
- Mix of numerical and categorical features
- Target variable: 'Life expectancy'
- Some columns have spaces in names that need cleaning

## 2. Data Cleaning

### 2.1 Fix Column Names

In [2]:
# Clean column names
df.columns = df.columns.str.replace('  ', ' ').str.strip()

### 2.2 Handle Missing Values

In [3]:
# Calculate missing values percentage
missing_percent = (df.isnull().mean() * 100).round(2)
print('Missing Values (%): ')
print(missing_percent[missing_percent > 0].sort_values(ascending=False))

Missing Values (%): 
Population                         22.19
Hepatitis B                        18.82
GDP                                15.25
Total expenditure                   7.69
Alcohol                             6.60
Income composition of resources     5.68
Schooling                           5.55
BMI                                 1.16
thinness 1-19 years                 1.16
thinness 5-9 years                  1.16
Polio                               0.65
Diphtheria                          0.65
Life expectancy                     0.34
Adult Mortality                     0.34


**Decisions:**
- Drop 'Population' (22% missing) and 'Hepatitis B' (19% missing)
- Fill remaining missing values with median (robust to outliers)

In [4]:
# Drop high-missing columns
df.drop(columns=['Population','Hepatitis B'], inplace=True)
print("Remaining columns after dropping:", df.shape[1])

Remaining columns after dropping: 20


In [5]:
# Fill remaining missing values
for col in df.select_dtypes(include=['number']):
    df[col].fillna(df[col].median(), inplace=True)

### 2.3 Handle Outliers

In [6]:
# Cap outliers at 1st and 99th percentiles
num_cols = df.select_dtypes(include=['number']).columns
for col in num_cols:
    upper = df[col].quantile(0.99)
    lower = df[col].quantile(0.01)
    df[col] = df[col].clip(lower, upper)

## 3. Feature Engineering

In [7]:
# One-hot encode categorical variables
df = pd.get_dummies(df, columns=['Country', 'Status'])

## 4. Export Cleaned Data

In [8]:
# Save cleaned data
df.to_csv('life_expectancy_cleaned.csv', index=False)
print("Data cleaning complete. Saved to 'life_expectancy_cleaned.csv'")

## Conclusion

The dataset has been cleaned through:
- Handling missing values
- Treating outliers
- Feature encoding

The cleaned data is now ready for exploratory analysis and modeling.