# Life Expectancy Prediction using Deep Learning

## Project Overview
This notebook implements a deep learning model to predict life expectancy based on various health, economic, and social factors. The model uses a neural network built with TensorFlow/Keras to perform regression analysis on the WHO Life Expectancy dataset.

## Dataset Information
The dataset contains life expectancy data from the World Health Organization (WHO) with the following key features:
- **Target Variable**: Life expectancy (in years)
- **Features**: Various health indicators, economic factors, and demographic information
- **Time Period**: Multiple years of data for different countries
- **Data Type**: Mixed (numerical and categorical)

## Methodology
1. Data loading and exploration
2. Data preprocessing and feature engineering
3. Train-test split
4. Feature scaling
5. Neural network model creation
6. Model training and evaluation


## 1. Import Required Libraries

This section imports all necessary libraries for data manipulation, preprocessing, and deep learning model creation.

In [13]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Machine learning preprocessing and model selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer

# Deep learning framework (TensorFlow/Keras)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense
from tensorflow.keras.optimizers import Adam

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

## 2. Data Loading

Load the Life Expectancy dataset from CSV file. This dataset contains information about life expectancy and various health/economic indicators for different countries over multiple years.

In [14]:
# Load the dataset from CSV file
# Note: Make sure 'Life_Expectancy_Data.csv' is in the same directory as this notebook
dataset = pd.read_csv('Life_Expectancy_Data.csv')

# Display basic information about the dataset
print(f"Dataset shape: {dataset.shape}")
print(f"Columns: {list(dataset.columns)}")
print(f"\nData types:")
print(dataset.dtypes)

Dataset shape: (2938, 22)
Columns: ['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling']

Data types:
Country                             object
Year                                 int64
Status                              object
Life expectancy                    float64
Adult Mortality                    float64
infant deaths                        int64
Alcohol                            float64
percentage expenditure             float64
Hepatitis B                        float64
Measles                              int64
 BMI                               float64
under-five deaths                    int64
Polio                              float64
Total expenditure                  fl

## 3. Data Exploration

Examine the structure and content of the dataset to understand the features and identify any data quality issues.

In [15]:
# Display the first few rows to understand the data structure
print("First 5 rows of the dataset:")
print(dataset.head())

# Check for missing values
print("\nMissing values per column:")
print(dataset.isnull().sum())

# Basic statistical summary
print("\nBasic statistical summary:")
print(dataset.describe())

First 5 rows of the dataset:
       Country  Year      Status  Life expectancy   Adult Mortality  \
0  Afghanistan  2015  Developing              65.0            263.0   
1  Afghanistan  2014  Developing              59.9            271.0   
2  Afghanistan  2013  Developing              59.9            268.0   
3  Afghanistan  2012  Developing              59.5            272.0   
4  Afghanistan  2011  Developing              59.2            275.0   

   infant deaths  Alcohol  percentage expenditure  Hepatitis B  Measles   ...  \
0             62     0.01               71.279624         65.0      1154  ...   
1             64     0.01               73.523582         62.0       492  ...   
2             66     0.01               73.219243         64.0       430  ...   
3             69     0.01               78.184215         67.0      2787  ...   
4             71     0.01                7.097109         68.0      3013  ...   

   Polio  Total expenditure  Diphtheria    HIV/AIDS      

## 4. Data Preprocessing

Clean and prepare the data for machine learning by handling missing values, removing unnecessary columns, and separating features from target variable.

In [16]:
# Remove 'Country' column as it's not useful for prediction
# Country names are categorical identifiers that don't provide meaningful patterns for life expectancy prediction
dataset = dataset.drop(['Country'], axis=1)

# Handle missing values by dropping rows with NaN values
# Note: In a production environment, you might want to use imputation instead
dataset = dataset.dropna()

print(f"Dataset shape after preprocessing: {dataset.shape}")
print(f"Remaining missing values: {dataset.isnull().sum().sum()}")

Dataset shape after preprocessing: (1649, 21)
Remaining missing values: 0


In [17]:
# Separate features (X) and target variable (y)
# The target variable 'Life expectancy' is in the last column
labels = dataset.iloc[:, -1]  # Target variable: Life expectancy
features = dataset.iloc[:, :-1]  # All other columns as features

print(f"Features shape: {features.shape}")
print(f"Labels shape: {labels.shape}")
print(f"Target variable (Life expectancy) - Min: {labels.min():.2f}, Max: {labels.max():.2f}, Mean: {labels.mean():.2f}")

Features shape: (1649, 20)
Labels shape: (1649,)
Target variable (Life expectancy) - Min: 4.20, Max: 20.70, Mean: 12.12


In [18]:
# Convert categorical variables to dummy variables (one-hot encoding)
# This is necessary for the neural network to process categorical data
# The 'Status' column (Developed/Developing) will be converted to binary features
features = pd.get_dummies(features, drop_first=True)  # drop_first=True to avoid multicollinearity

print(f"Features shape after one-hot encoding: {features.shape}")
print(f"Feature columns: {list(features.columns)}")

Features shape after one-hot encoding: (1649, 20)
Feature columns: ['Year', 'Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Status_Developing']


## 5. Train-Test Split

Split the dataset into training and testing sets to evaluate model performance on unseen data.

In [19]:
# Split the data into training (80%) and testing (20%) sets
# random_state=42 ensures reproducible results
features_train, features_test, labels_train, labels_test = train_test_split(
    features, labels, 
    test_size=0.2,  # 20% for testing
    random_state=42  # For reproducibility
)

print(f"Training set - Features: {features_train.shape}, Labels: {labels_train.shape}")
print(f"Testing set - Features: {features_test.shape}, Labels: {labels_test.shape}")
print(f"Training set percentage: {len(features_train) / len(features) * 100:.1f}%")
print(f"Testing set percentage: {len(features_test) / len(features) * 100:.1f}%")

Training set - Features: (1319, 20), Labels: (1319,)
Testing set - Features: (330, 20), Labels: (330,)
Training set percentage: 80.0%
Testing set percentage: 20.0%


## 6. Feature Scaling

Scale numerical features to ensure all features contribute equally to the model training. This is crucial for neural networks as they are sensitive to the scale of input features.

In [20]:
# Identify numerical features for scaling
# Neural networks perform better when features are on similar scales
numerical_features = features.select_dtypes(include=['float64', 'int64'])
numerical_columns = numerical_features.columns

# Create a ColumnTransformer to scale only numerical features
# StandardScaler normalizes features to have mean=0 and std=1
ct = ColumnTransformer(
    [("numeric_scaler", StandardScaler(), numerical_columns)], 
    remainder='passthrough'  # Keep non-numerical features as-is
)

print(f"Numerical columns to be scaled: {list(numerical_columns)}")
print(f"Number of numerical features: {len(numerical_columns)}")

Numerical columns to be scaled: ['Year', 'Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness  1-19 years', ' thinness 5-9 years', 'Income composition of resources']
Number of numerical features: 19


In [21]:
# Apply scaling to training and testing sets
# IMPORTANT: Fit the scaler only on training data to prevent data leakage
features_train_scaled = ct.fit_transform(features_train)  # Fit and transform training data
features_test_scaled = ct.transform(features_test)        # Only transform test data (using training stats)

print(f"Scaled training features shape: {features_train_scaled.shape}")
print(f"Scaled testing features shape: {features_test_scaled.shape}")

# Verify scaling worked correctly (training data should have mean≈0, std≈1 for numerical features)
print(f"\nTraining data statistics after scaling:")
print(f"Mean of first few features: {np.mean(features_train_scaled[:, :5], axis=0)}")
print(f"Std of first few features: {np.std(features_train_scaled[:, :5], axis=0)}")

Scaled training features shape: (1319, 20)
Scaled testing features shape: (330, 20)

Training data statistics after scaling:
Mean of first few features: [-1.12426284e-14 -1.51104805e-15 -1.21207063e-17 -2.69349028e-17
 -9.69656501e-17]
Std of first few features: [1. 1. 1. 1. 1.]


## 7. Neural Network Model Creation

Build a deep learning model using TensorFlow/Keras. The model architecture consists of:
- Input layer: Accepts features from the dataset
- Hidden layer: 64 neurons with ReLU activation for non-linearity
- Output layer: Single neuron for regression (predicting life expectancy)

In [22]:
# Create a Sequential model (layers stacked one after another)
my_model = Sequential()

# Input layer: defines the shape of input data
# The input shape should match the number of features after preprocessing
input_layer = InputLayer(input_shape=(features_train_scaled.shape[1],))
my_model.add(input_layer)

# Hidden layer: 64 neurons with ReLU activation
# ReLU (Rectified Linear Unit) is commonly used for hidden layers
# It helps with the vanishing gradient problem and is computationally efficient
my_model.add(Dense(64, activation='relu', name='hidden_layer'))

# Output layer: Single neuron for regression (no activation function)
# For regression tasks, we don't use activation in the output layer
# This allows the model to output any real number (life expectancy in years)
my_model.add(Dense(1, name='output_layer'))

# Display model architecture
print("Model Architecture:")
my_model.summary()

print(f"\nModel Details:")
print(f"- Input features: {features_train_scaled.shape[1]}")
print(f"- Hidden layer neurons: 64")
print(f"- Output: 1 (life expectancy prediction)")
print(f"- Total parameters: {my_model.count_params()}")

Model Architecture:



Model Details:
- Input features: 20
- Hidden layer neurons: 64
- Output: 1 (life expectancy prediction)
- Total parameters: 1409


## 8. Model Compilation and Training

Configure the model for training by specifying:
- **Loss function**: Mean Squared Error (MSE) - appropriate for regression tasks
- **Optimizer**: Adam - adaptive learning rate optimizer that works well for most problems
- **Metrics**: Mean Absolute Error (MAE) - easier to interpret than MSE

In [None]:
# Configure the optimizer with an appropriate learning rate
optimizer = Adam(learning_rate=0.001)

# Compile the model with:
# - MSE (Mean Squared Error) loss: standard for regression problems
# - MAE (Mean Absolute Error) metric: easier to interpret (average error in years)
my_model.compile(
    loss='mse',           # Loss function for regression
    metrics=['mae'],      # Additional metrics to monitor
    optimizer=optimizer   # Optimization algorithm
)

print("Model compiled successfully!")
print(f"Optimizer: Adam with learning rate {optimizer.learning_rate.numpy()}")
print(f"Loss function: Mean Squared Error (MSE)")
print(f"Metrics: Mean Absolute Error (MAE)")

# Train the model with improved parameters
# - epochs=40: More training iterations for better convergence
# - batch_size=32: Better than batch_size=1, provides more stable gradients
# - validation_split=0.2: Use 20% of training data for validation during training
print("\nStarting model training...")
history = my_model.fit(
    features_train_scaled, labels_train,
    epochs=40,              # Number of training epochs
    batch_size=32,          # Number of samples per gradient update
    validation_split=0.2,   # Use 20% of training data for validation
    verbose=1               # Show progress during training
)

print("\nTraining completed!")

Model compiled successfully!
Optimizer: Adam with learning rate 0.0010000000474974513
Loss function: Mean Squared Error (MSE)
Metrics: Mean Absolute Error (MAE)

Starting model training...
Epoch 1/40
[1m33/33[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - loss: 155.7377 - mae: 12.1236 - val_loss: 126.7375 - val_mae: 10.9554
Epoch 2/40
[1m33/33[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 122.7278 - mae: 10.7929 - val_loss: 97.6267 - val_mae: 9.5963
Epoch 3/40
[1m33/33[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 91.6700 - mae: 9.2940 - val_loss: 69.4931 - val_mae: 8.0312
Epoch 4/40
[1m33/33[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 63.7902 - mae: 7.7100 - val_loss: 44.8274 - val_mae: 6.3308
Epoch 5/40
[1m33/33[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 39.6819 - mae: 5.9386 - val_loss: 26.8880 - val_mae: 4.6987
Epoch 6/40
[1m33/33[0m [32m━━━━━━━━━━━━━━

## 9. Model Evaluation

Evaluate the trained model on the test set to assess its performance on unseen data. This gives us an unbiased estimate of how well the model will perform in real-world scenarios.

In [24]:
# Evaluate the model on the test set
print("Evaluating model on test set...")
test_loss, test_mae = my_model.evaluate(features_test_scaled, labels_test, verbose=0)

print(f"\n=== Model Performance on Test Set ===")
print(f"Test Loss (MSE): {test_loss:.4f}")
print(f"Test MAE: {test_mae:.4f} years")
print(f"Test RMSE: {np.sqrt(test_loss):.4f} years")

# Make predictions on test set for further analysis
predictions = my_model.predict(features_test_scaled, verbose=0)

# Calculate additional metrics
from sklearn.metrics import r2_score, mean_absolute_percentage_error

r2 = r2_score(labels_test, predictions)
mape = mean_absolute_percentage_error(labels_test, predictions)

print(f"\n=== Additional Metrics ===")
print(f"R² Score: {r2:.4f} (closer to 1.0 is better)")
print(f"MAPE: {mape:.4f} ({mape*100:.2f}%)")

# Show some example predictions vs actual values
print(f"\n=== Sample Predictions vs Actual Values ===")
for i in range(min(10, len(predictions))):
    print(f"Predicted: {predictions[i][0]:.2f} years, Actual: {labels_test.iloc[i]:.2f} years, Difference: {abs(predictions[i][0] - labels_test.iloc[i]):.2f} years")

Evaluating model on test set...

=== Model Performance on Test Set ===
Test Loss (MSE): 2.2009
Test MAE: 1.1621 years
Test RMSE: 1.4835 years

=== Additional Metrics ===
R² Score: 0.7035 (closer to 1.0 is better)
MAPE: 0.1055 (10.55%)

=== Sample Predictions vs Actual Values ===
Predicted: 11.29 years, Actual: 11.00 years, Difference: 0.29 years
Predicted: 14.94 years, Actual: 13.50 years, Difference: 1.44 years
Predicted: 15.12 years, Actual: 16.40 years, Difference: 1.28 years
Predicted: 10.07 years, Actual: 10.10 years, Difference: 0.03 years
Predicted: 8.15 years, Actual: 6.40 years, Difference: 1.75 years
Predicted: 10.31 years, Actual: 9.60 years, Difference: 0.71 years
Predicted: 12.71 years, Actual: 12.50 years, Difference: 0.21 years
Predicted: 12.09 years, Actual: 13.20 years, Difference: 1.11 years
Predicted: 15.80 years, Actual: 16.50 years, Difference: 0.70 years
Predicted: 11.90 years, Actual: 11.60 years, Difference: 0.30 years


## 10. Conclusions and Next Steps

### Model Performance Interpretation:
- **MAE (Mean Absolute Error)**: Average prediction error in years
- **RMSE (Root Mean Square Error)**: Penalizes larger errors more heavily
- **R² Score**: Proportion of variance explained by the model (0-1, higher is better)
- **MAPE**: Mean Absolute Percentage Error, shows error as a percentage

### Potential Improvements:
1. **Feature Engineering**: Create new features or transform existing ones
2. **Hyperparameter Tuning**: Optimize learning rate, batch size, number of neurons
3. **Model Architecture**: Try different numbers of layers or neurons
4. **Regularization**: Add dropout or L1/L2 regularization to prevent overfitting
5. **Data Quality**: Handle missing values more sophisticatedly (imputation)
6. **Cross-Validation**: Use k-fold cross-validation for more robust evaluation

### Usage Notes:
- This model predicts life expectancy based on health and socioeconomic indicators
- The predictions should be interpreted as estimates with inherent uncertainty
- Regular retraining with updated data would improve model accuracy over time