## 1. Train Test Split

- Import data

In [1]:
%store -r df

- It’s crucial to ensure that the train-test split respects the temporal and hierarchical structure of the data. Simply shuffling rows randomly could lead to data leakage  or break the natural grouping of data by country and time

- In that case I'll use a temporal split:
    - Use data from earlier years (2000–2010) for training.
    - Reserve later years (2011–2015) for testing.
     

In [2]:
train = df[df['Year'] <= 2010]
test = df[df['Year'] > 2010]

- Prevents future data from leaking into the training set.
- Mimics real-world use cases where predictions are made for future periods.
     

In [4]:
print(f'Test data shape:{test.shape}')

Test data shape:(925, 23)


In [5]:
print(f'Train data shape: {train.shape}')

Train data shape: (2013, 23)


- Train Test Split

In [6]:
# For training
X_train = train.drop(['Country', 'Life expectancy '], axis=1)
y_train = train['Life expectancy ']

In [7]:
# For the test set
X_test = test.drop(['Country', 'Life expectancy '], axis=1)
y_test = test['Life expectancy ']

## 2. Training

## a. Linear Regression

**i) Import libraries**

In [8]:
import sklearn as sk
from sklearn.linear_model import LinearRegression

**ii) Reinitialise LR model**

In [9]:
model_LR = LinearRegression()

**iv) Fit model - LR**

In [10]:
model_LR.fit(X_train, y_train)

**v) Predict**

In [11]:
y_pred_train_lr = model_LR.predict(X_train)
y_pred_test_lr = model_LR.predict(X_test)

**vi) Evaluate Model - LR**

**1. Train Metrics**

In [12]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [13]:
mse_train_lr = mean_squared_error(y_train, y_pred_train_lr)
mae_train_lr = mean_absolute_error(y_train, y_pred_train_lr)
r2_train_lr = r2_score(y_train, y_pred_train_lr)

print(f"Train MSE: {mse_train_lr:.4f}")
print(f"Train MAE: {mae_train_lr:.4f}")
print(f"Train R² Score: {r2_train_lr:.4f}")

Train MSE: 0.1733
Train MAE: 0.3142
Train R² Score: 0.8277


> **Train MSE**: Before transformation is 1**5.4997**, After transformation: **15.3956**, After scaling: **13.8513**, After temporal split: **14.8966**

> **Train MSE**: After scaling: Train MAE: **2.7678**, After temporal split: Train MAE: **2.9072**

> **R Squared**: Before transformation: Train R² Score: **0.8304**,After transformation: Train R² Score: **0.8304**,After scaling: Train R² Score: **0.8484**,After Temporal split: **0.8455**.

**2. Test Metrics**

In [14]:
mse_test_lr = mean_squared_error(y_test, y_pred_test_lr)
mae_test_lr = mean_absolute_error(y_test, y_pred_test_lr)
r2_test_lr = r2_score(y_test, y_pred_test_lr)

print(f"Test MSE: {mse_test_lr:.4f}")
print(f"Test MAE: {mae_test_lr:.4f}")
print(f"Test R² Score: {r2_test_lr:.4f}")

Test MSE: 0.1555
Test MAE: 0.2922
Test R² Score: 0.8324


> **Test MSE**: Before transformation is **14.2255**, After transformation: **13.2045**, After scaling: **12.0188**, After temporal split: **10.8894**

> **Test MAE**: After scaling Test MAE: **2.5973**, After temporal split: Test MAE: **2.5022**

> **R Squared**: Before transformation: Test R² Score: **0.8359**, After transformation: Test R² Score: **0.8359**, After scaling: Test R² Score: **0.8613**, After Temporal split: **0.8482**

**3. Residuals**

In [15]:
# Calculate residuals
residual = y_test - y_pred_test_lr
residual2 = y_test - y_pred_test_lgb

residual3 = y_test - y_pred_test_cb
residual4 = y_test - y_pred_test_lgb

residuals = [residual, residual2, residual3, residual4]
import seaborn as sns
import matplotlib.pyplot as plt

# explore residual
fig, ax = plt.subplots(2,2, figsize=(8, 5))
sns.histplot(residual, bins=12, kde=True, color='green', ax=ax[0,0])
sns.histplot(residual2, bins=12, kde=True, color='green', ax=ax[0,1])
sns.histplot(residual3, bins=12, kde=True, color='green', ax=ax[1,1])
sns.histplot(residual4, bins=12, kde=True, color='green', ax=ax[1,0])

for i,cols in enumerate (residuals): 
    if i<len(ax):
        plt.title("Residuals - Linear Regression", fontsize=14)
        plt.axvline(x=0, color='red', linestyle='dashed', label='Zero Residual Line')
        plt.grid(alpha=0.3)
        plt.xlabel("Residuals", fontsize=12)
        plt.ylabel("Frequency", fontsize=12)
        plt.legend()
        plt.show()

NameError: name 'y_pred_test_lgb' is not defined

**4. Cross Validation**

In [16]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model_LR, X_train, y_train, cv=5, scoring='r2')
print("Cross-validation scores:", scores)
print("Mean R²:", scores.mean(), "Std Dev:", scores.std())

Cross-validation scores: [0.83975802 0.78812784 0.72792147 0.82630918 0.82114316]
Mean R²: 0.8006519322488839 Std Dev: 0.04013488244726311


## b. Random Forest

**i) Import RFR**

In [17]:
from sklearn.ensemble import RandomForestRegressor

**ii) Reinitialize model**

In [18]:
rf = RandomForestRegressor()

**iii) Fit model - RF**

In [19]:
rf.fit(X_train, y_train)

**iv) Predict**

In [20]:
y_pred_train_rf = rf.predict(X_train)
y_pred_test_rf = rf.predict(X_test)

**v) Evaluate Model - RF**

**1. Train Metrics**

In [21]:
mse_train_rf = mean_squared_error(y_train, y_pred_train_rf)
mae_train_rf = mean_absolute_error(y_train, y_pred_train_rf)
r2_train_rf = r2_score(y_train, y_pred_train_rf)

print(f"Train MSE: {mse_train_rf:.4f}")
print(f"Train MAE: {mae_train_rf:.4f}")
print(f"Train R² Score: {r2_train_rf:.4f}")

Train MSE: 0.0055
Train MAE: 0.0453
Train R² Score: 0.9945


> **Train MSE:** Before Temporal split: Train MSE: **0.5284**, After Temporal split: Train MSE: **0.4812**, 

> **Train MAE**: Before Temporal Split: Train MAE: **0.4462**, After Temporal Split: Train MAE: **0.4259**

> **Train R Squared**: Before temporal split: Train R² Score: **0.9942**, After temporal split: Train R² Score:**0.9950**

**2. Test Metrics**

In [22]:
mse_test_rf = mean_squared_error(y_test, y_pred_test_rf)
mae_test_rf = mean_absolute_error(y_test, y_pred_test_rf)
r2_test_rf = r2_score(y_test, y_pred_test_rf)

print(f"Test MSE: {mse_test_rf:.4f}")
print(f"Test MAE: {mae_test_rf:.4f}")
print(f"Test R² Score: {r2_test_rf:.4f}")

Test MSE: 0.0751
Test MAE: 0.1886
Test R² Score: 0.9191


> **Test MAE**: > Before Temporal Split: Test MAE: **1.0845**, After Temporal Split: Test MAE: **1.6125**

> **Test MAE**: Before Temporal Split: Test MAE: **1.0845**,  After Temporal Split: Test MAE: **1.6125**

>**Test R Squared**: Before temporal split: Test R² Score: **0.9689**, After temporal split: Test R² Score: **0.9262**, 



In [23]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='r2')
print("Cross-validation scores:", scores)
print("Mean R²:", scores.mean(), "Std Dev:", scores.std())

Cross-validation scores: [0.93906629 0.87468202 0.90772911 0.90526103 0.90787667]
Mean R²: 0.906923022874289 Std Dev: 0.020381331083069595


## 3. Support Vector Machine

**i) Import model**

In [206]:
from sklearn.svm import SVR

**ii) Initialise model**

In [225]:
# Initialize SVR with a kernel (e.g., 'rbf', 'linear', or 'poly')
model_SVR = SVR(kernel='rbf', C=1.0, epsilon=0.1)

**iii) Fit model - SVR**

In [226]:
# Fit the model to the scaled training data
model_SVR.fit(X_train, y_train)

**iv) Predict**

In [237]:
# Predict on the test set
y_pred_train_svr = model_SVR.predict(X_train)
y_pred_test_svr = model_SVR.predict(X_test)

**v) Evaluate**

**i) Train Metrics**

In [238]:
mse_train_svr = mean_squared_error(y_train, y_pred_train_svr)
mae_train_svr = mean_absolute_error(y_train, y_pred_train_svr)
r2_train_svr = r2_score(y_train, y_pred_train_svr)

print(f"Train MSE: {mse_train_svr:.4f}")
print(f"Train MAE: {mae_train_svr:.4f}")
print(f"Train R² Score: {r2_train_svr:.4f}")

Train MSE: 104.8100
Train MAE: 7.6981
Train R² Score: -0.0867


**ii) Test Metrics**

In [240]:
mse_test_svr = mean_squared_error(y_test, y_pred_test_svr)
mae_test_svr = mean_absolute_error(y_test, y_pred_test_svr)
r2_test_svr = r2_score(y_test, y_pred_test_svr)

print(f"Test MSE: {mse_test_svr:.4f}")
print(f"Test MAE: {mae_test_svr:.4f}")
print(f"Test R² Score: {r2_test_svr:.4f}")

Test MSE: 70.0083
Test MAE: 6.7898
Test R² Score: 0.0238


> The model performs very poorly

## 4. CatBoost

**i) Import model**

In [209]:
from catboost import CatBoostRegressor

**ii) Initialise model**

In [210]:
# Initialize the CatBoost model
model_CBR = CatBoostRegressor(
    iterations=100,       # Number of boosting iterations
    learning_rate=0.1,    # Learning rate
    depth=6,              # Depth of the trees
    loss_function='RMSE', # Loss function (e.g., RMSE for regression)
    verbose=0             # Suppress verbose output during training
)

**iii) Fit model**

In [211]:
# Fit the model
model_CBR.fit(
    X_train, y_train,
    # cat_features=cat_features  # Pass categorical feature indices here
)

<catboost.core.CatBoostRegressor at 0x7be231a105f0>

**iv) Predict**

In [216]:
# Predict on the test set
y_pred_train_cb = model_CBR.predict(X_train)
y_pred_test_cb = model_CBR.predict(X_test)

**v) Evaluate - CatBoost**

**i) Train Metrics**

In [217]:
mse_train_cb = mean_squared_error(y_train, y_pred_train_cb)
mae_train_cb = mean_absolute_error(y_train, y_pred_train_cb)
r2_train_cb = r2_score(y_train, y_pred_train_cb)

print(f"Train MSE: {mse_train:.4f}")
print(f"Train MAE: {mae_train:.4f}")
print(f"Train R² Score: {r2_train:.4f}")

Train MSE: 14.8966
Train MAE: 2.9072
Train R² Score: 0.8455


**ii) Test Metrics**

In [218]:
mse_test_cb = mean_squared_error(y_test, y_pred_test_cb)
mae_test_cb = mean_absolute_error(y_test, y_pred_test_cb)
r2_test_cb = r2_score(y_test, y_pred_test_cb)

print(f"Test MSE: {mse_test_cb:.4f}")
print(f"Test MAE: {mae_test_cb:.4f}")
print(f"Test R² Score: {r2_test_cb:.4f}")

Test MSE: 5.5011
Test MAE: 1.6636
Test R² Score: 0.9233


In [248]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model_CBR, X_train, y_train, cv=5, scoring='r2')
print("Cross-validation scores:", scores)
print("Mean R²:", scores.mean(), "Std Dev:", scores.std())

Cross-validation scores: [0.93481169 0.87281734 0.9158566  0.89870677 0.9244196 ]
Mean R²: 0.9093223980610723 Std Dev: 0.021751487021140718


## 5. LightGBM

**i) Import model - LGB**

In [None]:
import lightgbm as lgb

**ii) Split data - LGD dataset function**

In [None]:
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

**iii) Set parameters**

In [None]:
params = {
    'objective': 'regression',  # For regression tasks
    'metric': 'rmse',          # Evaluation metric
    'boosting_type': 'gbdt',   # Gradient Boosting Decision Tree
    'learning_rate': 0.1,
    'num_leaves': 31,          # Number of leaves in one tree
    'verbose': -1              # Suppress logs
}

**iv) Train model**

In [None]:
model_LGB = lgb.train(
    params,
    train_data,
    valid_sets=[test_data],
    valid_names=['valid'],
    num_boost_round=200,
    callbacks=[lgb.early_stopping(10)]  # Use callbacks instead of early_stopping_rounds
)

Training until validation scores don't improve for 10 rounds


Early stopping, best iteration is:
[81]	valid's rmse: 2.2124


**v) Predict**

In [220]:
y_pred_train_lgb = model_LGB.predict(X_train)
y_pred_test_lgb = model_LGB.predict(X_test)

**vi) Evaluate Model - LightGBM**

**1. Train Metrics**

In [221]:
mse_train_lgb = mean_squared_error(y_train, y_pred_train_lgb)
mae_train_lgb = mean_absolute_error(y_train, y_pred_train_lgb)
r2_train_lgb = r2_score(y_train, y_pred_train_lgb)

print(f"Train MSE: {mse_train_lgb:.4f}")
print(f"Train MAE: {mae_train_lgb:.4f}")
print(f"Train R² Score: {r2_train_lgb:.4f}")

Train MSE: 0.8448
Train MAE: 0.6062
Train R² Score: 0.9912


**2. Test Metrics**

In [223]:
mse_test_lgb = mean_squared_error(y_test, y_pred_test_lgb)
mae_test_lgb = mean_absolute_error(y_test, y_pred_test_lgb)
r2_test_lgb = r2_score(y_test, y_pred_test_lgb)

print(f"Test MSE: {mse_test_lgb:.4f}")
print(f"Test MAE: {mae_test_lgb:.4f}")
print(f"Test R² Score: {r2_test_lgb:.4f}")

Test MSE: 4.8947
Test MAE: 1.5572
Test R² Score: 0.9318


Metrics Summary

| Model     | Train MSE | Test MSE | Train MAE | Test MAE | Train R² Score | Test R² Score |
|-----------|----------|----------|-----------|----------|---------------|--------------|
| **LR**        | 14.8966  | 10.8894  | 2.9072    | 2.5022   | 0.8455        | 0.8482       |
| **RF**        | 0.4970   | 5.3383   | 0.4263    | 1.6278   | 0.9948        | 0.9256       |
| **SVR**       | 104.8100 | 70.0083  | 7.6981    | 6.7898   | -0.0867       | 0.0238       |
| **CatBoost**  | 14.8966  | 5.5011   | 2.9072    | 1.6636   | 0.8455        | 0.9233       |
| **LightGBM**  | 0.8448   | 4.8947   | 0.6062    | 1.5572   | 0.9912        | 0.9318       |

In [301]:
import joblib

# Assuming 'best_model' is the best one you identified
joblib.dump(rf, "/home/davidkibet/Desktop/Life Expectancy ML/models/rf_model.pkl") 
joblib.dump(model_LGB, "/home/davidkibet/Desktop/Life Expectancy ML/models/LGB_model.pkl") 

# Save training and test sets
joblib.dump(X_train, "/home/davidkibet/Desktop/Life Expectancy ML/models/train_test_sets/X_train.pkl")
joblib.dump(X_test, "/home/davidkibet/Desktop/Life Expectancy ML/models/train_test_sets/X_test.pkl")
joblib.dump(y_train, "/home/davidkibet/Desktop/Life Expectancy ML/models/train_test_sets/y_train.pkl")
joblib.dump(y_test, "/home/davidkibet/Desktop/Life Expectancy ML/models/train_test_sets/y_test.pkl")

['/home/davidkibet/Desktop/Life Expectancy ML/models/train_test_sets/y_test.pkl']