## Life expectancy prediction using SVM algorithm
### Adel.AMD
### October 2024

In this project, we aim to classify life expectancy into three categories: Low, Medium, and High, based on socio-economic and health-related data using cleaned-life-expectancy-dataset from WHO. To achieve this, we will apply Multi-Class Support Vector Machines (SVM) for classification.

First, Lets install kagglehub so we can directly load dataset from kaggle.

In [1]:
pip install kagglehub

Note: you may need to restart the kernel to use updated packages.


Now we can load the dataset directly from kaggle, using kagglehub package.

In [3]:
# Loading Dataset
import kagglehub

# Download latest version
path = kagglehub.dataset_download("paperxd/cleaned-life-expectancy-dataset")

print("Path to dataset files:", path)
print("File name: Cleaned-Life-Exp.csv")

Path to dataset files: /Users/adelahmadi/.cache/kagglehub/datasets/paperxd/cleaned-life-expectancy-dataset/versions/1
File name: Cleaned-Life-Exp.csv


Now, we import some needed libraries that we use, like Pandas, Numpy and Matplotlib.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Lets load the dataset using pandas.

In [5]:
filename = path + '/Cleaned-Life-Exp.csv'
filename
data = pd.read_csv(filename)

In [6]:
data.columns

Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
       'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness  1-19 years',
       'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
      dtype='object')

This dataset is already cleaned (complete explanation on dataset kaggle page), so we skip some parts of data cleaning and move on.
Right now, target variable type is countinues numeric number. lets get min and max.

In [10]:
print('Minimum:',min(data['Life expectancy']))
print('Maximum:',max(data['Life expectancy']))

Minimum: -3.4576872896992716
Maximum: 2.076724197034914


So, values are between -4 and 3. Lets bin this continues values to categories.

In [11]:
# Binning countinues life expectancy
bins = [-4,-1.5,0.5,3]
labels = ['Low','Medium','High']
data['lifeexp_category'] = pd.cut(data['Life expectancy'], bins = bins, labels = labels)

Now, we drop target variable from SVM algorithm features.

In [12]:
X = data.drop(['Life expectancy','lifeexp_category'],axis=1)

We need to do one more thing related to features, Country values are string and need to be encoded, we use One-hot encoding to encode country names.

In [13]:
X = pd.get_dummies(X, columns=['Country'], drop_first=True)

Our target variable for this algorithm is lifeexp_category, created above.

In [15]:
Y = data['lifeexp_category']

Now, we split our dataset for train and test values. We select %30 of data for testing.

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X,Y,test_size = .3, random_state = 20)

Lets countinue and scale our feature variables. First we need to import related libraries.

In [18]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

In [21]:
scaler = StandardScaler()
X_Train_Scaled = scaler.fit_transform(X_Train)
X_Test_Scaled = scaler.transform(X_Test)

Now we can train the model using model.fit() function. for now, we choose some random hyper parameters and at the end we utilize grid search for finding optimal hyper-parameters.

In [22]:
model = SVC(kernel='rbf', C=100, gamma=0.1)
model.fit(X_train_scaled, Y_Train)

We can import classification_report from sklearn to print model exact accuracy, f1 score and etc...

In [23]:
from sklearn.metrics import classification_report

Now we predict the test data to evaluate model accuracy.

In [25]:
y_predicted = model.predict(X_Test_Scaled)
print(classification_report(Y_Test, y_predicted))

              precision    recall  f1-score   support

        High       0.93      0.92      0.93       317
         Low       0.84      0.71      0.77        86
      Medium       0.90      0.93      0.91       479

    accuracy                           0.91       882
   macro avg       0.89      0.85      0.87       882
weighted avg       0.91      0.91      0.90       882



Lets print our confusion matrix to evaluate model prediction.

In [27]:
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(Y_Test, y_predicted)
confusion_matrix

array([[292,   0,  25],
       [  0,  61,  25],
       [ 21,  12, 446]])

Great! Main diagonal values are greater than other values in row. this is a good sign for model but we use grid search to calculate optimal hyper-parameters.

In [29]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf', 'linear']
}

# Initialize the SVM model
model = SVC()
# Set up GridSearchCV
grid = GridSearchCV(model, param_grid, refit=True, verbose=2, cv=5)
grid.fit(X_train_scaled, Y_Train)

# Best parameters
print(f"Best parameters: {grid.best_params_}")

# Predictions using the best model
y_predicted = grid.predict(X_test_scaled)
print(classification_report(Y_Test, y_predicted))
confusion_matrix = metrics.confusion_matrix(Y_Test, y_predicted)
confusion_matrix

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.5s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.4s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.4s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.4s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.4s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.1s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.1s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.1s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.1s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=   0.1s
[CV] END .......................C=0.1, gamma=0.1, kernel=rbf; total time=   0.4s
[CV] END .......................C=0.1, gamma=0.

array([[287,   0,  30],
       [  0,  77,   9],
       [ 21,  13, 445]])