# Life Expectancy (WHO) Machine Learning

This notebook aims to use several Machine Learning methods.

In [34]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import skew, kurtosis

#Imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer


In [35]:
life = pd.read_csv("Life Expectancy Data.csv")

In [36]:
life

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,44.3,723.0,27,4.36,0.000000,68.0,31,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,Zimbabwe,2003,Developing,44.5,715.0,26,4.06,0.000000,7.0,998,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,Zimbabwe,2002,Developing,44.8,73.0,25,4.43,0.000000,73.0,304,...,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,Zimbabwe,2001,Developing,45.3,686.0,25,1.72,0.000000,76.0,529,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [37]:
life.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

There are 2938 entries with total 22 columns. 21 columns are features and the column **Life expectancy** will be handled as target columns

**Country** and **Status** are two categorical features. While for **Country** it is clear that we have several unique outcomes, let's check whether we can transform **Status** to a numerical variable

In [38]:
life.Status.value_counts()

Developing    2426
Developed      512
Name: Status, dtype: int64

For **Status** there are two outcomes:
1. Developing
2. Developed

So we can replace Developing with a 0 and Developed with a 0.

In [39]:
life["Status"].replace(["Developing", "Developed"], [0,1], inplace=True)

Let's check for missing values.

In [40]:
life.isna().sum()

Country                              0
Year                                 0
Status                               0
Life expectancy                     10
Adult Mortality                     10
infant deaths                        0
Alcohol                            194
percentage expenditure               0
Hepatitis B                        553
Measles                              0
 BMI                                34
under-five deaths                    0
Polio                               19
Total expenditure                  226
Diphtheria                          19
 HIV/AIDS                            0
GDP                                448
Population                         652
 thinness  1-19 years               34
 thinness 5-9 years                 34
Income composition of resources    167
Schooling                          163
dtype: int64

There are 14 features with null values. Some have a lot of null values, such as popuation with 652 null values and Hepatitis B with 553 null values.

In [41]:
#drop country
life = life.drop(columns="Country", axis=1, inplace=False)

In [45]:
imp_iter = IterativeImputer()
imputed_iter = imp_iter.fit_transform(life)
life_imp_iter = pd.DataFrame(imputed_iter, columns=life.columns)

imp_iter_iterativ = IterativeImputer(max_iter=10, random_state=0)
imputed_iter_iterativ = imp_iter_iterativ.fit_transform(life)
life_imp_iter_iterativ = pd.DataFrame(imputed_iter_iterativ, columns=life.columns)

imp_knn = KNNImputer()
imputed_knn = imp_knn.fit_transform(life)
life_imp_knn = pd.DataFrame(imputed_knn, columns=life.columns)

imp_mean = SimpleImputer(strategy="mean")
imputed_mean = imp_mean.fit_transform(life)
life_imp_mean = pd.DataFrame(imputed_mean, columns=life.columns)

imp_median = SimpleImputer(strategy="median")
imputed_median = imp_median.fit_transform(life)
life_imp_median = pd.DataFrame(imputed_median, columns=life.columns)



In [46]:
life_imp_knn.isna().sum()

Year                               0
Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
 BMI                               0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
 HIV/AIDS                          0
GDP                                0
Population                         0
 thinness  1-19 years              0
 thinness 5-9 years                0
Income composition of resources    0
Schooling                          0
dtype: int64