# Predicting Life Expectancy with World Health Organization data 
Health Organization data 
Data curated from Kaggle covers 22 columns of data. Here we try to predict Life Expectancy given the other columns as features, which include data such as disease, alcohol, and mortality.
Original data: The data covers 193 countries and has been collected from the WHO data repository website and the corresponding economic data was collected from the United Nation website. 


# Cross Validating and tuning with Tensorflow.Keras Neural Network  regressor in Sklearn's cross validation
The data is then pipelined and used to fit a neural network. This is then refactored into cross-validating and hyperparameter tuning pipelines.

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

In [3]:
path = r"Life Expectancy Data.csv"
df = pd.read_csv(path)

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

# Preprocessing is necessary
However, I will leave preprocessing out of this version as I am trying to integrate that into the 5_b project.<br>
I believe there is value to be had in incorporating more of the model and data together while validating.<br>
In this project, we will drop from about 2900 eligible rows down to 1649 rows. These will be added in the next 5_b project.

## Generally about the data
General Trends for Life expectancy, here will try to make an independent and identically distributed model. TimeSeries approach comes to mind, as improvements happen over time and place. These two variables are dropped in the IID approach.
A time-series approach would also give rise to some easy to interpret graphs/plots.<br>
The yearly data here comes down to just being quite close to itself. That is we almost bootstrapped a country for data <br> 
I decided to exclude Country and Year, these two variables are somewhat leaky for our independent data. It is also an extra step for predicting new data if we do not have an embedded strategy for 'Country' name. 
There could be general trends on year, that is technological advancements and life expectancy say for the year 1948 would be different for 2020, again these are then time trends, not independent observations.<vr>
I also find GDP to be somewhat of a leaky variable, but GDP + Total Expenditure make an interesting feature. The absolute amount of money spent on health care. As a result, a suggested Feature is 'Absolute amount per capital on health care'. That is how many dollars spent per person<br>
On investigation of the data: The 'Total Expenditure' is faulty. It has a description of % Amount but is the actual amount. The same as the suggested feature. Which makes a 'relative measure' missing and will be added instead.<br>
Some of the column names are have starting or trailing whitespaces and those are also to be fixed. Things need to look neat.  

In [5]:
df.columns = df.columns.str.strip()
df['Relative Expenditure'] = df['Total expenditure'] / df['GDP']
df = df.drop(labels=['Country','Year','GDP'], axis=1)
df = df.dropna()
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1649 entries, 0 to 2937
Data columns (total 20 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Status                           1649 non-null   object 
 1   Life expectancy                  1649 non-null   float64
 2   Adult Mortality                  1649 non-null   float64
 3   infant deaths                    1649 non-null   int64  
 4   Alcohol                          1649 non-null   float64
 5   percentage expenditure           1649 non-null   float64
 6   Hepatitis B                      1649 non-null   float64
 7   Measles                          1649 non-null   int64  
 8   BMI                              1649 non-null   float64
 9   under-five deaths                1649 non-null   int64  
 10  Polio                            1649 non-null   float64
 11  Total expenditure                1649 non-null   float64
 12  Diphtheria          

# Binary Categorical Variable, Easy fix.
'Status' is a categorical variable with developing and developed. Suggested change to one-hot encoding and kept. get_dummies works well. Lambda will also work on the series because it is binary: 1,0 will do.

In [6]:
print(df['Status'].value_counts())
df['Status'] = df['Status'].apply(lambda x: 1 if x == 'Developing' else 0)
print(df['Status'].value_counts())

Developing    1407
Developed      242
Name: Status, dtype: int64
1    1407
0     242
Name: Status, dtype: int64


In [7]:
target = df.pop('Life expectancy')
features = df
features.shape[1]

19

# Train Test Split Approach
We will also revisit Cross Validation which is prefered. <br>
Tensorflow has internal validation and that is used later on in the cross validation. <br>
There isn't anything in particular to stratify the splits on other than 'Status' there are enough points though the model would capture the distinctions. 

In [8]:
# Internal validation possible in TF with validation_split. Splitting only for test set 
X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=0.8) # stratify on developing or other variables reasonable

In [9]:
numeric_features = features.select_dtypes(['float64','int64']) # or list comprehension + membership check [col for col in df.columns if df.columns[col] in ['float64','int64']]
numeric_columns = numeric_features.columns
ct = ColumnTransformer([("Numeric Scaler", StandardScaler(), numeric_columns)], remainder='passthrough')
fulldata_ct = ct
# Zipping the standardscaler

In [10]:
X_train_scaled = ct.fit_transform(X_train) # This is similar to a function that would take in the data, calculate the mean and std for each column. Store that information and apply it to new data to z-standardize.
X_test_scaled = ct.transform(X_test)

Preprocessing done. Note that nothing in the data was imputed and columns were dropped. If we suspect a Time Series is a good idea, that approach would also yield good results in my estimation. Especially visually

In [11]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense, Dropout
from keras.optimizers import Adam

Using TensorFlow backend.


In [12]:
my_model = Sequential(name='my_model')

In [13]:

input_layer = InputLayer(input_shape=(features.shape[1],), name='input_layer') # (Feature columns, any number of rows.)
my_model.add(input_layer)

In [14]:
# 19 features and a close enough binary number. 2,4,8,16,32
dense_1 = Dense(32, activation='relu', name='hidden_layer_one')
#dense_2 = Dense(16, activation='relu', name='hidden_layer_two')
my_model.add(dense_1)
#my_model.add(dense_2)

In [15]:
output_layer = Dense(1, name='regression_output') # Regression, no activation function on this estimating layer
my_model.add(output_layer)

In [16]:
print(my_model.summary())

Model: "my_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
hidden_layer_one (Dense)     (None, 32)                640       
_________________________________________________________________
regression_output (Dense)    (None, 1)                 33        
Total params: 673
Trainable params: 673
Non-trainable params: 0
_________________________________________________________________
None


In [17]:
# Put the pieces for backpropagation in
my_model.compile(loss='mse',metrics=['mae'], optimizer='Adam') # Standard Learning Rate for Adam

In [18]:
history = my_model.fit(X_train_scaled, y_train, epochs=40, batch_size=1, verbose=False) # from ca 10 -> 2 MAE

In [19]:
res_mse, res_mae = my_model.evaluate(X_test_scaled, y_test)



In [20]:
print(f"Mean Squared Error: {np.round(res_mse, decimals=2)}")
print(f"Mean Absolute Error:", np.round(res_mae, decimals=2)) # Notebook Error

Mean Squared Error: 9.54
Mean Absolute Error: 2.15


In [21]:
print('-----FIVE ACTUAL VALUES ---')
print(y_test.iloc[5:10])

-----FIVE ACTUAL VALUES ---
2156    62.8
2557    68.1
1335    73.4
1193    65.5
1987    59.6
Name: Life expectancy, dtype: float64


In [22]:
# prediction = my_model.predict(np.array([X_test_scaled[1,:],])) Single prediction A Bit harder To give the model [[]]
prediction = my_model.predict(np.array(X_test_scaled[5:10,:]))
print('-----FIVE PREDICTIONS VALUES ---')
prediction

-----FIVE PREDICTIONS VALUES ---


array([[63.2666  ],
       [69.29033 ],
       [74.30104 ],
       [64.472466],
       [62.19476 ]], dtype=float32)

In [23]:
np.round(target.std(), decimals=2)

8.8

# So far the model does pretty well.
We have an absolute Error of about 2.27 years which is quite low. The Standard deviation is roughly 8.8 and the variance of  77 for the target data. So the model does a lot better at about, at Mean absolute error of 2.28 and Mean Squared Error 9.69. MAE is computed a bit different from standard deviation but mostly are in the same ballpark. 
It is somewhat like an R-squared value for explanatory value except this one is in the smaller_is_better schema.

# Use all the data?
Technically speaking, there is not much need for such a large test set if we are doing cross-validation. 
And so we could use our CT object on more of the data set at this point.  

In [24]:
X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=0.95)

In [25]:
X_train = fulldata_ct.fit_transform(X_train) # In case we want to do more transforms the object is now saved.
X_test = fulldata_ct.transform(X_test)

In [26]:
# One more time With a Random Search
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from scipy.stats import randint as sp_randint
import tensorflow as tf
from tensorflow.keras.callbacks import EarlyStopping

In [27]:
def design_model():
    model = Sequential(name="my_model")
    input = tf.keras.Input(shape=(features.shape[1],)) # Cannot use input_shape with Cross Validation
    model.add(input)
    model.add(Dense(11, activation = 'relu'))
    model.add(Dense(1))
    opt = tf.keras.optimizers.Adam(learning_rate = 0.01)
    model.compile(loss='mse', metrics=['mae'], optimizer=opt)
    return model

In [28]:
def do_randomized_search(design=design_model):
  param_grid = {'batch_size': sp_randint(2, 16), 'nb_epoch': sp_randint(10, 100)}
  model = KerasRegressor(build_fn=design)
  grid = RandomizedSearchCV(estimator = model, param_distributions=param_grid, 
                            scoring = make_scorer(mean_squared_error, greater_is_better=False), 
                            n_iter = 12, n_jobs=-1) # njobs -1 uses all processors if one has more to spare.
  es = EarlyStopping(monitor='loss', mode='min', verbose=1, patience = 20)
  random_search = grid.fit(X_train, y_train, verbose = 0, callbacks=[es]) # es goes here?
 
  return random_search

In [29]:
random_search = do_randomized_search()
print("-------------- RANDOMIZED SEARCH COMPLETED--------------------")

-------------- RANDOMIZED SEARCH COMPLETED--------------------


In [30]:
random = pd.DataFrame({'Mean Score':random_search.cv_results_['mean_test_score'],
                      'Standard Dev': random_search.cv_results_['std_test_score'],
                       'Parameters': random_search.cv_results_['params']
                      })
# cv_results_:dict of numpy (masked) ndarrays
random.sort_values(by='Mean Score', ascending=False, inplace=True)
random.head(5).append(random.tail(5)) # Appends to the end of df, Best and worst performers.
# we see a Tendency for small Batch Sizes and  > 43 epochs
# We could use this information to sort of do Bayesian estimations ourselvess. 
# Clearly smaller batches are better at this point.

Unnamed: 0,Mean Score,Standard Dev,Parameters
9,-114.186734,17.984845,"{'batch_size': 3, 'nb_epoch': 55}"
0,-161.383234,26.037532,"{'batch_size': 4, 'nb_epoch': 28}"
5,-184.535867,31.99985,"{'batch_size': 4, 'nb_epoch': 52}"
10,-265.861618,31.616364,"{'batch_size': 6, 'nb_epoch': 28}"
1,-518.913632,186.794057,"{'batch_size': 8, 'nb_epoch': 66}"
8,-1222.874248,295.12789,"{'batch_size': 12, 'nb_epoch': 62}"
4,-1265.548907,222.80349,"{'batch_size': 11, 'nb_epoch': 21}"
2,-1386.27931,210.82804,"{'batch_size': 13, 'nb_epoch': 93}"
6,-1571.947558,187.049547,"{'batch_size': 12, 'nb_epoch': 54}"
7,-1818.998144,627.104025,"{'batch_size': 13, 'nb_epoch': 35}"


In [31]:
best_estimator = random_search.best_estimator_
# TF res_mse, res_mae = best_estimator.evaluate(X_test_scaled, y_test)
-1* best_estimator.score(X_test, y_test)



88.41277657933982

# The Cross Validated model performs worse. Why?
Notice the difference. The instantiated model from the designmodel function has a smaller number of nodes.
11 Nodes was used here since I took the code of another dataset in the tensorflow_4 project. 
In this project, we were using binary stacks of nodes. I did a manual increase of hidden layers without much success, however, reducing the model's number of nodes makes it worse off.
However, we do notice that smaller batches and high epochs seem to work well for the model. <br>
In the 32 nodes, we had a loss of about 9 and 2 years. The 11 node seems to do about 80MSE loss and 7 years off.
The lower complexity model does not capture enough of the data. So we can say that at 11 -> 32 there is computational reducibility of the problem. We can bridge this reducibility by increasing the complexity of the model, sometimes this is also referred to as a bias-variance trade-off. Tweaking hyperparameters does not make up for it at this point.

# Time to refactor the Code to be able to do more exhaustive searches
The continuation on this project will be work on refactoring the code to fit to two of my ideas on computational reducibility and also taking into account Bayesian optimization thinking, the last two DataFrames shows a good (indicaiton on how to optimize two hyperparameters) and iterative modeling. 
The code here is enough of a start for those two refactornigs.
My thinking is that I want the code to first add complexity then add optimization, and keep going like this. Added on to the 5_b project will be the viking model that did just those things.

# What else can be done? Get the Data back
Another important piece is all that data that was dropped. Since it had year and country columns, I am pretty sure the first draft of TimeSeries interpolation could help bridge those gaps. 
I would not at this stage want to impute the values. After using the interpolation approach, we could do an iterative play between imputation and interpolation. <br>
That is the best estimation for the country's missing values is the closest existing value. If this doesn't exist then the closest neighbor by Vectorizing each country and year could be used. I would use the Cosine similarity often used in natural language processing to figure it out. So those are two ways to improve on the data.<br>
I will end up trying to impute with a KNNimputer which is close to cosine similarity. <br>
Barring that, we could use a stratified split to infer values and finally, impute the thing with the mean for the column. However imputing with the mean is well, strange. The missing values will tend to come from poorer nations in general and these will have different life expectancies from the mean. Since it is almost half the data I would hesitate to use the mean imputation for a first go. 
### SPOILER: taking into account outliers makes for more trouble.