# Here we go Revisiting the 5A project 
The 5A project aimed to find a solution to life expectancy given some features. <br>
To set the bar higher from project 5A, I have made a stronger model in another program of mine optimized for the 5A project. This optimized model is also given here by the function viking<br>
In 5A we excluded some data, we are going to revisit that data without an optimized model, but not fit to that data at all. In so doing we see the robustness to outliers most likely. 

In [243]:
import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
np.random.seed(416)

In [244]:
path = r"Life Expectancy Data.csv"
df = pd.read_csv(path)

In [245]:
df.columns = df.columns.str.strip()
df['Relative Expenditure'] = df['Total expenditure'] / df['GDP']
df = df.drop(labels=['Country','Year','GDP'], axis=1)
df = df.dropna()
df['Status'] = df['Status'].apply(lambda x: 1 if x == 'Developing' else 0)
target = df.pop('Life expectancy')
features = df

In [246]:
numeric_features = features.select_dtypes(['float64','int64'])
numeric_columns = numeric_features.columns
ct = ColumnTransformer([("Numeric Scaler", StandardScaler(), numeric_columns)], remainder='passthrough')
# We want to standardize or normalize the data, or many other litte math features such as L1, L2 wont go at all. 

In [247]:
X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=0.8)
X_train_scaled = ct.fit_transform(X_train)
X_test_scaled = ct.transform(X_test)

In [248]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout

In [249]:
def viking():
    my_model = Sequential(name='Model_Viking')
    input_layer = Input(shape=(features.shape[1],), name='input_layer')
    my_model.add(input_layer)

    dense_1 = Dense(32, activation='relu', name='hidden_layer_one')
    dense_2 = Dense(32, activation='relu', name='hidden_layer_two')
    my_model.add(dense_1)
    my_model.add(dense_2)

    output_layer = Dense(1, name='regression_output')
    my_model.add(output_layer)
    
    my_model.compile(loss='mse',metrics=['mae'], optimizer='Adam')
    return my_model

In [250]:
my_model = viking()
print(my_model.summary())

Model: "Model_Viking"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
hidden_layer_one (Dense)     (None, 32)                640       
_________________________________________________________________
hidden_layer_two (Dense)     (None, 32)                1056      
_________________________________________________________________
regression_output (Dense)    (None, 1)                 33        
Total params: 1,729
Trainable params: 1,729
Non-trainable params: 0
_________________________________________________________________
None


In [251]:
history = my_model.fit(X_train_scaled, y_train, epochs=97, batch_size=4, verbose=False)

In [252]:
res_mse, res_mae = my_model.evaluate(X_test_scaled, y_test)



In [253]:
print(f"Mean Squared Error: {np.round(res_mse, decimals=2)}")
print(f"Mean Absolute Error:", np.round(res_mae, decimals=2))

Mean Squared Error: 6.66
Mean Absolute Error: 1.83


# An even tougher baseline to beat.
In project 5A the losses were about: 
* Mean Squared Error: 7.69 <br>
* Mean Absolute Error: 2.16<br>
<br>
Now we have this total beast of a model that pushed that down even further! This will be our baseline working on the data to see if our model can handle the data that was not included.

# Revisit the problem of Data not dropping the NAN
Keep in mind that the data we dropped previously was probably of a particular kind. So we would probably have dropped outliers for the model. That said, let's revisit these data points.

In [274]:
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline

In [275]:
path = r"Life Expectancy Data.csv"
df = pd.read_csv(path)
df.columns = df.columns.str.strip()
df['Relative Expenditure'] = df['Total expenditure'] / df['GDP']
df = df.drop(labels=['Country','Year','GDP'], axis=1) 
df['Status'] = df['Status'].apply(lambda x: 1 if x == 'Developing' else 0)
imputer = KNNImputer(n_neighbors=3)
imputed_data = imputer.fit_transform(df) # this is also technically cheating for the machine learning

In [276]:
imputed_df = pd.DataFrame(imputed_data, columns = df.columns)
display(imputed_df.head())
print(imputed_df.isna().sum())

Unnamed: 0,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,under-five deaths,Polio,Total expenditure,Diphtheria,HIV/AIDS,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling,Relative Expenditure
0,1.0,65.0,263.0,62.0,0.01,71.279624,65.0,1154.0,19.1,83.0,6.0,8.16,65.0,0.1,33736494.0,17.2,17.3,0.479,10.1,0.013966
1,1.0,59.9,271.0,64.0,0.01,73.523582,62.0,492.0,18.6,86.0,58.0,8.18,62.0,0.1,327582.0,17.5,17.5,0.476,10.0,0.013351
2,1.0,59.9,268.0,66.0,0.01,73.219243,64.0,430.0,18.1,89.0,62.0,8.13,64.0,0.1,31731688.0,17.7,17.7,0.47,9.9,0.012869
3,1.0,59.5,272.0,69.0,0.01,78.184215,67.0,2787.0,17.6,93.0,67.0,8.52,67.0,0.1,3696958.0,17.9,18.0,0.463,9.8,0.012717
4,1.0,59.2,275.0,71.0,0.01,7.097109,68.0,3013.0,17.2,97.0,68.0,7.87,68.0,0.1,2978599.0,18.2,18.2,0.454,9.5,0.123864


Status                             0
Life expectancy                    0
Adult Mortality                    0
infant deaths                      0
Alcohol                            0
percentage expenditure             0
Hepatitis B                        0
Measles                            0
BMI                                0
under-five deaths                  0
Polio                              0
Total expenditure                  0
Diphtheria                         0
HIV/AIDS                           0
Population                         0
thinness  1-19 years               0
thinness 5-9 years                 0
Income composition of resources    0
Schooling                          0
Relative Expenditure               0
dtype: int64


In [266]:
target = imputed_df.pop('Life expectancy') 
features = imputed_df

In [267]:
numeric_features = features.select_dtypes(['float64','int64'])
numeric_columns = numeric_features.columns
ct = ColumnTransformer([("Numeric Scaler", StandardScaler(), numeric_columns)], remainder='passthrough')

In [268]:
X_train, X_test, y_train, y_test = train_test_split(features, target, train_size=0.8)
X_train_imputed_scaled = ct.fit_transform(X_train)
X_test_imputed_scaled = ct.transform(X_test)

In [270]:
my_model.evaluate(X_test_imputed_scaled, y_test)



[10.528401335891413, 2.2058244]

# Conclusions, Taking into account more outliers and even letting the model cheat
We end up with worse performance if we take in more of the data and even letting the scrip cheat a bit with the imputation we are still worse off. So what happened? We included more outliers in our data and the Viking-model was not prepared for that. It has been trained on a smaller dataset without these missing values. <br>
 What if we imputed in some other way? That is certainly a good 'hyperparameter' to tweak. There is no real statistics consensus on how to solve complex problems. Is a mean better than cosine? Is the median better than an IterativeImputer using round-robin linear regression, modeling each feature with missing values as a function of other features and assuming Gaussian (output) variables? Well, who can answer that? Iterations, many many many iterations, and experiments. <br>
<br>
 We took the original loss down. We started with a Mean Squared Error loss of about 7.71. The higher complexity model used here called Viking took that down to 5.26. Revisiting the data and expanding brought it up to 10.5 using that higher complexity model. What would be next? Another iteration of what is the best model given all the data. You see the problem is now a different problem. We are still trying to predict the same life expectancy variable but given other data.<br>
# Mean Absolute Error Gives us a Clue
Now, this was a pretty lenient lesson to learn. We added on about 50% more data and the model on trained on better data got worse on Mean Squared Error that punishes for outliers (see above) but not by a whole lot for mean absolute error that returned to about 2.2 which was our starting error for a lower complexity model. <br>
So adding outliers and not training on them produces a result that is worse on outliers, no real surprise.

# The really scary outlier cake
We have seen that we were mildly punished for not training on outliers here. There is of course something much worse always lurking out there. Always a bigger fish. If we had a model with a year column. And in the year 2016, the earth was hit by a comet on July 1. On June 29th our model would be predicting something reasonable for the next day. However, on June 30th it would not make a good prediction at all. There was one outlier that made all the difference.  