
# Report on U.N. World Happiness Data. 

## Davide Vaccari

# Objective: Predict World Happiness Rankings 

What makes the citizens of one country more happy than the citizens of other countries?  Do variables measuring perceptions of corruption, GDP, maintaining a healthy lifestyle, or social support associate with a country's happiness ranking?  

Let's use the United Nation's World Happiness Rankings country level data to experiment with models that predict happiness rankings well.


---

**Data**: 2019 World Happiness Survey Rankings
*(Data can be found on Advanced Projects in ML courseworks site)*

**Features**
*   Country or region
*   GDP per capita
*   Social support
*   Healthy life expectancy
*   Freedom to make life choices
*   Generosity
*   Perceptions of corruption

**Target**
*   Happiness_level (Very High = Top 20% and Very Low = Bottom 20%)

Source: https://worldhappiness.report/




To these, I've added:

* Air quality (from https://www.stateofglobalair.org/data/#/air/plot. Hong Kong and Kosovo are the only two countries for which data is missing. For those, I've imputed the mean of the neighboring countries. Macedonia, Albania, Serbia, Montenegro, for Kosovo; China for Hong Kong.)

* Suicide rates (from https://apps.who.int/gho/data/node.main.MHSUICIDE?lang=en. Data is from 2016.)

* Number of terrorist's attacks

# Mini-Hackathon In Class Tasks



1.   Build, save, and submit at least one Keras model.
2.   Build, save, and submit at least one Scikit-learn model.
3.   Seek advice through collaboration via Github:

*      Save notebook w/ best model to private repo
*      Invite a collaborator
*      Collaborator should submit at least two issues w/ suggestions for model improvement

4.   If time, improve model further!











# Import the data




In [None]:
# Colab Setup: 
# note that tabular preprocessors require scikit-learn>=0.24.0
# Newest Tensorflow 2 has some bugs for onnx conversion
!pip install scikit-learn --upgrade 
import os
os.environ['TF_KERAS'] = '1'
% tensorflow_version 1

In [54]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data=pd.read_excel("final.xlsx", header=1)

data.head()

Unnamed: 0,Happiness_level,Country,GDP.per.capita,Social.support,Healthy.life.expectancy,Freedom.to.make.life.choices,Generosity,Perceptions.of.corruption,name,region,sub.region,Mean_Exposure_pm25,Suicide.rates.2016.,Terrorist_attacks
0,Very High,Finland,1.34,1.587,0.986,0.596,0.153,0.393,Finland,Europe,Northern Europe,5.57,15.9,57.333333
1,Very High,Denmark,1.383,1.573,0.996,0.592,0.252,0.41,Denmark,Europe,Northern Europe,9.79,12.8,2.0
2,Very High,Norway,1.488,1.582,1.028,0.603,0.271,0.341,Norway,Europe,Northern Europe,6.64,12.2,1.0
3,Very High,Iceland,1.38,1.624,1.026,0.591,0.354,0.118,Iceland,Europe,Northern Europe,5.7,14.0,1.0
4,Very High,Netherlands,1.396,1.522,0.999,0.557,0.322,0.298,Netherlands,Europe,Western Europe,12.0,12.6,1.0


In [55]:
# Clean up final data

X=data.drop(['Happiness_level'],axis=1)
X=X.drop(['name'],axis=1)
X=X.drop(['Country'],axis=1)
X=X.drop(['sub.region'],axis=1)

X


Unnamed: 0,GDP.per.capita,Social.support,Healthy.life.expectancy,Freedom.to.make.life.choices,Generosity,Perceptions.of.corruption,region,Mean_Exposure_pm25,Suicide.rates.2016.,Terrorist_attacks
0,1.340,1.587,0.986,0.596,0.153,0.393,Europe,5.57,15.9,57.333333
1,1.383,1.573,0.996,0.592,0.252,0.410,Europe,9.79,12.8,2.000000
2,1.488,1.582,1.028,0.603,0.271,0.341,Europe,6.64,12.2,1.000000
3,1.380,1.624,1.026,0.591,0.354,0.118,Europe,5.70,14.0,1.000000
4,1.396,1.522,0.999,0.557,0.322,0.298,Europe,12.00,12.6,1.000000
...,...,...,...,...,...,...,...,...,...,...
151,0.359,0.711,0.614,0.555,0.217,0.411,Africa,36.20,6.7,20.000000
152,0.476,0.885,0.499,0.417,0.276,0.147,Africa,24.70,5.4,21.000000
153,0.350,0.517,0.361,0.000,0.158,0.025,Asia,52.40,4.7,6023.000000
154,0.026,0.000,0.105,0.225,0.235,0.035,Africa,46.40,7.7,76.000000


# Build a model to predict happiness rankings

In [56]:
# Set up training and test data
from sklearn.model_selection import train_test_split

y=data['Happiness_level']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=42)

print(X_train.shape)
print(y_train.shape)
print(X_train.columns.tolist())

(104, 10)
(104,)
['GDP.per.capita', 'Social.support', 'Healthy.life.expectancy', 'Freedom.to.make.life.choices', 'Generosity', 'Perceptions.of.corruption', 'region', 'Mean_Exposure_pm25', 'Suicide.rates.2016.', 'Terrorist_attacks']


In [None]:
y_test

## Preprocess data using Column Transformer and save fit preprocessor to ".pkl" file

In [57]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# We create the preprocessing pipelines for both numeric and categorical data.

numeric_features=X.columns.tolist()
numeric_features.remove('region')

X.loc[:,numeric_features]

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['region']

#Replacing missing values with Modal value and then one hot encoding.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# final preprocessor object set up with ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])


#Fit your preprocessor object
preprocess=preprocessor.fit(X_train) 


In [58]:
# Write function to transform data with preprocessor

def preprocessor(data):
    preprocessed_data=preprocess.transform(data)
    return preprocessed_data

In [59]:
# Check shape for keras input:
preprocessor(X_train).shape # pretty small dataset

(104, 14)

In [60]:
# Check shape for keras output:
pd.get_dummies(y_train).shape

(104, 5)

In [None]:
# Now we will automatically save out the best model.

# Use model checkpoints with test data validation to save out best model evaluated on test data throughout fitting process:
# Here is a good source for further reading: https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

# Model with best L2 regularization run for three times the epochs
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout,BatchNormalization
from keras.optimizers import SGD
from keras.regularizers import l1
from keras.regularizers import l2
from keras.regularizers import l1_l2
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.models import load_model

model = Sequential()
model.add(Dense(128, input_dim=14, activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))
model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.07), bias_regularizer=l2(0.07)))
model.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))
model.add(Dense(16, activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))

model.add(Dense(5, activation='softmax')) 
                                            
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['accuracy'])

# save best model given maximum val_accuracy, stop early if loss does not improve after 200 further iterations beyond best loss
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization

# Fitting the NN to the Training set
model.fit(preprocessor(X_train), pd.get_dummies(y_train), batch_size=50,
          validation_data=(preprocessor(X_test), pd.get_dummies(y_test)), epochs=700, verbose=1, callbacks=[es,mc])



In [71]:
model = load_model('best_model.h5')

#Now we have automated model building such that we can choose the best model evaluated on test data 
#throughout the model building process!


# using predict_classes() for multi-class data to return predicted class index.

prediction_index=model.predict_classes(preprocessor(X_test))

#Now lets run some code to get keras to return the label rather than the index...

# get labels from one hot encoded y_train data
labels=pd.get_dummies(y_train).columns

# Iterate through all predicted indices using map method

predicted_labels=list(map(lambda x: labels[x], prediction_index))


# load model_eval_metrics() function into our session to calculate metrics

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
import pandas as pd
from math import sqrt

def model_eval_metrics(y_true, y_pred,classification="TRUE"):
     if classification=="TRUE":
        accuracy_eval = accuracy_score(y_true, y_pred)
        f1_score_eval = f1_score(y_true, y_pred,average="macro",zero_division=0)
        precision_eval = precision_score(y_true, y_pred,average="macro",zero_division=0)
        recall_eval = recall_score(y_true, y_pred,average="macro",zero_division=0)
        mse_eval = 0
        rmse_eval = 0
        mae_eval = 0
        r2_eval = 0
        metricdata = {'accuracy': [accuracy_eval], 'f1_score': [f1_score_eval], 'precision': [precision_eval], 'recall': [recall_eval], 'mse': [mse_eval], 'rmse': [rmse_eval], 'mae': [mae_eval], 'r2': [r2_eval]}
        finalmetricdata = pd.DataFrame.from_dict(metricdata)
     else:
        accuracy_eval = 0
        f1_score_eval = 0
        precision_eval = 0
        recall_eval = 0
        mse_eval = mean_squared_error(y_true, y_pred)
        rmse_eval = sqrt(mean_squared_error(y_true, y_pred))
        mae_eval = mean_absolute_error(y_true, y_pred)
        r2_eval = r2_score(y_true, y_pred)
        metricdata = {'accuracy': [accuracy_eval], 'f1_score': [f1_score_eval], 'precision': [precision_eval], 'recall': [recall_eval], 'mse': [mse_eval], 'rmse': [rmse_eval], 'mae': [mae_eval], 'r2': [r2_eval]}
        finalmetricdata = pd.DataFrame.from_dict(metricdata)
     return finalmetricdata

model_eval_metrics( y_test,predicted_labels,classification="TRUE")


# add metrics to submittable object
modelevalobject=model_eval_metrics( y_test,predicted_labels,classification="TRUE")

modelevalobject

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.673077,0.659399,0.694127,0.677121,0,0,0,0


# Save preprecessor, save keras model to onnx file, generate predictions and submit to leaderboard.

In [None]:
#install aimodelshare library
! pip install aimodelshare --upgrade --extra-index-url https://test.pypi.org/simple/ 

In [14]:
#Save preprocessor function to local "preprocessor.zip" file for leaderboard submission
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"")

In [None]:
#test your preprocessor
prep=ai.import_preprocessor("preprocessor.zip")
prep(X_test)

In [72]:
#Save keras model object to onnx file

from aimodelshare.aimsonnx import model_to_onnx
# transform sklearn model to ONNX
onnx_model_keras= model_to_onnx(model, framework='keras', 
                                   transfer_learning=False,
                                   deep_learning=True,
                                   task_type='classification')

# Save model to local .onnx file
with open("onnx_model_keras.onnx", "wb") as f:
    f.write(onnx_model_keras.SerializeToString())

The ONNX operator number change on the optimization: 39 -> 15


### To submit a model you need to sign up for username and password at:
[AI Model Share Initiative Site](http://mlsite5aimodelshare-dev.s3-website.us-east-2.amazonaws.com/)

# Set up necessary arguments for model submission using aimodelshare python library.


In [11]:
import pickle

In [16]:
#aimodelshare username and password
username="username"
password="password"

# load submit model creds (only gives access to s3 bucket)
# Load from pkl file
with open("aws_creds_worldhappiness.pkl", 'rb') as file:
    aws_key,aws_password,region = pickle.load(file)

token=ai.aws.get_aws_token(username, password)
awscreds=ai.aws.get_aws_client(aws_key=aws_key, aws_secret=aws_password, aws_region=region)

In [74]:
# Submit_model() to leaderboard
ai.submit_model("onnx_model_keras.onnx",
                "https://z69mxrxdz5.execute-api.us-east-1.amazonaws.com/prod/m",
                token,awscreds,prediction_submission=predicted_labels,
                preprocessor="preprocessor.zip")

True

# Now you can check the leaderboard!

In [77]:
# Check leaderboard
import pandas
data=ai.get_leaderboard("https://z69mxrxdz5.execute-api.us-east-1.amazonaws.com/prod/m",
                token,awscreds,verbose=2)

#get rid of any duplicate model submissions
data=data.loc[data.iloc[:,0:8].duplicated()==False,:]
data.head()


Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,batchnormalization_layers,dense_layers,dropout_layers,loss,optimizer,model_config,username,timestamp,version
0,0.673077,0.659399,0.694127,0.677121,keras,,True,Sequential,5.0,12869.0,,5.0,,str,Adagrad,"{'name': 'sequential_28', 'layers': [{'class_n...",dv2438,2021-02-15 22:09:46.058993,181
1,0.653846,0.653063,0.702656,0.671061,keras,,True,Sequential,5.0,12741.0,,5.0,,str,Adagrad,"{'name': 'sequential_34', 'layers': [{'class_n...",dv2438,2021-02-14 16:18:09.182759,164
2,0.596154,0.581563,0.653571,0.605455,keras,,True,Sequential,4.0,9477.0,,4.0,,str,SGD,"{'name': 'sequential_1', 'layers': [{'class_na...",mikedparrott,2021-02-08 22:15:47.939712,146
3,0.538462,0.532364,0.593922,0.550303,sklearn,,,SVC,,110.0,,,,,,"{'C': 10, 'break_ties': False, 'cache_size': 2...",prajseth,2021-02-02 00:57:22.077630,67
10,0.519231,0.524285,0.63,0.54,keras,,True,Sequential,5.0,15941.0,,5.0,,str,Adam,"{'name': 'sequential_45', 'layers': [{'class_n...",dv2438,2021-02-08 23:26:40.070375,158


In [80]:
ai.stylize_leaderboard(data, category="classification")

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,batchnormalization_layers,dense_layers,dropout_layers,loss,optimizer,model_config,username,version
0,67.31%,65.94%,69.41%,67.71%,keras,,True,Sequential,5.0,12869.0,,5.0,,str,Adagrad,"{'name': 'sequential_28', 'lay...",dv2438,181
1,65.38%,65.31%,70.27%,67.11%,keras,,True,Sequential,5.0,12741.0,,5.0,,str,Adagrad,"{'name': 'sequential_34', 'lay...",dv2438,164
2,59.62%,58.16%,65.36%,60.55%,keras,,True,Sequential,4.0,9477.0,,4.0,,str,SGD,"{'name': 'sequential_1', 'laye...",mikedparrott,146
3,53.85%,53.24%,59.39%,55.03%,sklearn,,,SVC,,110.0,,,,,,"{'C': 10, 'break_ties': False,...",prajseth,67
10,51.92%,52.43%,63.00%,54.00%,keras,,True,Sequential,5.0,15941.0,,5.0,,str,Adam,"{'name': 'sequential_45', 'lay...",dv2438,158
11,51.92%,52.43%,63.00%,54.00%,sklearn,,,LogisticRegression,,45.0,,,,,lbfgs,"{'C': 53.37, 'class_weight': N...",dv2438,157
12,51.92%,52.34%,56.79%,53.33%,keras,True,True,Sequential,4.0,35205.0,,4.0,,str,SGD,"{'name': 'sequential_5', 'laye...",prajseth,15
13,51.92%,51.99%,55.57%,53.00%,keras,True,True,Sequential,4.0,135941.0,,4.0,,str,SGD,"{'name': 'sequential_1', 'laye...",prajseth,124
14,50.00%,50.49%,55.78%,51.48%,keras,True,True,Sequential,4.0,135941.0,,4.0,,str,SGD,"{'name': 'sequential_9', 'laye...",prajseth,34
15,50.00%,50.50%,55.44%,51.52%,keras,True,True,Sequential,5.0,201733.0,,5.0,,str,SGD,"{'name': 'sequential_8', 'laye...",prajseth,30


In [83]:
data.iloc[0,15]

"{'name': 'sequential_28', 'layers': [{'class_name': 'Dense', 'config': {'name': 'dense_144', 'trainable': True, 'batch_input_shape': (None, 14), 'dtype': 'float32', 'units': 128, 'activation': 'relu', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'bias_initializer': {'class_name': 'Zeros', 'config': {}}, 'kernel_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 0.009999999776482582}}, 'bias_regularizer': {'class_name': 'L1L2', 'config': {'l1': 0.0, 'l2': 0.009999999776482582}}, 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None}}, {'class_name': 'Dense', 'config': {'name': 'dense_145', 'trainable': True, 'dtype': 'float32', 'units': 64, 'activation': 'relu', 'use_bias': True, 'kernel_initializer': {'class_name': 'VarianceScaling', 'config': {'scale': 1.0, 'mode': 'fan_avg', 'distribution': 'uniform', 'seed': None}}, 'bias_initi