# Htperparameter Testing

## Objective

The purpose of this notebook is to perform a preliminary hyperparameter testing for the project.  

## Import libraries

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn import tree

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Load dataset

In [2]:
#- Define data file
file ='../dataset/ObesityDataSet_raw_and_data_sinthetic.csv'

In [3]:
#- Load dataset to a pandas dataframe for analysis
ds = pd.read_csv(file)

## Preprocessing

In [4]:
# Transformation of binary data
ds["Gender"] = ds.Gender.apply(lambda s: 1 if s == "Female" else 0)
ds["family_history_with_overweight"] = ds.family_history_with_overweight.apply(lambda s: 1 if s == "yes" else 0)
ds["FAVC"] = ds.FAVC.apply(lambda s: 1 if s == "yes" else 0)
ds["SMOKE"] = ds.SMOKE.apply(lambda s: 1 if s == "yes" else 0)
ds["SCC"] = ds.SCC.apply(lambda s: 1 if s == "yes" else 0)

In [5]:
# One hot encoding for categorical data
CAEC_list = pd.get_dummies(ds.CAEC, prefix="CAEC")
ds.drop("CAEC", inplace=True, axis=1)
ds = ds.join(CAEC_list)

CALC_list = pd.get_dummies(ds.CALC, prefix="CALC")
ds.drop("CALC", inplace=True, axis=1)
ds = ds.join(CALC_list)

MTRANS_list = pd.get_dummies(ds.MTRANS, prefix="MTRANS")
ds.drop("MTRANS", inplace=True, axis=1)
ds = ds.join(MTRANS_list)

In [6]:
# Transformation of target feature through a dictionary
obesity = {"Insufficient_Weight":1, "Normal_Weight":2, "Overweight_Level_I":3, "Overweight_Level_II":4, "Obesity_Type_I":5, "Obesity_Type_II":6, "Obesity_Type_III":7}
ds["NObeyesdad"] = ds.NObeyesdad.map(obesity)

## Obtain Train and Test datasets

In [7]:
# Obtain train and test datasets
X_train, X_test, y_train, y_test = train_test_split(ds.drop('NObeyesdad',axis=1), 
                                                    ds['NObeyesdad'],
                                                    test_size=0.30, 
                                                    random_state=0)

In [8]:
# Standard scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Decision Trees

The hyperparameters to test are:
* Maximum depth of the tree (max_depth).  
* Minimum number of samples required to split (min_samples_split).  

**Maximum depth.**  
This is the maximum depth of the tree.  
The default value is *None*.  
A high value causes overfitting. A low value causes underfitting.  
The values selected for the pre-test were 5, 10, 50, 100 to test underfitting and overfitting.  

**Minimum number of samples.**  
This is the minimum number of samples required to split an internal node.  
The default value is 2.  
The values selected for the sampler were 2, 10, 50, 100 to see the effect of the selecting too few an too many samples.  

**Note:** *random_state* was set to 0, to obtain a deterministic behaviour during fitting.  
This parameter controls the randomness of the estimator. If set to the default value *None*, the features are randomly permuted at each split. The best found split may vary across different runs.  


### Maximum Depth

In [9]:
# Train models
model_none = tree.DecisionTreeClassifier(max_depth = None, random_state = 0)
model_none.fit(X_train, y_train)
model_5 = tree.DecisionTreeClassifier(max_depth = 5, random_state = 0)
model_5.fit(X_train, y_train)
model_10 = tree.DecisionTreeClassifier(max_depth = 10, random_state = 0)
model_10.fit(X_train, y_train)
model_50 = tree.DecisionTreeClassifier(max_depth = 50, random_state = 0)
model_50.fit(X_train, y_train)
model_100 = tree.DecisionTreeClassifier(max_depth = 100, random_state = 0)
model_100.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=100, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

In [10]:
# Test models
y_pred_none = model_none.predict(X_test)
accuracy_none = accuracy_score (y_test, y_pred_none)
y_pred_5 = model_5.predict(X_test)
accuracy_5 = accuracy_score (y_test, y_pred_5)
y_pred_10 = model_10.predict(X_test)
accuracy_10 = accuracy_score (y_test, y_pred_10)
y_pred_50 = model_50.predict(X_test)
accuracy_50 = accuracy_score (y_test, y_pred_50)
y_pred_100 = model_100.predict(X_test)
accuracy_100 = accuracy_score (y_test, y_pred_100)

#### Model Accuracy Comparison

In [11]:
print ("None:  ", accuracy_none)
print ("5:     ", accuracy_5)
print ("10:    ", accuracy_10)
print ("50:    ", accuracy_50)
print ("100:   ", accuracy_100)

None:   0.9321766561514195
5:      0.832807570977918
10:     0.9337539432176656
50:     0.9321766561514195
100:    0.9321766561514195


The best parameter is a tree with a maximum depth of 10.

### Minumum Number of Samples

In [12]:
# Train models
model_2 = tree.DecisionTreeClassifier(min_samples_split = 2, random_state = 0)
model_2.fit(X_train, y_train)
model_5 = tree.DecisionTreeClassifier(min_samples_split = 5, random_state = 0)
model_5.fit(X_train, y_train)
model_10 = tree.DecisionTreeClassifier(min_samples_split = 10, random_state = 0)
model_10.fit(X_train, y_train)
model_50 = tree.DecisionTreeClassifier(min_samples_split = 50, random_state = 0)
model_50.fit(X_train, y_train)
model_100 = tree.DecisionTreeClassifier(min_samples_split = 100, random_state = 0)
model_100.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=100,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

In [13]:
# Test models
y_pred_2 = model_2.predict(X_test)
accuracy_2 = accuracy_score (y_test, y_pred_2)
y_pred_5 = model_5.predict(X_test)
accuracy_5 = accuracy_score (y_test, y_pred_5)
y_pred_10 = model_10.predict(X_test)
accuracy_10 = accuracy_score (y_test, y_pred_10)
y_pred_50 = model_50.predict(X_test)
accuracy_50 = accuracy_score (y_test, y_pred_50)
y_pred_100 = model_100.predict(X_test)
accuracy_100 = accuracy_score (y_test, y_pred_100)

#### Model Accuracy Comparison

In [14]:
print ("2:   ", accuracy_2)
print ("5:   ", accuracy_5)
print ("10:  ", accuracy_10)
print ("50:  ", accuracy_50)
print ("100: ", accuracy_100)

2:    0.9321766561514195
5:    0.9274447949526814
10:   0.9274447949526814
50:   0.8943217665615142
100:  0.8233438485804416


The best parameter is a minumum number of samples of 2.