# MODEL CREATION AND TRAINING

In this notebook, we will now create our model to be used by the system. We will explore different ML algorithms, and build models, examine their performances, and via the performance metrics, we will choose the best model, and employ it in our system.

The algorithms to be used are classification algorithms, since our problem is a classification one. The algorithms to be used are:
Logistic Regression,
Random Forest,
K- nearest neighbor,
Naive Bayes Classifier,
Voting Classifier (ensemble algorithm), and
Stacking Classifier (ensemble algorithm)

We will use k-fold cross validation, to examine the performances of these models.

In [1]:
# Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
# Loading the data

data = pd.read_csv("obesity_data.csv")

In [5]:
# Splitting the dataset into the features, and the target variable

x = data.drop('Obesity Status', axis=1)
y = data['Obesity Status']

## 1. Logistic Regression 

In [23]:
# Importing important libraries

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Creating the model and splitting the dataset
model = LogisticRegression(random_state=0)

x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.75,test_size=0.25)

log_reg = model.fit(x_train, y_train)

### Using k-fold cross validation to assess the model...

In [24]:
from sklearn.model_selection import cross_val_score

score = cross_val_score(log_reg, x_train, y_train, scoring = 'accuracy', cv = 5)
print(np.mean(score))

0.8754716981132076


In [27]:
# Scores on test data

test_score  = cross_val_score(log_reg, x_test, y_test, scoring = 'accuracy', cv = 5)
print(np.mean(test_score))

0.8733333333333334


## 2. Naive Bayes Classifier

In [26]:
# Importing important libraries

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

# Creating the model and splitting the dataset

nb = GaussianNB()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

nb_model = nb.fit(x_train, y_train)

In [28]:
# Assessment via k-fold cross validation
# On training set
from sklearn.model_selection import cross_val_score
score2 = cross_val_score(nb_model, x_train, y_train, scoring = 'accuracy', cv = 5)
print(np.mean(score2))

0.7137218045112783


In [29]:
# On test set
test_score2  = cross_val_score(nb_model, x_test, y_test, scoring = 'accuracy', cv = 5)
print(np.mean(test_score2))


0.5361904761904762


## 3. K-nearest neighbor

In [30]:
# Importing important libraries

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Creating the model and splitting the dataset

knn = KNeighborsClassifier(n_neighbors=5)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20)

knn_model = knn.fit(x_train, y_train)

### Using k-fold cross validation to assess...

In [31]:
from sklearn.model_selection import cross_val_score
score3 = cross_val_score(knn_model, x_train, y_train, scoring = 'accuracy', cv = 5)
print(np.mean(score3))

0.8796992481203008


In [32]:
# Scores on test set

test_score3  = cross_val_score(knn_model, x_test, y_test, scoring = 'accuracy', cv = 5)
print(np.mean(test_score3))

0.8295238095238094


## 4. Random Forest Classifier

In [40]:
# Importing important libraries

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Creating the model and splitting the dataset

rfc = RandomForestClassifier(n_estimators=50, random_state=42, max_depth=4)
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.80, test_size=0.20, stratify=y)

rfc_model = rfc.fit(x_train, y_train)

### Using k-fold cross validation to assess...

In [41]:
from sklearn.model_selection import cross_val_score
score4 = cross_val_score(rfc_model, x_train, y_train, scoring = 'accuracy', cv = 5)
print(np.mean(score4))

0.7843984962406014


In [42]:
# On test data 

test_score4  = cross_val_score(rfc_model, x_test, y_test, scoring = 'accuracy', cv = 5)
print(np.mean(test_score4))

0.6752380952380952


It was observed that, in Random Forest, as the value of 'n_estimators' increased, the computational time of the cross validation increased, while its accuracy scores decreased. Thus, 'n_estimators' is directly proportional to computational time, and inversely proportional to accuracy.

## Onto the ensemble classifiers...

## 5. Voting Classifier

In [43]:
# Import necessary libraries

from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier

# Splitting the data set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x, y, stratify=y, test_size=0.20, random_state=42)

# Creating the submodels
estimators = []
model1 = LogisticRegression()
estimators.append(('Logistic',model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart',model2))
model3 = SVC()
estimators.append(('svm',model3))


# Define the model
ensemble = VotingClassifier(estimators)

# Fit the model
ensemble.fit(x_train,y_train)

# Predictions 
y_predict = ensemble.predict(x_test)

# Accuracy computation
from sklearn.metrics import accuracy_score
model_score = accuracy_score(y_test,y_predict)
print("{:.2f}%".format(model_score*100))

91.55%


In [44]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

report = classification_report(y_test, y_predict)
print(report)

              precision    recall  f1-score   support

           0       0.91      0.91      0.91        11
           1       0.92      1.00      0.96        12
           2       0.80      1.00      0.89        12
           3       1.00      0.75      0.86        12
           4       0.91      0.83      0.87        12
           5       1.00      1.00      1.00        12

    accuracy                           0.92        71
   macro avg       0.92      0.92      0.91        71
weighted avg       0.92      0.92      0.91        71



## 6. Stacking Classifier

In [45]:
# Importing necessary libraries

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

#The weak learner models in the stacking method 

estimators = [
    ('dt', DecisionTreeClassifier()),
    ('nb', GaussianNB()),
    ('rf', RandomForestClassifier()),
    ('knn', KNeighborsClassifier())
]

# Building the stack model

stack_model = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

# Splitting the dataset

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

# Training the dataset

stack_model.fit(x_train, y_train)

In [46]:
y_pred = stack_model.predict(x_test)

# Performance evaluation

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       0.88      0.93      0.90        15
           2       0.91      1.00      0.95        10
           3       1.00      1.00      1.00        10
           4       1.00      0.83      0.91        12
           5       1.00      1.00      1.00        12

    accuracy                           0.96        71
   macro avg       0.96      0.96      0.96        71
weighted avg       0.96      0.96      0.96        71



In [47]:
# Accuracy computation
from sklearn.metrics import accuracy_score
model_score = accuracy_score(y_test,y_pred)
print("{:.2f}%".format(model_score*100))

95.77%


The ensemble algorithms have recorded very high values of accuracy, with the Voting Classifier having a 91.55% accuracy, and the Stacking Classifier having a 96% accuracy. This may be due to the fact that, these algorithms employ multiple individual algorithms like logistic regression, SVM, KNN, etc., and they use them to improve on the general accuracy.

Let's create a dummy input, to try and see the prediction. This will mirror the actual system, as the system will take input, process them in a numpy array, send it to the model, and retrieving prediction results.

In [50]:
# Dummy input


test_case = [23, 167, 120, 0, 0, 1, 1, 1]
test = np.array([test_case])

df = pd.DataFrame(test)
test_pred = stack_model.predict(df)
print(test_pred)

[3]


In [49]:
# Saving the Stacking Classifier model

import pickle

filename = 'obesitymodel.sav'
pickle.dump(stack_model, open(filename, 'wb'))