# Learning Models

<img src="https://blogs.elespectador.com/wp-content/uploads/2017/09/logo-Universidad-Nacional.png" 
     style="float: right; margin-right: 10px;" 
     width="130"
     />



<div style="text-align: left"> 
Valentina Orduz Bonilla <br>
Student of Mathematics <br>
Universidad Nacional de Colombia - Sede Bogotá <br>
</div>




Starting from the problems of the SVM models homework, we are going to implement a machine learning model that considers the hypothesis spaces reviewed in class. The model must guarantee:

1. Generalization by means of an appropriate cross-validation strategy.
2. Hyperparameter adjustment (with an adequate selection of hyperparameters for each model)

In [None]:
# Basic libraries.
import numpy as np
import pandas as pd
import zipfile
from pathlib import Path
import urllib.request
import numpy as np
from datetime import datetime

# Optimization libraries.
from scipy.optimize import linprog
from scipy.spatial import Delaunay

# Data Preprocessing libraries.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Machine Learning Models and Metrics.
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Metrics and GridSearch.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

# Visualization libraries.
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import plotly.express as px
from plotly.offline import plot
from plotly.subplots import make_subplots

from pathlib import Path
import urllib.request
import zipfile
from datetime import datetime

#Datasets

In the previous document we study the two datasets, Banknote Authentication Data Set and Occupancy Detection Data Set, now we are going to evalute them in differnt models. 

Let's start saving correctly the information

In [None]:
#Banknote Authentication Data Set

#Save the dataset in such a way that we can call any feature
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt", 
                 sep = ',', 
                 header = None, 
                 names=["variance_of_Wavelet","skewness_of_Wavelet",
                        "curtosis_of_Wavelet","entropy",
                        "class"],
                 thousands = ',')
variables_bank=["variance_of_Wavelet","skewness_of_Wavelet",
                        "curtosis_of_Wavelet","entropy"]

#print("Dimensionality of the Dataframe:",df.shape)

In [None]:
#Occupancy Detection Data Set

#Save the dataset in such a way that we can call any feature

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

def load_occupancy_data():
    tarball_path = Path("datasets/occupancy_data.zip")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00357/occupancy_data.zip"
        urllib.request.urlretrieve(url, tarball_path)
        with zipfile.ZipFile(tarball_path) as occupancy_tarball:
           # open the csv file in the dataset
           occupancy_tarball.extractall(path="datasets")
    list_df =[pd.read_csv(Path("datasets/datatraining.txt"),parse_dates=['date'],date_parser=dateparse),
              pd.read_csv(Path("datasets/datatest.txt"),parse_dates=['date'],date_parser=dateparse),
              pd.read_csv(Path("datasets/datatest2.txt"),parse_dates=['date'],date_parser=dateparse),]
    return list_df

In [None]:
train_o, test1_o, test2_o= load_occupancy_data()


Now, as in the previous document, we have to change the date feature, in such a way it can be used in the agorithms.  

In [None]:
train_o['date_numeric'] = train_o['date'].apply(lambda time: time.year+time.month/12+ time.day/365 + time.hour/8760+time.minute/525600)
test1_o['date_numeric'] = test1_o['date'].apply(lambda time: time.year+time.month/12+ time.day/365 + time.hour/8760+time.minute/525600)
test2_o['date_numeric'] = test2_o['date'].apply(lambda time: time.year+time.month/12+ time.day/365 + time.hour/8760+time.minute/525600)

In [None]:
variables_occupancy = ['date_numeric', 'Temperature', 'Humidity', 'Light', 
              'CO2', 'HumidityRatio']

#Dataset division

BANKNOTE AUTHENTICATION

#Data Splitting 

We want that this notebook's outputs are the same every time we run it, that is why we need a fixed random seed

In [None]:
np.random.seed(42)

Now, we need to take some data for the training, it is going to be 80%, and the rest has to be divided in testing and validation, were we would do 10% and 10%.

BANKNOTE AUTHENTICATION

In [None]:
# In the first step we will split the data in training and remaining dataset
X_train_banknote, X_rem, y_train_banknote, y_rem = train_test_split(df[variables_bank],df["class"], train_size=0.8)

# Now since we want the valid and test size to be equal (10% each of overall data). 
# we have to define valid_size=0.5 (that is 50% of remaining data)
test_size = 0.5
X_valid_banknote, X_test_banknote, y_valid_banknote, y_test_banknote = train_test_split(X_rem,y_rem, test_size=0.5)

OCCUPANCY DETECTION

In [None]:
X_train_o=train_o[['date_numeric', 'Temperature', 'Humidity', 'Light', 
              'CO2', 'HumidityRatio', 'Occupancy']]
y_train_o=train_o['Occupancy']
#df.keys()


X_test1_o=test1_o[['date_numeric', 'Temperature', 'Humidity', 'Light', 
              'CO2', 'HumidityRatio', 'Occupancy']]
y_test1_o=test1_o['Occupancy']


X_test2_o=test2_o[['date_numeric', 'Temperature', 'Humidity', 'Light', 
              'CO2', 'HumidityRatio', 'Occupancy']]
y_test2_o=test2_o['Occupancy']

#Feature Scaling

In [None]:
#Banknote Authentication Data Set:
scaler_banknote = StandardScaler()
df_banknote_transformed = scaler_banknote.fit_transform(X_train_banknote[variables_bank])
df_banknote_transformed_test = scaler_banknote.transform(X_test_banknote[variables_bank])
df_banknote_transformed_valid = scaler_banknote.transform(X_valid_banknote[variables_bank])

#Occupancy Detection Data Set:
scaler_occupancy = StandardScaler()
df_occupancy_transformed = scaler_occupancy.fit_transform(train_o[variables_occupancy])
df_occupancy_transformed_test1 = scaler_occupancy.transform(test1_o[variables_occupancy])
df_occupancy_transformed_test2 = scaler_occupancy.transform(test2_o[variables_occupancy])

#Hyperparameter Adjustement

In order to use the diffent models, we need to find the appropriate hyperparameters using the training dataset we just defined. For this, it is necessary to have a function that depends of the parameters, and allows us to compare the models. We will call this function a metric.

##Metric Definition 

We need need to choose a metric that evaluates the performance of the algorithms and help us to determine the hyperparameters. Thinking in the bank situation, they want their clients to be comfortable and stay there, having a high recall and a good accuracy, that is why we choose F1 socre.

The F1 score (or the F-score or F-measure) can be seen as a harmonic mean of the recall and the precision, where the relative contribution of both is equal. The formula is 

$$F1=\frac{2(precision\cdot recall)}{precision~+~recall}.$$

##Hyperparameter tuning



Now we are going to find the best hyperparameters for the four models: 

*   Logistic Regression
*   Suport Vector Machine 
*   K-Nearest Neighbor
*   Decision Tree Classifier


###Logistic Regression

In [None]:
# Create a logistic regression model
model_logistic_reg_banknote = LogisticRegression(max_iter=10000)
model_logistic_reg_occupancy = LogisticRegression(max_iter=10000)


# Define hyperparameters for tuning
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000,10000], 
              'penalty': ['l2','l1'],
              'solver': ['liblinear', 'saga']} 

# Define F1 score as the metric for evaluation
f1 = make_scorer(f1_score)

# Tune hyperparameters using GridSearchCV
log_grid_search_banknote = GridSearchCV(model_logistic_reg_banknote, param_grid, cv=5, scoring=f1)
log_grid_search_occupancy = GridSearchCV(model_logistic_reg_occupancy, param_grid, cv=5, scoring=f1)
log_grid_search_banknote.fit(df_banknote_transformed, y_train_banknote)
log_grid_search_occupancy.fit(df_occupancy_transformed, train_o["Occupancy"])

In [None]:
# Print best hyperparameters and F1 score
print("\t Banknote Data Set: ")
print("Best Hyperparameters: ", log_grid_search_banknote.best_params_)
print("Best F1 Score: ", log_grid_search_banknote.best_score_)

print("\n\t Occupancy Data Set: ")
print("Best Hyperparameters: ", log_grid_search_occupancy.best_params_)
print("Best F1 Score: ", log_grid_search_occupancy.best_score_)

	 Banknote Data Set: 
Best Hyperparameters:  {'C': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best F1 Score:  0.9916771803099673

	 Occupancy Data Set: 
Best Hyperparameters:  {'C': 0.001, 'penalty': 'l1', 'solver': 'saga'}
Best F1 Score:  0.9677440928347621


###Suport Vector Machine (SVM)



In [None]:
# Define the SVM classifier
model_svm_banknote = SVC(max_iter=10000)
model_svm_occupancy = SVC(max_iter=10000)

# Define the parameter grid
param_grid = {'C': [0.01, 0.1, 1, 10, 100], 
              'kernel': ['rbf',],#'sigmoid','poly','linear'
              'gamma': [0.1, 1, 10, 100]}

# Define the GridSearchCV object
svm_grid_search_banknote= GridSearchCV(model_svm_banknote, param_grid, cv=5, scoring=f1)
svm_grid_search_occupancy = GridSearchCV(model_svm_occupancy, param_grid, cv=5, scoring=f1)

# Fit the GridSearchCV object to the data
svm_grid_search_banknote.fit(df_banknote_transformed, y_train_banknote)
svm_grid_search_occupancy.fit(df_occupancy_transformed, train_o["Occupancy"])

# Print the best hyperparameters
print("\t Banknote Data Set: ")
print("Best hyperparameters: ", svm_grid_search_banknote.best_params_)
print("Best F1 Score: ", svm_grid_search_banknote.best_score_)

print("\n\t Occupancy Data Set: ")
print("Best hyperparameters: ", svm_grid_search_occupancy.best_params_)
print("Best F1 Score: ", svm_grid_search_occupancy.best_score_)

	 Banknote Data Set: 
Best hyperparameters:  {'C': 1, 'gamma': 10, 'kernel': 'rbf'}
Best F1 Score:  1.0

	 Occupancy Data Set: 
Best hyperparameters:  {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
Best F1 Score:  0.9554811142205516


###K-Nearest Neighbor (KNN)

In [None]:
# Define the KNN classifier
knn_banknote = KNeighborsClassifier()
knn_occupancy = KNeighborsClassifier()

# Define the parameter grid
param_grid = {'n_neighbors': [3, 5, 7, 9], 
              'weights': ['uniform', 'distance'],
              'metric': ['euclidean', 'manhattan']
              }

# Define the GridSearchCV object
knn_grid_search_banknote = GridSearchCV(knn_banknote, param_grid, cv=5, scoring=f1)
knn_grid_search_occupancy = GridSearchCV(knn_occupancy, param_grid, cv=5, scoring=f1)

# Fit the GridSearchCV object to the data
knn_grid_search_banknote.fit(df_banknote_transformed, y_train_banknote)
knn_grid_search_occupancy.fit(df_occupancy_transformed, train_o["Occupancy"])

# Print the best hyperparameters
print("\t Banknote Data Set: ")
print("Best hyperparameters: ", knn_grid_search_banknote.best_params_)
print("Best F1 Score: ", knn_grid_search_banknote.best_score_)

print("\n\t Occupancy Data Set: ")
print("Best hyperparameters: ", knn_grid_search_occupancy.best_params_)
print("Best F1 Score: ", knn_grid_search_occupancy.best_score_)

	 Banknote Data Set: 
Best hyperparameters:  {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}
Best F1 Score:  0.9980099502487562

	 Occupancy Data Set: 
Best hyperparameters:  {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'}
Best F1 Score:  0.9818181818181818


###Decision Tree Classifier

In [None]:
# Define the Decision Tree classifier
dtc_banknote = DecisionTreeClassifier()
dtc_occupancy = DecisionTreeClassifier()

# Define the parameter grid
param_grid = {'max_depth': [5, 10, 15, 20], 
              'min_samples_split': [2, 5, 10, 15, 20]}

# Define the GridSearchCV object
dtc_grid_search_banknote = GridSearchCV(dtc_banknote, param_grid, cv=5, scoring=f1)
dtc_grid_search_occupancy = GridSearchCV(dtc_occupancy, param_grid, cv=5, scoring=f1)

# Fit the GridSearchCV object to the data
dtc_grid_search_banknote.fit(df_banknote_transformed, y_train_banknote)
dtc_grid_search_occupancy.fit(df_occupancy_transformed, train_o["Occupancy"])

# Print the best hyperparameterssvm
print("\tBest Hyperparameters Banknote Data Set: ")
print("Best hyperparameters: ", dtc_grid_search_banknote.best_params_)
print("Best F1 Score: ", dtc_grid_search_banknote.best_score_)

print("\n\tBest Hyperparameters Occupancy Data Set: ")
print("Best hyperparameters: ", dtc_grid_search_occupancy.best_params_)
print("Best F1 Score: ", dtc_grid_search_occupancy.best_score_)

	Best Hyperparameters Banknote Data Set: 
Best hyperparameters:  {'max_depth': 10, 'min_samples_split': 2}
Best F1 Score:  0.9844421587833387

	Best Hyperparameters Occupancy Data Set: 
Best hyperparameters:  {'max_depth': 5, 'min_samples_split': 2}
Best F1 Score:  0.7176247467206485


We just found the best parameters for each model, therefore, the next step is to evaluate the models.

#Model Evaluation

We have the best parameters, nevertheless, we also want the best model, that is why we are implementing cross-validation. 

To avoid overfitting and make sure that the results are independent from the partion we made (training and testing), we use cross-validation, a thecnique that uses different portions of the dataset to train and test the model on different iterations. It also let us assess the skill of machine learning models, to be able to choose the best. 

First we evaluate all the models with the testing set, and then we evaluate the best models with the validation set.

##Model Evaluation with testing set





###Logistic Regression

In [None]:
model_logistic_reg_banknote = LogisticRegression(C=100,penalty = 'l1', solver= 'liblinear',max_iter=10000, random_state= 42)
model_logistic_reg_occupancy = LogisticRegression(max_iter=10000,C= 0.001, penalty= 'l1', solver= 'saga')

model_logistic_reg_banknote.fit(df_banknote_transformed, y_train_banknote)
model_logistic_reg_occupancy.fit(df_occupancy_transformed, train_o["Occupancy"])

log_predictions_banknote = model_logistic_reg_banknote.predict(df_banknote_transformed_test)
log_predictions_occupancy = model_logistic_reg_occupancy.predict(df_occupancy_transformed_test1)

print("Logistic Regression Banknote Data Set: ")
print("F1 Score: ", f1_score(y_test_banknote, log_predictions_banknote))
print("Accuracy Score: ", accuracy_score(y_test_banknote, log_predictions_banknote))
print("Precision Score: ", precision_score(y_test_banknote, log_predictions_banknote))
print("Recall Score: ", recall_score(y_test_banknote, log_predictions_banknote))

print("\nLogistic Regression Occupancy Data Set: ")
print("F1 Score: ", f1_score(test1_o["Occupancy"], log_predictions_occupancy))
print("Accuracy Score: ", accuracy_score(test1_o["Occupancy"], log_predictions_occupancy))
print("Precision Score: ", precision_score(test1_o["Occupancy"], log_predictions_occupancy))
print("Recall Score: ", recall_score(test1_o["Occupancy"], log_predictions_occupancy))

Logistic Regression Banknote Data Set: 
F1 Score:  0.9787234042553192
Accuracy Score:  0.9782608695652174
Precision Score:  0.971830985915493
Recall Score:  0.9857142857142858

Logistic Regression Occupancy Data Set: 
F1 Score:  0.9714285714285715
Accuracy Score:  0.9786116322701689
Precision Score:  0.9472140762463344
Recall Score:  0.9969135802469136


###SVM

In [None]:
model_svm_banknote = SVC(C= 1, gamma= 10, kernel= 'rbf', max_iter=10000)
model_svm_occupancy = SVC(C= 1, gamma= 0.1, kernel= 'rbf', max_iter=10000)

model_svm_banknote.fit(df_banknote_transformed, y_train_banknote)
model_svm_occupancy.fit(df_occupancy_transformed, train_o["Occupancy"])

svm_predictions_banknote = model_svm_banknote.predict(df_banknote_transformed_test)
svm_predictions_occupancy = model_svm_occupancy.predict(df_occupancy_transformed_test1)

print("SVM Banknote Data Set: ")
print("F1 Score: ", f1_score(y_test_banknote, svm_predictions_banknote))
print("Accuracy: ", accuracy_score(y_test_banknote, svm_predictions_banknote))
print("Precision: ", precision_score(y_test_banknote, svm_predictions_banknote))
print("recall: ", recall_score(y_test_banknote, svm_predictions_banknote))

print("\nSVM Occupancy Data Set: ")
print("F1 Score: ", f1_score(test1_o["Occupancy"], svm_predictions_occupancy))
print("Accuracy: ", accuracy_score(test1_o["Occupancy"], svm_predictions_occupancy))
print("Precision: ", precision_score(test1_o["Occupancy"], svm_predictions_occupancy))
print("recall: ", recall_score(test1_o["Occupancy"], svm_predictions_occupancy))

SVM Banknote Data Set: 
F1 Score:  1.0
Accuracy:  1.0
Precision:  1.0
recall:  1.0

SVM Occupancy Data Set: 
F1 Score:  0.9603567888999008
Accuracy:  0.9699812382739212
Precision:  0.9263862332695985
recall:  0.9969135802469136


###KNN

In [None]:
knn_banknote = KNeighborsClassifier(metric= 'euclidean', n_neighbors= 3, weights= 'uniform')
knn_occupancy = KNeighborsClassifier(metric= 'euclidean', n_neighbors= 9, weights= 'uniform')

knn_banknote.fit(df_banknote_transformed, y_train_banknote)
knn_occupancy.fit(df_occupancy_transformed, train_o["Occupancy"])

knn_predictions_banknote = knn_banknote.predict(df_banknote_transformed_test)
knn_predictions_occupancy = knn_occupancy.predict(df_occupancy_transformed_test1)

print("KNN Banknote Data Set: ")
print("F1 Score: ", f1_score(y_test_banknote, knn_predictions_banknote))
print("Accuracy: ", accuracy_score(y_test_banknote, knn_predictions_banknote))
print("Precision: ", precision_score(y_test_banknote, knn_predictions_banknote))
print("recall: ", recall_score(y_test_banknote, knn_predictions_banknote))

print("\nKNN Occupancy Data Set: ")
print("F1 Score: ", f1_score(test1_o["Occupancy"], knn_predictions_occupancy))
print("Accuracy: ", accuracy_score(test1_o["Occupancy"], knn_predictions_occupancy))
print("Precision: ", precision_score(test1_o["Occupancy"], knn_predictions_occupancy))
print("recall: ", recall_score(test1_o["Occupancy"], knn_predictions_occupancy))

KNN Banknote Data Set: 
F1 Score:  1.0
Accuracy:  1.0
Precision:  1.0
recall:  1.0

KNN Occupancy Data Set: 
F1 Score:  0.948125321006677
Accuracy:  0.9621013133208255
Precision:  0.9466666666666667
recall:  0.9495884773662552


###Decision Tree Classifier

In [None]:
dtc_banknote = DecisionTreeClassifier(max_depth= 20, min_samples_split= 2)
dtc_occupancy = DecisionTreeClassifier(max_depth= 5, min_samples_split= 2)

dtc_banknote.fit(df_banknote_transformed, y_train_banknote)
dtc_occupancy.fit(df_occupancy_transformed, train_o["Occupancy"])

dtc_predictions_banknote = dtc_banknote.predict(df_banknote_transformed_test)
dtc_predictions_occupancy = dtc_occupancy.predict(df_occupancy_transformed_test1)

print("Decision Tree Banknote Data Set: ")
print("F1 Score: ", f1_score(y_test_banknote, dtc_predictions_banknote))
print("Accuracy: ", accuracy_score(y_test_banknote, dtc_predictions_banknote))
print("Precision: ", precision_score(y_test_banknote, dtc_predictions_banknote))
print("recall: ", recall_score(y_test_banknote, dtc_predictions_banknote))

print("\nDecision Tree Occupancy Data Set: ")
print("F1 Score: ", f1_score(test1_o["Occupancy"], dtc_predictions_occupancy))
print("Accuracy: ", accuracy_score(test1_o["Occupancy"], dtc_predictions_occupancy))
print("Precision: ", precision_score(test1_o["Occupancy"], dtc_predictions_occupancy))
print("recall: ", recall_score(test1_o["Occupancy"], dtc_predictions_occupancy))

Decision Tree Banknote Data Set: 
F1 Score:  0.9781021897810218
Accuracy:  0.9782608695652174
Precision:  1.0
recall:  0.9571428571428572

Decision Tree Occupancy Data Set: 
F1 Score:  0.7626076260762606
Accuracy:  0.8551594746716698
Precision:  0.9480122324159022
recall:  0.6378600823045267


In summary, all the above results can be seen in the following tables.


In [None]:
results_banknote = pd.DataFrame({'model':['Logistic Reg', "SVM", "KNN", "Decision Tree"], 
                    'f1_score':[f1_score(y_test_banknote, log_predictions_banknote), f1_score(y_test_banknote, svm_predictions_banknote), 
                                f1_score(y_test_banknote, knn_predictions_banknote), f1_score(y_test_banknote, dtc_predictions_banknote)],
                    'precision_score':[precision_score(y_test_banknote, log_predictions_banknote), precision_score(y_test_banknote, svm_predictions_banknote),
                                 precision_score(y_test_banknote, knn_predictions_banknote), precision_score(y_test_banknote, dtc_predictions_banknote)],
                    'recall_score':[recall_score(y_test_banknote, log_predictions_banknote), recall_score(y_test_banknote, svm_predictions_banknote),
                              recall_score(y_test_banknote, knn_predictions_banknote), recall_score(y_test_banknote, dtc_predictions_banknote)],
                    'accuracy_score':[accuracy_score(y_test_banknote, log_predictions_banknote), accuracy_score(y_test_banknote, svm_predictions_banknote),
                                accuracy_score(y_test_banknote, knn_predictions_banknote), accuracy_score(y_test_banknote, dtc_predictions_banknote)]})
results_occupancy = pd.DataFrame({'model':['Logistic Reg', "SVM", "KNN", "Decision Tree"],
                                  'f1_score':[f1_score(test1_o["Occupancy"], log_predictions_occupancy), f1_score(test1_o["Occupancy"], svm_predictions_occupancy),
                                              f1_score(test1_o["Occupancy"], knn_predictions_occupancy), f1_score(test1_o["Occupancy"], dtc_predictions_occupancy)],
                                  'precision_score':[precision_score(test1_o["Occupancy"], log_predictions_occupancy), precision_score(test1_o["Occupancy"], svm_predictions_occupancy),
                                                     precision_score(test1_o["Occupancy"], knn_predictions_occupancy), precision_score(test1_o["Occupancy"], dtc_predictions_occupancy)],
                                  'recall_score':[recall_score(test1_o["Occupancy"], log_predictions_occupancy), recall_score(test1_o["Occupancy"], svm_predictions_occupancy),
                                                  recall_score(test1_o["Occupancy"], knn_predictions_occupancy), recall_score(test1_o["Occupancy"], dtc_predictions_occupancy)],
                                  'accuracy_score':[accuracy_score(test1_o["Occupancy"], log_predictions_occupancy), accuracy_score(test1_o["Occupancy"], svm_predictions_occupancy),
                                                    accuracy_score(test1_o["Occupancy"],knn_predictions_occupancy), accuracy_score(test1_o["Occupancy"], dtc_predictions_occupancy)]})
                                                                   

In [None]:
print('Banknote Authentication')
results_banknote.style.background_gradient(cmap='Reds',subset='f1_score')

Unnamed: 0,model,f1_score,precision_score,recall_score,accuracy_score
0,Logistic Reg,0.978723,0.971831,0.985714,0.978261
1,SVM,1.0,1.0,1.0,1.0
2,KNN,1.0,1.0,1.0,1.0
3,Decision Tree,0.978102,1.0,0.957143,0.978261


In [None]:
print('Occupancy Detection')
results_occupancy.style.background_gradient(cmap='Reds',subset='f1_score')

Occupancy Detection


Unnamed: 0,model,f1_score,precision_score,recall_score,accuracy_score
0,Logistic Reg,0.971429,0.947214,0.996914,0.978612
1,SVM,0.960357,0.926386,0.996914,0.969981
2,KNN,0.948125,0.946667,0.949588,0.962101
3,Decision Tree,0.762608,0.948012,0.63786,0.855159


These results let us conclude that, using the F1 escore as reference, the best models are:


*   SVM and KNN for the Banknote Authentication dataset
*   Logistic regression for the Occupancy Detection dataset



##Model Evaluation with validation set

Now, with the best models that we found, we are going to evaluate again with the validation set. 

###Banknote Authentication dataset

In [None]:
svm_predictions_valid_banknote = model_svm_banknote.predict(df_banknote_transformed_valid)
knn_predictions_valid_banknote = knn_banknote.predict(df_banknote_transformed_valid)

print("SVM Banknote Data Set: ")
print("F1 Score: ", f1_score(y_valid_banknote, svm_predictions_valid_banknote))
print("Accuracy: ", accuracy_score(y_valid_banknote, svm_predictions_valid_banknote))
print("Precision: ", precision_score(y_valid_banknote, svm_predictions_valid_banknote))
print("recall: ", recall_score(y_valid_banknote, svm_predictions_valid_banknote))

print("\nKNN Banknote Data Set: ")
print("F1 Score: ", f1_score(y_valid_banknote, knn_predictions_valid_banknote))
print("Accuracy: ", accuracy_score(y_valid_banknote, knn_predictions_valid_banknote))
print("Precision: ", precision_score(y_valid_banknote, knn_predictions_valid_banknote))
print("recall: ", recall_score(y_valid_banknote, knn_predictions_valid_banknote))

SVM Banknote Data Set: 
F1 Score:  1.0
Accuracy:  1.0
Precision:  1.0
recall:  1.0

KNN Banknote Data Set: 
F1 Score:  1.0
Accuracy:  1.0
Precision:  1.0
recall:  1.0


###Occupancy Detection dataset

In [None]:
log_reg_predictions_valid_occupancy = model_logistic_reg_occupancy.predict(df_occupancy_transformed_test2)

print("Logistic Regression Occupancy Data Set: ")
print("F1 Score: ", f1_score(test2_o["Occupancy"], log_reg_predictions_valid_occupancy))
print("Accuracy: ", accuracy_score(test2_o["Occupancy"], log_reg_predictions_valid_occupancy))
print("Precision: ", precision_score(test2_o["Occupancy"], log_reg_predictions_valid_occupancy))
print("recall: ", recall_score(test2_o["Occupancy"], log_reg_predictions_valid_occupancy))

Logistic Regression Occupancy Data Set: 
F1 Score:  0.9840579710144928
Accuracy:  0.9932321575061526
Precision:  0.9741750358680057
recall:  0.9941434846266471


In both cases, we get excelent F1 scores, which let us confirm that the models selected have a good performance. 

#Conclusions

After all this process we can say that, comparing the models with the F1 score, the best models for the Banknote Authentication dataset are Support Vector Machine and K-Nearest Neighbor, both scored 1, and for the Occupancy Detection dataset, the best model is Logistic Regression, scored 0.98. in the two cases, the models give good results both in testing and validation.

#Referencias 

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

https://scikit-learn.org/stable/modules/cross_validation.html

https://learn.g2.com/cross-validation

https://www.geeksforgeeks.org/svm-hyperparameter-tuning-using-gridsearchcv-ml/