Delving deeper into ml methods, better metrics are needed to find the best approach to a particular task. Fine tuning the ML models comes into play in this section

In this book methods to improve already used ML methods and models are considered

In [1]:
# importing modules for the tasks at hand
# datasets also imported rather than using any external files (like CVS, SPSS etc.)
from sklearn import datasets
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

Confusion matrix

Its a table used to describe performance of a classification model (or classifier) on a set of test data for which the values are known

In [2]:
# adding the dataset to be used at the start
cancer = datasets.load_breast_cancer()
cancer.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [3]:
X = cancer.data
y = cancer.target

In [4]:
# predict the values using k-NN
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

knn = KNeighborsClassifier(n_neighbors = 8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# confusion matrix calculation 
print(confusion_matrix(y_test, y_pred))

[[ 77   3]
 [  4 144]]


In [5]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.96        80
           1       0.98      0.97      0.98       148

    accuracy                           0.97       228
   macro avg       0.97      0.97      0.97       228
weighted avg       0.97      0.97      0.97       228



In [6]:
# accuracy metric or score for the dataset
knn.score(X_test, y_test)

0.9692982456140351

Hyperparameter tuning

Method of selecting the parameter used in the models like k in kNN and alpha in regression

one method is grid search cross validation is done to get an optimised value for the k in the below case for kNN

In [8]:
# import the required module values
from sklearn.model_selection import GridSearchCV

# parameter grid to varry the k value over
param_grid = {"n_neighbors" : np.arange(1,50)}

knn = KNeighborsClassifier()

knn_cv = GridSearchCV(knn, param_grid, cv = 5)
knn_cv.fit(X,y)

print("Best parameter: {}".format(knn_cv.best_params_))
print("Best score: {}".format(knn_cv.best_score_))

Best parameter: {'n_neighbors': 13}
Best score: 0.9332401800962584


Randomized Search CV is used when the hyperparameter space is large which when used with Grid Search CV can lead to being computationally expensive.

For this DecisionTree is used as the classifer, here the number of params are also more than kNN and regression 

In [11]:
# import the randomized func from scipy
from scipy.stats import randint

# import the dataset
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

#import decisiontree and randomized search cv
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

#parameters for decision tree which can be changed to tune the classfier
param_dist = {"max_depth" : [3, None],
                "max_features" : randint(1,9),
                "min_samples_leaf" : randint(1,9),
                "criterion" : ["gini", "entropy"]}

# initialize the decision tree
tree = DecisionTreeClassifier()

# applying the tuning
tree_cv = RandomizedSearchCV(tree, param_dist, cv = 5)
tree_cv.fit(X,y)

#print the output which is the params used as well as the score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best Score: {}".format(tree_cv.best_score_))

Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 2, 'min_samples_leaf': 3}
Best Score: 0.020352400408580187
