# GridSearchCV

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#import data
data = pd.read_csv('breast-cancer-wisconsin.csv',header=None)

#set column names
data.columns = ['Sample Code Number','Clump Thickness','Uniformity of Cell Size',
                                                        'Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size',
                                                        'Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
#view top 10 rows
data.head(10)

Each row in the dataset have one of two possible classes: benign (represented by 2) and malignant (represented by 4). Also, there are 10 attributes in this dataset (shown above) which will be used for prediction, except Sample Code Number which is the id number.

## Clean the data and rename the class values as 0/1 for model building (where 1 represents a malignant case). Also, let’s observe the distribution of the class.

In [None]:
data = data.drop(['Sample Code Number'],axis=1) #Drop 1st column
data = data[data['Bare Nuclei'] != '?'] #Remove rows with missing data
data['Class'] = np.where(data['Class'] ==2,0,1) #Change the Class representation
data['Class'].value_counts() #Class distribution

## Before building a classification model, let’s build a Dummy Classifier to determine the ‘baseline’ performance. This answers the question — ‘What would be the success rate of the model, if one were simply guessing?’ The dummy classifier we are using will simply predict the majority class.

In [None]:
#Split data into attributes and class
X = data.drop(['Class'],axis=1)
y = data['Class']

#perform training and test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#Dummy Classifier
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy= 'most_frequent').fit(X_train,y_train)
y_pred = clf.predict(X_test)

#Distribution of y test
print('y actual : \n' +  str(y_test.value_counts()))

#Distribution of y predicted
print('y predicted : \n' + str(pd.Series(y_pred).value_counts()))

From the output, we can observe that there are 68 malignant and 103 benign cases in the test dataset. However, our classifier predicts all cases as benign (as it is the majority class).

In [None]:
# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Dummy Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

The accuracy of the model is 60.2%, but this is a case where accuracy may not be the best metric to evaluate the model. So, let’s take a look at the other evaluation metrics.

In [None]:

#Logistic regression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression().fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Logistic Regression Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

Looking at the misclassified instances, we can observe that 8 malignant cases have been classified incorrectly as benign (False negatives). Also, just one benign case has been classified as malignant (False positive).
A false negative is more serious as a disease has been ignored, which can lead to the death of the patient. At the same time, a false positive would lead to an unnecessary treatment — incurring additional cost.
Let’s try to minimize the false negatives by using Grid Search to find the optimal parameters. Grid search can be used to improve any specific evaluation metric.

In [None]:

#Grid Search
from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
pens = ['l1', 'l2']
Cs = [0.009,0.01,.09,1,5,10,25]
grid_values = {'penalty': pens,'C':Cs}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
grid_clf_acc.fit(X_train, y_train)

#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))

#Logistic Regression (Grid Search) Confusion matrix
confusion_matrix(y_test,y_pred_acc)

The hyperparameters we tuned are:
1. Penalty: l1 or l2 which species the norm used in the penalization.
2. C: Inverse of regularization strength- smaller values of C specify stronger regularization.

Also, in Grid-search function, we have the scoring parameter where we can specify the metric to evaluate the model on (We have chosen recall as the metric). From the confusion matrix below, we can see that the number of false negatives has reduced, however, it is at the cost of increased false positives. The recall after grid search has jumped from 88.2% to 91.1%, whereas the precision has dropped to 87.3% from 98.3%.

In [None]:
grid_clf_acc.cv_results_

In [None]:
grid_clf_acc.best_score_

In [None]:
grid_clf_acc.best_params_

In [None]:
grid_clf_acc.best_estimator_

In [None]:
scores = [x for x in grid_clf_acc.cv_results_['mean_train_score']]
scores = np.array(scores).reshape(len(Cs), len(pens))

for ind, i in enumerate(Cs):
    plt.plot(pens, scores[ind], label='C: ' + str(i))
plt.legend()
plt.xlabel('Penalty')
plt.ylabel('Mean score')
plt.show()