# Classification Models Practice

About data set: this is classic data set for classification, features describe characteristics of the cell nuclei present in the image, 10 real-valued features are computed for each cell nucleus. 
The goal is diagnosis (4 = malignant, 2 = benign).
Resource: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
import warnings
warnings.filterwarnings("ignore")

## Importing the dataset

In [3]:
dataset = pd.read_csv('breast-cancer-wisconsin.data', 
                      names = ['ID', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
                                 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
                                 'Normal Nucleoli', 'Mitoses', 'Class' ])


In [4]:
dataset.dtypes

ID                             object
Clump Thickness                 int64
Uniformity of Cell Size         int64
Uniformity of Cell Shape        int64
Marginal Adhesion               int64
Single Epithelial Cell Size     int64
Bare Nuclei                    object
Bland Chromatin                 int64
Normal Nucleoli                 int64
Mitoses                         int64
Class                           int64
dtype: object

In [5]:
dataset.shape

(699, 11)

## Data Preprocessing

The data set has missed values in 'Bare Nuclei' (I decided replace into mean value) and it is better encode dependent variable class - instead 2 and 4 assign 0 and 1

In [6]:
# replace ? into Nan to calculate mean
dataset['Bare Nuclei'].replace({"?": np.nan}, inplace=True)

In [7]:
dataset['Bare Nuclei'] = pd.to_numeric(dataset['Bare Nuclei'])

In [8]:
dataset['Bare Nuclei']= dataset['Bare Nuclei'].fillna(dataset['Bare Nuclei'].mean(skipna=True)).astype(np.int64)

## Splitting data into features and dependent variable

In [9]:
X = dataset.iloc[:, 1:-1].values # exclude ID
y = dataset.iloc[:, -1].values

In [10]:
transdict = {2: 0, 4: 1}
y = np.array([transdict[x] for x in y])

## Splitting the dataset into the Training set and Test set

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Feature Scaling

We should use feature scaling for all models where we use distance - to avoid dominating one of the variables. Tree and forest don't need it but it won't spoil results. But all values are in range from 1 to 10, so we don't need it

# Logistic Regression

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

classifier_log_reg = LogisticRegression()
classifier_log_reg.fit(X_train, y_train)

LogisticRegression()

In [13]:
y_pred = classifier_log_reg.predict(X_test)

In [14]:
confusion_matrix(y_test, y_pred)

array([[82,  3],
       [ 1, 54]], dtype=int64)

In [15]:
acc_score_log_reg = accuracy_score(y_test, y_pred)
acc_score_log_reg

0.9714285714285714

Indeed, great result, only 4 wrong predictions and only one of them is second type error, lets see coeficients to see significance of features

In [16]:
classifier_log_reg.coef_

array([[ 0.57437269, -0.03111522,  0.30197919,  0.38782657,  0.15825968,
         0.40505614,  0.375729  ,  0.15414341,  0.4853966 ]])

The most impact make first and 6st feature - Clump Thickness and Mitoses.   

# K-NN

In [17]:
from sklearn.neighbors import KNeighborsClassifier

classifier_knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier_knn.fit(X_train, y_train)


KNeighborsClassifier()

In [18]:
y_pred = classifier_knn.predict(X_test)

In [19]:
confusion_matrix(y_test, y_pred)

array([[83,  2],
       [ 1, 54]], dtype=int64)

In [20]:
acc_score_knn = accuracy_score(y_test, y_pred)
acc_score_knn

0.9785714285714285

k-NN beats Logisitic Regression on one observation. Five neighbours is optimal in our occasion  

# SVM 

In [21]:
from sklearn.svm import SVC

classifier_SVM_lin = SVC(kernel = 'linear')
classifier_SVM_lin.fit(X_train, y_train)

SVC(kernel='linear')

In [22]:
y_pred = classifier_SVM_lin.predict(X_test)

In [23]:
confusion_matrix(y_test, y_pred)

array([[82,  3],
       [ 1, 54]], dtype=int64)

In [24]:
acc_score_SVM_lin = accuracy_score(y_test, y_pred)
acc_score_SVM_lin

0.9714285714285714

Linear SVM gives result like Log. Reg., lets now see SVM with different non linear kernels

# Kernel SVM

In [25]:
from sklearn.svm import SVC

kernels = ['poly', 'rbf', 'sigmoid']

In [26]:
for k in kernels:
    classifier_SVM = SVC(kernel = k) 
    classifier_SVM.fit(X_train, y_train)
    y_pred = classifier_SVM.predict(X_test)
    
    print('Kernel :', k)
    print(confusion_matrix(y_test, y_pred))
    print('Score : ', accuracy_score(y_test, y_pred))

Kernel : poly
[[82  3]
 [ 1 54]]
Score :  0.9714285714285714
Kernel : rbf
[[82  3]
 [ 1 54]]
Score :  0.9714285714285714
Kernel : sigmoid
[[57 28]
 [54  1]]
Score :  0.4142857142857143


According to results, it is better use kernel of sigmoid or rbf

In [27]:
classifier_SVM_rbf = SVC(kernel = 'rbf') #‘poly’, ‘rbf’, ‘sigmoid’,
classifier_SVM_rbf.fit(X_train, y_train)
y_pred = classifier_SVM_rbf.predict(X_test)

In [28]:
confusion_matrix(y_test, y_pred)

array([[82,  3],
       [ 1, 54]], dtype=int64)

In [29]:
acc_score_SVM_rbf = accuracy_score(y_test, y_pred)
acc_score_SVM_rbf

0.9714285714285714

# Naive Bayes

In [30]:
from sklearn.naive_bayes import GaussianNB
 
classifier_Bayes = GaussianNB()
classifier_Bayes.fit(X_train, y_train)

GaussianNB()

In [31]:
y_pred = classifier_Bayes.predict(X_test)

In [32]:
confusion_matrix(y_test, y_pred)

array([[80,  5],
       [ 1, 54]], dtype=int64)

In [33]:
acc_score_bayes = accuracy_score(y_test, y_pred)
acc_score_bayes

0.9571428571428572

Naive Bayes is bad in predicting 'benign' in this situation

# Decision Tree CLassifier

In [34]:
from sklearn.tree import DecisionTreeClassifier
 
classifier_tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier_tree.fit(X_train, y_train)


DecisionTreeClassifier(criterion='entropy', random_state=0)

In [35]:
y_pred = classifier_tree.predict(X_test)

In [36]:
confusion_matrix(y_test, y_pred)

array([[80,  5],
       [ 5, 50]], dtype=int64)

In [37]:
acc_score_tree = accuracy_score(y_test, y_pred)
acc_score_tree

0.9285714285714286

Decision tree gives worse result than previous models, it can be because of lack of data

# Random Forest Classifier

In [38]:
from sklearn.ensemble import RandomForestClassifier
 
classifier_forest = RandomForestClassifier(n_estimators = 5, criterion = 'gini', random_state = 0)
classifier_forest.fit(X_train, y_train)

RandomForestClassifier(n_estimators=5, random_state=0)

In [39]:
y_pred = classifier_forest.predict(X_test)

In [40]:
confusion_matrix(y_test, y_pred)

array([[83,  2],
       [ 3, 52]], dtype=int64)

In [41]:
acc_score_forest = accuracy_score(y_test, y_pred)
acc_score_forest

0.9642857142857143

Random forest Classifier gives 97.8% with criterion gini and 5 trees, also with criterion entropy with 13 trees

# Summarizing results

In [42]:
print(f"score for Logistic Regression: {acc_score_log_reg} % accuracy")
print(f"score for K-NN: {acc_score_knn} % accuracy")
print(f"score for SVM linear: {acc_score_SVM_lin} % accuracy")
print(f"score for SVM kernel: {acc_score_SVM_rbf} % accuracy")
print(f"score for Naive Bayes: {acc_score_bayes} % accuracy")
print(f"score for Decision Tree: {acc_score_tree} % accuracy")
print(f"score for Random Forest: {acc_score_forest} % accuracy")


score for Logistic Regression: 0.9714285714285714 % accuracy
score for K-NN: 0.9785714285714285 % accuracy
score for SVM linear: 0.9714285714285714 % accuracy
score for SVM kernel: 0.9714285714285714 % accuracy
score for Naive Bayes: 0.9571428571428572 % accuracy
score for Decision Tree: 0.9285714285714286 % accuracy
score for Random Forest: 0.9642857142857143 % accuracy


As we can see, best results give k-NN, RandomForest. I would prefer k-NN if we had a lot data because it is fast and efficient, Random forest is powerful and accurate model at all, but it need more tunning hyperparameters than k-NN.

# Cross-validation and GridSearch

And now, I wanna find best hyperparameters for each model using GridSearchCV and compare their accuracy scores and see the contribution of this approach to determining the best model experimentally. After that use XgBoost and Catboost and to see if the beat results of previous models. And finding materials about best metrics in classification tasks recommend using F1-score for evaluating models

In [43]:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import f1_score

# Logistic Regression

In [44]:
parameters = [{'solver': ['liblinear'],'penalty': ['l1', 'l2']}, # I think it will choose liblinear because it fits for small data set
             {'solver': ['lbfgs'],'penalty': ['none', 'l2']}]
grid_search = GridSearchCV(estimator =  LogisticRegression(),
                           param_grid = parameters,
                           scoring = 'f1',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best F1-score: {:.2f} ".format(best_accuracy))
print("Best Parameters:", best_parameters)

Best F1-score: 0.95 
Best Parameters: {'penalty': 'none', 'solver': 'lbfgs'}


In [45]:
classifier = LogisticRegression(**best_parameters)
classifier.fit(X_train, y_train)
print("Test set F1-score: {:.2f} ".format(f1_score(y_test, classifier.predict(X_test))))

Test set F1-score: 0.96 


# K-NN

In [46]:
parameters = [{'n_neighbors': [3, 5, 7, 9, 11], 'weights': ['uniform', 'distance'], 'metric': ['minkowski'], 
               'p': [1,2,3]}]
grid_search = GridSearchCV(estimator =  KNeighborsClassifier(),
                           param_grid = parameters,
                           scoring = 'f1',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best F1-score: {:.2f} ".format(best_accuracy))
print("Best Parameters:", best_parameters)

Best F1-score: 0.95 
Best Parameters: {'metric': 'minkowski', 'n_neighbors': 7, 'p': 2, 'weights': 'distance'}


In [47]:
classifier = KNeighborsClassifier(**best_parameters)
classifier.fit(X_train, y_train)
print("Test set F1-score: {:.2f} ".format(f1_score(y_test, classifier.predict(X_test))))

Test set F1-score: 0.97 


# SVM 

In [48]:
parameters = [{'C': [0.25, 0.5, 0.75], 'kernel': ['linear']},
              {'C': [0.25, 0.5, 0.75], 'kernel': ['rbf'], 'gamma': [0.1,  0.3,  0.5,  0.7, 0.9]},
             {'C': [0.25, 0.5, 0.75], 'kernel': ['poly'], 'gamma': [0.1,  0.3,  0.5,  0.7, 0.9], 
              'degree': [2, 3, 4, 5, 6, 7], 'coef0': [0, 1, 3, 5]},
             {'C': [0.25, 0.5, 0.75], 'kernel': ['sigmoid'], 'gamma': [0.1,  0.3,  0.5,  0.7, 0.9], 
              'coef0': [0, 1, 3, 5]}]
grid_search = GridSearchCV(estimator =  SVC(),
                           param_grid = parameters,
                           scoring = 'f1',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best F1-score: {:.2f} ".format(best_accuracy))
print("Best Parameters:", best_parameters)

Best F1-score: 0.95 
Best Parameters: {'C': 0.75, 'kernel': 'linear'}


In [49]:
classifier = SVC(**best_parameters)
classifier.fit(X_train, y_train)
print("Test set F1-score: {:.2f} ".format(f1_score(y_test, classifier.predict(X_test))))

Test set F1-score: 0.96 


# Naive Bayes

In [50]:
parameters = [{'var_smoothing': np.logspace(0,-9, num=100)}]
grid_search = GridSearchCV(estimator =  GaussianNB(),
                           param_grid = parameters,
                           scoring = 'f1',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best F1-score: {:.2f} ".format(best_accuracy))
print("Best Parameters:", best_parameters)

Best F1-score: 0.95 
Best Parameters: {'var_smoothing': 0.12328467394420659}


In [51]:
classifier = GaussianNB(**best_parameters)
classifier.fit(X_train, y_train)
print("Test set F1-score: {:.2f} ".format(f1_score(y_test, classifier.predict(X_test))))

Test set F1-score: 0.96 


# Decision Tree CLassifier

In [52]:
parameters = [{'criterion': ['gini', 'entropy']}]
grid_search = GridSearchCV(estimator =  DecisionTreeClassifier(random_state = 0),
                           param_grid = parameters,
                           scoring = 'f1',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best F1-score: {:.2f} ".format(best_accuracy))
print("Best Parameters:", best_parameters)

Best F1-score: 0.91 
Best Parameters: {'criterion': 'entropy'}


In [53]:
classifier = DecisionTreeClassifier(random_state = 0, **best_parameters)
classifier.fit(X_train, y_train)
print("Test set F1-score: {:.2f} ".format(f1_score(y_test, classifier.predict(X_test))))

Test set F1-score: 0.91 


# Random Forest Classifier

In [54]:
parameters = [{'n_estimators': [5, 8, 10, 12, 15, 18, 20, 25], 'criterion': ['gini', 'entropy']} ]
grid_search = GridSearchCV(estimator = RandomForestClassifier(random_state = 0),
                           param_grid = parameters,
                           scoring = 'f1',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best F1-score: {:.2f} ".format(best_accuracy))
print("Best Parameters:", best_parameters)

Best F1-score: 0.94 
Best Parameters: {'criterion': 'entropy', 'n_estimators': 20}


In [55]:
classifier = RandomForestClassifier(random_state = 0, **best_parameters)
classifier.fit(X_train, y_train)
print("Test set F1-score: {:.2f} ".format(f1_score(y_test, classifier.predict(X_test))))

Test set F1-score: 0.96 


Conclusion. So models with best hyperparameters was found and the best models with F1-score score on training are K-NN, Naive Bayes, SVM, Logistic regression and on test set the leader is K-NN. But all models has great results 0.91-0.96. Thanks to these algorithms we can be sure to choose right hyperparameters and evaluate models carefully.

# XgBoost and Catboost

These methods are known for their power in both regression and classification, it is interesting to see if they give the best results. The second should be good for categorical data, but the sample may be small for this algorithm.

In [56]:
from xgboost import XGBClassifier
accuracies = cross_val_score(estimator = XGBClassifier(verbosity = 0), X = X_train, y = y_train, cv = 10, scoring = 'f1')
print("Best F1-score: {:.2f} ".format(accuracies.mean()))
print("Standard Deviation: {:.2f} ".format(accuracies.std()))

Best F1-score: 0.93 
Standard Deviation: 0.03 


In [57]:
regressor = XGBClassifier()
regressor.fit(X_train, y_train)
print("Test set F1-score: {:.2f} ".format(f1_score(y_test, classifier.predict(X_test))))

Test set F1-score: 0.96 


In [58]:
from catboost import CatBoostClassifier
accuracies = cross_val_score(estimator = CatBoostClassifier(metric_period = 200), X = X_train, y = y_train, cv = 10, scoring = 'f1')
print("Best F1-score: {:.2f} ".format(accuracies.mean()))
print("Standard Deviation: {:.2f} ".format(accuracies.std()))

Learning rate set to 0.007683
0:	learn: 0.6781819	total: 158ms	remaining: 2m 38s
200:	learn: 0.0594538	total: 516ms	remaining: 2.05s
400:	learn: 0.0270740	total: 839ms	remaining: 1.25s
600:	learn: 0.0161270	total: 1.17s	remaining: 779ms
800:	learn: 0.0106805	total: 1.5s	remaining: 373ms
999:	learn: 0.0075683	total: 1.82s	remaining: 0us
Learning rate set to 0.007683
0:	learn: 0.6788662	total: 1.82ms	remaining: 1.81s
200:	learn: 0.0732659	total: 329ms	remaining: 1.31s
400:	learn: 0.0374880	total: 652ms	remaining: 974ms
600:	learn: 0.0229140	total: 985ms	remaining: 654ms
800:	learn: 0.0153208	total: 1.32s	remaining: 327ms
999:	learn: 0.0108368	total: 1.64s	remaining: 0us
Learning rate set to 0.007683
0:	learn: 0.6787709	total: 1.73ms	remaining: 1.73s
200:	learn: 0.0698130	total: 328ms	remaining: 1.3s
400:	learn: 0.0342041	total: 656ms	remaining: 980ms
600:	learn: 0.0211700	total: 982ms	remaining: 652ms
800:	learn: 0.0140734	total: 1.41s	remaining: 351ms
999:	learn: 0.0099887	total: 1.83s	

In [59]:
regressor = CatBoostClassifier(metric_period = 200)
regressor.fit(X_train, y_train)
print("Test set F1-score: {:.2f} ".format(f1_score(y_test, classifier.predict(X_test))))

Learning rate set to 0.008037
0:	learn: 0.6778431	total: 6.03ms	remaining: 6.02s
200:	learn: 0.0698486	total: 392ms	remaining: 1.56s
400:	learn: 0.0357454	total: 728ms	remaining: 1.09s
600:	learn: 0.0221856	total: 1.07s	remaining: 708ms
800:	learn: 0.0146790	total: 1.42s	remaining: 353ms
999:	learn: 0.0104182	total: 1.76s	remaining: 0us
Test set F1-score: 0.96 


Conclusion. XGBClassifier and CatBoostClassifier didn't beat k-NN F1-score it can be explained by a small sample of observations, but usually they are best ones