# Breast Cancer Detection

- Detecting the breast cancer using multiple models like: LR, SVM, K-NN, Random Forest, NN  and Naive Bayes.
- Then, comparing the models according to their accuracies using Matrices like: (Confusion matrix, Classification Report & ROC AUC Score) and Cross-validation method (K-fold).

### Importing the required libraries

In [24]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import cross_val_score

### Importing the dataset

In [25]:
dataset = pd.read_csv("breast_cancer.csv")
X = dataset.iloc[ :  , 1:-1].values
y = dataset.iloc[ :  , -1].values

### Splitting the dataset into the Training set and Test set

In [26]:
x_train , x_test , y_train , y_test = train_test_split(X , y , test_size = 0.2, random_state = 0)

### Training multiple models on the Training set

In [27]:
# Training the logistic regression (lr) model
classifier_lr = LogisticRegression(random_state = 0)
classifier_lr.fit(x_train , y_train)

# Training the SVM model
classifier_svm = SVC(kernel= 'linear' , random_state = 0)
classifier_svm.fit(x_train , y_train)

# Training the K-NN model
classifier_knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
classifier_knn.fit(x_train, y_train)

# Training the Naive Bayes (nb) model
classifier_nb = GaussianNB()
classifier_nb.fit(x_train , y_train)

# Training the Neural Network (nn) model
classifier_nn = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, activation='relu', solver='adam', random_state=1)
classifier_nn.fit(x_train, y_train)

# Training the Random Forest (rf) model
classifier_rf = RandomForestClassifier(n_estimators=100, criterion='gini', random_state=0)
classifier_rf.fit(x_train, y_train)

# Training the Ensemble Model
ensemble_model = VotingClassifier(estimators=[('lr', classifier_lr), ('svm', classifier_svm), ('knn', classifier_knn) , ('rf' , classifier_rf)], voting='hard')
ensemble_model.fit(x_train, y_train)

### Predicting the Test set results using multiple models

In [28]:
# Predicting the results for Logistic Regression (lr)
y_predict_lr = classifier_lr.predict(x_test)
print(f"Logistic Regression - Prediction results:\n{y_predict_lr}\n")

# Predicting the results for SVM
y_predict_svm = classifier_svm.predict(x_test)
print(f"SVM - Prediction results:\n{y_predict_svm}\n")

# Predicting the results for K-NN
y_predict_knn = classifier_knn.predict(x_test)
print(f"K-NN - Prediction results:\n{y_predict_knn}\n")

# Predicting the results for Naive Bayes (nb)
y_predict_nb = classifier_nb.predict(x_test)
print(f"Naive Bayes - Prediction results:\n{y_predict_nb}\n")

# Predicting the results for Neural Network (nn)
y_predict_nn = classifier_nn.predict(x_test)
print(f"Neural Network - Prediction results:\n{y_predict_nn}\n")

# Predicting the results for Random Forest (nn)
y_predict_rf = classifier_rf.predict(x_test)
print(f"Random Forest - Prediction results:\n{y_predict_rf}\n")

# Predicting the results for Ensemble Model
y_predict_ensemble = ensemble_model.predict(x_test)
print(f"Ensemble Model - Prediction results:\n{y_predict_ensemble}\n")

Logistic Regression - Prediction results:
[2 2 4 4 2 2 2 4 2 2 4 2 4 2 2 2 4 4 4 2 2 2 4 2 4 4 2 2 2 4 2 4 4 2 2 2 4
 4 2 4 2 2 2 2 2 2 2 4 2 2 4 2 4 2 2 2 4 4 2 4 2 2 2 2 2 2 2 2 4 4 2 2 2 2
 2 2 4 2 2 2 4 2 4 2 2 4 2 4 4 2 4 2 4 4 2 4 4 4 4 2 2 2 4 4 2 2 4 2 2 2 4
 2 2 4 2 2 2 2 2 2 2 4 2 2 4 4 2 4 2 4 2 2 4 2 2 4 2]

SVM - Prediction results:
[2 2 4 4 2 2 2 4 2 2 4 2 4 2 2 4 4 4 4 2 2 2 4 2 4 4 2 2 2 4 2 4 4 2 2 2 4
 4 2 4 2 2 2 2 2 2 2 4 2 2 4 2 4 2 2 2 4 4 2 4 2 2 2 2 2 2 2 2 4 4 2 2 2 2
 2 2 4 2 2 2 4 2 4 2 2 4 2 4 4 2 4 2 4 4 2 4 4 4 4 2 2 2 4 4 2 2 4 4 2 2 4
 2 2 4 2 2 2 2 2 2 2 4 2 2 4 4 2 4 2 4 2 2 4 2 2 4 2]

K-NN - Prediction results:
[2 2 4 4 2 2 2 4 2 2 4 2 4 2 2 2 4 4 4 2 2 2 4 2 4 4 2 2 2 4 2 4 4 2 2 2 4
 4 2 4 2 2 2 2 2 2 2 4 2 2 4 2 4 2 2 2 4 4 2 4 2 2 2 2 2 2 2 2 4 4 2 2 2 2
 2 2 4 2 2 2 4 2 4 2 2 4 2 4 4 2 4 2 4 4 4 4 4 4 4 2 2 2 4 4 2 2 4 4 2 2 4
 2 2 4 2 2 2 2 2 2 2 4 2 2 4 4 2 4 2 4 2 2 4 2 2 4 2]

Naive Bayes - Prediction results:
[2 2 4 4 2 2 2 4 2 2 4 2 4 2 2 

### Making the Confusion Matrix for each model used

In [29]:
# Confusion matrix for Logistic Regression
cm_lr = confusion_matrix(y_test , y_predict_lr)
print(f"Logistic Regression - Confusion Matrix:\n{cm_lr}\n")

# Confusion matrix for SVM
cm_svm = confusion_matrix(y_test , y_predict_svm)
print(f"SVM - Confusion Matrix:\n{cm_svm}\n")

# Confusion matrix for K-NN
cm_knn = confusion_matrix(y_test , y_predict_knn)
print(f"K-NN - Confusion Matrix:\n{cm_knn}\n")

# Confusion matrix for Naive Bayes
cm_nb = confusion_matrix(y_test , y_predict_nb)
print(f"Naive Bayes - Confusion Matrix:\n{cm_nb}\n")

# Confusion matrix for Neural Network
cm_nn = confusion_matrix(y_test , y_predict_nn)
print(f"Neural Network - Confusion Matrix:\n{cm_nn}\n")

# Confusion matrix for Random Forest
cm_rf = confusion_matrix(y_test , y_predict_rf)
print(f"Random Forest - Confusion Matrix:\n{cm_rf}\n")

# Confusion matrix for Ensemble Model
cm_ensemble = confusion_matrix(y_test , y_predict_ensemble)
print(f"Ensemble Model - Confusion Matrix:\n{cm_ensemble}\n")

Logistic Regression - Confusion Matrix:
[[84  3]
 [ 3 47]]

SVM - Confusion Matrix:
[[83  4]
 [ 2 48]]

K-NN - Confusion Matrix:
[[84  3]
 [ 1 49]]

Naive Bayes - Confusion Matrix:
[[80  7]
 [ 0 50]]

Neural Network - Confusion Matrix:
[[84  3]
 [ 2 48]]

Random Forest - Confusion Matrix:
[[84  3]
 [ 1 49]]

Ensemble Model - Confusion Matrix:
[[84  3]
 [ 3 47]]



### Making the classification_report and roc_auc_score matrices for each model used

In [30]:
# Logistic Regression - Classification Report and ROC-AUC Score
cls_rep_lr = classification_report(y_test, y_predict_lr)
print("Logistic Regression - Classification Report:\n", cls_rep_lr)
roc_auc_lr = roc_auc_score(y_test, y_predict_lr)
print("Logistic Regression - ROC AUC Score: {:.2f}".format(roc_auc_lr),"\n")

# SVM - Classification Report and ROC-AUC Score
cls_rep_svm =  classification_report(y_test, y_predict_svm)
print("SVM - Classification Report:\n", cls_rep_svm)
roc_auc_svm = roc_auc_score(y_test, y_predict_svm)
print("SVM - ROC AUC Score: {:.2f}".format(roc_auc_svm),"\n")

# KNN - Classification Report and ROC-AUC Score
cls_rep_knn = classification_report(y_test, y_predict_knn)
print("KNN - Classification Report:\n", cls_rep_knn)
roc_auc_knn = roc_auc_score(y_test, y_predict_knn)
print("KNN - ROC AUC Score: {:.2f}".format(roc_auc_knn),"\n")

# Naive Bayes - Classification Report and ROC-AUC Score
cls_rep_nb = classification_report(y_test, y_predict_nb)
print("Naive Bayes - Classification Report:\n", cls_rep_nb)
roc_auc_nb = roc_auc_score(y_test, y_predict_nb)
print("Naive Bayes - ROC AUC Score: {:.2f}".format(roc_auc_nb),"\n")

# Neural Network - Classification Report and ROC-AUC Score
cls_rep_nn =  classification_report(y_test, y_predict_nn)
print("Neural Network - Classification Report:\n", cls_rep_nn)
roc_auc_nn = roc_auc_score(y_test, y_predict_nn)
print("Neural Network - ROC AUC Score: {:.2f}".format(roc_auc_nn),"\n")

# Random Forest - Classification Report and ROC-AUC Score
cls_rep_rf =  classification_report(y_test, y_predict_rf)
print("Random Forest - Classification Report:\n", cls_rep_rf)
roc_auc_rf = roc_auc_score(y_test, y_predict_rf)
print("Random Forest - ROC AUC Score: {:.2f}".format(roc_auc_rf),"\n")

# Ensemble Model - Classification Report and ROC-AUC Score
cls_rep_ensemble =  classification_report(y_test, y_predict_ensemble)
print("Ensemble Model - Classification Report:\n", cls_rep_ensemble)
roc_auc_ensemble = roc_auc_score(y_test, y_predict_ensemble)
print("Ensemble Model - ROC AUC Score: {:.2f}".format(roc_auc_ensemble),"\n")

Logistic Regression - Classification Report:
               precision    recall  f1-score   support

           2       0.97      0.97      0.97        87
           4       0.94      0.94      0.94        50

    accuracy                           0.96       137
   macro avg       0.95      0.95      0.95       137
weighted avg       0.96      0.96      0.96       137

Logistic Regression - ROC AUC Score: 0.95 

SVM - Classification Report:
               precision    recall  f1-score   support

           2       0.98      0.95      0.97        87
           4       0.92      0.96      0.94        50

    accuracy                           0.96       137
   macro avg       0.95      0.96      0.95       137
weighted avg       0.96      0.96      0.96       137

SVM - ROC AUC Score: 0.96 

KNN - Classification Report:
               precision    recall  f1-score   support

           2       0.99      0.97      0.98        87
           4       0.94      0.98      0.96        50

    

### Computing the accuracy of each model with k-Fold Cross Validation

In [31]:
# Cross-validation for Logistic Regression
accuracies_lr = cross_val_score(estimator = classifier_lr , X = x_train , y = y_train , cv = 10)

# Will print the average / mean of the 10 accuracies:
print("Logistic Regression - Accuracy: {:.2f} %".format(accuracies_lr.mean() * 100))

# Will print the standard deviation of the accuracies:
print("Logistic Regression - Standard Deviation: {:.2f} %".format(accuracies_lr.std() * 100),"\n")

# Cross-validation for SVM
accuracies_svm = cross_val_score(estimator=classifier_svm, X=x_train, y=y_train, cv=10)
print("SVM - Accuracy: {:.2f} %".format(accuracies_svm.mean() * 100))
print("SVM - Standard Deviation: {:.2f} %".format(accuracies_svm.std() * 100),"\n")

# Cross-validation for K-NN
accuracies_knn = cross_val_score(estimator=classifier_knn, X=x_train, y=y_train, cv=10)
print("KNN - Accuracy: {:.2f} %".format(accuracies_knn.mean() * 100))
print("KNN - Standard Deviation: {:.2f} %".format(accuracies_knn.std() * 100),"\n")

# Cross-validation for Naive Bayes
accuracies_nb = cross_val_score(estimator=classifier_nb, X=x_train, y=y_train, cv=10)
print("Naive Bayes - Accuracy: {:.2f} %".format(accuracies_nb.mean() * 100))
print("Naive Bayes - Standard Deviation: {:.2f} %".format(accuracies_nb.std() * 100),"\n")

# Cross-validation for Neural Network
accuracies_nn = cross_val_score(estimator=classifier_nn, X=x_train, y=y_train, cv=10)
print("Neural Network - Accuracy: {:.2f} %".format(accuracies_nn.mean() * 100))
print("Neural Network - Standard Deviation: {:.2f} %".format(accuracies_nn.std() * 100),"\n")

# Cross-validation for Random Forest
accuracies_rf = cross_val_score(estimator=classifier_rf, X=x_train, y=y_train, cv=10)
print("Random Forest - Accuracy: {:.2f} %".format(accuracies_rf.mean() * 100))
print("Random Forest - Standard Deviation: {:.2f} %".format(accuracies_rf.std() * 100),"\n")

# Cross-validation for Ensemble Model
accuracies_ensemble = cross_val_score(estimator= ensemble_model, X=x_train, y=y_train, cv=10)
print("Ensemble Model - Accuracy: {:.2f} %".format(accuracies_ensemble.mean() * 100))
print("Ensemble Model - Standard Deviation: {:.2f} %".format(accuracies_ensemble.std() * 100),"\n")

Logistic Regression - Accuracy: 96.70 %
Logistic Regression - Standard Deviation: 1.97 % 

SVM - Accuracy: 97.07 %
SVM - Standard Deviation: 2.19 % 

KNN - Accuracy: 97.44 %
KNN - Standard Deviation: 1.85 % 

Naive Bayes - Accuracy: 96.52 %
Naive Bayes - Standard Deviation: 2.24 % 

Neural Network - Accuracy: 96.71 %
Neural Network - Standard Deviation: 2.41 % 

Random Forest - Accuracy: 96.70 %
Random Forest - Standard Deviation: 2.58 % 

Ensemble Model - Accuracy: 97.07 %
Ensemble Model - Standard Deviation: 2.20 % 



### Conclusion

By looking at each matrix of all the models and accuracies of all the models, eventually we can say that the `K-NN` model is performing most effeciently and effectively among all the models.