<font size="8">**Part 1: Breast Cancer Data**</font>


<font size="5">In this portion of the project you will use a dataset describing tumors from breast cancer patients. You will classify
these tumors as malignant ('M') or benign ('B') using a KNN classifer and a Naive Bayes classifier. Since this
dataset does not have any missing values you will not need to address imputations.</font>

<font size="5">**Prepare the breastcancer_data.csv dataset**</font>

<font size="5">a. Remove the 'id' column, as it will not be helpful to the classification task.</font>

<font size="5">b. Seperate the target column ('diagnosis'). You may want to convert these feature
values to integers (0 for benign; 1 for malignant).</font>



<font size="5">Create and test your models**</font>

<font size="5">a. Train and test a KNN Classifier and a Naive Bayes Classifier.</font>

<font size="5">b. Report cross-validated F1, precision, and recall scores (5 folds) for each varia
nt in tabular form as shown below
</font>


In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

<font size="8">**Naive Bayes Classifier**</font>


In [2]:
#read the csv
df = pd.read_csv('../input/breastcancerdata/breastcancer_data.csv')
#drop the column titled 'id'
df = df.drop(columns = ['id'])
#change all values of 'M' to '1' and values of 'B' to '0'
df.loc[df['diagnosis'] == 'M','diagnosis'] = '1'
df.loc[df['diagnosis'] == 'B','diagnosis'] = '0'
#isolate target column
y = df['diagnosis']
features = df.columns[1:-1].tolist()
X = df[features]
#Naive Bayes Classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 1)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)


<font size="7">**Naive Bayes - Predictions**</font>


In [3]:
#Print Predictions
print(metrics.classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.96      0.95      0.96       148
           1       0.91      0.93      0.92        80

    accuracy                           0.94       228
   macro avg       0.94      0.94      0.94       228
weighted avg       0.94      0.94      0.94       228



<font size="7">**Naive Bayes - Cross Validation**</font>

In [4]:
scores = cross_val_score(gnb, X, y, cv=5, scoring='f1_micro')
print('Cross-validated scores: {}'.format(scores))
print()
print("Average F1_micro: %0.2f" % (scores.mean()))

Cross-validated scores: [0.9122807  0.92105263 0.94736842 0.94736842 0.95575221]

Average F1_micro: 0.94


<font size="7">**KNN Classifier**</font>

In [5]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

<font size="7">**KNN - Predictions**</font>

In [6]:
#Print Predictions
print(metrics.classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.96      0.95      0.96       148
           1       0.91      0.93      0.92        80

    accuracy                           0.94       228
   macro avg       0.94      0.94      0.94       228
weighted avg       0.94      0.94      0.94       228



<font size="7">**KNN - Cross Validation**</font>

In [7]:
scores = cross_val_score(knn, X, y, cv=5, scoring='f1_micro')
print('Cross-validated scores: {}'.format(scores))
print()
print("Average F1_micro: %0.2f" % (scores.mean()))

Cross-validated scores: [0.88596491 0.93859649 0.93859649 0.94736842 0.92920354]

Average F1_micro: 0.93
