# Comparison of Classifiers for the Breast Cancer Prediction

In this assignment we have been given the task to create a Decision Tree Classifier and a Naive Bayes Classifier, and compare the results of the two, from the given dataset.

### *The dataset*
We have been given the dataset Breast Cancer Prediction dataset. It contains data from cell nucleus in breast tissue, which is designed to assist diagnosing breastcancer. Here is an overview of the given dataset.

 - Consists of 569 instances/observations. 
 - Has 30 numerical features. Excluding ID and diagnosis.
 - The features are real, meaning they will consist of floating point numbers.
 - Classification problem, where the goal is to classify cancer tumours, based on cell properties.


#### *What is the purpose of this dataset?*
The purpose of this dataset is as mentioned before, using data from cell nucleus in breast tissue to assist diagnosing breastcancer.

#### *What are the features?*
The features are numeric values that describe different properties of the tumor. Such as texture, radius, smoothness and etc.

#### *What are the targets?*
M and B, will act as our boolean outputs/targets. Where B is benign, which will act as our boolean 0. The M is malignant, the opposite of the benign, will act as our boolean 1.

In other words, it can be simplified to as follows. B is for when the tumor is benign, and M is when tumor is malignant.


In [28]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay

In [29]:
dataset = pd.read_csv('wdbc.data', sep=",", header=None)
dataset.drop(columns=[0], inplace=True)
dataset[1] = dataset[1].map({'M': True, 'B': False})
dataset.head(5)


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
0,True,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,True,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,True,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,True,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,True,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Pre-prossesing of data
Since this is a .data file and not a .csv file, the first row is now a "header" for the other coloums. We can fix this by setting header=None.

Our pre-prossesing will consist of removing the ID, since it's not relevant for us. So we will use dataset.drop(columns=[0], inplace=True) to remove the ID coloumn.

Additionally, we are remapping our B and M features to boolean features to provide more clarity, and it will most likely help us on later.





## Splitting the dataset


### *Choosing the sampling technique*
This will be a random sampling technique, meaning we will be splitting our data completely by randomly selecting them. It's also worth noting this needs to be without repetition.

The option for choosing Stratified sampling was also considered. However, in our case, I will choose random selection for simplicity’s sake. It could be a possibility to use Stratified sampling if the dataset contained more "boolean features". Meaning the more clearly representative features that we could use to categorize more easily than floating numbers. In other words, it would be to organize by specific features to groups, before selecting. Which ensures the sample to be representative.

### *The 60/20/20 split*
My approach to the splitting of the dataset will be a 60/20/20 split. Reason for this is that model reliabilty is crucial in our case. Therefore choosing 60/20/20 split will provide better balance between training and evaluation. This should also enhance our ability to prevent overfitting. In other words, we wants our evaluation to be thourough and the model to be able to generalize well to new data. 

- 60% for training
  - This will ensure the model has enough data to learn from. In other words, it should be sufficient enough to capture the patterns and complexities within the data.

- 20% for validation
  -  This allow us to tune our hyperparamaters and make decisions according to our performance, without overfitting to the training data.

- 20% for testing
  - This will be used for having an unbiased evaluation of the models performance. This occurs after the training and the validation, to ensure its evaluating data it has not seen before.

### *Comparing our split choice with the 80/10/10 split*

When arriving at the question of choosing the correct split, it stood between two options. Either the 60/20/20 split or 80/10/10. So to elaborate why I didnt choose the 80/10/10, boils down to that the split is relevant for when the scenario is about maximizing the training data, which is not our desired goal.  


### *Defending the chosen 60/20/20 split*

Why I think this is a good approach is that there will be sufficent data for the training to yield good results, without overfitting. It has a balanced evaluation which provides flexibility and reliability. 

In [30]:
# Let's make our split

training_validation = dataset.groupby(dataset.columns[1],group_keys=False).apply(lambda x: x.sample(frac=0.40))
test = dataset.drop(training_validation.index)

train = training_validation.groupby(dataset.columns[1], group_keys=False).apply(lambda x: x.sample(frac=0.50))
validation = training_validation.drop(train.index)

  training_validation = dataset.groupby(dataset.columns[1],group_keys=False).apply(lambda x: x.sample(frac=0.40))
  train = training_validation.groupby(dataset.columns[1], group_keys=False).apply(lambda x: x.sample(frac=0.50))


In [31]:
train_features = train.drop(dataset.columns[1], axis=1)
train_targets = train[dataset.columns[1]]

validation_features = validation.drop(dataset.columns[1], axis=1)
validation_targets = validation[dataset.columns[1]]

test_features = test.drop(dataset.columns[1], axis=1)
test_targets = test[dataset.columns[1]]

In [32]:
dt1 = DecisionTreeClassifier()
dt1.fit(train_features, train_targets)

ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.