## Breast Cancer Tumor Classification (Work in progress)

One of the old methods of diagnosis of Breast Cancer is carried out by Fine Needle Aspiration procedure. A thin and hollow needle is inserted into a lump and a sample of cells is extracted. The specimen is then studied under a microscope and the following measurements are made.

<b>1: Sample code number: id number</b>

<b>2: Clump Thickness: 1 - 10</b>

Determines if the cells are mono-layered or multi-layered

<b>3: Uniformity of Cell Size: 1 - 10</b> 

A measure of the variance of the sizes of the cells. Cancer cells vary in size.

<b>4: Uniformity of Cell Shape: 1 - 10</b>

A measure of the variance in the shapes of the cells. Cancer cells vary in shape.

<b>5: Marginal Adhesion: 1 - 10</b>

Assess the adhesion ability of the cells. Cancer cells tend not to stick together.

<b>6: Single Epithelial Cell Size: 1 - 10</b>

Determines if the epithelial cells have signifcantly enlarged.

<b>7: Bare Nuclei: 1 - 10</b> 

A measurement of the proportion of the number of cells not covered by cytoplasm to those that are.

<b>8: Bland Chromatin: 1 - 10</b> 

A rating of the texture of nucleus from fine to coarse.

<b>9: Normal Nucleoli: 1 - 10</b> 

Nucleoli are small structures present in the nucleus. Normally nucleoli are small and barely visible. However, it becomes visible and plentiful in case of malignancy. A higher value of this attribute would indicate a higher chance of malignancy.

<b>10: Mitoses: 1 - 10</b>

Describes the level of cell division.

<b>11: Class: (2 for benign, 4 for malignant)</b>

The dataset has been created by Dr. WIlliam H. Wolberg (physician), University of Wisconsin Hospitals, Madison, Wisconsin, USA. 
Here is the link to this repository on UCI website.
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)

Here we will use this dataset to create a model that can identify whether a tumor is malignant(cancerous) or benign.

We will use three methods for classification problem and assess the performance of each method.

The three methods that we will use are as following.<br>
<b>1. Logistic Regression</b><br>
<b>2. Linear Discriminant Analysis</b><br>
<b>3. K-nearest neighbors</b>

In [18]:
breast_data = read.csv("datasets/breast-cancer-wisconsin.csv",header=F,na.strings="?")
breast_data = na.omit(breast_data)
#Removing the first column. It contains the sample code number
breast_data = breast_data[,-1]
#Making the final attribute qualitative
breast_data$V11 = as.factor(breast_data$V11)
#The names of the columns of the attributes are arranged in the same order as above.
names(breast_data)

In [12]:
nrow(breast_data)

In [19]:
set.seed(1)
#creating a training dataset
train = sample(nrow(breast_data),550)

# Logistic Regression

In [15]:
glm.fit = glm(V11~.,data=breast_data,family=binomial,subset=train)
glm.prob = predict(glm.fit,newdata=breast_data[-train,],type="response")
glm.pred = rep(2,length(glm.prob))
glm.pred[glm.prob>0.5]=4

In [16]:
#error rate
mean(glm.pred!=breast_data[-train,]$V11)*100

In [17]:
#accuracy
mean(glm.pred==breast_data[-train,]$V11)*100

In [22]:
#confusion table 2=benign 4=malignant
table(glm.pred,breast_data[-train,]$V11)

        
glm.pred  2  4
       2 91  2
       4  1 39

In [24]:
91/(91+1)*100

As we can see the model predicts 98.9% of benign cases correctly

In [25]:
39/(39+2)*100

And it predicts 95.1% of the malignant cases correctly.