## Breast Cancer Tumor Classification (Work in progress)

One of the old methods of diagnosis of Breast Cancer is carried out by Fine Needle Aspiration procedure. A thin and hollow needle is inserted into a lump and a sample of cells is extracted. The specimen is then studied under a microscope and the following measurements are made.

<b>1: Sample code number: id number</b>

<b>2: Clump Thickness: 1 - 10</b>

Determines if the cells are mono-layered or multi-layered

<b>3: Uniformity of Cell Size: 1 - 10</b> 

A measure of the variance of the sizes of the cells. Cancer cells vary in size.

<b>4: Uniformity of Cell Shape: 1 - 10</b>

A measure of the variance in the shapes of the cells. Cancer cells vary in shape.

<b>5: Marginal Adhesion: 1 - 10</b>

Assess the adhesion ability of the cells. Cancer cells tend not to stick together.

<b>6: Single Epithelial Cell Size: 1 - 10</b>

Determines if the epithelial cells have signifcantly enlarged.

<b>7: Bare Nuclei: 1 - 10</b> 

A measurement of the proportion of the number of cells not covered by cytoplasm to those that are.

<b>8: Bland Chromatin: 1 - 10</b> 

A rating of the texture of nucleus from fine to coarse.

<b>9: Normal Nucleoli: 1 - 10</b> 

Nucleoli are small structures present in the nucleus. Normally nucleoli are small and barely visible. However, it becomes visible and plentiful in case of malignancy. A higher value of this attribute would indicate a higher chance of malignancy.

<b>10: Mitoses: 1 - 10</b>

Describes the level of cell division.

<b>11: Class: (2 for benign, 4 for malignant)</b>

The dataset has been created by Dr. WIlliam H. Wolberg (physician), University of Wisconsin Hospitals, Madison, Wisconsin, USA. 
Here is the link to this repository on UCI website.
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)

Here we will use this dataset to create a model that can identify whether a tumor is malignant(cancerous) or benign.

We will use three methods for classification problem and assess the performance of each method.

The three methods that we will use are as following.<br>
<b>1. Logistic Regression</b><br>
<b>2. Linear Discriminant Analysis</b><br>
<b>3. K-nearest neighbors</b>

In [2]:
breast_data = read.csv("datasets/breast-cancer-wisconsin.csv",header=F,na.strings="?")
breast_data = na.omit(breast_data)
breast_data = breast_data[,-1]
names(breast_data)

In [3]:
nrow(breast_data)

In [4]:
breast_data$V11 = as.factor(breast_data$V11)
set.seed(1)
train = sample(nrow(breast_data),550)
glm.fit = glm(V11~.,data=breast_data,family=binomial,subset=train)
glm.prob = predict(glm.fit,newdata=breast_data[-train,],type="response")
glm.pred = rep(2,length(glm.prob))
glm.pred[glm.prob>0.5]=4
mean(glm.pred!=breast_data[-train,]$V11)*100

In [5]:
#accuracy
mean(glm.pred==breast_data[-train,]$V11)*100

In [6]:
summary(glm.fit)


Call:
glm(formula = V11 ~ ., family = binomial, data = breast_data, 
    subset = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4181  -0.1145  -0.0639   0.0231   2.0039  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -10.14349    1.29725  -7.819 5.31e-15 ***
V2            0.50161    0.15677   3.200  0.00138 ** 
V3           -0.08125    0.21424  -0.379  0.70449    
V4            0.49669    0.25178   1.973  0.04853 *  
V5            0.30697    0.12642   2.428  0.01517 *  
V6            0.13494    0.16705   0.808  0.41921    
V7            0.28867    0.09704   2.975  0.00293 ** 
V8            0.49372    0.18590   2.656  0.00791 ** 
V9            0.22690    0.11776   1.927  0.05401 .  
V10           0.47054    0.35130   1.339  0.18043    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 718.760  on 549  degrees of freedom
Residu

In [7]:
#v3 - uniformity of cell size v5- uniformity of cell shape
#find correlation between these attributes and the output
cor(glm.prob,breast_data[-train,]$V3)

In [8]:
cor(glm.prob,breast_data[-train,]$V4)

In [10]:
cor(glm.prob,breast_data[-train,]$V10)