## Gaussian Naive Bayes and Breast Cancer

Data from the UCI Machine Learning Respository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29

Originally gathered by Dr. William H. Wolberg at the University of Wisconsin Hospitals

Attribute Information:

1. Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

In [71]:
import pandas as pd
cancer_data = pd.read_csv("/Users/evancolvin/Dropbox/Documents/MLCancer Presentation/cancer.csv")
cancer_data.head(10)

Unnamed: 0,ID,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Brand Chromatin,Normal Nucleoli,Mitosis,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
5,1017122,8,10,10,8,7,10,9,7,1,4
6,1018099,1,1,1,1,2,10,3,1,1,2
7,1018561,2,1,2,1,2,1,3,1,1,2
8,1033078,2,1,1,1,2,1,1,1,5,2
9,1033078,4,2,1,1,2,1,2,1,1,2


In [72]:
cancer_data.iloc[23]


ID                             1057013
Clump Thickness                      8
Uniformity of Cell Size              4
Uniformity of Cell Shape             5
Marginal Adhesion                    1
Single Epithelial Cell Size          2
Bare Nuclei                          ?
Brand Chromatin                      7
Normal Nucleoli                      3
Mitosis                              1
Class                                4
Name: 23, dtype: object

We have some missing values coded as '?' 

We'll just remove those observations for simplicity. 

In [73]:
# This line will convert everything not a string into a NaN
cancer_data= cancer_data.convert_objects(convert_numeric = True) 
cancer_data.iloc[23]

  from ipykernel import kernelapp as app


ID                             1057013.0
Clump Thickness                      8.0
Uniformity of Cell Size              4.0
Uniformity of Cell Shape             5.0
Marginal Adhesion                    1.0
Single Epithelial Cell Size          2.0
Bare Nuclei                          NaN
Brand Chromatin                      7.0
Normal Nucleoli                      3.0
Mitosis                              1.0
Class                                4.0
Name: 23, dtype: float64

In [74]:
# This line will drop any lines with a NaN valueâ€”thereby removing the missing values
print len(cancer_data)
cancer_data = cancer_data.dropna(axis = 0)
print len(cancer_data)
cancer_data.iloc[23]

699
683


ID                             1059552.0
Clump Thickness                      1.0
Uniformity of Cell Size              1.0
Uniformity of Cell Shape             1.0
Marginal Adhesion                    1.0
Single Epithelial Cell Size          2.0
Bare Nuclei                          1.0
Brand Chromatin                      3.0
Normal Nucleoli                      1.0
Mitosis                              1.0
Class                                2.0
Name: 24, dtype: float64

So we dropped 16 observations, which is still enough data for what we want to do

We'll rename the target variables 'benign' and 'malignant' instead of '2' and '4', respectively. It won't change the analysis, but it will make it easier for us to understand 

We'll then extract the values into a NumPy array to feed into the Naive Bayes classifier

In [75]:
cancer_data = cancer_data.drop('ID', axis = 1) # removing the ID observation for analysis
cancer_data.ix[cancer_data['Class'] == 2.0, 'Class'] = 'benign'
cancer_data.ix[cancer_data['Class'] == 4.0, 'Class'] = 'malignant'

In [76]:
cancer_data.head(25)

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Brand Chromatin,Normal Nucleoli,Mitosis,Class
0,5,1,1,1,2,1.0,3,1,1,benign
1,5,4,4,5,7,10.0,3,2,1,benign
2,3,1,1,1,2,2.0,3,1,1,benign
3,6,8,8,1,3,4.0,3,7,1,benign
4,4,1,1,3,2,1.0,3,1,1,benign
5,8,10,10,8,7,10.0,9,7,1,malignant
6,1,1,1,1,2,10.0,3,1,1,benign
7,2,1,2,1,2,1.0,3,1,1,benign
8,2,1,1,1,2,1.0,1,1,5,benign
9,4,2,1,1,2,1.0,2,1,1,benign


In [77]:
cancer = cancer_data.values

In [78]:
classification = cancer[:,9] # target values are in 9th column; index starting at 0
data = cancer[:, 0:9] # all the data is everything up to that 9th column

print classification[:10]
print data[:10]

['benign' 'benign' 'benign' 'benign' 'benign' 'malignant' 'benign' 'benign'
 'benign' 'benign']
[[5 1 1 1 2 1.0 3 1 1]
 [5 4 4 5 7 10.0 3 2 1]
 [3 1 1 1 2 2.0 3 1 1]
 [6 8 8 1 3 4.0 3 7 1]
 [4 1 1 3 2 1.0 3 1 1]
 [8 10 10 8 7 10.0 9 7 1]
 [1 1 1 1 2 10.0 3 1 1]
 [2 1 2 1 2 1.0 3 1 1]
 [2 1 1 1 2 1.0 1 1 5]
 [4 2 1 1 2 1.0 2 1 1]]


### Training and Evaluating the Model

We'll train on the first 500 of the observation and see how the model predicts on the 183 observations it hasn't seen before

In [79]:
training_set = data[0:500, :]
training_result = classification[0:500]
print training_set[:10]
print training_result[:10]

[[5 1 1 1 2 1.0 3 1 1]
 [5 4 4 5 7 10.0 3 2 1]
 [3 1 1 1 2 2.0 3 1 1]
 [6 8 8 1 3 4.0 3 7 1]
 [4 1 1 3 2 1.0 3 1 1]
 [8 10 10 8 7 10.0 9 7 1]
 [1 1 1 1 2 10.0 3 1 1]
 [2 1 2 1 2 1.0 3 1 1]
 [2 1 1 1 2 1.0 1 1 5]
 [4 2 1 1 2 1.0 2 1 1]]
['benign' 'benign' 'benign' 'benign' 'benign' 'malignant' 'benign' 'benign'
 'benign' 'benign']


In [80]:
print training_set.shape

(500, 9)


In [81]:
test_set = data[500:, :] # all columns, rows 500 to the end
test_result = classification[500:] # rows 500 to the end
print test_set[:10]
print test_result[:10]
print test_set.shape
print test_result.shape

[[4 10 4 7 3 10.0 9 10 1]
 [1 1 1 1 1 1.0 1 1 1]
 [1 1 1 1 1 1.0 2 1 1]
 [3 1 2 2 2 1.0 1 1 1]
 [4 7 8 3 4 10.0 9 1 1]
 [1 1 1 1 3 1.0 1 1 1]
 [4 1 1 1 3 1.0 1 1 1]
 [10 4 5 4 3 5.0 7 3 1]
 [7 5 6 10 4 10.0 5 3 1]
 [3 1 1 1 2 1.0 2 1 1]]
['malignant' 'benign' 'benign' 'benign' 'malignant' 'benign' 'benign'
 'malignant' 'malignant' 'benign']
(183, 9)
(183,)


### Running the machine learning model 
Using Scikit-Learn's GaussianNB classifier. It's the standard Naive Bayes classifier that adds the assumption that the data are distributed normally. 

In [82]:
from sklearn import *
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(training_set, training_result)
predictions = gnb.predict(training_set)

print("Number of mislabeled points out of a total %d points : %d" % (training_set.shape[0],
                                                                     (training_result != predictions).sum()))

Number of mislabeled points out of a total 500 points : 24


In [83]:
test_predictions = gnb.predict(test_set)
print("Number of mislabeled points out of a total %d points : %d" % (test_set.shape[0],
                                                                     (test_result != test_predictions).sum()))

Number of mislabeled points out of a total 183 points : 3


In [84]:
print test_predictions[:10]
print test_result[:10]

['malignant' 'benign' 'benign' 'benign' 'malignant' 'benign' 'benign'
 'malignant' 'malignant' 'benign']
['malignant' 'benign' 'benign' 'benign' 'malignant' 'benign' 'benign'
 'malignant' 'malignant' 'benign']


In [85]:
# So how did we do?
from __future__ import division
accuracy = sum(test_result == test_predictions)/len(test_predictions)
print accuracy

0.983606557377
