# Cancer Diagnostics
Using this set of [breast cancer data](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29), create a model to predict breast cancer.  Also, what traits are most indiciated of whether or not an individual will be diagnosed?

---

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn import ensemble

%matplotlib inline

In [2]:
file = 'C:/Users/Carter Carlson/Documents/Thinkful/Large Databases/Breast cancer.csv'
df = pd.read_csv(file)


# Add column headers
cols = ['id', 'clump thickness', 'cell size uniformity', 'cell shape uniformity', 'marginal adhesion', 
        'single epithelial cell size', 'bare nuclei', 'bland chromatin', 'normal nucleoli', 'mitoses', 'class']
df.columns = cols
df.head()

Unnamed: 0,id,clump thickness,cell size uniformity,cell shape uniformity,marginal adhesion,single epithelial cell size,bare nuclei,bland chromatin,normal nucleoli,mitoses,class
0,1002945,5,4,4,5,7,10,3,2,1,2
1,1015425,3,1,1,1,2,2,3,1,1,2
2,1016277,6,8,8,1,3,4,3,7,1,2
3,1017023,4,1,1,3,2,1,3,1,1,2
4,1017122,8,10,10,8,7,10,9,7,1,4


In [3]:
# With class- 2 means negative for cancer, 4 means positive for cancer
df['has cancer'] = (df['class'] == 4)

In [5]:
X = df.drop(['id', 'class', 'has cancer'], 1)
Y = df['has cancer']

bnb = BernoulliNB()
bnb.fit(X, Y)

ValueError: could not convert string to float: '?'

In [12]:
for col in df.drop(['id', 'class', 'has cancer'], 1):
    print(df.groupby(col)[col].count())

clump thickness
1     139
2      50
3     104
4      79
5     127
6      33
7      23
8      44
9      14
10     69
Name: clump thickness, dtype: int64
cell size uniformity
1     372
2      45
3      52
4      38
5      30
6      25
7      19
8      28
9       6
10     67
Name: cell size uniformity, dtype: int64
cell shape uniformity
1     345
2      58
3      53
4      43
5      32
6      29
7      30
8      27
9       7
10     58
Name: cell shape uniformity, dtype: int64
marginal adhesion
1     392
2      58
3      58
4      33
5      23
6      21
7      13
8      25
9       4
10     55
Name: marginal adhesion, dtype: int64
single epithelial cell size
1      44
2     375
3      71
4      48
5      39
6      40
7      11
8      21
9       2
10     31
Name: single epithelial cell size, dtype: int64
bare nuclei
1     401
10    132
2      30
3      28
4      19
5      30
6       4
7       8
8      21
9       9
Name: bare nuclei, dtype: int64
bland chromatin
1     150
2     160
3     160


In [10]:
df = df[df['bare nuclei'] != '?']

In [38]:
X = df.drop(['id', 'class', 'has cancer'], 1)
Y = df['has cancer']

regr = ensemble.RandomForestClassifier()
regr.fit(X, Y)

score = cross_val_score(regr, X, Y, cv=10).mean()
print('Accuracy: {}'.format(str(round(score, 2))))

Accuracy: 0.96


In [39]:
feature_importance = regr.feature_importances_

num = 0
print('Relative importance by feature:\n')
for col in X:
    print(col, ': ', str(round(feature_importance[num], 2)))
    num+=1

Relative importance by feature:

clump thickness :  0.06
cell size uniformity :  0.34
cell shape uniformity :  0.3
marginal adhesion :  0.02
single epithelial cell size :  0.03
bare nuclei :  0.11
bland chromatin :  0.12
normal nucleoli :  0.03
mitoses :  0.01
