### How To Use Data Science To Predict Whether A Mammogram Mass Is Benign Or Malignant

Let's run data science on the public "mammogram masses" data set from the UC Irvine repository.

The 961 records in this data set contain these 6 attributes:

- BI-RADS assessment: 1 to 5
- Patient's age in years (integer)
- Mass shape: 1 (round), 2 (oval), 3 (lobular), 4 (irregular)
- Mass margin: 1 (circumscribed), 2 (microlobulated), 3 (obscured), 4 (ill-defined), 5 (spiculated)
- Mass density: 1 (high), 2 (iso), 3 (low), 4 (fat-containing)
- Severity: 0 (benign), 1 (malignant)

Because BI-RADS is an assessment of confidence in the severity classification, we'll discard it from our feature set. The remaining features of "Age", "Shape", "Margin" and "Density" will be used in our models designed to predict our binomial target variable of "Severity".

(Note: Although "Shape" and "Margin" are nominal data types, their numerical values turn out to be functionally ordinal, and thus can be used effectively with sklearn.)

Let's get started!

### Data Preparation

We'll start by loading our data into a pandas dataframe, converting missing data ('?') into NaN, and adding column names.

In [113]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

masses_data = pd.read_csv('mammographic_masses.data.txt', na_values=['?'], names = ['BI-RADS', 'age', 'shape', 'margin', 'density', 'severity'])
masses_data.head()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


Next we'll inspect how "clean" our data is with the pandas "describe" function.

In [114]:
masses_data.describe()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


The "count" row shows quite a few missing values in our features and target.

Before elimininating all rows with data, we'll want to observe all rows with missing data and make sure there isn't an obvious correlation to the data with missing fields. If there is, we'll want to try to go back and fill that missing data.

In [115]:
masses_data.loc[(masses_data['age'].isnull()) |
              (masses_data['shape'].isnull()) |
              (masses_data['margin'].isnull()) |
              (masses_data['density'].isnull())]

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
9,5.0,60.0,,5.0,1.0,1
12,4.0,64.0,1.0,,3.0,0
19,4.0,40.0,1.0,,,0
20,,66.0,,,1.0,1
22,4.0,43.0,1.0,,,0


The missing data seems randomly distributed, so we've got the "green light" to use pandas' "dropna" function to drop the rows with missing data.

In [116]:
masses_data.dropna(inplace=True)
masses_data.describe()

Unnamed: 0,BI-RADS,age,shape,margin,density,severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Next we'll convert the pandas dataframe feature and target columns into numpy arrays so sklearn can use them.

In [117]:
all_features = masses_data[['age', 'shape', 'margin', 'density']].values
feature_names = ['age', 'shape', 'margin', 'density']

all_classes = masses_data['severity'].values

all_features

array([[67.,  3.,  5.,  3.],
       [58.,  4.,  5.,  3.],
       [28.,  1.,  1.,  3.],
       ...,
       [64.,  4.,  5.,  3.],
       [66.,  4.,  5.,  3.],
       [62.,  3.,  3.,  3.]])

To further prepare our feature data for use with our models, let's normalize it with sklearn's "preprocessing.StandardScaler" function.

In [118]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)
all_features_scaled

array([[ 0.7650629 ,  0.17563638,  1.39618483,  0.24046607],
       [ 0.15127063,  0.98104077,  1.39618483,  0.24046607],
       [-1.89470363, -1.43517241, -1.157718  ,  0.24046607],
       ...,
       [ 0.56046548,  0.98104077,  1.39618483,  0.24046607],
       [ 0.69686376,  0.98104077,  1.39618483,  0.24046607],
       [ 0.42406719,  0.17563638,  0.11923341,  0.24046607]])

### Models

Now let's run our data through several models to see which one comes out on top!

### Decision Trees

Let's start out by train/test splitting our data and fit a Decision Tree Classifier to the training data.

In [119]:
import numpy
from sklearn.model_selection import train_test_split

numpy.random.seed(1234)

(training_inputs, testing_inputs, training_classes, testing_classes)\
    = train_test_split(all_features_scaled, all_classes, train_size=0.75, random_state=1)

from sklearn.tree import DecisionTreeClassifier

clf= DecisionTreeClassifier(random_state=1)

clf.fit(training_inputs, training_classes)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

We can now evaluate our Decision Tree Classifier's accuracy against our test data

In [120]:
clf.score(testing_inputs, testing_classes)

0.7355769230769231

What happens if, rather than a single set of train/test data, we use "K-Fold Cross Validation" to get a better measure of our model's accuracy?

In [121]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.7373123154639465

Now let's try several other models and see how they do...

### Random Forest

In [122]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, random_state=1)
cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.7528157927878762

### Support Vector Machine

In [123]:
from sklearn import svm

C = 1.0
svc = svm.SVC(kernel='linear', C=C)

cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.7964988875362076

Let's also try the other Support Vector Machine kernels (rbf, sigmoid and poly) as a form of "hyperparameter tuning"...

In [124]:
C = 1.0
svc = svm.SVC(kernel='rbf', C=C)
cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.8012023704574396

In [125]:
C = 1.0
svc = svm.SVC(kernel='sigmoid', C=C)
cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.7351055791108685

In [126]:
C = 1.0
svc = svm.SVC(kernel='poly', C=C)
cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.792753942599667

### K-Nearest-Neighbors

In [127]:
from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=10)
cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.7854795488574507

Before we move on to other models, let's try other "K" values to see how they perform.

In [128]:
for n in range(1, 51):
    clf = neighbors.KNeighborsClassifier(n_neighbors=n)
    cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)
    print (n, cv_scores.mean())

1 0.7239123742356184
2 0.6889838098036746
3 0.7541080699103032
4 0.7300813008130081
5 0.7735464506108056
6 0.7626163189342738
7 0.7940595133145824
8 0.7747082406280172
9 0.7880200243482641
10 0.7854795488574507
11 0.7915333809104012
12 0.7794257168045002
13 0.7819084701174035
14 0.7915039950743742
15 0.7878748443250353
16 0.7794411093852764
17 0.7818073688482151
18 0.775681121699341
19 0.7805147418944068
20 0.7828666582707136
21 0.7853927906748946
22 0.7817342540895289
23 0.7805588206484475
24 0.780587506821712
25 0.7878171221471251
26 0.7866269957880302
27 0.7854365195975539
28 0.7902271105327232
29 0.7865979597833844
30 0.7878314652337574
31 0.7914172368918182
32 0.7878314652337574
33 0.7865976099520032
34 0.7866119530386354
35 0.7866262961252677
36 0.7854358199347914
37 0.7866843681345592
38 0.7866553321299133
39 0.7878891874116676
40 0.7854791990260694
41 0.7854645061080558
42 0.7818500482767305
43 0.7830692106404713
44 0.783054867553839
45 0.783054867553839
46 0.7854648559394373
4

### Naive Bayes

In [129]:
from sklearn.naive_bayes import MultinomialNB

# MultinomialNB requires non-negative values,
#  so use min-max scaling on the features
scaler = preprocessing.MinMaxScaler()
all_features_minmax = scaler.fit_transform(all_features)

clf = MultinomialNB()
cv_scores = cross_val_score(clf, all_features_minmax, all_classes, cv=10)

cv_scores.mean()

0.7844055665169388

### Logistic Regression

In [130]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.8073583532737221

### The Verdict?

The "winner" for the most accurate model is Logistic Regression with an accuracy of 80.7%, and the "loser" is Decision Tree with an accuracy of of 73.6%