## Predicting whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so it will be discarded. The age, shape, margin, and density attributes are the features that will be used to build the model with, and "severity" is the classification we will attempt to predict based on those attributes.


Several different machine learning techniques will be applied to this data set to see which one yields the highest accuracy as measured with K-Fold cross validation (K=10):

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression
* Keras neural network

The data needs to be cleaned and normalized; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

In [32]:
import pandas as pd
import numpy as numpy

from sklearn import tree, svm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier



## Preparing the data
Import of the data set, converting missing values (indicated as ?), setting column names.

In [94]:
col_names = ['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']
df = pd.read_csv('mammographic_masses.data.txt', na_values='?', header=None, names=col_names)
df.head()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


Evaluation of the data we have.

In [95]:
df.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


Creating two datasets with dropped and filled NaN's for (possible) testing in future to see which one will give better results.

In [4]:
df_fill_na = df.fillna(method='bfill')
df_drop_na = df.dropna()

Converting pandas dataframe to numpy array and normalizing it.

In [5]:
# Features 
features = df_fill_na[['age', 'shape', 'margin', 'density']].to_numpy()
scaler = StandardScaler().fit(features)
features_scaled = scaler.transform(features)

# Labels
labels = df_fill_na[['severity']].to_numpy()

## Decision Trees

Creating a single train/test split of the data.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled, labels, test_size=0.20, random_state=42)

Creating a Decision Tree / Random Forest Classifier and fit it to the training data.

In [36]:
# Decision Tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

# Random Forest
clf_rand_forest = RandomForestClassifier(n_estimators=10)
clf_rand_forest = clf_rand_forest.fit(X_train, y_train) 

Measuring the accuracy of the resulting decision tree model using test data.

In [37]:
score_dec_tree = clf.score(X_test, y_test)
score_rand_forest = clf_rand_forest.score(X_test, y_test)

print('Decision tree score: ' + str(score_dec_tree))
print('Random forest score: ' + str(score_rand_forest))

Decision tree score: 0.7823834196891192
Random forest score: 0.7668393782383419


Using K-Fold cross validation to get a better measure of model's accuracy (K=10).


In [50]:
# Refit models with all data (not only on training piece)
clf_all_data = tree.DecisionTreeClassifier()
clf_all_data = clf_all_data.fit(features_scaled, labels)

# Random Forest
clf_rand_forest_all_data = RandomForestClassifier(n_estimators=10)
clf_rand_forest_all_data = clf_rand_forest_all_data.fit(features_scaled, labels) 

scores_trees_Kcrossval = cross_val_score(clf_all_data, features_scaled, labels, cv=10)
scores_rand_forest_Kcrossval = cross_val_score(clf_rand_forest_all_data, features_scaled, labels, cv=10)

print('Decision tree K-fold validation score: ' + str(scores_trees_Kcrossval.mean()))
print('Random forest K-fold validation score: ' + str(scores_rand_forest_Kcrossval.mean()))

Decision tree K-fold validation score: 0.7336984536082475
Random forest K-fold validation score: 0.7534579037800688


## SVM

Using svm.SVC with rbf kernel (showed highest score)

In [51]:
clf_svm = svm.SVC(kernel='rbf')
scores_svm = cross_val_score(clf_svm, features_scaled, labels, cv=10)

print('SVM K-fold validation score: ' + str(scores_svm.mean()))

SVM K-fold validation score: 0.7867053264604811


## KNN
K-Nearest-Neighbors method. Starting with K hyperparameter as 5.

In [73]:
neigh = KNeighborsClassifier(n_neighbors=5)
neigh = neigh.fit(features_scaled, labels)

scores_KNN = cross_val_score(neigh, features_scaled, labels, cv=10)
print('KNN score: ' + str(scores_KNN.mean()))

KNN score: 0.7596756872852233


Now making an algo to choose the best K for our dataset in range from 1 to 50.

In [91]:
def choose_K(X, y):
    K = []
    for i in range(1, 50):
        neigh = KNeighborsClassifier(n_neighbors=i)
        neigh = neigh.fit(X, y)
        scores_KNN = cross_val_score(neigh, X, y, cv=10)
        K.append(scores_KNN.mean())
    return K

K_eval = choose_K(features_scaled, labels)
max_KNN_score = max(K_eval)
index_KNN_max = K.index(max_KNN_score)

print('Max KNN score: {0} when K={1}'.format(max_KNN_score, index_KNN_max))

Max KNN score: 0.7877684707903778 when K=41


## Naive Bayes

 naive_bayes.MultinomialNB.
 MinMaxScaler to get the features in the range MultinomialNB requires.

## Logistic Regression

Trying Logistic Regression

## Neural Network

