# Final Project

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Your assignment

Apply several different supervised machine learning techniques to this data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10). Apply:

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Remember some techniques such as SVM also require the input data to be normalized first.

Many techniques also have "hyperparameters" that need to be tuned. Once you identify a promising approach, see if you can make it even better by tuning its hyperparameters.

I was able to achieve over 80% accuracy - can you beat that?

Below I've set up an outline of a notebook for this project, with some guidance and hints. 


## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [2]:
import pandas as pd
db = pd.read_csv('mammographic_masses.data.txt',header=None)
db.head()

Unnamed: 0,0,1,2,3,4,5
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1


In [7]:
import numpy as np
db.columns=['Bi-Rads','Age','Shape','Margin','Density','Severity']
df=db.replace('?', np.nan)

In [10]:
df.drop('Bi-Rads',axis=1,inplace=True)

Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [11]:
df.head()

Unnamed: 0,Age,Shape,Margin,Density,Severity
0,67,3,5,3.0,1
1,43,1,1,,1
2,58,4,5,3.0,1
3,28,1,1,3.0,0
4,74,1,5,,1


Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [12]:
df.describe()

Unnamed: 0,Severity
count,961.0
mean,0.463059
std,0.498893
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [13]:
df.isnull().sum()

Age          5
Shape       31
Margin      48
Density     76
Severity     0
dtype: int64

If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [14]:
df.isnull()

Unnamed: 0,Age,Shape,Margin,Density,Severity
0,False,False,False,False,False
1,False,False,False,True,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,True,False
5,False,False,True,False,False
6,False,True,True,False,False
7,False,False,True,False,False
8,False,False,False,False,False
9,False,True,False,False,False


In [16]:
data=df.dropna()
data.isnull().sum()

Age         0
Shape       0
Margin      0
Density     0
Severity    0
dtype: int64

In [20]:
data[['Age','Shape','Margin','Density','Severity']]=data[['Age','Shape','Margin','Density','Severity']].apply(pd.to_numeric)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [21]:
data.describe()

Unnamed: 0,Age,Shape,Margin,Density,Severity
count,831.0,831.0,831.0,831.0,831.0
mean,55.777377,2.783394,2.814681,2.915764,0.484958
std,14.663528,1.242331,1.566771,0.350737,0.500075
min,18.0,1.0,1.0,1.0,0.0
25%,46.0,2.0,1.0,3.0,0.0
50%,57.0,3.0,3.0,3.0,0.0
75%,66.0,4.0,4.0,3.0,1.0
max,96.0,4.0,5.0,4.0,1.0


Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [68]:
X=data[['Age','Shape','Margin','Density']]
y=data['Severity']
Names=['Age','Shape','Margin','Density']

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [69]:
import sklearn as sk
import sklearn.preprocessing as pre

## Decision Trees

Before moving to K-Fold cross validation and random forests, start by creating a single train/test split of our data. Set aside 75% for training, and 25% for testing.

In [81]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(Y), test_size=0.25)

Now create a DecisionTreeClassifier and fit it to your training data.

In [82]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [83]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

Measure the accuracy of the resulting decision tree model using your test data.

In [84]:
predictions = dtree.predict(X_test)
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.70      0.73      0.71       101
          1       0.74      0.70      0.72       107

avg / total       0.72      0.72      0.72       208



In [85]:
print(confusion_matrix(y_test,predictions))

[[74 27]
 [32 75]]


In [86]:
print(accuracy_score(y_test,predictions))

0.716346153846


Now instead of a single train/test split, use K-Fold cross validation to get a better measure of your model's accuracy (K=10). Hint: use model_selection.cross_val_score

In [87]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(dtree, X, y, cv=10)
scores

array([ 0.71428571,  0.77380952,  0.71428571,  0.73493976,  0.78313253,
        0.71084337,  0.72289157,  0.75903614,  0.75609756,  0.70731707])

Now try a RandomForestClassifier instead. Does it perform better?

In [88]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=150)
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.74      0.68      0.71       101
          1       0.72      0.78      0.75       107

avg / total       0.73      0.73      0.73       208



In [89]:
print(confusion_matrix(y_test,predictions))

[[69 32]
 [24 83]]


In [90]:
print(accuracy_score(y_test,predictions))

0.730769230769


## SVM

Next try using svm.SVC with a linear kernel. How does it compare to the decision tree?

In [126]:
from sklearn.svm import SVC
SV = SVC(kernel='linear')
SV.fit(X_train,y_train)
predictions = SV.predict(X_test)
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.86      0.67      0.76       101
          1       0.74      0.90      0.81       107

avg / total       0.80      0.79      0.79       208



In [127]:
print(confusion_matrix(y_test,predictions))

[[68 33]
 [11 96]]


In [128]:
print(accuracy_score(y_test,predictions))

0.788461538462


In [131]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['linear']}
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.fit(X_train,y_train)

Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] C=0.1, gamma=1, kernel=linear ...................................
[CV]  C=0.1, gamma=1, kernel=linear, score=0.8221153846153846, total=   0.0s
[CV] C=0.1, gamma=1, kernel=linear ...................................
[CV]  C=0.1, gamma=1, kernel=linear, score=0.7980769230769231, total=   0.0s
[CV] C=0.1, gamma=1, kernel=linear ...................................
[CV]  C=0.1, gamma=1, kernel=linear, score=0.7681159420289855, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV]  C=0.1, gamma=0.1, kernel=linear, score=0.8221153846153846, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV]  C=0.1, gamma=0.1, kernel=linear, score=0.7980769230769231, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=linear .................................
[CV]  C=0.1, gamma=0.1, kernel=linear, score=0.7681159420289855, total=   0.0s
[CV] C=0.1, gamma=0.01, kernel=linear .......

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


[CV]  C=1, gamma=0.1, kernel=linear, score=0.8221153846153846, total=   0.0s
[CV] C=1, gamma=0.1, kernel=linear ...................................
[CV]  C=1, gamma=0.1, kernel=linear, score=0.8028846153846154, total=   0.0s
[CV] C=1, gamma=0.1, kernel=linear ...................................
[CV]  C=1, gamma=0.1, kernel=linear, score=0.7681159420289855, total=   0.0s
[CV] C=1, gamma=0.01, kernel=linear ..................................
[CV]  C=1, gamma=0.01, kernel=linear, score=0.8221153846153846, total=   0.0s
[CV] C=1, gamma=0.01, kernel=linear ..................................
[CV]  C=1, gamma=0.01, kernel=linear, score=0.8028846153846154, total=   0.0s
[CV] C=1, gamma=0.01, kernel=linear ..................................
[CV]  C=1, gamma=0.01, kernel=linear, score=0.7681159420289855, total=   0.0s
[CV] C=1, gamma=0.001, kernel=linear .................................
[CV]  C=1, gamma=0.001, kernel=linear, score=0.8221153846153846, total=   0.0s
[CV] C=1, gamma=0.001, kernel=

[CV]  C=1000, gamma=0.0001, kernel=linear, score=0.8028846153846154, total=   7.0s
[CV] C=1000, gamma=0.0001, kernel=linear .............................
[CV]  C=1000, gamma=0.0001, kernel=linear, score=0.7681159420289855, total=  13.8s


[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:  4.3min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['linear']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [132]:
grid.best_params_

{'C': 1, 'gamma': 1, 'kernel': 'linear'}

In [133]:
grid.best_estimator_

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [134]:
grid_predictions = grid.predict(X_test)

In [135]:
print(confusion_matrix(y_test,grid_predictions))

[[68 33]
 [11 96]]


In [136]:
print(classification_report(y_test,grid_predictions))

             precision    recall  f1-score   support

          0       0.86      0.67      0.76       101
          1       0.74      0.90      0.81       107

avg / total       0.80      0.79      0.79       208



In [137]:
print(accuracy_score(y_test,grid_predictions))

0.788461538462


## KNN
How about K-Nearest-Neighbors? Hint: use neighbors.KNeighborsClassifier - it's a lot easier than implementing KNN from scratch like we did earlier in the course. Start with a K of 10. K is an example of a hyperparameter - a parameter on the model itself which may need to be tuned for best results on your particular data set.

In [104]:
from sklearn.neighbors import KNeighborsClassifier
KN= KNeighborsClassifier(n_neighbors=10)
KN.fit(X_train,y_train)
predictions = KN.predict(X_test)
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.77      0.70      0.74       101
          1       0.74      0.80      0.77       107

avg / total       0.76      0.75      0.75       208



In [105]:
print(confusion_matrix(y_test,predictions))

[[71 30]
 [21 86]]


In [106]:
print(accuracy_score(y_test,predictions))

0.754807692308


Choosing K is tricky, so we can't discard KNN until we've tried different values of K. Write a for loop to run KNN with K values ranging from 1 to 50 and see if K makes a substantial difference. Make a note of the best performance you could get out of KNN.

In [110]:
for number in range (1,51):
    KN= KNeighborsClassifier(n_neighbors=number)
    KN.fit(X_train,y_train)
    predictions = KN.predict(X_test)
    print('for K='+str(number)+'acurracy is:'+str((accuracy_score(y_test,predictions))))

for K=1acurracy is:0.673076923077
for K=2acurracy is:0.649038461538
for K=3acurracy is:0.759615384615
for K=4acurracy is:0.764423076923
for K=5acurracy is:0.783653846154
for K=6acurracy is:0.774038461538
for K=7acurracy is:0.759615384615
for K=8acurracy is:0.764423076923
for K=9acurracy is:0.764423076923
for K=10acurracy is:0.754807692308
for K=11acurracy is:0.764423076923
for K=12acurracy is:0.764423076923
for K=13acurracy is:0.769230769231
for K=14acurracy is:0.774038461538
for K=15acurracy is:0.783653846154
for K=16acurracy is:0.759615384615
for K=17acurracy is:0.75
for K=18acurracy is:0.754807692308
for K=19acurracy is:0.745192307692
for K=20acurracy is:0.735576923077
for K=21acurracy is:0.745192307692
for K=22acurracy is:0.740384615385
for K=23acurracy is:0.730769230769
for K=24acurracy is:0.721153846154
for K=25acurracy is:0.735576923077
for K=26acurracy is:0.730769230769
for K=27acurracy is:0.730769230769
for K=28acurracy is:0.721153846154
for K=29acurracy is:0.730769230769
for 

# Best Acurracy for K=15 acurracy is:0.783653846154

## Naive Bayes

Now try naive_bayes.MultinomialNB. How does its accuracy stack up? Hint: you'll need to use MinMaxScaler to get the features in the range MultinomialNB requires.

In [113]:
from sklearn.naive_bayes import MultinomialNB
NB= MultinomialNB()
NB.fit(X_train,y_train)
predictions = NB.predict(X_test)
print(classification_report(y_test,predictions))

             precision    recall  f1-score   support

          0       0.81      0.66      0.73       101
          1       0.73      0.85      0.78       107

avg / total       0.77      0.76      0.76       208



In [114]:
print(confusion_matrix(y_test,predictions))

[[67 34]
 [16 91]]


In [115]:
print(accuracy_score(y_test,predictions))

0.759615384615


## Revisiting SVM

svm.SVC may perform differently with different kernels. The choice of kernel is an example of a "hyperparamter." Try the rbf, sigmoid, and poly kernels and see what the best-performing kernel is. Do we have a new winner?

In [120]:
SV = SVC(kernel='rbf')
SV.fit(X_train,y_train)
predictions = SV.predict(X_test)
print(accuracy_score(y_test,predictions))

0.759615384615


In [121]:
SV = SVC(kernel='sigmoid')
SV.fit(X_train,y_train)
predictions = SV.predict(X_test)
print(accuracy_score(y_test,predictions))

0.485576923077


In [122]:
SV = SVC(kernel='poly')
SV.fit(X_train,y_train)
predictions = SV.predict(X_test)
print(accuracy_score(y_test,predictions))

0.754807692308


## Logistic Regression

We've tried all these fancy techniques, but fundamentally this is just a binary classification problem. Try Logisitic Regression, which is a simple way to tackling this sort of thing.

In [125]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
predictions = logreg.predict(X_test)
print(accuracy_score(y_test,predictions))

0.783653846154


## Do we have a winner?

Which model, and which choice of hyperparameters, performed the best? 

## Amazingly withouth moving any parameters at all, the best performing model was Logistic Regression and KNN with just about 78%, with the modified parameters the best one was SVM with almost a 79%