# Final Project

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Your assignment

Apply several different supervised machine learning techniques to this data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10). Apply:

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression
* And, as a bonus challenge, a neural network using Keras.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Remember some techniques such as SVM also require the input data to be normalized first.

Many techniques also have "hyperparameters" that need to be tuned. Once you identify a promising approach, see if you can make it even better by tuning its hyperparameters.

I was able to achieve over 80% accuracy - can you beat that?


In [1]:
import pandas as pd

# Preprocessing

In [2]:
data= pd.read_csv("mammographic_masses.data")
data.head()

Unnamed: 0,5,67,3,5.1,3.1,1
0,4,43,1,1,?,1
1,5,58,4,5,3,1
2,4,28,1,1,3,0
3,5,74,1,5,?,1
4,4,65,1,?,3,0


In [3]:
data= pd.read_csv("mammographic_masses.data", na_values=["?"], names=["BI_RADS", "age", "shape", "margin", "density", "severity"])
data.head()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,,1


In [4]:
data.describe()

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
count,959.0,956.0,930.0,913.0,885.0,961.0
mean,4.348279,55.487448,2.721505,2.796276,2.910734,0.463059
std,1.783031,14.480131,1.242792,1.566546,0.380444,0.498893
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,45.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


In [5]:
data.isnull().sum()

BI_RADS      2
age          5
shape       31
margin      48
density     76
severity     0
dtype: int64

In [6]:
data.loc[(data["age"].isnull()) | (data["shape"].isnull()) | (data["margin"].isnull()) | data["density"].isnull()]

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
1,4.0,43.0,1.0,1.0,,1
4,5.0,74.0,1.0,5.0,,1
5,4.0,65.0,1.0,,3.0,0
6,4.0,70.0,,,3.0,0
7,5.0,42.0,1.0,,3.0,0
...,...,...,...,...,...,...
778,4.0,60.0,,4.0,3.0,0
819,4.0,35.0,3.0,,2.0,0
824,6.0,40.0,,3.0,4.0,1
884,5.0,,4.0,4.0,3.0,1


# Data cleaning

In [7]:
col= list(data.columns)
col

['BI_RADS', 'age', 'shape', 'margin', 'density', 'severity']

In [8]:
for i in range (len(col)):
    print(f"Median for {i} column: ", data[col[i]].median())

Median for 0 column:  4.0
Median for 1 column:  57.0
Median for 2 column:  3.0
Median for 3 column:  3.0
Median for 4 column:  3.0
Median for 5 column:  0.0


# Filling with medians

In [9]:
for i in range(len(col)):
    data.fillna({col[i]: data[col[i]].median()}, inplace=True)

In [10]:
data.isnull().sum()

BI_RADS     0
age         0
shape       0
margin      0
density     0
severity    0
dtype: int64

In [11]:
data.head(6)

Unnamed: 0,BI_RADS,age,shape,margin,density,severity
0,5.0,67.0,3.0,5.0,3.0,1
1,4.0,43.0,1.0,1.0,3.0,1
2,5.0,58.0,4.0,5.0,3.0,1
3,4.0,28.0,1.0,1.0,3.0,0
4,5.0,74.0,1.0,5.0,3.0,1
5,4.0,65.0,1.0,3.0,3.0,0


In [12]:
data.drop("BI_RADS", axis=1, inplace =True)
data.head()

Unnamed: 0,age,shape,margin,density,severity
0,67.0,3.0,5.0,3.0,1
1,43.0,1.0,1.0,3.0,1
2,58.0,4.0,5.0,3.0,1
3,28.0,1.0,1.0,3.0,0
4,74.0,1.0,5.0,3.0,1


In [13]:
data.corr()

Unnamed: 0,age,shape,margin,density,severity
age,1.0,0.360532,0.402995,0.021119,0.431329
shape,0.360532,1.0,0.718893,0.057495,0.552781
margin,0.402995,0.718893,1.0,0.094516,0.557867
density,0.021119,0.057495,0.094516,1.0,0.054681
severity,0.431329,0.552781,0.557867,0.054681,1.0


# Splitting x & y

In [14]:
x= data.iloc[:,:-1].values
y= data.loc[:,"severity"].values

# Normalization

In [15]:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
x_scaled= scaler.fit_transform(x)
x_scaled

array([[ 0.79698441,  0.22038395,  1.43676223,  0.22480407],
       [-0.86561042, -1.41505218, -1.18321596,  0.22480407],
       [ 0.17351135,  1.03810202,  1.43676223,  0.22480407],
       ...,
       [ 0.58916006,  1.03810202,  1.43676223,  0.22480407],
       [ 0.72770962,  1.03810202,  1.43676223,  0.22480407],
       [ 0.45061049,  0.22038395,  0.12677314,  0.22480407]])

# Visualization

In [16]:
from CompareDetails import Compare_Details

In [17]:
data.head()

Unnamed: 0,age,shape,margin,density,severity
0,67.0,3.0,5.0,3.0,1
1,43.0,1.0,1.0,3.0,1
2,58.0,4.0,5.0,3.0,1
3,28.0,1.0,1.0,3.0,0
4,74.0,1.0,5.0,3.0,1


In [18]:
for i in range(1,len(col)-1):
        Compare_Details(data, col[i], "severity", 7)

0,1,2,3,4,5,6
age,Total No. (age),Percentage (age),Total Outcome (0),Percentage Outcome (0),Total Outcome (1),Percentage Outcome (1)
Greater Than Mean,510,53.07,192,37.21 %,318,71.46 %
Less Than Mean,451,46.93,324,62.79 %,127,28.54 %


0,1,2,3,4,5,6
shape,Total No. (shape),Percentage (shape),Total severity (0),Percentage severity (0),Total severity (1),Percentage severity (1)
1.0,224,23.31%,186,83.04%,38,16.96%
2.0,211,21.96%,176,83.41%,35,16.59%
3.0,126,13.11%,69,54.76%,57,45.24%
4.0,400,41.62%,85,21.25%,315,78.75%


0,1,2,3,4,5,6
margin,Total No. (margin),Percentage (margin),Total severity (0),Percentage severity (0),Total severity (1),Percentage severity (1)
1.0,357,37.15%,316,88.52%,41,11.48%
2.0,24,2.5%,9,37.5%,15,62.5%
3.0,164,17.07%,80,48.78%,84,51.22%
4.0,280,29.14%,89,31.79%,191,68.21%
5.0,136,14.15%,22,16.18%,114,83.82%


0,1,2,3,4,5,6
density,Total No. (density),Percentage (density),Total severity (0),Percentage severity (0),Total severity (1),Percentage severity (1)
1.0,16,1.66%,9,56.25%,7,43.75%
2.0,59,6.14%,41,69.49%,18,30.51%
3.0,874,90.95%,459,52.52%,415,47.48%
4.0,12,1.25%,7,58.33%,5,41.67%


In [19]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


# Hyperparameter tuning

In [20]:
params={
    'knn': {'n_neighbors': [3,5,7,9,11,13,15],
            'metric': ['cosine', 'euclidean', 'manhattan'],
            'weights': ['uniform', 'distance']},

    'svc': {'C': [0.1, 1, 10, 100],
            'gamma': [1,0.1, 0.01, 0.001],
            'kernel': ['rbf', 'linear']},

    'dtc': {'criterion': ['gini', 'entropy'],
            'max_depth': [2,4,6,8,10,12]},

    'nb': {'priors': [None],
           'var_smoothing': [0.00000001, 0.000000001, 0.0000000001]},

    'rf': {'criterion': ['gini', 'entropy'],
           'max_depth': [2,4,6,8,10,12]},

    'lr': {'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
}

# Machine Learning

In [21]:
svc= SVC()
knn= KNeighborsClassifier()
dtc= DecisionTreeClassifier()
nb= GaussianNB()
rf= RandomForestClassifier()
lr= LogisticRegression()

In [22]:
models= {'svc':svc, 'knn':knn, 'dtc':dtc, 'nb':nb, 'rf':rf, 'lr':lr}

In [23]:
x_train, x_test, y_train, y_test= train_test_split(x_scaled, y, test_size=0.2, random_state=56)

In [24]:
model_accuracy={}
score=0.0001
for model in models.keys():

    mod= GridSearchCV(
        models[model],
        params[model],
        verbose=0,
        cv=20,
        n_jobs=-1
    )

    gridsearch_result= mod.fit(x_train, y_train)
    predict= mod.predict(x_test)

    print(f"{model}: ", gridsearch_result.best_estimator_)
    print(confusion_matrix(predict, y_test))

    if(score < float(gridsearch_result.score(x_test, y_test))):
        score= gridsearch_result.score(x_test, y_test)
        gridsearch= gridsearch_result.fit(x_train, y_train)

    if model not in model_accuracy.keys():
        model_accuracy.update({model: gridsearch_result.score(x_test, y_test)})


svc:  SVC(C=10, gamma=0.1)
[[84 18]
 [20 71]]
knn:  KNeighborsClassifier(metric='manhattan', n_neighbors=15)
[[81 19]
 [23 70]]
dtc:  DecisionTreeClassifier(max_depth=4)
[[82 19]
 [22 70]]
nb:  GaussianNB(var_smoothing=1e-08)
[[77 16]
 [27 73]]
rf:  RandomForestClassifier(criterion='entropy', max_depth=4)
[[81 15]
 [23 74]]
lr:  LogisticRegression(solver='newton-cg')
[[81 17]
 [23 72]]


In [25]:
model_accuracy

{'svc': 0.8031088082901554,
 'knn': 0.7823834196891192,
 'dtc': 0.7875647668393783,
 'nb': 0.7772020725388601,
 'rf': 0.8031088082901554,
 'lr': 0.7927461139896373}

In [26]:
gridsearch.best_estimator_

In [27]:
data.corr()

Unnamed: 0,age,shape,margin,density,severity
age,1.0,0.360532,0.402995,0.021119,0.431329
shape,0.360532,1.0,0.718893,0.057495,0.552781
margin,0.402995,0.718893,1.0,0.094516,0.557867
density,0.021119,0.057495,0.094516,1.0,0.054681
severity,0.431329,0.552781,0.557867,0.054681,1.0


# Droping less correlated columns

In [28]:
dropped_data= data.drop("density", axis=1)

In [29]:
dropped_data.head()

Unnamed: 0,age,shape,margin,severity
0,67.0,3.0,5.0,1
1,43.0,1.0,1.0,1
2,58.0,4.0,5.0,1
3,28.0,1.0,1.0,0
4,74.0,1.0,5.0,1


In [30]:
x=dropped_data.iloc[:,:-1].values
y=dropped_data.loc[:,'severity'].values

In [31]:
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
x_scaled

array([[ 0.79698441,  0.22038395,  1.43676223],
       [-0.86561042, -1.41505218, -1.18321596],
       [ 0.17351135,  1.03810202,  1.43676223],
       ...,
       [ 0.58916006,  1.03810202,  1.43676223],
       [ 0.72770962,  1.03810202,  1.43676223],
       [ 0.45061049,  0.22038395,  0.12677314]])

In [32]:
x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,test_size=0.2,random_state=56)

In [33]:
model_accuracy={}
score=0.0001
for model in models.keys():

    mod= GridSearchCV(
        models[model],
        params[model],
        verbose=0,
        cv=20,
        n_jobs=-1
    )

    gridsearch_result= mod.fit(x_train, y_train)
    predict= mod.predict(x_test)

    print(f"{model}: ", gridsearch_result.best_estimator_)
    print(confusion_matrix(predict, y_test))

    if(score < float(gridsearch_result.score(x_test, y_test))):
        score= gridsearch_result.score(x_test, y_test)
        gridsearch= gridsearch_result.fit(x_train, y_train)

    if model not in model_accuracy.keys():
        model_accuracy.update({model: gridsearch_result.score(x_test, y_test)})

svc:  SVC(C=1, gamma=1)
[[83 17]
 [21 72]]
knn:  KNeighborsClassifier(metric='manhattan', n_neighbors=9)
[[85 18]
 [19 71]]
dtc:  DecisionTreeClassifier(max_depth=4)
[[82 19]
 [22 70]]
nb:  GaussianNB(var_smoothing=1e-08)
[[76 14]
 [28 75]]
rf:  RandomForestClassifier(criterion='entropy', max_depth=4)
[[81 16]
 [23 73]]
lr:  LogisticRegression(solver='newton-cg')
[[80 16]
 [24 73]]


In [34]:
model_accuracy

{'svc': 0.8031088082901554,
 'knn': 0.8082901554404145,
 'dtc': 0.7875647668393783,
 'nb': 0.7823834196891192,
 'rf': 0.7979274611398963,
 'lr': 0.7927461139896373}

# Neural Network

In [35]:
x=data.iloc[:,:-1].values
y=data.loc[:,'severity'].values

In [36]:
x_scaled = scaler.fit_transform(x)
x_scaled

array([[ 0.79698441,  0.22038395,  1.43676223,  0.22480407],
       [-0.86561042, -1.41505218, -1.18321596,  0.22480407],
       [ 0.17351135,  1.03810202,  1.43676223,  0.22480407],
       ...,
       [ 0.58916006,  1.03810202,  1.43676223,  0.22480407],
       [ 0.72770962,  1.03810202,  1.43676223,  0.22480407],
       [ 0.45061049,  0.22038395,  0.12677314,  0.22480407]])

In [37]:
x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,test_size=0.2,random_state=42)

In [38]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

In [39]:
model = Sequential()
    #4 feature inputs going into an 6-unit layer (more does not seem to help - in fact you can go down to 4)
model.add(Dense(6, input_dim=4, kernel_initializer='normal', activation='relu'))
    # "Deep learning" turns out to be unnecessary - this additional hidden layer doesn't help either.
model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    # Output layer with a binary classification (benign or malignant)
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model; adam seemed to work best
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [40]:
model.fit(x_scaled,y,verbose=2,epochs=200)

Epoch 1/200
31/31 - 1s - 40ms/step - accuracy: 0.5328 - loss: 0.6927
Epoch 2/200
31/31 - 0s - 3ms/step - accuracy: 0.5369 - loss: 0.6912
Epoch 3/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6867
Epoch 4/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6756
Epoch 5/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6550
Epoch 6/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6286
Epoch 7/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6011
Epoch 8/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.5770
Epoch 9/200
31/31 - 0s - 2ms/step - accuracy: 0.7347 - loss: 0.5599
Epoch 10/200
31/31 - 0s - 2ms/step - accuracy: 0.7908 - loss: 0.5498
Epoch 11/200
31/31 - 0s - 2ms/step - accuracy: 0.7929 - loss: 0.5429
Epoch 12/200
31/31 - 0s - 2ms/step - accuracy: 0.7940 - loss: 0.5378
Epoch 13/200
31/31 - 0s - 2ms/step - accuracy: 0.7908 - loss: 0.5338
Epoch 14/200
31/31 - 0s - 2ms/step - accuracy: 0.7908 - loss: 0.5304
Epoch 15/200
31/31 - 0s - 2ms/step - accur

<keras.src.callbacks.history.History at 0x1d925222240>

In [41]:

x=dropped_data.iloc[:,:-1].values
y=dropped_data.loc[:,'severity'].values

In [42]:
x_scaled = scaler.fit_transform(x)
x_scaled

array([[ 0.79698441,  0.22038395,  1.43676223],
       [-0.86561042, -1.41505218, -1.18321596],
       [ 0.17351135,  1.03810202,  1.43676223],
       ...,
       [ 0.58916006,  1.03810202,  1.43676223],
       [ 0.72770962,  1.03810202,  1.43676223],
       [ 0.45061049,  0.22038395,  0.12677314]])

In [43]:
model = Sequential()
    #4 feature inputs going into an 6-unit layer (more does not seem to help - in fact you can go down to 4)
model.add(Dense(6, input_dim=3, kernel_initializer='normal', activation='relu'))
    # "Deep learning" turns out to be unnecessary - this additional hidden layer doesn't help either.
model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    # Output layer with a binary classification (benign or malignant)
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model; adam seemed to work best
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [44]:
model.fit(x_scaled,y,verbose=2,epochs=200)

Epoch 1/200
31/31 - 1s - 30ms/step - accuracy: 0.5234 - loss: 0.6929
Epoch 2/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6919
Epoch 3/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6893
Epoch 4/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6828
Epoch 5/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6708
Epoch 6/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6543
Epoch 7/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6324
Epoch 8/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.6074
Epoch 9/200
31/31 - 0s - 2ms/step - accuracy: 0.5369 - loss: 0.5853
Epoch 10/200
31/31 - 0s - 2ms/step - accuracy: 0.6837 - loss: 0.5679
Epoch 11/200
31/31 - 0s - 2ms/step - accuracy: 0.7888 - loss: 0.5555
Epoch 12/200
31/31 - 0s - 2ms/step - accuracy: 0.7950 - loss: 0.5467
Epoch 13/200
31/31 - 0s - 2ms/step - accuracy: 0.7940 - loss: 0.5402
Epoch 14/200
31/31 - 0s - 2ms/step - accuracy: 0.7940 - loss: 0.5358
Epoch 15/200
31/31 - 0s - 2ms/step - accur

<keras.src.callbacks.history.History at 0x1d92535bcb0>