# **Breast Cancer Classification**

**Author:** Meg Hutch

**Date:** October 22, 2019


**Objective:** Classify Breast Cancer Tumnors as Malignant or Benign from the Breast Cancer Wisconin Dataset downloaded fromn https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

**Additional reference:** https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

**Dataset:** The following is the given data descriptions: 

**Attribute Information:**

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

1.   ID Number
2.   Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus (3-32):

3.  radius (mean of distances from center to points on the perimeter)
4.  texture (standard deviation of gray-scale values)
5.  perimeter
6.  area
7.  smoothness (local variation in radius lengths)
8.  compactness (perimeter^2 / area - 1.0)
9.  concavity (severity of concave portions of the contour)
10. concave points (number of concave portions of the contour)
11. symmetry
12. fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

**WIP Updates**

**11.01.2019: MH implemented the code for logisitic regression, will need to add random forest, and also ensure the neural network is okay**

**11.04.2019: MH is not sure if I need the train_x or val_x to also be converted to long format ; I'm having problems getting the models to run due to dimension problems**

**11.05.2019: MH will try revising the code for xb, yb and in regards to the data loader -- I think this is the problem. First though, I will save what I've done to github**

**Update: Can't figure out how to upload to github, but I created a cleaner version of this code. I'm thinking about removing the validaiton set, since I think this reduces the size, but I'm a bit weary of doing so.**

**Need to also implement random forest classifier**

In [0]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

# Connect Colab to google drive
from google.colab import drive
drive.mount('/content/drive')

In [0]:
# Import Data
tumor = pd.read_csv('/content/drive/My Drive/Projects/Breast_Cancer_Wisconsin/data.csv')

# View data
tumor.head(10)

# **Explore Data**

First, we will examine the worst values obtained from each patient.

In [0]:
tumor.diagnosis.value_counts().plot(kind="bar")
count_dx = tumor.groupby(['diagnosis']).size()
print('Total Number of Patients:', len(tumor.index))
print('Number Diagnosed:', count_dx)
print('Percent Benign: {:.1f}'.format(357/len(tumor.index)))
print('Percent Malignant: {:.1f}'.format(212/len(tumor.index)))


In [0]:
# Create a new dataframe to just contain columns of interset
tumor_plots = tumor[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]

# Generically define how many plots along and across
ncols = 3
nrows = int(np.ceil(len(tumor_plots.columns) / (1.0*ncols)))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(10, 10))

# Lazy counter so we can remove unwanted axes
counter = 0
for i in range(nrows):
    for j in range(ncols):

        ax = axes[i][j]

        # Plot when we have data
        if counter < len(tumor_plots.columns):

            ax.hist(tumor_plots[tumor_plots.columns[counter]], bins=50, color='blue', alpha=0.5, label='{}'.format(tumor_plots.columns[counter]))
            ax.set_xlabel('x')
            ax.set_ylabel('PDF')
            leg = ax.legend(loc='upper left')
            leg.draw_frame(False)

        # Remove axis when we no longer have data
        else:
            ax.set_axis_off()

        counter += 1

plt.show()

# **Correlations for Feature Selection**

I'll eventually have to learn how to look into this more, as of now, I'm just going to include all features

In [0]:
# Basic correlogram
#sns.pairplot(tumor_plots)
#plt.show()

In [0]:
tumor.head()

# **Logistic Regression**

We will first try and assess classification using a simple logistic regression - this will also serve as a bench mark once we develop our neural network classifier

These steps were followed from the following tutorial: https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python

In [0]:
# Packages 
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Create dummy variables for diagnoses

In [0]:
tumor['diagnosis'] = tumor.diagnosis.map({'B':0, 'M':1})

Create data frames into features and labels


In [0]:
# create x to represent the input features; y is the label; 
x = tumor[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
y = tumor.diagnosis # This is output of our training data

Split the data into testing and training

In [0]:
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=0)

Develop the logistic regression model

In [0]:
# instantiate the model (using default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

In [0]:
y_pred

**Model Evaulation using Confusion Matrix**

In [0]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

The confusion matrix generated abouve is in the form of an array. Diagonal values represent accurate predictions, while non-diagnonal elements are inaccurate predictions. The diagnoal starting with the top left to the bottom right hand corner are the actual predictions, while the bottom left corner to the top right corner are incorrect predictions.

In [0]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

**ROC**

The Reciever Operating Characteristic (ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificty

In [0]:
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

# **Random Forest Classification**

## **PyTorch Neural Network for Classification**


In [0]:
# Import PyTorch packages
import torch
from torch import nn
from torchvision import datasets, transforms
from torch import optim
from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import TensorDataset
import torch.nn.functional as F

In [0]:
# Function that randomly shuffles and splits the dataset
def split_indices(n, val_pct):
  # Determine size of test/validation set
  n_val = int(val_pct*n)
  # Create random permutation of 0 to n-1
  idxs = np.random.permutation(n)
  # Pick first n_val indices for test/validation set
  return idxs[n_val:], idxs[:n_val]

In [0]:
train_indices, test_indices = split_indices(len(tumor), val_pct=0.2)

print(len(train_indices), len(test_indices))
print('Sample test indices: ' , test_indices[:20])

In [0]:
# Create a test set
test_ds = tumor[tumor.index.isin(test_indices)]

# Rename the train at tumor
tumor = tumor[tumor.index.isin(train_indices)]

Now we can apply this function once more, to create a training and validaiton set

In [0]:
train_indices, val_indices = split_indices(len(tumor), val_pct=0.2)

print(len(train_indices), len(val_indices))
print('Sample val indices: ' , val_indices[:10])

**Create a Training Set**

In [0]:
# Create training set
train_ds = tumor[tumor.index.isin(train_indices)]

# Using the training dataset just created, remove the diagnosis variable and create training feature and label vector
xb = train_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
yb = train_ds.diagnosis # This is output of our training data

# Convert data into arrays
xb = np.array(xb, dtype = "float32")
yb = np.array(yb, dtype= "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

#Combine the arrays 
trainloader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size=25

# Training Loader
trainloader = DataLoader(trainloader, 
                         batch_size)

**Similarly, Create the Validation Set**


In [0]:
# Create validation set
val_ds = tumor[tumor.index.isin(val_indices)]

# Using the validation dataset just created, remove the diagnosis variable and create validaiton feature and label vector
xb = val_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
yb = val_ds.diagnosis # This is output of our validation data

# Convert data into arrays
xb = np.array(xb, dtype = "float32")
yb = np.array(yb, dtype= "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

#Combine the arrays 
val_loader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size=25

# Validation Loader
val_loader = DataLoader(val_loader, 
                         batch_size)

**Create/Format the Test Set**

In [0]:
# Using the test dataset just created, remove the diagnosis variable and create test feature and label vector
xb = test_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
yb = test_ds.diagnosis # This is output of our test data

# Convert data into arrays
xb = np.array(xb, dtype = "float32")
yb = np.array(yb, dtype= "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

#Combine the arrays 
testloader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size=25

# Test Loader
testloader = DataLoader(testloader, 
                         batch_size)

**Create Neural Network Model**

In [0]:
# Define the model with one hidden layer
model = nn.Sequential(nn.Linear(10, 5),
                      nn.ReLU(),
                      nn.Linear(5, 2),
                      nn.LogSoftmax(dim=1))

# Set optimizer and learning rate
#optimizer = optim.SGD(model.parameters(), lr=0.001)

# Could also use Adam optimizer; similar to stochastic gradient descent, but uses momentum which can speed up the actual fitting process, and it also adjusts the learning rate for each of the individual parameters in the model
optimizer = optim.Adam(model.parameters(), lr=0.003)

# Define the loss
criterion = nn.NLLLoss()

# Set 50 epochs to start
epochs = 50
for e in range(epochs):
    running_loss = 0
    for xb, yb in trainloader:

        # Flatten yb
        #yb = yb.view(yb.shape[0], -1)
        
        # Clear the gradients, do this because gradients are accumulated
        optimizer.zero_grad()
        
        # Training pass
        output = model.forward(xb)
        loss = criterion(output, yb.long()) # Loss calculated from the output compared to the labels 
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() # loss.item() gets the scalar value held in the loss. Running_loss = 0, 
        # += notation, says "Add a value and the variable and assigns the result to that variable." So, adds the running_loss (0) with loss.item and assigns to running_loss
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")

The goal of validation is to measure the model's performance on data that isn't part of the training set. Performance here is up to the developer to define though. Typically, this is just accuracy, the percentage of classes the network predicted correctly. 

First, do a forward pass with one batch from the test set 

In [0]:
xb, yb = next(iter(testloader))

# Get the class probabilities 
ps = torch.exp(model(xb))

# Make sure the shape is appropriate, we should get 10 class probabilities for 64 examples
print(ps.shape)

With the probabilities, we can get the most likely class using the ps.topk method. This returns the k highest values. Since we just want the most likely class, we can use ps.topk(1). This returns a tuple of the top-k values and the top-k indices. If the highest value is the first element, we'll get back 4 as the index. 

In [0]:
top_p, top_class = ps.topk(1, dim=1)
# Look at the most likely classes for the first 10 examples
print(top_class[:20,:])

Now we can check if the predicted classes match the labels. This is simple to do by equating top_class and labels, but we have to be careful of the shapes. To get the equality to work out the way we want, top_class and the labels (yb) must have the same shape.

In [0]:
equals = top_class == yb.view(*top_class.shape) 
equals

Now we need to calculate the correct predictions. 

equals has binary values, either 0 or 1. This means that if we just sum up all the values and divide by the total number of values, we get the percentage of correct predictions. This is the same operation as taking the mean, so we can get the accuracy with a call to torch.mean. 

So we'll need to convert equals to a float tensor. Note that when we take torch.mean it returns a scalar tensor, to get the actual value as a float we'll need to do accuracy.item()

In [0]:
accuracy = torch.mean(equals.type(torch.FloatTensor))
print(f'Accuracy: {accuracy.item()*100}%')