# **Breast Cancer Classification**

**Author:** Meg Hutch

**Date:** October 22, 2019


**Objective:** Classify Breast Cancer Tumnors as Malignant or Benign from the Breast Cancer Wisconin Dataset downloaded fromn https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

**Additional reference:** https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

**Dataset:** The following is the given data descriptions: 

**Attribute Information:**

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

1.   ID Number
2.   Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus (3-32):

3.  radius (mean of distances from center to points on the perimeter)
4.  texture (standard deviation of gray-scale values)
5.  perimeter
6.  area
7.  smoothness (local variation in radius lengths)
8.  compactness (perimeter^2 / area - 1.0)
9.  concavity (severity of concave portions of the contour)
10. concave points (number of concave portions of the contour)
11. symmetry
12. fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

**WIP Updates**

**11.01.2019: MH implemented the code for logisitic regression, will need to add random forest, and also ensure the neural network is okay**

**11.04.2019: MH is not sure if I need the train_x or val_x to also be converted to long format ; I'm having problems getting the models to run due to dimension problems**

**11.05.2019: MH will try revising the code for xb, yb and in regards to the data loader -- I think this is the problem. First though, I will save what I've done to github**

In [0]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

# Connect Colab to google drive
from google.colab import drive
drive.mount('/content/drive')

In [0]:
# Import Data
tumor = pd.read_csv('/content/drive/My Drive/Projects/Breast_Cancer_Wisconsin/data.csv')

# View data
tumor.head(10)

# **Explore Data**

First, we will examine the worst values obtained from each patient.

In [0]:
tumor.diagnosis.value_counts().plot(kind="bar")
count_dx = tumor.groupby(['diagnosis']).size()
print('Total Number of Patients:', len(tumor.index))
print('Number Diagnosed:', count_dx)
print('Percent Benign: {:.1f}'.format(357/len(tumor.index)))
print('Percent Malignant: {:.1f}'.format(212/len(tumor.index)))


In [0]:
# Create a new dataframe to just contain columns of interset
tumor_plots = tumor[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]

# Generically define how many plots along and across
ncols = 3
nrows = int(np.ceil(len(tumor_plots.columns) / (1.0*ncols)))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(10, 10))

# Lazy counter so we can remove unwanted axes
counter = 0
for i in range(nrows):
    for j in range(ncols):

        ax = axes[i][j]

        # Plot when we have data
        if counter < len(tumor_plots.columns):

            ax.hist(tumor_plots[tumor_plots.columns[counter]], bins=50, color='blue', alpha=0.5, label='{}'.format(tumor_plots.columns[counter]))
            ax.set_xlabel('x')
            ax.set_ylabel('PDF')
            leg = ax.legend(loc='upper left')
            leg.draw_frame(False)

        # Remove axis when we no longer have data
        else:
            ax.set_axis_off()

        counter += 1

plt.show()

# **Correlations for Feature Selection**

I'll eventually have to learn how to look into this more, as of now, I'm just going to include all features

In [0]:
# Basic correlogram
#sns.pairplot(tumor_plots)
#plt.show()

In [0]:
tumor.head()

# **Logistic Regression**

We will first try and assess classification using a simple logistic regression - this will also serve as a bench mark once we develop our neural network classifier

These steps were followed from the following tutorial: https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python

In [0]:
# Packages 
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Create dummy variables for diagnoses

In [0]:
tumor['diagnosis'] = tumor.diagnosis.map({'B':0, 'M':1})

Create data frames into features and labels


In [0]:
# create x to represent the input features; y is the label; 
x = tumor[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
y = tumor.diagnosis # This is output of our training data

Split the data into testing and training

In [0]:
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=0)

Develop the logistic regression model

In [0]:
# instantiate the model (using default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

In [0]:
y_pred

**Model Evaulation using Confusion Matrix**

In [0]:
# import the metrics class
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

The confusion matrix generated abouve is in the form of an array. Diagonal values represent accurate predictions, while non-diagnonal elements are inaccurate predictions. The diagnoal starting with the top left to the bottom right hand corner are the actual predictions, while the bottom left corner to the top right corner are incorrect predictions.

In [0]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

**ROC**

The Reciever Operating Characteristic (ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificty

In [0]:
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

# **Random Forest Classification**

## **PyTorch Neural Network for Classification**


In [0]:
# Import PyTorch packages
import torch
from torch import nn
from torchvision import datasets, transforms
from torch import optim
from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data.dataloader import DataLoader
from torch.utils.data import TensorDataset
import torch.nn.functional as F

Meg is going to restart here. 

In [0]:
# Function that randomly shuffles and splits the dataset
def split_indices(n, val_pct):
  # Determine size of test/validation set
  n_val = int(val_pct*n)
  # Create random permutation of 0 to n-1
  idxs = np.random.permutation(n)
  # Pick first n_val indices for test/validation set
  return idxs[n_val:], idxs[:n_val]

In [0]:
train_indices, test_indices = split_indices(len(tumor), val_pct=0.2)

print(len(train_indices), len(test_indices))
print('Sample test indices: ' , test_indices[:20])

In [0]:
# Create a test set
test_ds = tumor[tumor.index.isin(test_indices)]

# Rename the train at tumor
tumor = tumor[tumor.index.isin(train_indices)]

Now we can apply this function once more, to create a training and validaiton set

In [0]:
train_indices, val_indices = split_indices(len(tumor), val_pct=0.2)

print(len(train_indices), len(val_indices))
print('Sample val indices: ' , val_indices[:10])

**Create a Training Set**

In [0]:
# Create training set
train_ds = tumor[tumor.index.isin(train_indices)]

# Using the training dataset just created, remove the diagnosis variable and create training feature and label vector
xb = train_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
yb = train_ds.diagnosis # This is output of our training data

# Convert data into arrays
xb = np.array(xb, dtype = "float32")
yb = np.array(yb, dtype= "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

#Combine the arrays 
trainloader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size=25

# Training Loader
trainloader = DataLoader(trainloader, 
                         batch_size)

**Similarly, Create the Validation Set**

In [0]:
# Create validation set
val_ds = tumor[tumor.index.isin(val_indices)]

# Using the validation dataset just created, remove the diagnosis variable and create validaiton feature and label vector
xb = val_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
yb = val_ds.diagnosis # This is output of our validation data

# Convert data into arrays
xb = np.array(xb, dtype = "float32")
yb = np.array(yb, dtype= "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

#Combine the arrays 
val_loader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size=25

# Validation Loader
val_loader = DataLoader(val_loader, 
                         batch_size)

**Create/Format the Test Set**

In [0]:
# Using the test dataset just created, remove the diagnosis variable and create test feature and label vector
xb = test_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
yb = test_ds.diagnosis # This is output of our test data

# Convert data into arrays
xb = np.array(xb, dtype = "float32")
yb = np.array(yb, dtype= "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

#Combine the arrays 
testloader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size=25

# Test Loader
testloader = DataLoader(testloader, 
                         batch_size)

**Create Neural Network Model**

In [0]:
# Define the model with one hidden layer
model = nn.Sequential(nn.Linear(10, 5),
                      nn.ReLU(),
                      nn.Linear(5, 2),
                      nn.LogSoftmax(dim=1))

# Set optimizer and learning rate
#optimizer = optim.SGD(model.parameters(), lr=0.001)

# Could also use Adam optimizer; similar to stochastic gradient descent, but uses momentum which can speed up the actual fitting process, and it also adjusts the learning rate for each of the individual parameters in the model
optimizer = optim.Adam(model.parameters(), lr=0.003)

# Define the loss
criterion = nn.NLLLoss()

# Set 20 epochs to start
epochs = 20
for e in range(epochs):
    running_loss = 0
    for xb, yb in trainloader:

        # Flatten yb
        #yb = yb.view(yb.shape[0], -1)
        
        # Clear the gradients, do this because gradients are accumulated
        optimizer.zero_grad()
        
        # Training pass
        output = model.forward(xb)
        loss = criterion(output, yb.long()) # Loss calculated from the output compard to the labels 
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() # loss.item() gets the scalar value held in the loss. Running_loss = 0, 
        # += notation, says "Add a value and the variable and assigns the result to that variable." So, adds the running_loss (0) with loss.item and assigns to running_loss
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")

In [0]:
!wget https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/3bd7dea850e936d8cb44adda8200e4e2b5d627e3/intro-to-pytorch/helper.py
import helper

In [0]:
xb, yb = next(iter(testloader))

# Get the class probabilities 
ps = torch.exp(model(xb))

# Make sure the shape is appropriate, we should get 10 class probabilities for 64 examples
print(ps.shape)

In [0]:
top_p, top_class = ps.topk(1, dim=1)
# Look at the most likely classes for the first 10 examples
print(top_class[:20,:])

In [0]:
equals = top_class == yb.view(*top_class.shape) 

In [0]:
equals

In [0]:
accuracy = torch.mean(equals.type(torch.FloatTensor))
print(f'Accuracy: {accuracy.item()*100}%')

# **DRAFT!!!!**

First need to create a test set. For this we will choose 20% of the data and define the function that can split the data 

In [0]:
# Function that randomly shuffles and splits the dataset
def split_indices(n, val_pct):
  # Determine size of test/validation set
  n_val = int(val_pct*n)
  # Create random permutation of 0 to n-1
  idxs = np.random.permutation(n)
  # Pick first n_val indices for test/validation set
  return idxs[n_val:], idxs[:n_val]

In [0]:
train_indices, test_indices = split_indices(len(tumor), val_pct=0.2)

print(len(train_indices), len(test_indices))
print('Sample test indices: ' , test_indices[:20])

In [0]:
# Create a test set
test_ds = tumor[tumor.index.isin(test_indices)]

# Rename the train at tumor
tumor = tumor[tumor.index.isin(train_indices)]

Now we can apply this function once more, to create a training and validaiton set

In [0]:
train_indices, val_indices = split_indices(len(tumor), val_pct=0.2)

print(len(train_indices), len(val_indices))
print('Sample val indices: ' , val_indices[:10])

**Create a Training Set**

In [0]:
from torch.utils.data.dataloader import DataLoader

# Create training set
train_ds = tumor[tumor.index.isin(train_indices)]

# Using the training dataset just created, remove the diagnosis variable and create training feature and label vector
xb = train_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
yb = train_ds.diagnosis # This is output of our training data

# Convert data into arrays
xb = np.array(xb, dtype = "float32")
yb = np.array(yb, dtype= "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

# Convert to longform
xb = xb.long()
yb = yb.long()

#Combine the arrays 
trainloader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size=25

# Training Loader
trainloader = DataLoader(trainloader, 
                         batch_size)

**Similarly, Create the Validation Set**

In [0]:
from torch.utils.data.dataloader import DataLoader

# Create validation set
val_ds = tumor[tumor.index.isin(val_indices)]

# Using the validation dataset just created, remove the diagnosis variable and create validaiton feature and label vector
xb = val_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
yb = val_ds.diagnosis # This is output of our validation data

# Convert data into arrays
xb = np.array(xb, dtype = "float32")
yb = np.array(yb, dtype= "float32")

# Convert arrays into tensors
xb = torch.from_numpy(xb)
yb = torch.from_numpy(yb)

# Convert to longform
xb = xb.long()
yb = yb.long()

#Combine the arrays 
val_loader = TensorDataset(xb, yb) 

# Define the batchsize
batch_size=25

# Validation Loader
val_loader = DataLoader(val_loader, 
                         batch_size)

# **Building the Neural Network**

In [0]:
input_size = 10 # we have 10 input features 
num_classes = 2 # we have two output labels, benign and melignant

# Logistic regression model
model = nn.Linear(input_size, num_classes) # nn.Linear can automatically intialize the weights and biases

In [0]:
model

In [0]:
print(model.weight.shape)
model.weight

In [0]:
for xb, xy in trainloader:
  print(xb.shape)
  outputs = model(xb)
  break

In [0]:
class MnistModel(nn.Module):
  def __init__(self): # we instantiate the weights and biases
    super().__init__() 
    self.linear = nn.Linear(input_size, num_classes)
    
  def forward(self,xb): # we flatten out the input tensor, and then pass it into self.linear
    #xb = xb.reshape(-1, 10) # indicates to PyTorch that we want a view of the xb tensor with two dimensions, where the length along the 2nd dimension is 10 - on argument to .reshape can be set to -1 (in this case, the first dimension), to let PyTorch figure it out automatically based on the shape of the original tensor
    out = self.linear(xb)
    return out
  
model = MnistModel() 

In [0]:
for xb, xy in trainloader:
    outputs = model(xb)
    break

print('outputs.shape :', outputs.shape)
print('Sample outputs :\n', outputs[:25].data)

In [0]:
# Apply softmax for each output row
probs = F.softmax(outputs, dim=1) # The softmax function requires us to specify a dimension along which the softmax must be applied

# Look at sample probabilities
print("Sample probabilities:\n", probs[:2].data)

# Add up the probabilities of an output row
print("Sum:", torch.sum(probs[0]).item())

Finally, we can determine the predicted label for each image by simply choosing the index of the element with the highest probability in each output row. This is done using torch.max, which returns the largest element and the index of the largest element along a particular dimension of a tensor

In [0]:
max_probs, preds = torch.max(probs, dim=1)
print(preds)

Now we can compare with the actual labels (MH thinks this should be with the train_y dataset since we are seeing whether the model was able to learn from the train_x set?

In [0]:
yb

# **Evaluation Metric and Loss Function**

We need a way to evaluate how well our model is performing. For this reason, we can use cross entropy. 

Cross Entropy works as follows: 

* For each output row, pick the predicted probability for the correct label. E.g. if the predicted probabilities for an image are [0.1, 0.3, 0.2,...] and the correct label is 1, we pick the corresponding element 0.3 and ignore the rest.
* Then, take the logarithm of the picked probability. If the probability is high i.e. close to 1, then its logarithm is a very small negative value, close to 0. And if the probability is low (close to 0), then the logarithm is a very large negative value. We also multiple the result by -1, which results in a large positive value of the loss for poor predictions. 
* Finally, take the average of the cross entropy across all the output rows to get the overall loss for a batch of data. 

Cross-entropy is continuous and differentiable that provides a good feedback for incremental improvments in the model (a slightly higher probability for the correct label leads to a lower loss). 

PyTorch provides an efficient and tensor-friendly implementation of cross-entropy. It also performs a softmax internally so we can directly pass in the outputs of the model without converting them into probabilities. 



# **DRAFTS**

Create training and validation sets for the data and then apply the indices

In [0]:
# Create training set
train_ds = tumor[tumor.index.isin(train_indices)]

# Create a validation set
val_ds = tumor[tumor.index.isin(val_indices)]

Create data loaders for the training and validation set. But first, remove the diagnosis variable from the x sets

In [0]:
# Using the training dataset just created, remove the diagnosis variable and create training feature and label vector
train_x = train_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']]
train_y = train_ds.diagnosis # This is output of our training data

# do the same with the testing set
val_x = val_ds[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']] # taking test data inputs
val_y = val_ds.diagnosis   #output value of test dat

Convert data into arrays

In [0]:
# Convert data into arrays
train_x = np.array(train_x, dtype = "float32")
train_y = np.array(train_y, dtype= "float32")

val_x = np.array(val_x, dtype = "float32")
val_y = np.array(val_y, dtype= "float32")

In [0]:
len(train_x)
#len(val_x)

In [0]:
# Convert arrays into tensors
train_x = torch.from_numpy(train_x)
train_y = torch.from_numpy(train_y)

val_x = torch.from_numpy(val_x)
val_y = torch.from_numpy(val_y)

Now we can create the dataloader 

In [0]:
from torch.utils.data import TensorDataset

#Combine the arrays 
trainloader = TensorDataset(train_x, train_y) 
val_loader = TensorDataset(val_x, val_y) 

In [0]:
# Define the batchsize
batch_size=25

# Training Loader
trainloader = DataLoader(trainloader, 
                         batch_size)

# Validation Data Loader
val_loader = DataLoader(val_loader, 
                         batch_size)

# **Building the Neural Network**

In [0]:
input_size = 10 # we have 10 input features 
num_classes = 2 # we have two output labels, benign and melignant

# Logistic regression model
model = nn.Linear(input_size, num_classes) # nn.Linear can automatically intialize the weights and biases

In [0]:
model

In [0]:
print(model.weight.shape)
model.weight

In [0]:
for train_x, train_y in trainloader:
  print(train_x.shape)
  outputs = model(train_x)
  break

In [0]:
class MnistModel(nn.Module):
  def __init__(self): # we instantiate the weights and biases
    super().__init__() 
    self.linear = nn.Linear(input_size, num_classes)
    
  def forward(self, train_x): # we flatten out the input tensor, and then pass it into self.linear
    #train_x = train_x.reshape(-1, 10) # indicates to PyTorch that we want a view of the xb tensor with two dimensions, where the length along the 2nd dimension is 10 - on argument to .reshape can be set to -1 (in this case, the first dimension), to let PyTorch figure it out automatically based on the shape of the original tensor
    out = self.linear(train_x)
    return out
  
model = MnistModel() 

In [0]:
for train_x, train_y in trainloader:
    outputs = model(train_x)
    break

print('outputs.shape :', outputs.shape)
print('Sample outputs :\n', outputs[:25].data)

In [0]:
# Apply softmax for each output row
probs = F.softmax(outputs, dim=1) # The softmax function requires us to specify a dimension along which the softmax must be applied

# Look at sample probabilities
print("Sample probabilities:\n", probs[:2].data)

# Add up the probabilities of an output row
print("Sum:", torch.sum(probs[0]).item())

Finally, we can determine the predicted label for each image by simply choosing the index of the element with the highest probability in each output row. This is done using torch.max, which returns the largest element and the index of the largest element along a particular dimension of a tensor

In [0]:
max_probs, preds = torch.max(probs, dim=1)
print(preds)

Now we can compare with the actual labels (MH thinks this should be with the train_y dataset since we are seeing whether the model was able to learn from the train_x set?

In [0]:
train_y

# **Evaluation Metric and Loss Function**

We need a way to evaluate how well our model is performing. For this reason, we can use cross entropy. 

Cross Entropy works as follows: 

* For each output row, pick the predicted probability for the correct label. E.g. if the predicted probabilities for an image are [0.1, 0.3, 0.2,...] and the correct label is 1, we pick the corresponding element 0.3 and ignore the rest.
* Then, take the logarithm of the picked probability. If the probability is high i.e. close to 1, then its logarithm is a very small negative value, close to 0. And if the probability is low (close to 0), then the logarithm is a very large negative value. We also multiple the result by -1, which results in a large positive value of the loss for poor predictions. 
* Finally, take the average of the cross entropy across all the output rows to get the overall loss for a batch of data. 

Cross-entropy is continuous and differentiable that provides a good feedback for incremental improvments in the model (a slightly higher probability for the correct label leads to a lower loss). 

PyTorch provides an efficient and tensor-friendly implementation of cross-entropy. It also performs a softmax internally so we can directly pass in the outputs of the model without converting them into probabilities. 



In [0]:
loss_fn = F.cross_entropy
#loss_fn = torch.nn.BCELoss()

In [0]:
# Convert the torches to long format - this seems to enable the loss_fn below
train_x = train_x.long()
train_y = train_y.long()

val_x = val_x.long()
val_y = val_y.long()

In [0]:
# Loss for current batch of data
loss = loss_fn(outputs, train_y)
print(loss) 

Since the cross entropy is the negative logarithm of the predicted probability of the correct label averaged over all training samples, one way to interpret the resulting number for example 7.6, is to look at e^-7.6 (0.0005), as the predicted probability of the correct label on average.

The lower the loss, the better the model.

# **Optimizer**

we can use the optim.SGD to update the weights and biases during training

In [0]:
learning_rate = 0.001
optimizer = torch.optim.SGD(model.parameters(), 
                            lr=learning_rate)

Parameters like batch size and learning rate, etc, need to be picked in advance while training machine learning models and are called hyperparametesr. Picking the right hyperparameters is critical for training an accurate model within a reasonable amount of time.

# **Training the Model**

After defining the data loaders, model, loss function and optimizer, we are ready to train the model. 

First, we can begin by defining a function loss_batch which:


*   Calculates the loss for a batch of data
*   Optionally performs the gradient descent update step in an optimizer is provided
*   Optionally computes a metric (e.g. accuracy) using the predictions and actual targets 



In [0]:
def loss_batch(model, loss_func, train_x, train_y,
               opt=None, metric=None):

  # Calculate loss
  preds = model(train_x)
  loss = loss_func(preds, train_y)
  
  if opt is not None: # The optimizer is an optional arguement, to ensure that we can reuse loss_batch for computing the loss on the validation set
    # Compute gradients
    loss.backward()
    # Update parameters
    opt.step()
    # Reset gradients
    opt.zero_grad()
    
  metric_result = None
  if metric is not None:
      # Compute the metric
      metric_result = metric(preds, train_y) # Computes the accuracy
      
  return loss.item(), len(train_x), metric_result

Next, define a function evaluate, which calculates the overall loss (and a metric if provided) for the validaiton set

In [0]:
def evaluate(model, loss_fn, valid_dl, metric=None):
  with torch.no_grad(): # no_grad indicates to PyTorch that we shouldn't track, calculate, or modify gradients while updating the weights and biases
    # Pass each batch through the model
    results = [loss_batch(model, loss_fn, 
                          val_x, val_y, metric=metric)
              for val_x, val_y in valid_dl]
    # Separate losses, counts and metrics
    losses, nums, metrics = zip(*results)
    # Total size of the dataset
    total = np.sum(nums)
    # Avg. loss across batches
    total_loss = np.sum(np.multiply(losses, nums))
    avg_loss = total_loss / total
    avg_metric = None
    if metric is not None: 
      # Avg. of metric across batches
      tot_metric = np.sum(np.multiply(metrics, nums))
      avg_metric = tot_metric / total
  return avg_loss, total, avg_metric

Also need to redefine the accuracy to operate on an entire batch of outputs directly, so that we can use it as a metric in fit.

In [0]:
def accuracy(outputs, train_y):
  _, preds = torch.max(outputs, dim=1)
  return torch.sum(preds == train_y).item() / len(preds)

Note: We don't need to apply softmax to the outputs since it doesn't change the relative order of the results. This is because e^x is an increasing function. if y1 > y2, then e^y1 > e^y2 and the same holds true averaging out the values to get the softmax 

Examine how the model performs on the validation set with the initital set of weights and biases

In [0]:
val_loss, total, val_acc = evaluate(
    model, loss_fn, val_loader, metric=accuracy)
print('Loss: {:.4f}, Accuracy: {:.4f}'
     .format(val_loss, val_acc))

In [0]:
type(loss_fn)


In [0]:
train_x.type

In [0]:
#train_y.numel()
train_x.numel()

# **#DRAFTS**

In [0]:
# Create a model
from torch import nn, optim
import torch.nn.functional as F

class Classifier(nn.Module):
  def __init__(self):
      super().__init__()
      self.fc1 = nn.Linear(10, 5)
      self.fc2 = nn.Linear(5, 2)
      
  def forward(self, x):
      # make sure input tensor is flattened
      #x = train_x.view(train_x.shape[0], -1)
      #x = train_x.reshape(-1, 10)
      
      x = F.relu(self.fc1(x))
      x = F.log_softmax(self.fc2(x), dim=1)
      
      return(x)

In [0]:
class TwoLayerNet(nn.Module):
  def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    
    D_in: input dimension
    H: dimension of hidden layer
    D_out: output dimension
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = nn.Linear(D_in, H) 
    self.linear2 = nn.Linear(H, D_out)
  
  def forward(self, x):
    """
    In the forward function we accept a Variable of input data and we must 
    return a Variable of output data. We can use Modules defined in the 
    constructor as well as arbitrary operators on Variables.
    """
    h_relu = F.relu(self.linear1(x))
    y_pred = self.linear2(h_relu)
    return y_pred

In [0]:
# N is batch size; D_in is input dimension;
# H is the dimension of the hidden layer; D_out is output dimension.
N, D_in, H, D_out = 10, 10, 5, 2

# Create random Tensors to hold inputs and outputs, and wrap them in Variables
#x = Variable(torch.randn(N, D_in))  # dim: 32 x 100

x = train_x.reshape(10, 10)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)   # dim: 32 x 10

Run the model...not sure if running into problems due to the vector needing to be flattened or problems with the trainloader? 

In [0]:
x = train_x.reshape(-1, 10)

In [0]:
x

In [0]:
model = Classifier()

train_x, train_y = next(iter(testloader))

# Get the class probabilities 
ps = torch.exp(model(train_x))

# Make sure the shape is appropriate
print(ps.shape)

In [0]:
# Define the model with one hidden layer
model = nn.Sequential(nn.Linear(10, 5),
                      nn.ReLU(),
                      nn.Linear(5, 2),
                      nn.LogSoftmax(dim=1))

# Set optimizer and learning rate
optimizer = optim.SGD(model.parameters(), lr=0.03)

# Could also use Adam optimizer; similar to stochastic gradient descent, but uses momentum which can speed up the actual fitting process, and it also adjusts the learning rate for each of the individual parameters in the model
# optimizer = optim.Adam(model.parameters(), lr=0.003)

# Define the loss
criterion = nn.NLLLoss()

# Set 5 epochs to start
epochs = 5
for e in range(epochs):
    running_loss = 0
    for train_x, train_y in trainloader:
      # Flatten data into a vector 
        #train_x = train_x.view(train_x.shape[0], -1)
        #try the reshape method
        #train_x = train_x.reshape(-1, 10)
        
        # Clear the gradients, do this because gradients are accumulated
        optimizer.zero_grad()
        
        # Training pass
        output = model.forward(train_x)
        loss = criterion(output, train_y) # Loss calculated from the output compard to the labels 
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() # loss.item() gets the scalar value held in the loss. Running_loss = 0, 
        # += notation, says "Add a value and the variable and assigns the result to that variable." So, adds the running_loss (0) with loss.item and assigns to running_loss
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")

In [0]:
train_x.numel()

In [0]:
batch_size=100

# Training sampler and data loader
train_sampler = SubsetRandomSampler(train_indices) #will sample elements randomly from a given list of indices, while creating batches of data
train_loader = DataLoader(tumor, 
                         batch_size,
                         sampler=train_sampler)

# Validation sampler and data loader
val_sampler = SubsetRandomSampler(val_indices)
val_loader = DataLoader(tumor, 
                       batch_size, 
                       sampler=val_sampler)

In [0]:
train_X = train[prediction_var]# taking the training data input 
train_y = train.diagnosis # This is output of our training data
# same we have to do for test
test_X= test[prediction_var] # taking test data inputs
test_y =test.diagnosis   #output value of test data

In [0]:
train_X = train[prediction_var]# taking the training data input 
train_y=train.diagnosis# This is output of our training data
# same we have to do for test
test_X= test[prediction_var] # taking test data inputs
test_y =test.diagnosis   #output value of test dat

In [0]:
#now split our data into train and test
train, test = train_test_split(tumor, test_size = 0.3)# in this our main data is split into train and test
# we can check their dimension
print(train.shape)
print(test.shape)

In [0]:
##Drafts!!!!!

In [0]:
# Convert dfs to arrays
inputs = np.array(inputs, dtype = "float32")
labels = np.array(labels, dtype= "float32")

Transform the data into arrays and then convert to tensors

In [0]:
# Convert arrays to tensors
inputs = torch.from_numpy(inputs)
labels = torch.from_numpy(labels)

Need to divide the data into a training and test set. Define a function to split the dataset:

In [0]:
from torch.utils.data import TensorDataset, DataLoader

#Define dataset
train_ds = TensorDataset(inputs, labels)

In [0]:
# Define data loader
batch_size = 5
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
next(iter(train_dl))

In [0]:
# The DataLoader is typically used in a for-in loop
for xb, yb in train_dl:
  print(xb)
  print(yb)
  break

**nn.Linear**

nn.Linear can be used to autmoatically initialize the weights and biases, rather than having to do so manually

**I have to assess whether this is actually working. I'm concerned with how to set the nn.Linear functions. Also, I think I need to apply softmax to turn into a proabability.**

In [0]:
# Define model
model = nn.Linear(10, 569)
print(model.weight)
print(model.bias)

In [0]:
# Parameters - returns a list containing all the weights and bias matrices present in the model. For our linear regression model, we have one weight matrix and one bias matrix.
list(model.parameters())

In [0]:
# Generate predictions
preds = model(inputs)
preds 

**Loss Function**

Instead of defining a loss function manually, we can use the built-in loss function mse_loss

In [0]:
# Import nn.functional
import torch.nn.functional as F

#These package contains many useful loss functions and several other utilities

In [0]:
# Define loss function
loss_fn = F.mse_loss

In [0]:
# Compute the loss for the current predictions of our model
loss = loss_fn(model(inputs), labels)
print(loss)

In [0]:
# Define optimizer
opt = torch.optim.SGD(model.parameters(), lr=1e-5)

#The model.parameters() is passed as an argument to optim.SGD so that the optimizer knows which matrices should be modified during the update step. 
#We can also specify a learning rate which controls the amount by which the parameters are modified

**Train the Model**

We will train the model with the same steps but using batches of data. The utility function **fit** trains the model for a given number of epochs

1.   Generate predictions
2.   Calculate the loss
3.   Compute the gradients w.r.t the weights and biases
4.   Adjust the weights by subtracting a small quantity proportional to the gradient
5.   Reset the gradients to zero

In [0]:
# Utility function to train the model
def fit(num_epochs, model, loss_fn, opt):
  
  # Repeat for given number of epochs:
  for epoch in range(num_epochs):
    
    #Train with batches of data
    for xb, yb in train_dl:
      
      # 1. Generate predictions
      pred = model(xb)
      
      # 2. Calculate loss
      loss = loss_fn(pred, yb)
      
      # 3. Compute Gradients
      loss.backward()
      
      # 4. Update parameters using gradients
      opt.step()
      
      # 5. Reset the gradients to zero
      opt.zero_grad()
      
    # Print the progress
    if (epoch+1) % 10 == 0:
      print('Epoch [{}/{}, Loss: {:.4f}' .format(
          epoch+1, num_epochs, loss.item()))
      
# We use the data loader defined earlier to get batches of data for every iteration
# Instead of updating parameters (weights and biases) manually, we use opt.step to perform the update, and opt.zero_grad to reset the gradients to zero
# We've also added a log statement which prints the loss from the last batch of data for every 10th epoch, to track the progress of training. 
# loss.item returns the actual value stored in the loss tensor

In [0]:
# Train the model for 100 epochs
fit(100, model, loss_fn, opt)

In [0]:
# Generate predictions to verify that our model is close to our targets
preds = model(inputs)
preds

In [0]:
# Compare with labels
labels

In [0]:
##DRAFTS BELOW######

In [0]:
class MnistModel(nn.Module):
  """Feedforward neural network with 1 hidden layer"""
  def __init__(self, in_size, hidden_size, out_size):
      super().__init__()
      # hidden layer
      self.linear1 = nn.Linear(in_size, hidden_size)
      # output layer
      self.linear2 = nn.Linear(hidden_size, out_size)
    
  def forward(self, xb):
      # Get intermediate outputs using hidden layers
      out = self.linear1(xb)
      # Apply activation function
      out = F.relu(out)
      # Get predictions using output layer
      out = self.linear2(out)
      return out

In [0]:
input_size = 9
num_classes = 2

model = MnistModel(input_size, hidden_size=3, # 3 nodes  
                   out_size=num_classes)

In [0]:
for t in model.parameters():
  print(t.shape)

In [0]:
for xb, yb in train_dl:
    outputs = model(xb)
    loss = F.cross_entropy(outputs, labels)
    print('Loss:', loss.item())
    break

print('Outputs.shape:', outputs.shape)
print('Sample outputs :\n', outputs[:2].data)

In [0]:
# Define the model with one hidden layer
model = nn.Sequential(nn.Linear(9, 3),
                      nn.ReLU(),
                      nn.Linear(3, 2),
                      nn.LogSoftmax(dim=1))

# Set optimizer and learning rate
optimizer = optim.SGD(model.parameters(), lr=0.03)

# Could also use Adam optimizer; similar to stochastic gradient descent, but uses momentum which can speed up the actual fitting process, and it also adjusts the learning rate for each of the individual parameters in the model
# optimizer = optim.Adam(model.parameters(), lr=0.003)

# Define the loss
criterion = nn.NLLLoss()

# Set 5 epochs to start
epochs = 5
for e in range(epochs):
    running_loss = 0
    for inputs, labels in trainloader:
       
        # Clear the gradients, do this because gradients are accumulated
        optimizer.zero_grad()
        
        # Training pass
        output = model.forward(images)
        loss = criterion(output, labels) # Loss calculated from the output compard to the labels 
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() # loss.item() gets the scalar value held in the loss. Running_loss = 0, 
        # += notation, says "Add a value and the variable and assigns the result to that variable." So, adds the running_loss (0) with loss.item and assigns to running_loss
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")
        


In [0]:
# Create a model
from torch import nn, optim
import torch.nn.functional as F

class Classifier(nn.Module):
  def __init__(self):
      super().__init__()
      self.fc1 = nn.Linear(9, 3)
      self.fc2 = nn.Linear(3, 3)
      self.fc4 = nn.Linear(3, 2)
      
  def forward(self, x):
      x = F.relu(self.fc1(x))
      x = F.relu(self.fc2(x))
      x = F.relu(self.fc3(x))
      x = F.log_softmax(self.fc4(x), dim=1)
      
      return(x)

In [0]:
model = Classifier()

images = next(iter(testloader))

# Get the class probabilities 
ps = torch.exp(model(images))

# Make sure the shape is appropriate, we should get 10 class probabilities for 64 examples
print(ps.shape)

# **Create Further Subgroups Based on Characteristics and do Multi-Class Classification**