# Building a Neural Network with PyTorch and PyTorch Lightning

In this lab, we will dive into building our own neural network using PyTorch and PyTorch Lightning. We'll work with our HCC data and build a simple neural network to predict whether a sample has cancer or not.

In [7]:
!pip install -U -q torch torchvision pytorch-lightning

^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
import pandas as pd
hcc = pd.read_csv('https://github.com/alexwolson/carte_workshop_2024/raw/main/data/HCC_all_ML_classification_test_annotated_frags_all_features_combined_4_tumors.csv.gz', compression='gzip')
hcc = hcc.sample(500)

In [None]:
categorical_columns = ['chr','upstream_motif','downstream_motif','Corrected_Call']
numerical_columns = ['frag','VAF','pos','read_cov','detected_read_cov','plasma_VAF','Corrected_Copy_Number']
y_column = 'alt_match'

X = hcc[categorical_columns + numerical_columns]
y = hcc[y_column]

In [None]:
X = pd.get_dummies(X, columns=categorical_columns)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)

## Building a Neural Network with PyTorch

PyTorch is a popular deep learning framework that allows you to build neural networks. PyTorch is a bit more low-level than TensorFlow, which handles some of the details for you. However, PyTorch is more flexible and allows you to build custom neural networks with ease.

In order to keep things simple, we will also take advantage of a companion library called PyTorch Lightning. Lightning takes care of much of the code that normally needs to be written by hand, allowing you to focus on building your neural network. Despite this, we will still need to write a fair amount of code to build our neural network.

Let's start by defining a simple, fully connected neural network using PyTorch.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as L

class NN(L.LightningModule):
    
    def __init__(self):
        super(NN, self).__init__()
        self.fc1 = nn.Linear(X_train.shape[1], 64)
        # Add additional layers following the same format. Bring us down to a single output
        
    def forward(self, x):
        # Define the forward pass of the network
        # Alternate between linear layers and activation functions
        # Relu can be called using F.relu
        return x
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.binary_cross_entropy(y_hat, y.view(-1, 1))
        return loss
    
    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)

## Training the Model

Now that we have defined our neural network, we can train it using PyTorch Lightning. We will create a PyTorch Dataset and DataLoader to feed our data into the model, and then use a Trainer to train the model. The Dataset and DataLoader classes have us define exactly how our data should be retrieved and fed into the model, while the Trainer class handles the training loop for us.

In [None]:
class HCCDataset(torch.utils.data.Dataset):
    
    def __init__(self, X, y):
        self.X = torch.tensor(X.values, dtype=torch.float32)
        self.y = torch.tensor(y.values, dtype=torch.float32)
        
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        # This is the important function that defines how data is retrieved from the dataset
        return self.X[idx], self.y[idx]

In [None]:
train_dataset = HCCDataset(X_train.astype(float), y_train)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)

val_dataset = HCCDataset(X_val.astype(float), y_val)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=32, shuffle=False)

model = NN()
trainer = L.Trainer(max_epochs=10, log_every_n_steps=1)
trainer.fit(model, train_loader, val_loader)

## Evaluating the Model

Now that we have trained our model, we can evaluate it on the test set to see how well it performs.

In [None]:
from sklearn.metrics import accuracy_score

test_dataset = HCCDataset(X_test.astype(float), y_test)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

Just like how we defined a training loop for our model, we have to define an evaluation loop to evaluate the model on the test set. There are a few key elements to this:

1. We need to set the model to evaluation mode using `model.eval()`. This tells PyTorch that we are evaluating the model and not training it.
2. We need to use the `torch.no_grad()` context manager to tell PyTorch that we do not need to keep track of gradients during evaluation. This can save memory and speed up computation.
3. We need to loop over the test set and compute predictions using the model. We can then use these predictions to compute the accuracy of the model.

In [None]:
# Switch the model to evaluation mode
model.eval()

# Use torch.no_grad() to disable gradient tracking
with torch.no_grad():
    y_true = []
    y_pred = []
    
    # Loop over the test set
    for x, y in test_loader:
        y_hat = model(x) # Compute predictions
        y_hat = (y_hat > 0.5).float() # Convert to binary predictions
        
        # Append the true and predicted labels to the lists
        y_true.extend(y.numpy())
        y_pred.extend(y_hat.numpy().flatten())

Now `y_pred` and `y_true` contain the predicted and true labels for the test set. We can use these to compute the accuracy of the model.

In [None]:
# Compute the accuracy of the model

# ===== YOUR CODE HERE =====




# ==========================

# Extending our model further

Now that we have a simple neural network that can predict whether a sample has cancer or not, we can extend it further to improve its performance. Here are a few ideas for how you can extend the model:

1. Add more layers to the neural network. You can experiment with different widths, depths and activations to see how they affect the performance of the model.
2. Experiment with different optimization algorithms. You can try using different optimizers such as SGD, RMSprop, or Adam, and experiment with different learning rates.
3. There are many settings that can be configured with the Lightning Trainer. Take a look at the [callbacks](https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html) (particularly EarlyStopping and ModelCheckpoint) and [logging](https://pytorch-lightning.readthedocs.io/en/latest/extensions/logging.html) options to see how you can monitor and improve your model during training.

# Tracking Experiments

Often, we want to be able to keep track of the experiments we run, including the hyperparameters, metrics, and other information. One way to do this is to use a tool like [Weights & Biases](https://wandb.ai/site) (wandb) to log and visualize our experiments.

In order to use wandb, you will need to sign up for an account and install the wandb library. You can do this by running `!pip install wandb` in a code cell.

In [None]:
!pip install wandb

In [None]:
import wandb
wandb.login()

In [None]:
# Initialize a new wandb run

run = wandb.init(project='hcc', config={
    'learning_rate': 0.001,
    'batch_size': 32,
    'epochs': 10
})

Now that we've initialized a wandb run, we can log information about our model and training process. Pytorch Lightning has built-in support for wandb, so we can easily log metrics, hyperparameters, and other information during training.

In [None]:
from pytorch_lightning.loggers import WandbLogger

wandb_logger = WandbLogger(log_model=True) # Save the model as an artifact

model = NN()

trainer = L.Trainer(max_epochs=10, logger=wandb_logger)

Once we have trained our model, we can log the final metrics and any other information we want to keep track of. You'll then be able to view this information in the wandb dashboard: [https://wandb.ai/home](https://wandb.ai/home)