[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alexwolson/carte-biozone-workshop/blob/main/Lab-3-1.ipynb)

# CARTE-BioZone Workshop on Machine Learning
#### Wednesday, August 30, 2023
#### Lab 1, Day 3: Neural Networks
##### Lab author: Alex Olson

#### Introduction

In this lab, we will be taking our first look at developing our own *neural networks* (NN) with [PyTorch](https://pytorch.org/), probably the most popular machine learning library for working with NNs.

In [1]:
# Check if we are running on Google Colab, or locally
import sys

IN_COLAB = "google.colab" in sys.modules

In [2]:
if not IN_COLAB:
    !pip install -q torch torchvision torchaudio numpy pandas matplotlib scikit-learn

In [4]:
# Standard Libraries
import numpy as np
import random

# Data Manipulation Libraries
import pandas as pd

# Scikit-Learn Libraries
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn import preprocessing

### An Intuitive Intro to Neural Nets

In today's lab, we're going to start with an intuitive exercise on the Titanic dataset using Logistic Regression and a simple Neural Network before moving onto some more complex stuff. Let's start by loading our dataset, and cleaning it up as we have done before.


In [5]:
data = fetch_openml("titanic", version=1, as_frame=True, parser="auto").frame
data.survived = pd.to_numeric(data["survived"])
data.drop(["boat", "body", "home.dest"], axis=1, inplace=True)
data = data.drop(["name", "ticket", "cabin", "embarked"], axis=1)

label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(data["sex"])
data["sex"] = label_encoder.transform(data["sex"])
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare
0,1,1,0,29.0,0,0,211.3375
1,1,1,1,0.9167,1,2,151.55
2,1,0,0,2.0,1,2,151.55
3,1,0,1,30.0,1,2,151.55
4,1,0,0,25.0,1,2,151.55


Next, as we have become accustomed to doing, we will split the dataset into a training set (where we will do our cross validation) and a test set (our hold-out data). We've done this a few times during the labs, so hopefully you're getting used to the process!

In [6]:
target_data = data["survived"]
feature_data = data.iloc[:, data.columns != "survived"]

X_train, X_test, y_train, y_test = train_test_split(
    feature_data, target_data, test_size=0.3, random_state=0
)

Now, we're ready to try out some models on our training data (you haven't seen anything new yet!). Since we're solving a binary classification problem (i.e., predicting a 0 or 1 target), we want to design classifiers. In this exercise, we're going to fit a logistic regression to our data and then design a neural network architecture that behaves exactly like a logistic regression and validate that we get the same result.

Logistic regression models are linear models similar to linear regression models. Hopefully you somewhat remember them from lecture. Let's review them, starting with the linear regression equation:


<center>$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 +, \dots, + \beta_n x_n$</center>

Where $\hat{y}$ is our prediction, $\beta$ is our vector of coefficients (the things we learn), and $x$ is our feature vector. The linear regression equation defines a line in $n$ dimensional space. The problem with linear regression is that it doesn't really perform well on classification tasks. Consider the following example:

<img src="https://github.com/lyeskhalil/mlbootcamp/blob/master/img/linear-classification.png?raw=1" width="500"/>

The green line represents our trained linear regression model. Our feature is the size of a tumor, and our target is whether it is malignant or not (0 or 1). As we can see, even though our model is trained to the data to minimize error, for a lot of the values of tumor size it is going to give us a weird result (e.g., for some really small tumors, the prediction would be a negative value!).

To resolve this, we use the *logistic function* (also called the *sigmoid* function) to 'squish' our linear model to be bounded by 0 and 1. The logistic (sigmoid) function is $\frac{1}{1+e^{-x}}$, and thus our logistic regression equation becomes:

<center>$\large\hat{y} = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 +, \dots, + \beta_n x_n)}}$</center>

In this equation, the values for $\hat{y}$ can never be below 0, and can never exceed 1 even for the most extreme feature values $x$.

**YOUR TURN:**
* Assuming you've trained a nice logistic regression model to the below data (see Figure), what might the model fit look like (i.e., what will the line look like)? ____________________________________
* For new data samples with features $x$, how would you convert the output of the logistic regression, $\hat{y}$, into a classification (0 or 1)? ______________________________

<img src="https://github.com/lyeskhalil/mlbootcamp/blob/master/img/logistic-classification.jpg?raw=1" width="400"/>

OK, cool! So a quick review of logistic regression. Let's use scikit-learn to fit a logistic regression model to our training set and then predict on our test set (we won't do cross validation this time).

In [7]:
imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
imputer.fit(X_train)

X_train = imputer.transform(X_train)

logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

X_test = imputer.transform(X_test)
predictions = logistic_model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
print(f"Logistic accuracy: {accuracy*100:.2f}%")

Logistic accuracy: 79.13%


**YOUR TURN:**
* The default regularization parameter for sklearn's logistic regression is L2 (or ridge regression); can you figure out how to change it to L1 (LASSO)? _Note: you'll have to look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)!_ ______________________
* What is the mean accuracy of an L1 regularized Logistic regression model on the training set? ______________________

OK - time for the good stuff!

### Neural Networks

A *neural network* (NN) is a type of machine learning model that, like linear or logistic regression, takes a feature vector, X, as input and predicts a target, y. The way it does this is a little bit different, however. A typical NN architecture consists of: an input layer, hidden layers, and an output layer. Each layer consists of a set of nodes (neurons) connected by edges (outputs). Let's look at the figure below:

<img src="https://github.com/lyeskhalil/mlbootcamp/blob/master/img/nn.jpeg?raw=1" width="400"/>

**Input layer**: This is a passive layer that simply takes in your feature data and outputs it to the hidden layers. You can think of each input layer neuron as being associated with a feature in your feature set.

**Hidden layer**: This is where the magic happens. The original features, as received by the input layer, go through a series of transformations within the hidden layer. You can think of each node (neuron) within the hidden layer as a highly transformed feature.

**Output layer**: This is where we get our final result, the 0 or 1 prediction.

### Zooming In

Let's take a look at what is happening at any given node (neuron) within the hidden layer. Take a look at the following image of a neuron within an NN:

<img src="https://github.com/lyeskhalil/mlbootcamp/blob/master/img/neuron.png?raw=1" width="500"/>

Every neuron has some inputs ($x_1, x_2, \dots, x_n$) with input weights ($w_1, w_2, \dots, w_n$) and an output, $Y$. The neuron itself applies a transformation, $f$, known as the *activation function*, to the linear combination of its inputs and input weights. The value $b$ is a constant weight called the bias.

There are many different types of activation functions, but the popular ones are the *sigmoid*, *tanh*, and *ReLU* activation functions. Yes, you heard correctly: the sigmoid function is a popular activation function! (This should be reminding you of the logistic regression model we discussed above).

**YOUR TURN:**
* If you were to develop a simple neural network architecture that was equivalent to a logistic regression model for the Titanic data, how would you do it? Get a pen and paper and draw it out. Make sure to specify: the input layer, the hidden layer(s), the output layer, the activation function(s), the weights, and the biases.
* How many hidden layers does your NN have? What type of activation function, $f$, does it use? _________________
* Say you wanted to add another layer to your NN architecture with 3x neurons, what would your new architecture look like? _________________

### Intro to  PyTorch

OK, so now that we've made the connection between NNs and Logistic Regression, let's code up our little NN in PyTorch and use it to predict survivorship on the Titanic dataset.

First, *tensors* are the fundamental data type of PyTorch. Each tensor is effectively a multi-dimensional array, just like a numpy array. The primary difference is that tensors have been setup in such a way to enhance the NN training process.

Let's load our X and y training data into tensors:

In [8]:
import torch

X_train_tensor = torch.from_numpy(X_train).float()
y_train_tensor = torch.from_numpy(y_train.values).float()

X_test_tensor = torch.from_numpy(X_test).float()
y_test_tensor = torch.from_numpy(y_test.values).float()

In [9]:
X_train_tensor.shape, y_train_tensor.shape, X_test_tensor.shape, y_test_tensor.shape

(torch.Size([916, 6]),
 torch.Size([916]),
 torch.Size([393, 6]),
 torch.Size([393]))

Next, we will actually define our logistic regression network model class. The below function, `LogisticRegression`, applies a sigmoid transformation to the output, as required.

In [10]:
class LogisticRegression(torch.nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LogisticRegression, self).__init__()
        self.linear = torch.nn.Linear(input_dim, output_dim)

    def forward(self, x):
        outputs = torch.sigmoid(self.linear(x))
        return outputs

Next, we identify the dimensions of our problem: 6 x 1 (6 features and 2 target classes: 0 or 1), initialize our model with those dimensions and then specify the loss function ([cross entropy](https://en.wikipedia.org/wiki/Cross_entropy)) and optimization technique ([stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)).

In [11]:
input_dim = 6
output_dim = 1

model = LogisticRegression(input_dim, output_dim)

criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

In [12]:
model

LogisticRegression(
  (linear): Linear(in_features=6, out_features=1, bias=True)
)

Next, we set up our dataset to be *iterable* such that we can train our neural network in *batches*. A batch is a subset of the total data such that if we combined them all, we'd get the whole dataset. Batching is done to speed up the training process and reduce memory requirements.

**YOUR TURN:**
* If we select a batch size of 256, how many batches of training data will be generated?________________

The `DataLoader` function does this batching operation for us.

In [13]:
batch_size = 256

train_dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset, batch_size=batch_size, shuffle=True
)

Next, we train our NN and test its performance on the test set every 4000 iterations. We use 40,000 *epochs* and print our accuracy values every 4000 iterations. An epoch is when the entire dataset has passed through the network. An iteration is when a single forward/backward pass of the network over a batch of data is done.

**YOUR TURN:**
* If our batch size is 256 and we elect to do 40,000 epochs (i.e., the network sees the entire dataset 40,000 times), how many iterations (forward/backward passes of the data) will our network see? _______________________

Run the below code and check out the incremental accuracy improvement output below.

In [18]:
# Set seeds for reproducibility
random.seed(1)
torch.manual_seed(1)
torch.backends.cudnn.deterministic = True

# Training parameters
total_epochs = 40000
log_interval = 4000  # Interval for logging progress

# Training loop
for epoch in range(total_epochs):
    for batch_idx, (batch_features, batch_target) in enumerate(train_loader):
        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        predictions = model(batch_features)

        # Compute loss
        loss = criterion(torch.squeeze(predictions), batch_target)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

    # Logging progress
    if epoch % log_interval == 0:
        model.eval()
        with torch.no_grad():
            test_predictions = model(X_test_tensor)
            test_predicted_classes = torch.round(test_predictions.data).numpy()

        model.train()
        test_accuracy = accuracy_score(y_test, test_predicted_classes)
        print(
            f"Epoch: {epoch}/{total_epochs} | "
            f"Loss: {loss.item():.2f} | "
            f"Accuracy: {test_accuracy:.2f} | "
            f"Progress: {epoch / total_epochs * 100:.0f}%"
        )

Epoch: 0/40000 | Loss: 0.45 | Accuracy: 0.79 | Progress: 0%
Epoch: 4000/40000 | Loss: 0.43 | Accuracy: 0.79 | Progress: 10%
Epoch: 8000/40000 | Loss: 0.45 | Accuracy: 0.79 | Progress: 20%
Epoch: 12000/40000 | Loss: 0.50 | Accuracy: 0.79 | Progress: 30%
Epoch: 16000/40000 | Loss: 0.46 | Accuracy: 0.79 | Progress: 40%
Epoch: 20000/40000 | Loss: 0.45 | Accuracy: 0.80 | Progress: 50%
Epoch: 24000/40000 | Loss: 0.41 | Accuracy: 0.79 | Progress: 60%
Epoch: 28000/40000 | Loss: 0.49 | Accuracy: 0.79 | Progress: 70%
Epoch: 32000/40000 | Loss: 0.45 | Accuracy: 0.80 | Progress: 80%
Epoch: 36000/40000 | Loss: 0.44 | Accuracy: 0.79 | Progress: 90%


Hopefully you got an accuracy of around ~78%. What you'll notice is that is similar accuracy we got from sklearn's built-in logistic regression function from earlier in the lab. There are differences in the implementation that account for this discrepancy, notably sklearn used l2 regularization, and an optimizer called LBFGS instead of SGD that we learned and used.

Let's take a look at the trained model parameters using the `model.parameters()` function within PyTorch.

In [19]:
params = list(model.parameters())
params

[Parameter containing:
 tensor([[-0.7323, -2.4649, -0.0197, -0.3609, -0.0120,  0.0048]],
        requires_grad=True),
 Parameter containing:
 tensor([3.2712], requires_grad=True)]

Hm, well this is interesting! We can see that our model consists of two tensors: the first has (1,6), and the second has dimension (1). Refer back to how you drew what you thought this NN architecture would look like.

**YOU TURN:**
* What do you think these values represent? ____________________________
* How many hidden layers does the architecture have? ______________________________
* Draw the architecture and label (some of) the weights (trained parameters). ______________________________

(*Hint: to answer these questions, try printing `outputs.data[0]` and `predicted[0]` to look at the model's assessment of the first sample*)

You'll notice that this took considerably longer to train than scikit-learn's logistic regression: this is because PyTorch is set-up to be more flexible and train architectures much more complex than a simple single neuron network. Scikit-learn's implementation of logistic regression is highly optimized.

**YOUR TURN:**
* How many epochs would you need to increase the process to 200,000 iterations? ______________________
* Does increasing to 200,000 iterations improve your test set accuracy? ______________________
* Compare the predictions of your logistic regression from scikit-learn and your network developed with PyTorch. Are all the predictions the same? How many predictions are pair-wise different? ______________________

Congratulations! You've completed an introduction to neural networks and PyTorch. If you want to explore more sophisicated architectures and applications, check out designing a PyTorch neural network to properly classify digit images here:

https://towardsdatascience.com/handwritten-digit-mnist-pytorch-977b5338e627

Other than that, you're done the lab!