![MLU Logo](../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 3</a>

## Neural Networks with PyTorch

In this notebook, we will build, train and validate a Neural Network using PyTorch.
1. <a href="#1">Implementing a neural network with PyTorch</a>
2. <a href="#2">Loss Functions</a>
3. <a href="#3">Training</a>
4. <a href="#4">Example - Binary Classification</a>
5. <a href="#5">Natural Language Processing Context</a>

In [None]:
%%capture
%pip install -q -r ../requirements.txt

## 1. <a name="1">Implementing a neural network with PyTorch</a>
(<a href="#0">Go to top</a>)

Let's implement a simple neural network with two hidden layers of size 64 and 128 using the sequential mode (Adding things in sequence). We will have 3 inputs, 2 hidden layers and 1 output layer. Some drop-outs attached to the hidden layers.

In [None]:
import torch
import torch.nn as nn

net = nn.Sequential(
    nn.Linear(3, 64),                # Linear layer-1 with 64 units
    nn.Tanh(),                       # Tanh activation is applied
    nn.Dropout(0.4),                 # Apply random 40% drop-out to layer_1
    
    nn.Linear(64, 64),               # Linear layer-2 with 64 units
    nn.Tanh(),                       # Tanh activation is applied
    nn.Dropout(0.3),                 # Apply random 30% drop-out to layer_2
    
    nn.Linear(64, 1)                 # Output layer with single unit
)

print(net)

We can initialize the weights of the network with 'apply()' function. We prefer to use the following:

In [None]:
def init_weights(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net.apply(init_weights)

Let's look at our layers and dropouts on them. We can easily access them with net[layer_index]

In [None]:
print(net[0])
print(net[1])
print(net[2])
print(net[3])
print(net[4])
print(net[5])
print(net[6])

## 2. <a name="2">Loss Functions</a>
(<a href="#0">Go to top</a>)

We will go over some popular loss functions here. We can select loss functions according to our problem. Full list of supported loss functions are available [here](https://pytorch.org/docs/stable/nn.html#loss-functions)


__Binary Cross-entropy Loss:__ A common loss function for binary classification. It is given by: 
$$
\mathrm{BinaryCrossEntropyLoss} = -\sum_{examples}{(y\log(p) + (1 - y)\log(1 - p))}
$$
where p is the prediction (between 0 and 1, ie. 0.831) and y is the true class (either 1 or 0). 

In PyTorch, we can use binary cross entropy with `BCEWithLogitsLoss`. It also applies sigmoid function on the predictions. Therefore, p is always between 0 and 1.


```python
from torch.nn import BCEWithLogitsLoss
loss = BCEWithLogitsLoss()
```
__Categorical Cross-entropy Loss:__ It is used for multi-class classification. We apply the softmax function on prediction probabilities and then extend the equation of binary cross-entropy. After the softmax function, summation of the predictions are equal to 1. Equation is below. y becomes 1 for true class and 0 for other classes.
$$
\mathrm{CategoricalCrossEntropyLoss} = -\sum_{examples}\sum_{classes}{y_j\log(p_j)}
$$
In PyTorch, `CrossEntropyLoss` implements the categorical cross-entropy loss with softmax function


```python
from torch.nn import CrossEntropyLoss
loss = CrossEntropyLoss()
```
__MSE Loss:__ This is a loss function for regression problems. It measures the squared difference between target values (y) and predictions (p). Here, square makes sure the offsets with different signs don't cancel out each other.
$$
\mathrm{MSE loss} = \frac{1}{n} \sum_{examples}(y - p)^2
$$
In PyTorch, we can use it with `MSELoss`:
```python
from torch.nn import MSELoss
loss = MSELoss()
```
__L1 Loss:__ This is similar to MSE loss. It measures the absolute difference between target values (y) and predictions (p).
$$
\mathrm{L1 loss} = \frac{1}{n} \sum_{examples}|y - p|
$$
In PyTorch, we can use it with `L1Loss`:
```python
from torch.nn import L1Loss
loss = L1Loss()
```

## 3. <a name="3">Training</a>
(<a href="#0">Go to top</a>)

`torch.optim` module provides necessary training algorithms for neural networks. We can use the following for training a network using Stochastic Gradient Descent method and learning rate of 0.001.

```python
import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.001)
```

## 4. <a name="4">Example - Binary Classification</a>
(<a href="#0">Go to top</a>)

Let's train a neural network on a random dataset. We have two classes and will learn to classify them.

In [None]:
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=750, shuffle=True, random_state=42, noise=0.05, factor=0.3)

Let's plot the dataset

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

def plot_dataset(X, y, title):
    
    # Activate Seaborn visualization
    sns.set()
    
    # Plot both classes: Class1->Blue, Class2->Red
    plt.scatter(X[y==1, 0], X[y==1, 1], c='blue', label="class 1")
    plt.scatter(X[y==0, 0], X[y==0, 1], c='red', label="class 2")
    plt.legend(loc='upper right')
    plt.xlabel('x1')
    plt.ylabel('x2')
    plt.xlim(-2, 2)
    plt.ylim(-2, 2)
    plt.title(title)
    plt.show()
    
plot_dataset(X, y, title="Dataset")

Importing the necessary libraries

In [None]:
import time
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

We are creating the network below. We will have two hidden layers. Since the data seems easily separable, we can have a small network (2 hidden layers) with 10 units at each layer.

In [None]:
net = nn.Sequential(
    nn.Linear(2, 10),
    nn.ReLU(),
    nn.Linear(10, 10),
    nn.ReLU(),
    nn.Linear(10, 1),
    nn.Sigmoid()
)

def init_weights(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform_(m.weight)
        m.bias.data.fill_(0.01)

net.apply(init_weights)

Let's define the training parameters

In [None]:
batch_size = 4           # How many samples to use for each weight update 
epochs = 50              # Total number of iterations
learning_rate = 0.01     # Learning rate
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Using GPU if available

# Define the loss. As we used sigmoid in the last layer, use BCELoss
criterion = nn.BCELoss()

# Define the optimizer, SGD with learning rate
optimizer = optim.SGD(net.parameters(), lr=learning_rate)

In [None]:
# Splitting the dataset into two parts: 80%-20% split
X_train, X_val = X[:int(len(X)*0.8), :], X[int(len(X)*0.8):, :]
y_train, y_val = y[:int(len(X)*0.8)], y[int(len(X)*0.8):]

# Convert to PyTorch tensors
X_train = torch.FloatTensor(X_train)
X_val = torch.FloatTensor(X_val)
y_train = torch.FloatTensor(y_train).unsqueeze(1)
y_val = torch.FloatTensor(y_val).unsqueeze(1)

# Using PyTorch DataLoader to load the data in batches
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

Let's start the training process. We will have training and validation sets and print our losses at each step.

In [None]:
import time
import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_losses = []
val_losses = []
for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    # Training loop, train the network
    net.train()
    for idx, (data, target) in enumerate(train_loader):
        data = data.to(device)
        target = target.to(device)
        
        optimizer.zero_grad()
        output = net(data)
        loss = criterion(output, target)
        training_loss += loss.item()
        loss.backward()
        optimizer.step()
    
    # Get validation predictions
    net.eval()
    with torch.no_grad():
        val_predictions = net(X_val.to(device))
        # Calculate validation loss
        val_loss = criterion(val_predictions, y_val.to(device)).item()
    
    # Let's take the average losses
    training_loss = training_loss / len(y_train)
    val_loss = val_loss / len(y_val)
    
    train_losses.append(training_loss)
    val_losses.append(val_loss)
    
    end = time.time()
    print(f"Epoch {epoch}. Train_loss {training_loss:.6f} Validation_loss {val_loss:.6f} Seconds {end-start:.6f}")

Let's see the training and validation loss plots below. Losses go down as the training process continues as expected.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

plt.plot(train_losses, label="Training Loss")
plt.plot(val_losses, label="Validation Loss")
plt.title("Loss values")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

## 5. <a name="5">Natural Language Processing Context</a>
(<a href="#0">Go to top</a>)

If we want to use the same type of architecture for text classification, we need to apply some feature extraction methods first. For example: We can get TF-IDF vectors of text fields. After that, we can use neural networks on those features. 

We will also look at __more advanced neural network architrectures__ such as __Recurrent Neural Networks (RNNs)__, __Long Short-Term Memory networks (LSTMs)__ and __Transformers__. 