<a href="https://colab.research.google.com/github/annesjyu/NLP2/blob/main/03_NLP2_CNN_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutioal Neural Networks (CNNs) for Language Modeling

CNNs is another type of feed-forward NN suitable for language modeling. Feed-forward means moving data forward through network without any backward loops. The core part of CNN applies convolution operations over the sequence text data and identify their in-between complex relationships. It comes from signal processing to use convolution kernels to identify singal patterns.

For example, applying a convolution kernel of size=3 to a text sequence, "this movie has amazing diverse characters", a new feature vectors will be generated to represent orginal text and its underlying token relationships.

<img src="https://github.com/annesjyu/NLP2/blob/main/images/conv_maxpooling_steps.gif?raw=true" width=350></img>

* The height of the kernel (e.g. size=3) will be the number of embeddings it will see at once, similar to representing an n-gram in a word model.

* The width of the kernel should span the length of an entire word embedding.

## CNN Hyperparameters

A few hyperparameters affect the performance of a CNN archiecture. Some concepts are introduced before explaining hyperparameters,

* Kernel

> It is a small matrix applied at different positions in the input text sequence to compute convolutional result. It is controlled by the `kernel_size`, equal to the kernel's height; and the stepping positions ,e.g. `stride`, the convolution will multiply in the input dataset.

* Padding

> To conduct convolution operation at the border tokens, padding is used to ensure they have enough length for kernel_size. padding is defined to control the value used for filling missing parts of such tokens.

<img src="https://github.com/annesjyu/NLP2/blob/main/images/cnn_hyperparameters.gif?raw=true" width=350></img>

* Channels

> They mean the feature dimension along each point in the input. In the language modeling, it equals to the size the vocabulary.

* Dimensions

> The 1D convolutions are useful for time series in which each time step has has a feature vector. Most NLP are 1D convolutions as well.

> The 2D convolutions can be used for images with width and height two axis.

> The 3D convolutions can be used for videos.

* Dilation

It's the value in the kernel controls how the convolutional kernel is applied to the input matrix.






## Implementation

We create a simple synthetic example to introduce immplementation.

In [None]:
import torch.nn as nn

In [None]:
class SimpleCNN(nn.Module):
  def __init__(self, one_hot_size, kernel_size, stride, output_size):
    super(SimpleCNN, self).__init__()
    self.one_hot_size = one_hot_size
    self.kernel_size = kernel_size
    self.stride = stride
    self.output_size = output_size
    self.conv1 = nn.Conv1d(in_channels=self.one_hot_size,
                  out_channels=16,
                  kernel_size=self.kernel_size,
                  stride=self.stride)
    self.conv2 = nn.Conv1d(in_channels=16,
                  out_channels=32,
                  kernel_size=self.kernel_size,
                  stride=self.stride)
    self.conv3 = nn.Conv1d(in_channels=32,
                  out_channels=self.output_size,
                  kernel_size=self.kernel_size,
                  stride=self.stride)

  def forward(self, x):
    '''
      x : the input with a dimension of (batch_size, sequence_width, one_hot_size)
      y : the output with a dimension of (batch_size, output_size)
    '''
    intermediate1 = self.conv1(x)
    intermediate2 = self.conv2(intermediate1)
    intermediate3 = self.conv3(intermediate2)
    #print(x.shape)
    #print(intermediate1.shape)
    #print(intermediate2.shape)
    #print(intermediate3.shape)
    y = intermediate3.squeeze()
    return y

The implementation diagram is as below. Pay attention to the dimensions, as if they are matched for matrix multiplications between input data and conv(1,2,3):

```
sequence_width, one_hot_size, in_channels, out_channels(1,2,3), output_size.
```

<img src="https://github.com/annesjyu/NLP2/blob/main/images/cnn_dims.drawio.png?raw=true" width=500></img>

The channel dimensions are increased because the channel dimension is the feature vector size. The input data has as size of one_hot_size=10, while the output data has a size of output_size=64. In the intermediate layers, feature size is gradually increased to match larger output_size.

#### Training the simple CNN

In [None]:
import torch
from torch import randn
import torch.optim as optim

In [None]:
output_size = 64
one_hot_size = 10

model1 = SimpleCNN(one_hot_size=one_hot_size,
                   kernel_size=3,
                   stride=1,
                   output_size=output_size)
print(model1)

In [None]:
def TrainRandn(model,
               n_epochs=5,
               n_batches=128,
               batch_size=3,
               sequence_width=7,
               output_size=64):
  '''
    Train the model with the given parameters using randomly generated data.
  '''
  loss_fn = nn.MSELoss()
  optimizer = optim.SGD(model.parameters(), lr=0.001)

  for epoch in range(n_epochs):
    epoch_loss = 0
    for batch in range(n_batches):
        # Generate fake data for the batch.
        x = torch.randn(batch_size, one_hot_size, sequence_width)
        y_target = torch.randn(batch_size, output_size)
        #print(f'x={x}')
        #print(f'y_target={y_target}')

        # Reset gradients
        optimizer.zero_grad()

        # Forward pass: compute predicted y by passing x to the model.
        y_pred = model(x)
        #print(f'y_pred={y_pred}')
        #print(f'y_pred.shape={y_pred.shape}')
        #print(f'y_target.shape={y_target.shape}')

        # Compute and print loss.
        loss = loss_fn(y_pred, y_target)
        batch_loss = loss.item()/batch_size
        epoch_loss += batch_loss

        # Propagate the loss value backward.
        loss.backward()

        # Trigger the optimizer to update once.
        optimizer.step()
        # End of one epoch

    epoch_loss /= n_batches
    print(f'epoch={epoch}, epoch_loss={epoch_loss}\n')

In [None]:
TrainRandn(model1)

### Model Optimization

On the top of `SimpleCNN`, a few optimization techniques can be used to improve its performance.

#### Nonlinearity

Nonlinearlity is introduced to model complex non-linear relationship among data. The `Sequential` module is a convenience wrapper to encapsulate a sequence of operations. `ELU` is a nonlinearlity similar to `ReLU` to exponentiate value below 0.

In [None]:
import torch.nn.functional as F

In [None]:
class OptCNN(nn.Module):
  def __init__(self, one_hot_size, kernel_size, stride, num_classes):
    super(OptCNN, self).__init__()
    self.one_hot_size = one_hot_size
    self.kernel_size = kernel_size
    self.stride = stride
    self.output_size = 64
    self.num_classes = num_classes

    self.convnet = nn.Sequential(
        nn.Conv1d(in_channels=self.one_hot_size,
                  out_channels=16,
                  kernel_size=self.kernel_size,
                  stride=self.stride),
        nn.ELU(),
        nn.Conv1d(in_channels=16,
                  out_channels=32,
                  kernel_size=self.kernel_size,
                  stride=self.stride),
        nn.ELU(),
        nn.Conv1d(in_channels=32,
                  out_channels=self.output_size,
                  kernel_size=self.kernel_size,
                  stride=self.stride),
        nn.ELU()
    )
    self.fc = nn.Linear(self.output_size, self.num_classes)

  def forward(self, x, apply_softmax=False):
    '''
      x : the input with a dimension of (batch_size, sequence_width, one_hot_size)
      y : the output with a dimension of (batch_size, output_size)
    '''
    features = self.convnet(x).squeeze()
    #print(features.shape)
    y = self.fc(features)
    if apply_softmax:
      y = F.softmax(y, dim=1)
    return y

Refer to `Training the simple CNN` section to re-train using the `OptCNN`.

In [None]:
# Instead of generating a feature vector (64x1) , classify to multiple classes.
num_classes = 3
model = OptCNN(one_hot_size=one_hot_size,
               kernel_size=3,
               stride=1,
               num_classes=num_classes)
TrainRandn(model, output_size=num_classes)

#### Pooling

Pooling is an operation to summarize a higher-dimensional feature vector to a lower-dimensional feature vector. The values in the output feature vector summarize a spatial region of input. The summarization can be applied as sum pooling, average pooling, and max pooling. Or you can say pooling transform a weak input feature vector to a statistically strong feature vector.

<img src="https://github.com/annesjyu/NLP2/blob/main/images/pooling.jpg?raw=true" width=350></img>

In [None]:
# Define a max pooling layer
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)

# Example tensor (e.g., a feature map from a convolutional layer)
# Shape: [batch_size, channels, height, width]
input_tensor = torch.randn(1, 1, 4, 4)

# Apply max pooling
output = max_pool(input_tensor)

print("Input Tensor:\n", input_tensor)
print("Output Tensor:\n", output)

#### Batch Normalization

It is commonly used to rescale outputs to have a zero-mean and unit variance. In this case, fluctuatinos in any batch won't affect or shift too much. In statistics, it's called Z-score.

In [None]:
# Instead of using nn.Conv1d, use nn.BatchNorm1d
# Can you re-write OptCNN to include max pooling and normalization?


#### Residual Connections / Residual Block

Residual connection is also called skip connection, to take input as part of the output instead of going through computation. It is used for deep neural network architecture, for example, up to 152 layers trainable.

A residual network adds a shortcut connection, as shown in subfigure B below. The addition operation is an element-wise addition. This could be written as:
$F(x)=ReLU(conv(x) + x)$.

<img src="https://github.com/annesjyu/NLP2/blob/main/images/residual_connection.png?raw=true" width=350></img>

When implement in CNN, let one layer with a $kernel\_size = 3$ and $padding = 1$. The output will have the same size of its input.

In [None]:
# Can you add a residual connection layer to the conv3 in the previous OptCNN?
