# Neural Networks


Neural networks (NN - Neural Networks) are a collection of chained functions that are applied to some input data. These functions are defined by parameters (weights and biases), stored through tensors.


The input data is a matrix. For an image, the matrix will represent pixel values. For text, each line will probably represent a word. It is important to know that all lines must have the same length, so we will apply padding when necessary.

Each arrow between neurons is a function. The output of that function is the value we find in the new neuron, which we can see as a "How likely is it that this data that we look at has X feature?". How do we combine multiple functions?

$N = \sum_{i = 0}^n w_i * x_i + b$

where:
- $w_i$ is the weight for each node i
- $x_i$ is the value in each node i
- $b$ is the bias for the current layer

# Convolutional Neural Network (CNN)

[A Convolution Visualiser](https://ezyang.github.io/convolution-visualizer/)

Dense Neural Networks learn features in the positions in the training set, so it works for centered images, as symmetrical as possible. If it is trained on a picture of an animal on the left side of the image, it will probably learn to recognize the eyes, ears, and nose, but only as long as it is on that side. If we rotate the image horizontally, it will most likely not be recognized.

This happens because the model learns the features in the position it is in.

To solve this problem, we can use CNNs. A CNN learns what a feature in an image looks like, without linking it to a position. It creates an _output feature map_ for each feature and searches for it with a sliding window.

## CNNs for text classification

CNNs for text classification were first introduced in [Convolutional Neural Networks for Sentence Classification](https://aclanthology.org/D14-1181.pdf). This is the architecture that we will use in this lab.



<img src="https://richliao.github.io/images/YoonKim_ConvtextClassifier.png">

Given as input a text of $n$ words $w_{1}$, $w_{2}$, ..., $w_{n}$, we transform each word into a vector of size $d$, resulting in the vectors $w_{1}$, $w_{2}$, ..., $w_{n}$ belonging to $R^d$. The resulting $d$×$n$ matrix is then used as input for a convolutional layer that passes a sliding window over the text.

For each window of length $l$:

$u_{i}$ = [$w_{i}$, ..., $w_{i+l-1}$] $∈ R^{d×l}$, 0≤$i$≤$n-l$

For each filter $f_{j} ∈ R^{d×l}$ we calculate <$u_{i}$, $f_{j}$> and obtain the matrix $ F ∈ R^{m×n}$ (if we have padded before applying the filter so that we keep the size $n$ of words), where $m$ is the number of filters. We apply max-pooling to the resulting $F$ matrix, then apply the activation function. Finally, we have a *fully connected* layer that produces the class distribution, from which the class with the highest probability results.

## Images vs Text

To understand why an approach using CNNs is suitable for text, we need to visualize our texts as a matrix.
For the following example we will consider that the representation of a sentence was done at the word level.

For example, for a sentence with a maximum length of 70 words and an embedding length of 300, we can create an array of numeric values of the form 70x300 to represent this sentence. Unlike images, where matrix elements are represented by pixel values, each line in the vector representation of the sentence is actually the representation of a word.

For images, the convolution filter moves both vertically and horizontally, but for text, the filter only moves vertically, the convolutions are only 1D. A kernel of size (2, 300) that has a filter size of 2 only looks at 2 words at a time. We can therefore think of the size of the filters as a size of n-grams (bigrams, trigrams, etc.).

We will use the IMDb movie reviews dataset: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [None]:
!pip install unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
import pandas as pd
from pprint import pprint
from sklearn.model_selection import train_test_split
from unidecode import unidecode
from collections import Counter
import nltk
from nltk import word_tokenize
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from urllib.request import urlretrieve
urlretrieve('https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv', 'IMDB_Dataset.csv')

('IMDB_Dataset.csv', <http.client.HTTPMessage at 0x7fa19e0ba970>)

In [None]:
data = pd.read_csv('IMDB_Dataset.csv')
data = data[:10000]
data

Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0
...,...,...
9995,"The film maybe goes a little far, but if you l...",1
9996,This two-parter was excellent - the best since...,1
9997,Shaggy & Scooby-Doo Get a Clue. It's like watc...,0
9998,"Todd Rohal is a mad genius. ""Knuckleface Jones...",1


Splitting the dataset into train and test.

In [None]:
train_df, test_df = train_test_split(data, test_size=0.20, random_state = 42)

print('Training set size', len(train_df))
print('Testing set size', len(test_df))

Training set size 8000
Testing set size 2000


As we have seen in past labs, we cannot train a model directly on textual data, we must transform the data into vector numerical representations.

For this, we have to go through 2 steps:

- **Normalization**: for this example we will only tokenize the texts (splitting them into smaller subtexts)

- **Vectorization**: representing in vector numerical format

### Character-level vector representation


```
Texts: 'The mouse ran up the clock' and 'The mouse ran down'
```

In addition to the tokens present in our texts, we also add 2 special tokens: UNK (unknown word) and PAD.


```
Index assigned for every token: {0: 'UNK', 1: 'PAD', 2: 't', 3: 'm', 4: 'c', 5: 'h', 6: 'l', 7: 'w', 8: ' ', 9: 'a', 10: 'k', 11: 'e', 12: 'r', 13: 'u', 14: 'n', 15: 's', 16: 'd', 17: 'p', 18: 'o'}
```

The vector representation of the two texts using the corresponding index for each word:

```
'The mouse ran up the clock' = [18, 2, 9, 11, 15, 16, 12, 3, 9, 11, 13, 5, 8, 11, 12, 7, 11, 18, 2, 9, 11, 4, 6, 16, 4, 10]
'The mouse ran down' = [18, 2, 9, 11, 15, 16, 12, 3, 9, 11, 13, 5, 8, 11, 14, 16, 17, 8]
```

We add padding values to the second vector to have a length equal to the first vector and we get:

```
[2, 16, 12, 6, 11, 13, 15, 5, 12, 6, 14, 7, 17, 6, 9, 13, 10, 17, 1, 1, 1, 1, 1, 1, 1, 1]
```

#### Exercises

1. Create the dictionary of unique characters and their corresponding id for the train dataset. We will use 0 for unknown, 1 for padding and the numbers from 2... for the others. You can transform utf8 characters to their closest ASCII form to reduce the size of the vocabulary, but don't forget to add this transformation to the test set as well before predicting.
2. Define a padding function. It should receive a text and a maximum length and
  a. Truncate the text to the given length (if it's bigger)
  b. Add the padding value to shorter texts, until they achieve the desired length
3. Transform the train and test set into their vector representation -- replace each datapoint with a list of chars, replace these chars with their corresponding id, then pad each line to the desired length.

### An Architecture

We will load our data sets into an object of the class [Dataset](https://pytorch.org/docs/stable/data.html?highlight=dataset#torch.utils.data.Dataset).

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, samples, labels):
        self.samples = samples
        self.labels = labels

    def __getitem__(self, k):
        """Returns the kth example from the dataset"""
        return self.samples[k], self.labels[k]

    def __len__(self):
        """Returns the size of the dataset"""
        return len(self.samples)

We define the model architecture:

In [None]:
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()

        # We define an embedding layer with a vocabulary of size 1024
        # and as output an embedding of size 100
        # padding_idx is the index in the padding vocabulary (1, in our case)

        self.embedding = torch.nn.Embedding(1024, 100, padding_idx=1)

        # Let's define a sequence of layers

        # A dropour layer with a probability of 0.4
        self.dropout = torch.nn.Dropout(p=0.4)

        # A 1D Convolutional layer with 100 input channels, 128 output channels, kernel size = 3 and padding = 1
        # 1D Batch Normalization Layer for 128 features
        # ReLU activation
        # 1D Maxpooling layer of size 2
        conv1 = torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=100, out_channels=128, kernel_size=3, padding=1),
            torch.nn.BatchNorm1d(num_features=128),
            torch.nn.ReLU(),
            torch.nn.MaxPool1d(kernel_size=2),
        )

        # A 1D Convolutional layer with 100 input channels, 128 output channels, kernel size = 5 and padding = 1
        # 1D Batch Normalization Layer for 128 features
        # ReLU activation
        # 1D Maxpooling layer of size 2
        conv2 = torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=128, out_channels=128, kernel_size=5, padding=2),
            torch.nn.BatchNorm1d(num_features=128),
            torch.nn.ReLU(),
            torch.nn.MaxPool1d(kernel_size=2),
        )

        # A 1D Convolutional layer with 100 input channels, 128 output channels, kernel size = 5 and padding = 1
        # 1D Batch Normalization Layer for 128 features
        # ReLU activation
        # 1D Maxpooling layer of size 2
        conv3 = torch.nn.Sequential(
            torch.nn.Conv1d(in_channels=128, out_channels=128, kernel_size=5, padding=2),
            torch.nn.BatchNorm1d(num_features=128),
            torch.nn.ReLU(),
            torch.nn.MaxPool1d(kernel_size=2),
        )

        # Global Average pooling layer which, in our case, is a 1D Avgerage Pooling layer
        # with size 125 and stride 125
        global_average = torch.nn.AvgPool1d(kernel_size=125, stride=125)

        self.convolutions = torch.nn.Sequential(
            conv1, conv2, conv3, global_average
        )

        # Flattening layer
        flatten = torch.nn.Flatten()

        # Linear layer with 128 input features and 2 outputs without activation function
        linear = torch.nn.Linear(in_features=128, out_features=2)

        self.classifier = torch.nn.Sequential(flatten, linear)

    def forward(self, input):
        # we pass the input through the embedding layer
        embeddings = self.embedding(input)

        # we permute the input so that the first dimension is the number of channels
        embeddings = embeddings.permute(0, 2, 1)

        # we pass the input through the sequence of layers
        output = self.dropout(embeddings)
        output = self.convolutions(output)
        output = self.classifier(output)
        return output

In [None]:
DEVICE = torch.device("cuda")
# initializing the model
model = Model().to(DEVICE)

# Adam optimizer with learning rate = 1e-3
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Cross Entropy loss
loss_fn = torch.nn.CrossEntropyLoss()

# training dataset and dataloader
# test dataset and dataloader
train_ds = Dataset(train_reviews_vectorized, train_labels)
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=64, shuffle=True)
test_ds = Dataset(test_reviews_vectorized, test_labels)
test_dl = torch.utils.data.DataLoader(test_ds, batch_size=64, shuffle=False)

Training loop

In [None]:
best_val_acc = 0
for epoch_n in range(5):
    print(f"Epoch #{epoch_n + 1}")
    model.train()
    for batch in train_dl:
        model.zero_grad()

        inputs, targets = batch
        inputs = inputs.long().to(DEVICE)
        targets = targets.to(DEVICE)

        output = model(inputs)
        loss = loss_fn(output, targets)

        loss.backward()
        optimizer.step()

    # validation
    model.eval()
    all_predictions = torch.tensor([])
    all_targets = torch.tensor([])
    for batch in test_dl:
        inputs, targets = batch
        inputs = inputs.long().to(DEVICE)
        targets = targets.to(DEVICE)

        with torch.no_grad():
            output = model(inputs)

        predictions = output.argmax(1)
        all_targets = torch.cat([all_targets, targets.detach().cpu()])
        all_predictions = torch.cat([all_predictions, predictions.detach().cpu()])

    val_acc = (all_predictions == all_targets).float().mean().numpy()
    print(val_acc)

    if val_acc > best_val_acc:
        torch.save(model.state_dict(), "./model")
        torch.save(optimizer.state_dict(), "./optimizer")
        best_val_acc = val_acc

print("Best validation accuracy", best_val_acc)

Epoch #1
0.5215
Epoch #2
0.7205
Epoch #3
0.517
Epoch #4
0.648
Epoch #5
0.5185
Best validation accuracy 0.7205


### Vector representation at word level



```
Texts: 'The mouse ran up the clock' and 'The mouse ran down'
```

In addition to the tokens present in our texts, we also add 2 special tokens: UNK (unknown word) and PAD.


```
Index assigned for every token: {'UNK': 0, 'PAD': 1, 'the': 2, 'mouse': 3, 'ran': 4, 'up': 5, 'clock': 6, 'down': 7}
```

The vector representation of the two texts using the corresponding index for each word:

```
'The mouse ran up the clock' = [2, 3, 4, 5, 2, 6]
'The mouse ran down' = [2, 3, 4, 7]
```

We add padding values to the second vector to have a length equal to the first vector and we get:

```
[2, 3, 4, 7, 1]
```
One-hot representation of each text:

```
'The mouse ran up the clock' = [[0. 0. 1. 0. 0. 0. 0.]
                                [0. 0. 0. 1. 0. 0. 0.]
                                [0. 0. 0. 0. 1. 0. 0.]
                                [0. 0. 0. 0. 0. 1. 0.]
                                [0. 0. 1. 0. 0. 0. 0.]
                                [0. 0. 0. 0. 0. 0. 1.]]

'The mouse ran down' = [[0. 0. 1. 0. 0. 0. 0. 0.]
                        [0. 0. 0. 1. 0. 0. 0. 0.]
                        [0. 0. 0. 0. 1. 0. 0. 0.]
                        [0. 0. 0. 0. 0. 0. 0. 1.]
                        [0. 1. 0. 0. 0. 0. 0. 0.]]
```

#### Exercises

1. Create the dictionary of unique words and their corresponding id for the train dataset. Same as before, we will use 0 for unknown, 1 for padding and the numbers from 2... for the other words.
We also need to reduce the vocabulary. Some methods you can use:
- Remove low frequency words
- Remove very frequent words (stopwords)
- Preprocess the text (stemming, lemmatization etc)
2. Define a padding function. It should receive a text and a maximum length and
  a. Truncate the text to the given length (if it's bigger)
  b. Add the padding value to shorter texts, until they achieve the desired length
3. Transform the train and test set into their vector representation -- replace each datapoint with a list of words, replace these words with their corresponding id, then pad each line to the desired length.

# Final Exercises

Compare the previous approaches on the given dataset. Try different paddings, architectures, preprocessing techniques etc. Display a table with your results