# CMU Deep Learning Systems Course (Online)
### Final project by Dmitry Sholomitsky and Lubov Kudrenok

Proposal:
> Model save/load functionality; transformers architecture blocks, including classes like Attention, MultiheadAttention, Transformer.

Implemetations *(briefly)*:

1. Transformers architecture:

    * `needle/transformers.py` - new module:
        * `TransformerEncoder`
        * `TransformerEncoderLayer`
        * `MultiheadAttention`
    * `needle/tokenizers.py` - new module:
        * `Tokenizer` - base class for tokenizers;
        * `EngLemmaTokenizer` - tokenizer for English language with lemmatization;
    * `needle/ops.py` - added new functions:
        * `softmax`
        * `bmm` - function for batch matrix multiplication;
    * `needle/data.py` - some minor upgrades:
        * `DataLoader` - added `__len__` method; method `__next__` now puts data on device if provided;

2. Save/load functionality:

    * `needle/backend_ndarray/ndarray.py` - upgrades:
        * `BackendDevice` - added methods `__getstate__` and `__setstate__` to make it `pickle`-able;
        * `NDArray` - added methods `__getstate__` and `__setstate__` to make it `pickle`-able;
    * `needle/utils.py` - new module:
        * `save` - new function to save `needle` object;
        * `load` - new function to load `needle` object.

More detailed information will be provided further in the notebook.

Repository with source code: https://github.com/dmitry-rvn/CMUDLS_project.git

## 0. Preparations

In [1]:
! git clone https://github.com/dmitry-rvn/CMUDLS_project.git

Cloning into 'CMUDLS_project'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 33 (delta 0), reused 33 (delta 0), pack-reused 0[K
Unpacking objects: 100% (33/33), done.


In case of `gdown` not working, data is available with links: https://drive.google.com/file/d/1Nodv3EaqGe8jaUvSXAEAek0NlSgf9Fdr/view?usp=share_link or https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [2]:
! gdown "1Nodv3EaqGe8jaUvSXAEAek0NlSgf9Fdr&confirm=t"

Downloading...
From: https://drive.google.com/uc?id=1Nodv3EaqGe8jaUvSXAEAek0NlSgf9Fdr&confirm=t
To: /content/IMDB_Dataset.csv.zip
100% 27.0M/27.0M [00:01<00:00, 25.9MB/s]


In [3]:
%cd /content/CMUDLS_project/project
! pip3 install pybind11 loguru tqdm spacy
! make

/content/CMUDLS_project/project
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pybind11
  Downloading pybind11-2.10.3-py3-none-any.whl (222 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.4/222.4 KB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting loguru
  Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pybind11, loguru
Successfully installed loguru-0.6.0 pybind11-2.10.3
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Che

In [None]:
import sys
sys.path.append('./python')

import os
os.environ['NEEDLE_BACKEND'] = 'nd'

In [None]:
from typing import List, Tuple
import warnings

import torch
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm
from loguru import logger
import matplotlib.pyplot as plt
import seaborn as sns

import needle as ndl
from needle import nn
from needle import optim
from needle import ops
from needle.data import Dataset, DataLoader

from needle.transformers import TransformerEncoder
from needle.tokenizers import Tokenizer, EngLemmaTokenizer

warnings.filterwarnings('ignore')

In [None]:
logger.add('/content/logs/log.txt')

1

In [None]:
DATA_FILEPATH = '/content/IMDB_Dataset.csv.zip'

MAX_VOCAB_SIZE = 10_000
MAX_LENGTH = 128

BATCH_SIZE = 20

EPOCHS = 1
LEARNING_RATE = 0.001
WEIGHT_DECAY = 0.0

In [None]:
if torch.cuda.is_available():
    DEVICE = ndl.backend_ndarray.cuda()
else:
    DEVICE = ndl.backend_ndarray.cpu()
logger.info(f'DEVICE: {DEVICE}')

2023-01-03 16:11:25.958 | INFO     | __main__:<module>:5 - DEVICE: cuda()


## 1. Data

In order to prove that model with our architecture actually works, we will try to train it on this data for text classification from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews (movie reviews with label either 'positive' or 'negative')

In [None]:
data = pd.read_csv(DATA_FILEPATH)
data['review'] = data['review'].str.replace('<br />', ' ')
data_train, data_test = train_test_split(data, test_size=0.2, random_state=0)
data_train

Unnamed: 0,review,sentiment
20330,That's about the only redeeming quality in a m...,negative
17532,"Even if I had not read Anne Rice's ""Queen of t...",negative
45819,I sort of liked this Columbo movie its atmosph...,positive
34807,"""Zabriskie Point"" (1970): This was especially ...",positive
31888,Quite one of the worst films I have ever seen....,negative
...,...,...
21243,I did not set very high expectations for this ...,positive
45891,"THE BLOB is a great horror movie, not merely b...",positive
42613,"After too many years of waiting, Anne Rivers S...",positive
43567,I am a massive fan of the LoG. I thought the f...,negative


Since data is balanced we will use `accuracy_score` further to evaluate classification quality:

In [None]:
data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [None]:
class IMDB50kReviews(Dataset):
    def __init__(self, data: pd.DataFrame, tokenizer: Tokenizer, max_length: int = None):
        super().__init__()
        self.label2index = {'positive': 1, 'negative': 0}
        self.index2label = {idx: label for label, idx in self.label2index.items()}
        data = data.copy()
        data['sentiment'] = data['sentiment'].map(self.label2index)
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.num_classes = len(self.label2index)

    def __getitem__(self, item):
        text = self.data.iloc[item]['review']
        label = self.data.iloc[item]['sentiment']

        indices = []
        if isinstance(text, str):
            text = [text]
            label = [label]
        for text_ in text:
            indices.append(self.tokenizer(text_, max_length=self.max_length))
        return np.array(indices), np.array(label)

    def __len__(self):
        return len(self.data)

## 2. Tokenizer

In order to learn token embeddings, we need to create a tokenizer that get text input and returns list of token indices. It needs fitting on train data to create vocabulary.

In `needle/tokenizers.py` we created:
* `Tokenizer` - base class for tokenizers;
* `EngLemmaTokenizer` - tokenizer for English language, that
    * removes punctuation
    * applies lemmatization with `spacy` (+ lowering)
    * adds start-of-sequence and end-of-sequence tokens
    * performs truncation and padding

In [None]:
tokenizer = EngLemmaTokenizer(max_vocab_size=MAX_VOCAB_SIZE).fit(data_train['review'].tolist())

Example of tokenizer usage:

In [None]:
print(f'Tokenizer size: {len(tokenizer)}')
print(f'Default tokens: {tokenizer.default_tokens}\n')

phrase_to_tokenize = 'How are you, NotARealName?'
print(f'Phrase to tokenize: {phrase_to_tokenize}')
for max_length in (None, 5, 10):
    token_ids = tokenizer(phrase_to_tokenize, max_length=max_length)
    print(f'Token ids with max_length={str(max_length): <4}: {token_ids}')

Tokenizer size: 10004
Default tokens: {'<UNK>': 0, '<PAD>': 1, '<SOS>': 2, '<EOS>': 3}

Phrase to tokenize: How are you, NotARealName?
Token ids with max_length=None: [2, 90, 27, 23, 0, 3]
Token ids with max_length=5   : [2, 90, 27, 23, 3]
Token ids with max_length=10  : [2, 90, 27, 23, 0, 3, 1, 1, 1, 1]


## 3. `Dataset`s and `DataLoader`s

We've added to `needle.data.DataLoader` `__len__` method (for more convenient usage with `tqdm`), and `__next__` method now puts data on the device if provided.

In [None]:
train_dataset = IMDB50kReviews(data_train, tokenizer, max_length=MAX_LENGTH)
valid_dataset = IMDB50kReviews(data_test, tokenizer, max_length=MAX_LENGTH)

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, device=DEVICE, random_seed=0)
valid_dataloader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False, device=DEVICE)

Check shapes, devices, values:

In [None]:
batch_x, batch_y = next(iter(train_dataloader))
batch_x.shape, batch_x.device, batch_y.shape, batch_y.device

((20, 128), cuda(), (20,), cuda())

In [None]:
np.array(batch_x.numpy()[4], dtype=int)

array([   2,   12,   65,  299,   13,   22,   40,  462,    8,  133,   88,
         12, 4780,    6,  427,   12,   63,   10, 1552,   60, 3912,  717,
       1108,   14,  111,   79,  108,   30,   10,  203,   32,    4,  247,
         22,   12,   30,  126,  110,   12,  140,  776,    6,    0,  276,
         69,  383,   24,   10,    3,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
          1,    1,    1,    1,    1,    1,    1])

In [None]:
batch_y.numpy()[:10]

array([0., 0., 0., 1., 0., 0., 1., 1., 0., 1.], dtype=float32)

## 4. Model

In order to check adequacy of our Transformer implementation we created model with Transformer as **encoder** and **classification head** to classify movie reviews.

Transformer architecure implemented in `needle/transformer.py` with 3 classes:
* `TransformerEncoder` - it includes token and position embeddings and number of `TransformerEncoderLayer`;
* `TransformerEncoderLayer` - it has `MultiheadAttention` and other components like `LayerNorm`s and `Linear`s;
* `MultiheadAttention` - it has *query*, *key*, *value* and *out* weights, applies `softmax` and other operations.

For some reason, Colab does not want to render our scheme (diagram) of our Transformer implementation (here we draw every operation in `forward` methods), so please check it with this link: https://github.com/dmitry-rvn/CMUDLS_project/blob/main/assets/transformer-schema.png



In [None]:
class TextClassifier(nn.Module):
    def __init__(self, 
                 num_classes: int,
                 vocab_size: int,
                 transformer_hidden_dim: int, 
                 transformer_linear_dim: int, 
                 transformer_n_layers: int,
                 transformer_n_heads: int,
                 transformer_max_length: int,
                 clf_linear_dim: int, 
                 dropout: float = 0.2,
                 device=None):
        # our implementation of Transformer encoder
        self.encoder = TransformerEncoder(
            input_dim=vocab_size, 
            max_length=transformer_max_length,
            hidden_dim=transformer_hidden_dim,
            feed_forward_dim=transformer_linear_dim,
            n_layers=transformer_n_layers, 
            n_heads=transformer_n_heads, 
            dropout=dropout, 
            device=device
        )
        # classification head to map start-of-sequence embeddings to classes logits
        self.classificator = nn.Sequential(
            nn.Linear(transformer_hidden_dim, clf_linear_dim, device=device),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(clf_linear_dim, num_classes, device=device)
        )
    
    def forward(self, x: ndl.Tensor) -> ndl.Tensor:
        encoded_state = self.encoder(x)  # batch_size x seq_len x hidden_dim
        encoded_state_first = ops.split(encoded_state, axis=1)[0]  # batch_size x hidden_dim
        clf_output = self.classificator(encoded_state_first)  # batch_size x num_classes
        return clf_output
    
    @staticmethod
    def _softmax(x: np.ndarray) -> np.ndarray:
        """
        Numpy softmax for predicting label indices.
        """
        e_x = np.exp(x - x.max(axis=-1, keepdims=True))
        return e_x / e_x.sum(axis=-1, keepdims=True)
    
    def get_labels_from_logits(self, x: ndl.Tensor) -> np.ndarray:
        return self._softmax(x.detach().numpy()).argmax(axis=-1)
    
    def predict(self, x: ndl.Tensor) -> np.ndarray:
        """
        Predict label indices.
        """
        self.eval()
        out = self.forward(x)
        return self.get_labels_from_logits(out)

In [None]:
model = TextClassifier(
    num_classes=train_dataset.num_classes,
    vocab_size=len(tokenizer),
    transformer_max_length=MAX_LENGTH,
    
    transformer_hidden_dim=64 * 4,
    transformer_linear_dim=64 * 4,
    transformer_n_layers=3,
    transformer_n_heads=4,
    
    clf_linear_dim=128,
    device=DEVICE
)

## 5. Training

Unfortunately, training appeared to be very slow, so in order to demonstrate that the model *actually learns* we will try to overfit with just one batch.

In [None]:
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
criterion = nn.SoftmaxLoss()

In [None]:
def train_or_evaluate(model, dataloader, criterion, optimizer=None, log_step: int = None) -> Tuple[float, float]:
    if optimizer:
        model.train()
    else:
        model.eval()
    
    losses = []
    metrics = []

    # TODO: delete for 'real' training
    x_, y_ = next(iter(dataloader))

    for i, (x, y) in enumerate(tqdm(dataloader)):
        if optimizer:
            optimizer.reset_grad()
        
        # TODO: delete for 'real' training
        x, y = x_, y_
        
        out = model(x)
        loss = criterion(out, y)

        if optimizer:
            loss.backward()
            optimizer.step()

        loss_value = loss.detach().numpy()[0]
        losses.append(loss_value)
        
        labels = model.get_labels_from_logits(out)
        metric = accuracy_score(y.detach().numpy(), labels)
        metrics.append(metric)

        if log_step and i % log_step == 0:
            logger.debug(f'batch {i + 1} | mov.avg: loss={np.mean(losses[-log_step:]):.4f}, metric={np.mean(metrics[-log_step:]):.4f}')
            logger.debug(f'y_true: {np.array(y.detach().numpy(), dtype=int)}, y_pred: {labels}')
        
        # TODO: delete for 'real' training
        if i == 201:
            break

    return np.mean(losses), np.mean(metrics)

From training logs we can actually see decreasing of training loss (almost to 0.0) and increasing of accuracy score (almost to 1.0):

In [None]:
train_losses, train_metrics, eval_losses, eval_metrics = [], [], [], []

for epoch in range(1, EPOCHS + 1):
    train_loss, train_metric = train_or_evaluate(model, train_dataloader, criterion, optimizer, log_step=20)
    logger.info(f'Epoch: {epoch} / {EPOCHS} | train_loss: {train_loss:.5f} | train_metric: {train_metric:.5f}')

    # TODO: uncomment for 'real' training
    # eval_loss, eval_metric = train_or_evaluate(model, valid_dataloader, criterion)
    # logger.info(f'Epoch: {epoch} / {EPOCHS} | eval_loss: {eval_loss:.5f} | eval_metric: {eval_metric:.5f}')

    train_losses.append(train_loss)
    train_metrics.append(train_metric)
    # eval_losses.append(eval_loss)
    # eval_metrics.append(eval_metric)

  0%|          | 0/2000 [00:00<?, ?it/s]

2023-01-03 16:18:14.341 | DEBUG    | __main__:train_or_evaluate:35 - batch 1 | mov.avg: loss=0.8860, metric=0.5500
2023-01-03 16:18:14.385 | DEBUG    | __main__:train_or_evaluate:36 - y_true: [0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1], y_pred: [1 1 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1]
2023-01-03 16:21:35.693 | DEBUG    | __main__:train_or_evaluate:35 - batch 21 | mov.avg: loss=0.9726, metric=0.5300
2023-01-03 16:21:35.722 | DEBUG    | __main__:train_or_evaluate:36 - y_true: [0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1], y_pred: [1 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 0 0 0 1]
2023-01-03 16:24:58.015 | DEBUG    | __main__:train_or_evaluate:35 - batch 41 | mov.avg: loss=0.7295, metric=0.5225
2023-01-03 16:24:58.043 | DEBUG    | __main__:train_or_evaluate:36 - y_true: [0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1], y_pred: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
2023-01-03 16:28:19.885 | DEBUG    | __main__:train_or_evaluate:35 - batch 61 | mov.avg: loss=0.6983, metric=0.5525
2023-01-03 16:28:19.887 | D

## 6. Save/load

Challenge for making `needle` object serializable was to persist `BackendDevice` and `NDArray`.

In `BackendDevice` we save only device's name during serialization and then recreate device from corresponding function call during deserialization:
```python
class BackendDevice:
    def __getstate__(self):
        return {'name': self.name}

    def __setstate__(self, state):
        devices = {'cpu': cpu, 'cuda': cuda, 'cpu_numpy': cpu_numpy}
        self.__dict__ = devices[state['name']]().__dict__.copy()
```

In `NDArray` we convert numeric data into `numpy`-array during serialization and then create `NDArray` from `numpy`-array during deserialization:
```python
class NDArray:
    def __getstate__(self):
        attributes = self.__dict__.copy()
        attributes['_handle'] = self.numpy()
        return attributes

    def __setstate__(self, state):
        state['_handle'] = NDArray(state['_handle'], device=state['_device'])._handle
        self.__dict__ = state
```

Another nuance: we need to clean `Tensor`'s `grad` before dumping, otherwise Python raises recursion error.

After these upgrades we can use simple `pickle.dump` & `pickle.load` in `needle.utils.save` & `needle.utils.load`:

```python
def save(obj, filepath: Union[str, Path]):
    """
    Save needle object to a file
    (tensor's gradients will be set as None).
    """
    if isinstance(obj, ndl.nn.Module):
        for param in obj.parameters():
            param.grad = None
    elif isinstance(obj, ndl.Tensor):
        obj.grad = None
    with open(filepath, 'wb') as file:
        pickle.dump(obj, file)


def load(filepath: Union[str, Path]):
    """
    Load needle object from a file 
    (object will be put on the same device they were before being saved to a file).
    """
    with open(filepath, 'rb') as file:
        return pickle.load(file)
```

Predictions of the model and true labels:

In [None]:
true_labels = np.array(batch_y.detach().numpy(), dtype=int)
model_predictions = model.predict(batch_x)

print(model_predictions, true_labels)

[0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1] [0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1]


Save model as file:

In [None]:
ndl.save(model, 'model.ndl')

In [None]:
! du -h model.ndl

15M	model.ndl


Dumpled model is also available in repository's folder `assets/`. 

Restore model and make predictions:

In [None]:
model_restored = ndl.load('model.ndl')

model_restored_predictions = model_restored.predict(batch_x)

print(f'model_restored_predictions: {model_restored_predictions}')
print(f'predictions are the same as before dumping: {(model_predictions == model_restored_predictions).all()}')

model_restored_predictions: [0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1]
predictions are the same: True


In [None]:
# TODO: uncomment after 'real' training

# texts = [
#     "I don't like this movie at all. Poor acting, boring story... No wonder that box office is low!",
#     "This movie is fantastic, I really enjoyed it! Great actors, good operator's work, beautiful music! This director became my favourite!"
# ]
# texts_tensor = ndl.Tensor([tokenizer(line, MAX_LENGTH) for line in texts], device=DEVICE)
# preds = model_restored.predict(texts_tensor)
# print(f'preds: {preds}')
# print('labels:')
# for pred in preds:
#     print(train_dataset.index2label[pred])

preds: [0 0]
labels:
negative
negative
