# Transformers architecture and BERT
<sup>This notebook is a part of Natural Language Processing class at the University of Ljubljana, Faculty for computer and information science. Please contact [ales.zagar@fri.uni-lj.si](mailto:ales.zagar@fri.uni-lj.si) for any comments.</sub>

[Transformers](https://huggingface.co/transformers/quicktour.html) library offers a variety of implemented architectures (Tensorflow and PyTorch) along with [pre-trained models](https://huggingface.co/models) for different tasks - sequence classification, sequence tagging, machine translation, .... There you can find also some Slovene models. Otherwise, Slovene models are available at:
   
* [CroSloEn BERT](https://www.clarin.si/repository/xmlui/handle/11356/1330)
* [SloBERTa 1.0](https://www.clarin.si/repository/xmlui/handle/11356/1387)
* [SloBERTa 2.0](https://www.clarin.si/repository/xmlui/handle/11356/1397)

[A nice introduction into BERT](https://huggingface.co/blog/bert-101) (for reading).


A lot of [notebooks](https://huggingface.co/docs/transformers/v4.39.1/en/notebooks) exist that can help you learn or speed up the model building process. 

We read the dataset from the dataset library. We will be working with sentiment analysis task, labeling positive and negative movie reviews. 

In [43]:
from datasets import load_dataset

imdb = load_dataset("imdb")

imdb['train'][0]  # print first example

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

# Tokenizers Overview

Tokenizers play a fundamental role in Natural Language Processing (NLP) by breaking down text into smaller, more manageable units called tokens. These tokens can be words, subwords, or even characters, depending on the granularity required for the task at hand. The choice of tokenizer can significantly impact the performance of NLP models, as it affects how the text is represented and understood by the algorithms.

## Types of Tokenizers

1. **Word Tokenizers**: These tokenizers split text into words, using spaces and punctuation as delimiters. They are simple to implement and understand, but might not be effective for languages that don't use spaces to separate words, or for handling compound words in languages like German.

2. **Subword Tokenizers**: Subword tokenization algorithms like Byte-Pair Encoding (BPE), WordPiece, and SentencePiece break words down into smaller units (subwords). This approach helps in handling out-of-vocabulary words, and provides a balance between the flexibility of character tokenization and the efficiency of word tokenization.

3. **Character Tokenizers**: These tokenizers break text down to the character level, offering the highest granularity. This can be useful for tasks like character-level language modeling or languages with no clear word boundaries, but it usually leads to longer sequences compared to word or subword tokenization.

4. **Byte-Level Tokenizers**: Similar to character tokenizers, byte-level tokenizers operate at the byte level, encoding each byte of the text as a separate token. This approach is language-agnostic and can handle any text without the need for a predefined vocabulary.

## Importance in NLP

Tokenization is the first step in preprocessing text data for NLP tasks. A well-chosen tokenizer can:

- Improve model understanding of language nuances
- Reduce the size of the vocabulary, leading to more efficient training and inference
- Handle a wide range of languages and special text elements like emojis or domain-specific terms

Choosing the right tokenizer is crucial for building robust and high-performing NLP models.

In [44]:
from transformers import BertForSequenceClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [45]:
tokenizer.tokenize("don't be so judgmental")

['don', "'", 't', 'be', 'so', 'judgment', '##al']

In [46]:
tokenized_text_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize("don't be so judgmental"))
tokenized_text_ids

[1274, 112, 189, 1129, 1177, 9228, 1348]

In [47]:
text = tokenizer.decode(tokenized_text_ids)
text

"don't be so judgmental"

In [48]:
t = tokenizer.encode("don't be so judgmental", return_tensors='pt')  # tokenizer will return pytorch tensors

print(t)
print(tokenizer.decode(t[0]))  # print decoded string with special tokens included
print(tokenizer.decode(t[0], skip_special_tokens=True))

tensor([[ 101, 1274,  112,  189, 1129, 1177, 9228, 1348,  102]])
[CLS] don't be so judgmental [SEP]
don't be so judgmental


### encode() vs encode_plus() methods

.encode(text):

- This method simply converts the input text into token IDs.
- It returns a list of token IDs representing the input text.
- This method is straightforward and useful when you only need token IDs for the input.

.encode_plus(text, ...):

- In addition to converting the input text into token IDs, this method also generates additional information such as attention masks, token type IDs, etc., depending on the specific model and tokenizer.
- It returns a dictionary containing token IDs ('input_ids'), attention mask ('attention_mask'), and potentially other information like token type IDs ('token_type_ids'), depending on the model architecture.
- This method is more versatile and useful when you need additional information along with token IDs, such as when preparing inputs for model training or inference.


In [49]:
tokenizer.encode_plus("don't be so judgmental", return_tensors='pt')

{'input_ids': tensor([[ 101, 1274,  112,  189, 1129, 1177, 9228, 1348,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}

If we have more than one sequence, we use batch encode. 

In [50]:
tokenizer.batch_encode_plus(["don't be so judgmental", 'i am a student'])

{'input_ids': [[101, 1274, 112, 189, 1129, 1177, 9228, 1348, 102], [101, 178, 1821, 170, 2377, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

We usually want that all sequences in a batch are of the same length. Therefor we need to decide how to prepare them. If we set padding to True the tokenizer will pad to the longest sequence in the batch.

In [51]:
tokenizer.batch_encode_plus(["don't be so judgmental", 'i am a student'], padding=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 1274,  112,  189, 1129, 1177, 9228, 1348,  102],
        [ 101,  178, 1821,  170, 2377,  102,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0]])}

In [52]:
tokenizer.batch_encode_plus(["don't be so judgmental", 'i am a student'], padding='max_length', max_length=15, return_tensors='pt')

{'input_ids': tensor([[ 101, 1274,  112,  189, 1129, 1177, 9228, 1348,  102,    0,    0,    0,
            0,    0,    0],
        [ 101,  178, 1821,  170, 2377,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [53]:
# Tokenize first sample in the train set
tokenizer(imdb['train'][0]['text'], padding='max_length', max_length=tokenizer.model_max_length)

{'input_ids': [101, 146, 12765, 146, 6586, 140, 19556, 19368, 13329, 118, 162, 21678, 2162, 17056, 1121, 1139, 1888, 2984, 1272, 1104, 1155, 1103, 6392, 1115, 4405, 1122, 1165, 1122, 1108, 1148, 1308, 1107, 2573, 119, 146, 1145, 1767, 1115, 1120, 1148, 1122, 1108, 7842, 1118, 158, 119, 156, 119, 10148, 1191, 1122, 1518, 1793, 1106, 3873, 1142, 1583, 117, 3335, 1217, 170, 5442, 1104, 2441, 1737, 107, 6241, 107, 146, 1541, 1125, 1106, 1267, 1142, 1111, 1991, 119, 133, 9304, 120, 135, 133, 9304, 120, 135, 1109, 4928, 1110, 8663, 1213, 170, 1685, 3619, 3362, 2377, 1417, 14960, 1150, 3349, 1106, 3858, 1917, 1131, 1169, 1164, 1297, 119, 1130, 2440, 1131, 3349, 1106, 2817, 1123, 2209, 1116, 1106, 1543, 1199, 3271, 1104, 4148, 1113, 1184, 1103, 1903, 156, 11547, 1162, 1354, 1164, 2218, 1741, 2492, 1216, 1112, 1103, 4357, 1414, 1105, 1886, 2492, 1107, 1103, 1244, 1311, 119, 1130, 1206, 4107, 8673, 1105, 6655, 10552, 3708, 2316, 1104, 8583, 1164, 1147, 11089, 1113, 4039, 117, 1131, 1144, 2673, 1

Lets tokenize now the whole dataset. We will set truncation to True: Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.

We will also use .map method on the dataset. Lets emphasize here that the cache is one of the reasons why 🤗 Datasets is so efficient. It stores previously downloaded and processed datasets so when you need to use them again, they are reloaded directly from the cache. This avoids having to download a dataset all over again, or reapplying processing functions. Even after you close and start another Python session, 🤗 Datasets will reload your dataset directly from the cache!

In [54]:
# define preprocess function
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

Combining the utility of Dataset.map() with batch mode is very powerful. It allows you to speed up processing, and freely control the size of the generated dataset.

In [55]:
# Tokenize dataset
tokenized_imdb = imdb.map(preprocess_function, batched=True, batch_size=1000, load_from_cache_file=True)

In [56]:
tokenized_imdb['train'][0].keys()

dict_keys(['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'])

In [57]:
len(tokenized_imdb['train'][10]['input_ids'])

375

Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset.

In [58]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

We also need an evaluation function. 

In [59]:
import evaluate

accuracy = evaluate.load("accuracy")

In [60]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# Model 

When loading the pre-trained BertForSequenceClassification model from the bert-base-cased checkpoint, it's important to note that certain weights, specifically those associated with the classifier layer ('classifier.bias' and 'classifier.weight'), are not initialized from the checkpoint. This occurs because the BertForSequenceClassification model adapts the base BERT model for a specific sequence classification task, which often requires a custom final classifier layer tailored to the number of classes in the specific task at hand.

NOTE: This part of the notebook requires a lot of compute resources. 

In [61]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
model = BertForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2, id2label=id2label, label2id=label2id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [62]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./runs",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=2,  # Set to a small fraction of an epoch
    weight_decay=0.01,
    evaluation_strategy="no",
    save_strategy="epoch",
    load_best_model_at_end=False,
    max_steps=3,  # Alternatively, limit the number of steps to a small value
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [63]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=3, training_loss=0.5767104625701904, metrics={'train_runtime': 5.0082, 'train_samples_per_second': 0.599, 'train_steps_per_second': 0.599, 'total_flos': 479458231740.0, 'train_loss': 0.5767104625701904, 'epoch': 0.00012})

In [64]:
trainer.evaluate()

KeyboardInterrupt: 

In [65]:
# Save your model
trainer.save_model('./models/sentiment-bert')

# Load your model
model = BertForSequenceClassification.from_pretrained('./models/sentiment-bert')

In [66]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [67]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="./models/sentiment-bert")
classifier(text)

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.5852044224739075}]

# Custom neural model for IMDB reviews sentiment prediction

## Key Components

1. `LightningCustomIMDBModel`: A PyTorch Lightning module that defines the model architecture, training step, validation step, and optimizer configuration.
2. `IMDBDataset`: A custom PyTorch dataset class to handle tokenized IMDb reviews, ensuring they are correctly batched and passed to the model.
3. Training and validation loop setup using PyTorch Lightning's `Trainer`, with added functionality for model checkpointing and early stopping to prevent overfitting.
4. Demonstration of model inference on new data, showcasing the model's ability to evaluate sentiment on unseen movie reviews.

NOTE: This part of the notebook requires a lot of compute resources. 

In [5]:
import lightning as L
import numpy as np
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import BertTokenizer
import torch
import torch.nn as nn
import torch.nn.functional as F

In [6]:
class LightningCustomIMDBModel(L.LightningModule):

    def __init__(self, vocabulary_size, embedding_dimensions=128, cnn_filters=50, dnn_units=512, model_output_classes=2,
                 dropout_rate=0.1, learning_rate=1e-4):
        super().__init__()

        self.model_output_classes = model_output_classes
        self.embedding = nn.Embedding(vocabulary_size, embedding_dimensions)
        self.cnn_layer1 = nn.Conv1d(embedding_dimensions, cnn_filters, kernel_size=2)
        self.cnn_layer2 = nn.Conv1d(embedding_dimensions, cnn_filters, kernel_size=3)
        self.cnn_layer3 = nn.Conv1d(embedding_dimensions, cnn_filters, kernel_size=4)
        self.pool = nn.AdaptiveMaxPool1d(1)
        self.flatten = nn.Flatten()
        self.dense_1 = nn.Linear(cnn_filters * 3, dnn_units)
        self.dropout = nn.Dropout(dropout_rate)
        if self.model_output_classes == 2:
            self.last_dense = nn.Linear(dnn_units, 1)
            self.activation = torch.sigmoid
        else:
            self.last_dense = nn.Linear(dnn_units, model_output_classes)
            self.activation = F.softmax

        self.learning_rate = learning_rate
        self.save_hyperparameters()

    def forward(self, input_ids, labels=None):
        x = self.embedding(input_ids).permute(0, 2, 1)
        x1 = self.pool(F.relu(self.cnn_layer1(x)))
        x2 = self.pool(F.relu(self.cnn_layer2(x)))
        x3 = self.pool(F.relu(self.cnn_layer3(x)))

        concatenated = self.flatten(torch.cat((x1, x2, x3), dim=1))
        concatenated = F.relu(self.dense_1(concatenated))
        concatenated = self.dropout(concatenated)

        logits = self.last_dense(concatenated)

        outputs = {'logits': logits}
        if labels is not None:
            if self.model_output_classes == 2:  # Binary classification
                loss_fct = nn.BCEWithLogitsLoss()
                loss = loss_fct(logits.view(-1), labels.view(-1).type_as(logits))
            else:  # Multiclass classification
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.model_output_classes), labels.view(-1))
            outputs['loss'] = loss
            self.log("my_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return outputs

    def training_step(self, batch, batch_idx):
        # Here, you define one training step
        input_ids, labels = batch['input_ids'], batch['labels']
        outputs = self(input_ids, labels)
        loss = outputs['loss']
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        input_ids, labels = batch['input_ids'], batch['labels']
        outputs = self(input_ids, labels)
        val_loss = outputs['loss']

        logits = outputs['logits']
        probs = torch.sigmoid(logits).cpu().numpy()

        # Determine class predictions with a threshold of 0.5
        preds = (probs >= 0.5).astype(int).flatten()

        # Ensure labels are on the same device as preds and also flattened
        labels = labels.cpu().flatten().numpy().astype(int)

        # Calculate accuracy
        correct = np.sum(preds == labels)  # Count how many predictions match the labels
        total = len(labels)  # Total number of labels
        acc = correct / total  # Calculate the accuracy

        # Log validation loss and accuracy
        self.log('val_loss', val_loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)

        # Return the loss and accuracy
        return {'val_loss': val_loss, 'val_acc': acc}

    def test_step(self, batch, batch_idx):
        input_ids, labels = batch['input_ids'], batch['labels']
        outputs = self(input_ids, labels)
        val_loss = outputs['loss']

        logits = outputs['logits']
        probs = torch.sigmoid(logits).cpu().numpy()

        # Determine class predictions with a threshold of 0.5
        preds = (probs >= 0.5).astype(int).flatten()

        # Ensure labels are on the same device as preds and also flattened
        labels = labels.cpu().flatten().numpy().astype(int)

        # Calculate accuracy
        correct = np.sum(preds == labels)  # Count how many predictions match the labels
        total = len(labels)   # Total number of labels
        acc = correct / total  # Calculate the accuracy

        # Log validation loss and accuracy
        self.log('test_loss', val_loss, prog_bar=True)
        self.log('test_acc', acc, prog_bar=True)

        # Return the loss and accuracy
        return {'test_loss': val_loss, 'test_acc': acc}

    def configure_optimizers(self):
        # Define optimizer (and scheduler if necessary)
        optimizer = AdamW(self.parameters(), lr=self.learning_rate)
        return optimizer

In [7]:
from torch.utils.data import Dataset
from datasets import load_dataset, DatasetDict

imdb = load_dataset("imdb")
del imdb['unsupervised']

In [8]:
MAX_SEQ_LENGTH = 256
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenized_imdb = imdb.map(lambda examples: tokenizer(examples['text'], truncation=True, padding='max_length', max_length=MAX_SEQ_LENGTH), batched=True)

# Split the train dataset into train and validation
train_test_split = tokenized_imdb["train"].train_test_split(test_size=0.1)

# Create a DatasetDict to hold the split datasets
split_datasets = DatasetDict({
    'train': train_test_split['train'],
    'validation': train_test_split['test'],
    'test': tokenized_imdb['test']
})


class IMDBDataset(Dataset):
    def __init__(self, tokenized_dataset):
        self.tokenized_dataset = tokenized_dataset

    def __len__(self):
        return len(self.tokenized_dataset)

    def __getitem__(self, idx):
        item = self.tokenized_dataset[idx]
        return {"input_ids": torch.tensor(item['input_ids'], dtype=torch.long),
                "labels": torch.tensor(item['label'], dtype=torch.float if OUTPUT_CLASSES == 2 else torch.long)}


BATCH_SIZE = 512
train_dataset = IMDBDataset(split_datasets["train"])
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

val_dataset = IMDBDataset(split_datasets["validation"])
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

test_dataset = IMDBDataset(split_datasets["test"])
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

Map: 100%|██████████| 25000/25000 [00:43<00:00, 580.16 examples/s]
Map: 100%|██████████| 25000/25000 [00:40<00:00, 609.83 examples/s]


In [9]:
VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2
DROPOUT_RATE = 0.2
NB_EPOCHS = 15

# Instantiate the model
model = LightningCustomIMDBModel(VOCAB_LENGTH, EMB_DIM, CNN_FILTERS, DNN_UNITS, OUTPUT_CLASSES, DROPOUT_RATE)

In [10]:
# To save a checkpoint automatically during training, you can use callbacks like ModelCheckpoint
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping

# Instantiate built-in callbacks (optional)
checkpoint_callback = ModelCheckpoint(dirpath='checkpoints/', save_top_k=1, verbose=True, monitor='train_loss', mode='min')
early_stopping_callback = EarlyStopping(monitor='train_loss', patience=3)

In [12]:
trainer = L.Trainer(callbacks=[checkpoint_callback, early_stopping_callback],
                    max_epochs=NB_EPOCHS,
                    accelerator='cpu',
                    devices=1,
                    enable_progress_bar=True
)

# Train the model
trainer.fit(model, train_dataloader, val_dataloader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/azagar/projects/envs/lab9/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default

  | Name       | Type              | Params | Mode 
---------------------------------------------------------
0 | embedding  | Embedding         | 5.8 M  | train
1 | cnn_layer1 | Conv1d            | 40.1 K | train
2 | cnn_layer2 | Conv1d            | 60.1 K | train
3 | cnn_layer3 | Conv1d            | 80.1 K | train
4 | pool       | Adaptive

Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

/home/azagar/projects/envs/lab9/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


                                                                           

/home/azagar/projects/envs/lab9/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.
/home/azagar/projects/envs/lab9/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py:310: The number of training batches (44) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Epoch 0:  98%|█████████▊| 43/44 [00:50<00:01,  0.86it/s, v_num=0, my_loss_step=0.685]


Detected KeyboardInterrupt, attempting graceful shutdown ...


NameError: name 'exit' is not defined

In [13]:
# Test the model
trainer.test(model, test_dataloader)

/home/azagar/projects/envs/lab9/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Testing DataLoader 0: 100%|██████████| 49/49 [00:22<00:00,  2.20it/s]


[{'my_loss_epoch': 0.6835217475891113,
  'test_loss': 0.6835217475891113,
  'test_acc': 0.6295599937438965}]


Model will output something like this:

```plaintext
/home/azagar/.miniconda3/bin/conda run -n whisper2 --no-capture-output python /home/azagar/myfiles/custom_bert/custom_model_lightning.py 
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
.... 
/home/azagar/local/miniconda3/envs/whisper2/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:441: It is recommended to use `self.log('test_acc', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Testing DataLoader 0: 100%|█████████████████████| 33/33 [00:02<00:00, 11.27it/s]
────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────
        test_acc            0.7319414615631104
        test_loss           0.5770595073699951
────────────────────────────────────────────────────────────────────────────────

Process finished with exit code 0

In [None]:
new = 'I don\'t like this movie.'
inps = tokenizer(new, truncation=True, padding='max_length', max_length=MAX_SEQ_LENGTH, return_tensors='pt')['input_ids'].to('cuda')

with torch.no_grad():
    print(torch.sigmoid(model(inps)['logits']))

best_model_path = checkpoint_callback.best_model_path
model.load_from_checkpoint(best_model_path)

In [None]:
# To load the model:
model = LightningCustomIMDBModel.load_from_checkpoint('./checkpoints/*.ckpt')
model.eval()
new = 'I don\'t like this movie.'
inps = tokenizer(new, truncation=True, padding='max_length', max_length=MAX_SEQ_LENGTH, return_tensors='pt')['input_ids'].to('cuda')

with torch.no_grad():
    print(torch.sigmoid(model(inps)['logits']))

# Custom neural model for IMDB reviews sentiment prediction using BERT Embeddings

## Key Highlights:

- **BERT Embeddings:** We utilize embeddings from a pre-trained BERT model as the foundation for our feature extraction. BERT's deep understanding of language semantics, garnered from extensive pre-training on diverse corpora, provides a rich contextual basis for our sentiment analysis task.

- **Freezing Weights:** To preserve the intrinsic language understanding capabilities of BERT and expedite training, we freeze the weights of the pre-trained layers. This approach allows us to benefit from BERT's pre-trained knowledge without the computational overhead of fine-tuning millions of parameters.

NOTE: This part of the notebook requires a lot of compute resources. 


In [None]:
import lightning as L
import numpy as np
from torch.utils.data import DataLoader
from transformers import AdamW, BertTokenizer, AutoModel
import torch
import torch.nn as nn
import torch.nn.functional as F


class LightningCustomIMDBModel(L.LightningModule):

    def __init__(self, model_name, cnn_filters=50, dnn_units=512, model_output_classes=2,
                 dropout_rate=0.1, learning_rate=1e-4, freeze=True):
        super().__init__()

        self.model_output_classes = model_output_classes
        self.bert = AutoModel.from_pretrained(model_name)  # Load pre-trained BERT
        self.bert.train()  # The model is by default in eval mode

        # Freeze BERT parameters
        if freeze:
            self.bert.eval()
            for param in self.bert.parameters():
                param.requires_grad = False

        embedding_dimensions = self.bert.config.hidden_size  # Use the embedding size from BERT config

        self.cnn_layer1 = nn.Conv1d(embedding_dimensions, cnn_filters, kernel_size=2)
        self.cnn_layer2 = nn.Conv1d(embedding_dimensions, cnn_filters, kernel_size=3)
        self.cnn_layer3 = nn.Conv1d(embedding_dimensions, cnn_filters, kernel_size=4)
        self.pool = nn.AdaptiveMaxPool1d(1)
        self.flatten = nn.Flatten()
        self.dense_1 = nn.Linear(cnn_filters * 3, dnn_units)
        self.dropout = nn.Dropout(dropout_rate)
        if self.model_output_classes == 2:
            self.last_dense = nn.Linear(dnn_units, 1)
            self.activation = torch.sigmoid
        else:
            self.last_dense = nn.Linear(dnn_units, model_output_classes)
            self.activation = F.softmax

        self.learning_rate = learning_rate
        self.save_hyperparameters()

    def forward(self, input_ids, attention_mask, labels=None):
        # Get embeddings from BERT
        bert_output = self.bert(input_ids=input_ids, attention_mask=attention_mask)

        # Use the last hidden state as embeddings
        embeddings = bert_output.last_hidden_state.permute(0, 2, 1)  # Permute to match (batch_size, channels, length)

        x1 = self.pool(F.relu(self.cnn_layer1(embeddings)))
        x2 = self.pool(F.relu(self.cnn_layer2(embeddings)))
        x3 = self.pool(F.relu(self.cnn_layer3(embeddings)))

        concatenated = self.flatten(torch.cat((x1, x2, x3), dim=1))
        concatenated = F.relu(self.dense_1(concatenated))
        concatenated = self.dropout(concatenated)

        logits = self.last_dense(concatenated)

        outputs = {'logits': logits}
        if labels is not None:
            if self.model_output_classes == 2:  # Binary classification
                loss_fct = nn.BCEWithLogitsLoss()
                loss = loss_fct(logits.view(-1), labels.view(-1).type_as(logits))
            else:  # Multiclass classification
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.model_output_classes), labels.view(-1))
            outputs['loss'] = loss

        return outputs

    def training_step(self, batch, batch_idx):
        # Here, you define one training step
        input_ids, attention_mask, labels = batch['input_ids'], batch['attention_mask'], batch['labels']
        outputs = self(input_ids, attention_mask, labels)
        loss = outputs['loss']
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch['input_ids'], batch['attention_mask'], batch['labels']
        outputs = self(input_ids, attention_mask, labels)
        val_loss = outputs['loss']

        logits = outputs['logits']
        probs = torch.sigmoid(logits).cpu().numpy()

        # Determine class predictions with a threshold of 0.5
        preds = (probs >= 0.5).astype(int).flatten()

        # Ensure labels are on the same device as preds and also flattened
        labels = labels.cpu().flatten().numpy().astype(int)

        # Calculate accuracy
        correct = np.sum(preds == labels)  # Count how many predictions match the labels
        total = len(labels)  # Total number of labels
        acc = correct / total  # Calculate the accuracy

        # Log validation loss and accuracy
        self.log('val_loss', val_loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)

        # Return the loss and accuracy
        return {'val_loss': val_loss, 'val_acc': acc}

    def test_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch['input_ids'], batch['attention_mask'], batch['labels']
        outputs = self(input_ids, attention_mask, labels)
        val_loss = outputs['loss']

        logits = outputs['logits']
        probs = torch.sigmoid(logits).cpu().numpy()

        # Determine class predictions with a threshold of 0.5
        preds = (probs >= 0.5).astype(int).flatten()

        # Ensure labels are on the same device as preds and also flattened
        labels = labels.cpu().flatten().numpy().astype(int)

        # Calculate accuracy
        correct = np.sum(preds == labels)  # Count how many predictions match the labels
        total = len(labels)  # Total number of labels
        acc = correct / total  # Calculate the accuracy

        # Log validation loss and accuracy
        self.log('test_loss', val_loss, prog_bar=True)
        self.log('test_acc', acc, prog_bar=True)

        # Return the loss and accuracy
        return {'test_loss': val_loss, 'test_acc': acc}

    def configure_optimizers(self):
        # Define optimizer (and scheduler if necessary)
        optimizer = AdamW(self.parameters(), lr=self.learning_rate)
        return optimizer



from torch.utils.data import Dataset

from datasets import load_dataset, DatasetDict

imdb = load_dataset("imdb")

del imdb['unsupervised']

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenized_imdb = imdb.map(lambda examples: tokenizer(examples['text'], padding='max_length', truncation=True, max_length=256, return_tensors='pt', return_attention_mask=True), batched=True)

# Split the train dataset into train and validation
train_test_split = tokenized_imdb["train"].train_test_split(test_size=0.1)

# Create a DatasetDict to hold the split datasets
split_datasets = DatasetDict({
    'train': train_test_split['train'],
    'validation': train_test_split['test'],
    'test': tokenized_imdb['test']
})


class IMDBDataset(Dataset):
    def __init__(self, tokenized_dataset):
        self.tokenized_dataset = tokenized_dataset

    def __len__(self):
        return len(self.tokenized_dataset)

    def __getitem__(self, idx):
        item = self.tokenized_dataset[idx]
        return {"input_ids": torch.tensor(item['input_ids'], dtype=torch.long),
                "attention_mask": torch.tensor(item['attention_mask'], dtype=torch.long),
                "labels": torch.tensor(item['label'], dtype=torch.float if OUTPUT_CLASSES == 2 else torch.long)}


BATCH_SIZE = 16
train_dataset = IMDBDataset(split_datasets["train"])
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=20)

val_dataset = IMDBDataset(split_datasets["validation"])
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, num_workers=20)

test_dataset = IMDBDataset(split_datasets["test"])
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, num_workers=20)


MODEL_NAME = 'bert-base-uncased'
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2
DROPOUT_RATE = 0.2
NB_EPOCHS = 5
FREEZE = True

# To save a checkpoint automatically during training, you can use callbacks like ModelCheckpoint
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping

# Instantiate the model
model = LightningCustomIMDBModel(MODEL_NAME, CNN_FILTERS, DNN_UNITS, OUTPUT_CLASSES, DROPOUT_RATE, freeze=FREEZE)

# Instantiate built-in callbacks (optional)
checkpoint_callback = ModelCheckpoint(dirpath='checkpoints/', save_top_k=1, verbose=True, monitor='train_loss', mode='min')
early_stopping_callback = EarlyStopping(monitor='train_loss', patience=3)


trainer = L.Trainer(callbacks=[checkpoint_callback, early_stopping_callback],
                    max_epochs=NB_EPOCHS,
                    accelerator='gpu',
                    devices=3,
                    log_every_n_steps=10,
                    strategy='ddp_find_unused_parameters_true'  # For training on multiple gpus
                    )

# Train the model
trainer.fit(model, train_dataloader, val_dataloader)

# Test the model
trainer.test(model, test_dataloader)

```plaintext
Model will output something like this:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

....
Epoch 4: 100%|█| 176/176 [02:08<00:00,  1.37it/s, v_num=32, val_loss=0.465, val_
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Testing DataLoader 0: 100%|███████████████████| 196/196 [02:01<00:00,  1.61it/s]
────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────
        test_acc            0.7701200246810913
        test_loss           0.47308972477912903
────────────────────────────────────────────────────────────────────────────────

Process finished with exit code 0
