# NLU project Language Modeling (LM)

Davide Lobba - 232089

Language modeling is a fundamental task in Natural Language Processing (NLP) that aims to predict the likelihood of words in a sentence, as well as generate new words based on a given context.
The goal of this task is to train a model that can produce new meaningful text according to the user's needs. It is essential to train the model on a massive corpus of text, such as books, articles or web pages. This is because, during the training phase, the language model learns the dependencies and probability distributions between words, which enables it to generate text that is coherent and meaningful.

The dataset used for training and evaluating the models used in this project is a preprocessed version of the Penn Treebank (PTB). This version of PTB contains only lower-cased words, numbers are replaced with N, rare words are replaced by unk and the vocabulary consists of the most frequent 10k words. This is one of the most popular datasets used by the NLP community, and it is used for both character-level and word-level language modeling. The dataset was created by the University of Pennsylvania by collecting articles from the Wall Street Journal.

The models presented in this project are:
- LSTM vanilla
- LSTM with improvements proposed by the paper [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182)
- GPT-2 provided by Hugging Face, specifically zero-shot analysis and finetuning on PTB dataset

The evaluation metric used to evaluate the performance of the models is the perplexity.

In [None]:
import os
import torch

from collections import Counter

In [None]:
!wget 'https://data.deepai.org/ptbdataset.zip'

In [None]:
!mkdir dataset

In [None]:
!unzip /content/ptbdataset.zip -d dataset

In [None]:
!pip install wandb -qU
!pip install transformers
!pip install datasets
!pip install pytorch_transformers

In [None]:
import wandb
wandb.login(relogin=True)

In [None]:
PAD_TOKEN = 0

class Lang():
    '''
    This class contains methods for creating word-to-id and label-to-id mappings.
        The w2id method maps words to unique IDs.
        The lab2id method maps labels to unique IDs.
    '''
    def __init__(self, words, cutoff=0):
        self.word2id = self.w2id(words, cutoff=cutoff, unk=True, eos=True)
        self.id2word = {v:k for k, v in self.word2id.items()}
        
    def w2id(self, elements, cutoff=None, unk=True, eos=True):
        vocab = {'pad': PAD_TOKEN, '<unk>': 1, '<eos>': 2}
        count = Counter(elements)
        for k, v in count.items():
            if v > cutoff:
                vocab[k] = len(vocab)
        return vocab
    
    def lab2id(self, elements, pad=True):
        vocab = {}
        if pad:
            vocab['pad'] = PAD_TOKEN
        for elem in elements:
                vocab[elem] = len(vocab)
        return vocab

## Functions for document handling

In [None]:
import torch
import torch.utils.data as data

def read_file(path):
    with open(path, 'r') as f:
        words = f.read().split()
    return words

def read_sentence(sent):
    words = sent.split()
    return words

class Dataset(data.Dataset):
    def __init__(self, path, lang):
        self.input = []
        self.target = []
        self.length = []
        
        with open(path, 'r') as f:
            for line in f:
                words = line.split() + ['<eos>']
                words = [lang.word2id[word] for word in words]
                self.input.append(words[:-1])
                self.target.append(words[1:])
                self.length.append(len(words[:-1]))

    def __len__(self):
        return len(self.input)

    def __getitem__(self, idx):
        input = torch.Tensor(self.input[idx])
        target = torch.Tensor(self.target[idx])
        length = self.length[idx]
        return input, target, length


## Collate function

This collate function is used to prepare a batch of samples for training, test or evaluation. By using this function, we can make sure that the sequences in the same batch have the same length, and the model can process them efficiently, and it's also handling the input and target sequences separately.

In [None]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def collate_fn(data):
    def merge(sequences):
        '''
        merge from batch * sent_len to batch * max_len 
        '''
        lengths = [len(seq) for seq in sequences]
        max_len = 1 if max(lengths)==0 else max(lengths)
        padded_seqs = torch.LongTensor(len(sequences),max_len).fill_(PAD_TOKEN)
        for i, seq in enumerate(sequences):
            end = lengths[i]
            padded_seqs[i, :end] = seq
        
        padded_seqs = padded_seqs.detach()
        
        return padded_seqs, lengths
        
    input_test = []
    target_test = []
    for inputs, target, length in data:
        input_test.append(inputs)
        target_test.append(target)
        inputs, length = merge(input_test)
        target, _ = merge(target_test)
    
    return inputs, target, length

## Baseline LSTM

This class defines a Long Short-Term Memory (LSTM) model. LSTM is a type of Recurrent Neural Network (RNN) that is commonly used in NLP tasks, such as language modeling and text classification.
The class architecture consists of an embedding layer, an LSTM layer and an output layer which is a fully connected layer.
The embedding layer is used to obtain the vector representation of every word, the LSTM layer processes the sentence and then the fully connected layer is useful for the prediction of the class probability for each word. 

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch import nn
from typing import List, Tuple, Union
from torch.autograd import Variable

class BaselineLSTM(nn.Module):
    '''
    This class is the implementation of a BaselineLSTM.
        num_classes: number of classes
        emb_dim: embedding dimension for the feature representation of the words
        hid_dim: dimension of the hidden layer of the LSTM
        n_layers: number of hidden layers of the LSTM
        pad_value: value used for the padding of the sequences
    '''

    def __init__(self, num_classes, emb_dim, hid_dim, n_layers = 1, pad_value = 0):
        super(BaselineLSTM, self).__init__()
        self.embd = nn.Embedding(num_classes, emb_dim, pad_value)
        self.lstm = nn.LSTM(emb_dim, hid_dim, n_layers, bidirectional=False)
        self.classifier = nn.Linear(hid_dim, num_classes)
        self.n_layers = n_layers
        self.hidden_size = hid_dim
        self.pad_value = pad_value
        self.init_weights()

    def init_weights(self):
        for p in self.parameters():
            if p.data.ndimension() >= 2:
                nn.init.xavier_uniform_(p.data)
            else:
                nn.init.zeros_(p.data)

    def forward(self, input, lengths, hidden = None):
        total_length = input.shape[1] #because input is [batch_size, sequence, ]
        input_emb = self.embd(input)

        packed_input = pack_padded_sequence(input_emb, lengths, batch_first=True, enforce_sorted=False)
        packed_output, hidden = self.lstm(packed_input)

        output, output_lengths = pad_packed_sequence(packed_output, batch_first=True, padding_value=self.pad_value, total_length=total_length)

        output = self.classifier(output)
        
        # Reshape output to (batch_size*sequence_length, hidden_size)
        output = output.reshape(output.size(0)*output.size(1), output.size(2))

        return output, hidden

## Training and Test functions

The loss function used for language modeling task is the perplexity. Perplexity is computed as follow: 
$$ \text{PP(W)} = e^{\mathrm{L_{CE}(W)}}$$

- Batch size for train is 32 and for test and evaluation is 16
- The optimizer used is the Stochasthic Gradient Descent with LR=1.0
- The scheduler used is the ReduceLROnPlateau
- The LSTM is implemented with only 1 hidden layer
- Dimension of embedding and hidden layer is 300

In [None]:
import math
import numpy as np
from torch.nn.utils import clip_grad_norm_

def train_step(model, optimizer, loss_function, train_loader, device, clip = None, scheduler = None, valid_perplexity = None):
    model.train()
    total_loss = 0

    for i, (inputs, targets, lengths) in enumerate(train_loader):
        inputs = inputs.to(device)
        targets = targets.to(device)

        model.zero_grad()
        optimizer.zero_grad()

        outputs, hidden = model(inputs, lengths)
        loss = loss_function(outputs, targets.view(-1))

        loss.backward()

        if clip is not None:
            clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        total_loss += loss.item()
    
    loss = total_loss/(i+1)
    perplexity = math.exp(loss)

    if scheduler is not None:
        scheduler.step(valid_perplexity)

    return loss, perplexity

In [None]:
def test_step(model, loss_function, test_loader, device):
    model.eval()
    total_loss = 0

    with torch.no_grad():
        for i, (inputs, targets, lengths) in enumerate(test_loader):
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs, hidden = model(inputs, lengths)
            loss = loss_function(outputs, targets.view(-1))

            total_loss += loss.item()

    loss = total_loss/(i+1)
    perplexity = math.exp(loss)

    return loss, perplexity


In [None]:
from tqdm import tqdm

def train(model, optimizer, loss_function, train_loader, test_loader, valid_loader, epoch, device, clip = None, scheduler = None, ntsgd = True):
    perplexity = float('inf')
    valid_perplexity = float('inf')
    valid_perplexity_list = []
    n=5
    
    for e in tqdm(range(epoch)):
        train_loss, train_perplexity = train_step(model, optimizer, loss_function, train_loader, device, clip, scheduler, valid_perplexity)
        
        test_loss, test_perplexity = test_step(model, loss_function, test_loader, device)

        valid_loss, valid_perplexity = test_step(model, loss_function, valid_loader, device)

        print("\nEpoch: {}".format(e+1))
        print("Train Loss : {}, Train Perplexity : {}" .format(train_loss, train_perplexity))
        print("Test Loss : {}, Test Perplexity : {}" .format(test_loss, test_perplexity))
        print("Valid Loss : {}, Valid Perplexity : {}" .format(valid_loss, valid_perplexity))

        valid_perplexity_list.append(valid_perplexity)

        if ntsgd and 't0' not in optimizer.param_groups[0] and (len(valid_perplexity_list)>n and valid_perplexity > min(valid_perplexity_list[:-n])):
                print('Switching to ASGD')
                optimizer = torch.optim.ASGD(model.parameters(), lr=1, t0=0, lambd=0., weight_decay=1.2e-6)

        if valid_perplexity < perplexity:
            os.makedirs("/content/lstm_weights/", exist_ok = True)
            torch.save(model.state_dict(), "/content/lstm_weights/lstm.pt")
            perplexity = valid_perplexity
        
        wandb.log({"Epoch": e+1, "train_perplexity": train_perplexity, "test_perplexity": test_perplexity, "valid_perplexity": valid_perplexity})
    wandb.log({'best_perplexity': perplexity})
    wandb.run.summary["best_perplexity"] = perplexity
    wandb.finish()

    print("\nTraining Completed, the best perplexity reached in validation is : {}".format(perplexity))

In [None]:
import torch.optim as optim

def main_baseline(hid_size, emb_size, n_layers, lr, pad_value, epochs):
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    data_train = read_file('dataset/ptb.train.txt')
    vocab = Lang(data_train)

    train_dataset = Dataset('dataset/ptb.train.txt', vocab)
    test_dataset = Dataset('dataset/ptb.test.txt', vocab)
    valid_dataset = Dataset('dataset/ptb.valid.txt', vocab)

    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
    valid_loader = DataLoader(valid_dataset, batch_size=16, collate_fn=collate_fn)
    test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=collate_fn)

    vocab_len = len(vocab.word2id)

    model = BaselineLSTM(vocab_len, emb_size, hid_size, n_layers = n_layers, pad_value = 0).to(device)

    optimizer = optim.SGD(model.parameters(), lr=lr)
    loss_function = nn.CrossEntropyLoss(ignore_index=pad_value)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.5, patience=2, verbose=True)
  
    wandb.init(project="NLU_project",
               name = "LSTM baseline",
               config = {
               "learning_rate": lr,
               "architecture": "Baseline",
               "hidden_size": hid_size,
               "embedding_size": emb_size,
               "epochs": epochs},
               notes = "this is the run of the baseline"
               )
    
    train(model, optimizer, loss_function, train_loader, test_loader, valid_loader, epochs, device, clip=None, scheduler=scheduler, ntsgd=False)

In [None]:
main_baseline(hid_size=300, emb_size=300, lr=1.0, n_layers=1, pad_value=0, epochs=60)

## Variational dropout (Locked dropout) and Weight dropout
Classes used to improve the LSTM model following the paper "Merity, S., Keskar, N. S., & Socher, R. (2017). Regularizing and Optimizing LSTM Language Models. arXiv. https://doi.org/10.48550/arXiv.1708.02182"

In [None]:
from torch import nn

class LockedDropout(nn.Module):
    '''
    Class for the implementation of Variational dropout.
    The class is taken from: https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/nn/lock_dropout.html
    '''

    def __init__(self, p=0.5):
        self.p = p
        super().__init__()

    def forward(self, x):
        
        if not self.training or not self.p:
            return x
        x = x.clone()
        mask = x.new_empty(1, x.size(1), x.size(2), requires_grad=False).bernoulli_(1 - self.p)
        mask = mask.div_(1 - self.p)
        mask = mask.expand_as(x)
        return x * mask


    def __repr__(self):
        return self.__class__.__name__ + '(' \
            + 'p=' + str(self.p) + ')'

In [None]:
from torch.nn import Parameter
class BackHook(torch.nn.Module):
    def __init__(self, hook):
        super(BackHook, self).__init__()
        self._hook = hook
        self.register_backward_hook(self._backward)

    def forward(self, *inp):
        return inp

    @staticmethod
    def _backward(self, grad_in, grad_out):
        self._hook()
        return None


class WeightDrop(torch.nn.Module):
    '''
    Class for the implementation of the Weight dropout technique.
    The class is taken from: https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/nn/weight_drop.html
    '''
    def __init__(self, module, weights, dropout=0, variational=False):
        super(WeightDrop, self).__init__()
        self.module = module
        self.weights = weights
        self.dropout = dropout
        self.variational = variational
        self._setup()
        self.hooker = BackHook(lambda: self._backward())

    def _setup(self):
        for name_w in self.weights:
            w = getattr(self.module, name_w)
            self.register_parameter(name_w + '_raw', Parameter(w.data))

    def _setweights(self):
        for name_w in self.weights:
            raw_w = getattr(self, name_w + '_raw')
            if self.training:
                mask = raw_w.new_ones((raw_w.size(0), 1))
                mask = torch.nn.functional.dropout(mask, p=self.dropout, training=True)
                w = mask.expand_as(raw_w) * raw_w
                setattr(self, name_w + "_mask", mask)
            else:
                w = raw_w
            rnn_w = getattr(self.module, name_w)
            rnn_w.data.copy_(w)

    def _backward(self):
        # transfer gradients from embeddedRNN to raw params
        for name_w in self.weights:
            raw_w = getattr(self, name_w + '_raw')
            rnn_w = getattr(self.module, name_w)
            raw_w.grad = rnn_w.grad * getattr(self, name_w + "_mask")

    def forward(self, *args):
        self._setweights()
        return self.module(*self.hooker(*args))

In [None]:
import torch.nn.functional as F

class EmbeddingDropout(torch.nn.Embedding):
    '''
    Class for the implementation of Embedding dropout at word level.
    This class is taken from: https://github.com/salesforce/awd-lstm-lm/blob/32fcb42562aeb5c7e6c9dec3f2a3baaaf68a5cb5/embed_regularize.py#L5
    '''
    def __init__(self, num_embeddings, embedding_dim, padding_idx=0, max_norm=None, norm_type=2, scale_grad_by_freq=False, sparse=False, dropout=0.1, scale=None):
        nn.Embedding.__init__(self, num_embeddings = num_embeddings, embedding_dim = embedding_dim, padding_idx = padding_idx, max_norm = max_norm, norm_type = norm_type, scale_grad_by_freq = scale_grad_by_freq, sparse = sparse)
        self.dropout = dropout
        self.scale = scale

    def forward(self, inputs):
        if self.training:
            dropout = self.dropout
        else:
            dropout = 0

        if dropout:
            mask = self.weight.data.new(self.weight.size(0), 1).bernoulli_(1 - dropout).expand_as(self.weight)/ (1 - dropout)
            masked_weight = self.weight * Variable(mask)
        else:
            masked_weight = self.weight
        if self.scale and self.scale != 1:
            masked_weight = masked_weight * self.scale

        return F.embedding(inputs, masked_weight, max_norm=self.max_norm, norm_type=self.norm_type, scale_grad_by_freq=self.scale_grad_by_freq, sparse=self.sparse)

# LSTM with improvements

As suggested by the paper "Merity, S., Keskar, N. S., & Socher, R. (2017). Regularizing and Optimizing LSTM Language Models. arXiv. https://doi.org/10.48550/arXiv.1708.02182".

The improvements that I applied are:
- Weight dropout for the recurrent hidden to hidden weight matrices, in order to prevent the overfitting of the LSTM. It is important to notice that the dropped weights remain dropped for the entirety of the forward and backward pass.

- NT-ASGD: this optimizer is the Non-monotonically Triggered ASGD. It is an improved version of the ASGD optimizer.

- Variational dropout, which is a binary dropout mask only once upon the first call and then to repeatedly use that locked dropout mask for all repeated connections within the forward and backward pass.

- Embedding dropout is the application of dropout on the embedding matrix at world level

- Weight tying, which shares the weights between the embedding and softmax layer, substantially reducing the total parameter count in the model.

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch import nn
from typing import List, Tuple, Union
from torch.autograd import Variable

class FinalModel(nn.Module):
    """
    This is the implementation of the LSTM with the improvements proposed by the paper 
    "Merity, S., Keskar, N. S., & Socher, R. (2017). Regularizing and Optimizing LSTM Language Models.
    arXiv. https://doi.org/10.48550/arXiv.1708.02182".
        num_classes: number of classes
        emb_dim: embedding dimension for the feature representation of the words
        hid_dim: dimension of the hidden layer of the LSTM
        weight_d: weight dropout of the hidden to hidden weight matrices of the LSTM
        input_d: variational dropout applied to the input
        output_d: variational dropout applied to the output
        tie_weights: bool for the application of the weight tying
        emb_dropout: value for the embedding dropout
        n_layers: number of hidden layers of the LSTM
        pad_value: value used for the padding of the sequences
    """

    def __init__(self, num_classes, emb_dim, hid_dim, weight_d = None, input_d = None, output_d = None, tie_weights = False, emb_dropout = None, n_layers = 3, pad_value = 0):
        super(FinalModel, self).__init__()

        if (emb_dropout is not None):
          self.embd = EmbeddingDropout(num_classes, emb_dim, pad_value, dropout = emb_dropout)
        else:
          self.embd = nn.Embedding(num_classes, emb_dim, pad_value)

        self.embd.weight.data.uniform_(-0.1, 0.1)
        
        self.lstm = nn.LSTM(emb_dim, hid_dim, n_layers, bidirectional=False)
        self.classifier = nn.Linear(hid_dim, num_classes)
        self.n_layers = n_layers
        self.hidden_size = hid_dim
        self.pad_value = pad_value
        self.init_weights()

        self.weight_d = weight_d
        self.input_d = input_d
        self.output_d = output_d
        self.tie_weights = tie_weights
        self.emb_dropout = emb_dropout

        hh_weights = []

        if weight_d is not None:
          for i in range(n_layers):
              hh_weights.append('weight_hh_l{}'.format(i))
          self.lstm = WeightDrop(self.lstm, hh_weights, weight_d)  #apply weight dropout to the hidden to hidden weights
          
        if (input_d is not None) and (output_d is not None):
          self.input_drop = LockedDropout(input_d)
          self.output_drop = LockedDropout(output_d)
        elif (input_d != output_d):
          raise ValueError("Input and Output dropout must be the same")

        if tie_weights:
            if hid_dim != emb_dim:
                raise ValueError('When using the tied flag, nhid must be equal to emsize')
            self.classifier.weight = self.embd.weight


    def init_weights(self):
        for p in self.parameters():
            if p.data.ndimension() >= 2:
                nn.init.xavier_uniform_(p.data)
            else:
                nn.init.zeros_(p.data)

    def forward(self, input, lengths, hidden = None):        
        total_length = input.shape[1]
        
        embedding = self.embd(input) #apply embedding dropout
        
        if (self.input_d is not None) and (self.output_d is not None): 
          embedding = self.input_drop(embedding)

        packed_input = pack_padded_sequence(embedding, lengths, batch_first=True, enforce_sorted=False)
        packed_output, hidden = self.lstm(packed_input)
        output, output_lengths = pad_packed_sequence(packed_output, batch_first=True, padding_value=self.pad_value, total_length=total_length)

        if (self.input_d is not None) and (self.output_d is not None): 
          output = self.output_drop(output)

        output = self.classifier(output)
        # Reshape output to (batch_size*sequence_length, hidden_size)
        output = output.reshape(output.size(0)*output.size(1), output.size(2))

        return output, hidden

In [None]:
import torch.optim as optim

def main_finalmodel(hid_size, emb_size, weight_d, input_d, output_d, tie_weights, emb_dropout, ntsgd, n_layers, lr, weight_decay, clip, pad_value, epochs):
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    data_train = read_file('dataset/ptb.train.txt')
    vocab = Lang(data_train)

    train_dataset = Dataset('dataset/ptb.train.txt', vocab)
    test_dataset = Dataset('dataset/ptb.test.txt', vocab)
    valid_dataset = Dataset('dataset/ptb.valid.txt', vocab)

    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
    valid_loader = DataLoader(valid_dataset, batch_size=16, collate_fn=collate_fn)
    test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=collate_fn)

    vocab_len = len(vocab.word2id)

    model = FinalModel(vocab_len, emb_size, hid_size, weight_d, input_d, output_d, tie_weights, emb_dropout, n_layers, pad_value).to(device)

    optimizer = optim.SGD(model.parameters(), lr=lr, weight_decay=weight_decay)
    loss_function = nn.CrossEntropyLoss(ignore_index=pad_value)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.5, patience=2, verbose=True)

    wandb.init(project="NLU_project",
               name = "LSTM no Variational dropout",
               config = {
               "learning_rate": lr,
               "architecture": "FinalModel",
               "hidden_size": hid_size,
               "embedding_size": emb_size,
               "weight_dropout": weight_d,
               "input_dropout": input_d,
               "input_dropout": output_d,
               "tie_weights": tie_weights,
               "embedding_dropout": emb_dropout,
               "num_layers": n_layers,
               "epochs": epochs},
               notes = "LSTM no Variational dropout"
               )
    
    train(model, optimizer, loss_function, train_loader, test_loader, valid_loader, epochs, device, clip, scheduler, ntsgd)

In [None]:
main_finalmodel(hid_size = 500, emb_size = 500, weight_d = 0.5, input_d = None, output_d = None, tie_weights = True, emb_dropout = 0.1, ntsgd=True, n_layers = 3, lr = 10.0, weight_decay = 1.2e-6, clip=1, pad_value = 0, epochs = 60)

# Transformer

In [None]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

def transformer_init(model_id, device):
    model = GPT2LMHeadModel.from_pretrained(model_id).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_id)
    
    return model, tokenizer

In [None]:
from datasets import load_dataset

def get_data(data_name, tokenizer):
    ptb_train = load_dataset(data_name, split="train")
    ptb_test = load_dataset(data_name, split="test")
    ptb_valid = load_dataset(data_name, split="validation")

    ptb_train_encodings = tokenizer("".join(ptb_train["sentence"]), return_tensors="pt")
    ptb_test_encodings = tokenizer("".join(ptb_test["sentence"]), return_tensors="pt")
    ptb_valid_encodings = tokenizer("".join(ptb_valid["sentence"]), return_tensors="pt")

    return ptb_train_encodings, ptb_test_encodings, ptb_valid_encodings

In [None]:
import torch
from tqdm import tqdm

def train_step_transformer(model, optimizer, ptb_train_encodings, device, stride):
    model.train()
    max_length = model.config.n_positions
    seq_len = ptb_train_encodings.input_ids.size(1)
    
    nlls = []
    prev_end_loc = 0
    for begin_loc in tqdm(range(0, seq_len, stride)):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc
        input_ids = ptb_train_encodings.input_ids[:, begin_loc:end_loc].to(device)
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100

        input_ids.to(device)
        target_ids.to(device)

        model.zero_grad()
        optimizer.zero_grad()

        outputs = model(input_ids, labels=target_ids)
        neg_log_likelihood = outputs.loss * trg_len
        
        neg_log_likelihood.backward()
        optimizer.step()
    
        nlls.append(neg_log_likelihood)
    
        prev_end_loc = end_loc
        if end_loc == seq_len:
            break
        ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
    return ppl

In [None]:
import torch
from tqdm import tqdm

def test_step_transformer(model, ptb_test_encodings, device, stride):
    max_length = model.config.n_positions
    seq_len = ptb_test_encodings.input_ids.size(1)

    nlls = []
    prev_end_loc = 0
    for begin_loc in tqdm(range(0, seq_len, stride)):
        end_loc = min(begin_loc + max_length, seq_len)
        trg_len = end_loc - prev_end_loc
        input_ids = ptb_test_encodings.input_ids[:, begin_loc:end_loc].to(device)
        target_ids = input_ids.clone()
        target_ids[:, :-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            neg_log_likelihood = outputs.loss * trg_len

        nlls.append(neg_log_likelihood)

        prev_end_loc = end_loc
        if end_loc == seq_len:
            break
    ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
    
    return ppl

In [None]:
from pytorch_transformers import WEIGHTS_NAME, CONFIG_NAME

def finetuning(model, tokenizer, optimizer, ptb_train_encodings, ptb_test_encodings, ptb_valid_encodings, epochs, stride, device):
    perplexity = float('inf')
    valid_perplexity = float('inf')
    
    for e in tqdm(range(epochs)):
        train_perplexity = train_step_transformer(model, optimizer, ptb_train_encodings, device, stride)
        test_perplexity = test_step_transformer(model, ptb_test_encodings, device, stride)
        valid_perplexity = test_step_transformer(model, ptb_valid_encodings, device, stride)

        print("\nEpoch: {}".format(e+1))
        print("Train Perplexity : {}" .format(train_perplexity))
        print("Test Perplexity : {}" .format(test_perplexity))
        print("Valid Perplexity : {}" .format(valid_perplexity))

        if valid_perplexity < perplexity:
            os.makedirs("/content/gpt2_weights/", exist_ok = True)
            output_dir = "/content/gpt2_weights/"
            model_to_save = model.module if hasattr(model, 'module') else model

            output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
            output_config_file = os.path.join(output_dir, CONFIG_NAME)

            torch.save(model_to_save.state_dict(), output_model_file)
            model_to_save.config.to_json_file(output_config_file)
            tokenizer.save_vocabulary(output_dir)

            perplexity = valid_perplexity
        
        wandb.log({"Epoch": e+1, "train_perplexity": train_perplexity, "test_perplexity": test_perplexity, "valid_perplexity": valid_perplexity})
    wandb.log({'best_perplexity': perplexity})
    wandb.run.summary["best_perplexity"] = perplexity
    wandb.finish()
            
    print("\Finetuning completed, the best perplexity reached in validation is : {}".format(perplexity))

In [None]:
def main_transformer_finetuning(model_name, device, data_name, stride, lr, wd, epochs):
    model, tokenizer = transformer_init(model_name, device)
    train_data, test_data, eval_data = get_data(data_name, tokenizer)

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)

    wandb.init(project="NLU_project",
               name = "Tranformer GPT2 Finetuning",
               config = {
               "learning_rate": lr,
               "architecture": "GPT2",
               "stride": stride,
               "learning_rate": lr,
               "weight_decay": wd,
               "epochs": epochs},
               notes = "run for the finetuning of GPT2"
               )

    finetuning(model, tokenizer, optimizer, train_data, test_data, eval_data, epochs, stride, device)

In [None]:
main_transformer_finetuning(model_name="gpt2", device="cuda", data_name="ptb_text_only", stride=1024, lr=2e-05, wd=0.01, epochs=3)

In [None]:
def main_tranformer_zero_shot(model_name, device, data_name, stride):
    model, tokenizer = transformer_init(model_name, device)
    train_data, test_data, eval_data = get_data(data_name, tokenizer)

    test_perplexity = test_step_transformer(model, test_data, device, stride)
    valid_perplexity = test_step_transformer(model, eval_data, device, stride)

    wandb.init(project="NLU_project",
               name = "Tranformer GPT2 Zero-shot",
               config = {
               "architecture": "GPT2",
               "stride": stride
               },
               notes = "run for zero-shot GPT2"
               )

    wandb.log({'best_perplexity': valid_perplexity})
    wandb.run.summary["best_perplexity"] = valid_perplexity
    wandb.finish()

    print("\nZero-shot GPT2 test perplexity on PTB is:{}".format(test_perplexity))
    print("\nZero-shot GPT2 validation perplexity on PTB is:{}".format(valid_perplexity))


In [None]:
main_tranformer_zero_shot(model_name="gpt2", device="cuda", data_name="ptb_text_only", stride=1024)

In [None]:

def prediction_transformer(model_weights, tokenizer_weights, sentence_prompt, max_tokens):
    model = GPT2LMHeadModel.from_pretrained(model_weights)
    tokenizer = GPT2TokenizerFast.from_pretrained(tokenizer_weights)

    inputs = tokenizer(sentence_prompt, return_tensors="pt").input_ids
    outputs = model.generate(inputs, max_new_tokens=max_tokens, do_sample=True, top_k=50, top_p=0.95)
    
    final_sentence = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(final_sentence)

prediction_transformer(model_weights="gpt2", tokenizer_weights="gpt2", sentence_prompt="the new plan is", max_tokens=15)

# Model Analysis

Method for the count of the trainable parameters of the LSTM

In [None]:
from prettytable import PrettyTable

def count_parameters(model):
  table = PrettyTable(["Mod name", "Parameters Listed"])
  t_params = 0
  for name, parameter in model.named_parameters():
    if not parameter.requires_grad: continue
    param = parameter.numel()
    table.add_row([name, param])
    t_params+=param
  print(table)
  print(f"Sum of trained paramters: {t_params}")

data_train = read_file('dataset/ptb.train.txt')
vocab = Lang(data_train)
count_parameters(BaselineLSTM(num_classes=len(vocab.word2id), emb_dim=300, hid_dim=300, n_layers=1, pad_value=0))
count_parameters(FinalModel(num_classes=len(vocab.word2id), emb_dim=500, hid_dim=500, weight_d=0.5, input_d=0.4, output_d=0.4, tie_weights=True, emb_dropout=0.1, n_layers=3, pad_value=0))

Method used to evaluate the performance of the trained model

In [None]:
def prediction_lstm(model, weights_path, sentence, vocab, num_words_to_predict, device):
  
  model.load_state_dict(torch.load(weights_path))
  
  input = []
  length = []

  sentence_words = read_sentence(sentence)
  words = [vocab.word2id[word] for word in sentence_words]

  input.append(words)
  length.append(len(words))

  input = torch.LongTensor(input).to(device)
  model = model.to(device)


  with torch.no_grad():
      for i in range(num_words_to_predict):
          output, hidden = model(input, length)

          predicted_id_word = output.max(dim=1).indices[-1].tolist()
          predicted_word = vocab.id2word[predicted_id_word]
          if predicted_word == "<eos>":
            break

          input = torch.cat((input, torch.tensor([[vocab.word2id[predicted_word]]], dtype=torch.long)),1)
          length = [input.shape[1]]

        
  final_sentence = [vocab.id2word[words.tolist()] for words in input.flatten()]
  print(final_sentence)

In [None]:
data_train = read_file('dataset/ptb.train.txt')
vocab = Lang(data_train)
device = "cpu"
model = BaselineLSTM(len(vocab.word2id), 300, 300, 1, 0)
prediction_lstm(model, "/content/lstm_weights/baseline_weights.pt", "once upon a time", vocab, 15, device=device)

# Dataset Analysis

In [None]:
class Dataset_analysis():
  def __init__(self, dataset_path, data):
    self.corpus = self.read_document(dataset_path)
    self.num_words = self.num_word_corpus(self.corpus)
    self.num_sentence = self.num_sentence_corpus(self.corpus)
    self.num_words_sent = self.num_word_sentence(self.corpus)
    self.num_chars_sent = self.num_char_sentence(self.corpus)
    self.num_chars_word = self.num_char_words(self.corpus)
    self.analysis(data)

  def read_document(self, file):
    with open(file, 'r') as f:
        data = f.read()
    return data

  def num_word_corpus(self, doc):
    return len(doc.split())
  
  def num_sentence_corpus(self, doc):
    return len(doc.splitlines())
  
  def num_word_sentence(self, doc):
    return len(doc.split())/len(doc.splitlines())
  
  def num_char_sentence(self, doc):
    return len(doc)/len(doc.splitlines())

  def num_char_words(self, doc):
    return len(doc)/len(doc.split())

  def analysis(self, data):
    print(f"Number of words in {data} dataset: {self.num_words}")
    print(f"Number of sentences in {data} dataset: {self.num_sentence}")
    print(f"Average words per sentence in {data} dataset: {self.num_words_sent}")
    print(f"Average chars per word in {data} dataset: {self.num_chars_word}")
    print(f"Average chars per sentence in {data} dataset: {self.num_chars_sent}\n")


In [None]:
dataset_train = Dataset_analysis('/content/dataset/ptb.train.txt', 'train')
dataset_test = Dataset_analysis('/content/dataset/ptb.test.txt', 'test')
dataset_valid = Dataset_analysis('/content/dataset/ptb.valid.txt', 'validation')

In [None]:
def unk_n_hashtag_occ(dataset, text):
    text = text.split()
    counter = Counter(text)

    print("For {} set occurrences of the token unk are {}".format(dataset, counter["<unk>"]))
    print("For {} set occurrences of the token N are {}".format(dataset, counter["N"]))
    print("For {} set occurrences of the token $ are {}\n".format(dataset, counter["$"]))    

In [None]:
unk_n_hashtag_occ("train", dataset_train.corpus)
unk_n_hashtag_occ("test", dataset_test.corpus)
unk_n_hashtag_occ("validation", dataset_valid.corpus)

In [None]:
import seaborn as sns
from nltk.corpus import stopwords
import nltk
from collections import  Counter
import matplotlib.pyplot as plt

nltk.download('stopwords')

def plot_top_non_stopwords_barchart(text):
    stop=set(stopwords.words('english'))
    stop.add('<unk>')
    stop.add('N')
    stop.add('$')
    text = text.split()

    counter=Counter(text)
    most=counter.most_common()
    x, y=[], []
    for word,count in most[:40]:
        if (word not in stop):
            x.append(word)
            y.append(count)
    fig, ax = plt.subplots(figsize = (10, 8))
    sns.barplot(x=y,y=x, ax=ax)

In [None]:
plot_top_non_stopwords_barchart(dataset_train.corpus)
plot_top_non_stopwords_barchart(dataset_test.corpus)
plot_top_non_stopwords_barchart(dataset_valid.corpus)

In [None]:
import matplotlib.pyplot as plt

def plot_top_stopwords_barchart(text):
    stop=set(stopwords.words('english'))

    text = text.split() #conver to list

    counter=Counter(text)
    most=counter.most_common()
    x, y=[], []
    for word,count in most[:40]:
        if (word in stop):
            x.append(word)
            y.append(count)

    fig, ax = plt.subplots(figsize = (10, 8))
    x = x[0:8]
    y = y[0:8]
    sns.barplot(x=y,y=x, ax=ax)

In [None]:
plot_top_stopwords_barchart(dataset_train.corpus)
plot_top_stopwords_barchart(dataset_test.corpus)
plot_top_stopwords_barchart(dataset_valid.corpus)

In [None]:
import matplotlib.pyplot as plt

def average_sentence_length_plot(doc):
    sentence_length = []
    for sentence in doc.splitlines():
        sentence_length.append(len(sentence.split()))

    fig, ax = plt.subplots(figsize = (11.7, 8.27))
    sns.histplot(sentence_length, ax=ax)

In [None]:
average_sentence_length_plot(dataset_train.corpus)
average_sentence_length_plot(dataset_test.corpus)
average_sentence_length_plot(dataset_valid.corpus)

In [None]:
import matplotlib.pyplot as plt

def average_word_length_plot(doc):
    word_length = []
    for word in doc.split():
        word_length.append(len(word))

    fig, ax = plt.subplots(figsize = (11.7, 8.27))
    sns.histplot(word_length, ax = ax)

In [None]:
average_word_length_plot(dataset_train.corpus)
average_word_length_plot(dataset_test.corpus)
average_word_length_plot(dataset_valid.corpus)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

def get_top_ngrams(corpus, stop_words, n=None):

    sw = text.ENGLISH_STOP_WORDS.union(stop_words)

    vec = CountVectorizer(ngram_range=(n, n), stop_words=sw).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:10]

def plot_top_ngrams(corpus, stop_words, n=None):
    top_n_bigrams=get_top_ngrams(corpus, stop_words, n)
    x,y=map(list,zip(*top_n_bigrams))
    fig, ax = plt.subplots(figsize = (11.7, 8.27))
    sns.barplot(x=y,y=x, ax=ax)

In [None]:
plot_top_ngrams([dataset_train.corpus], ['unk'], n=3)