# Exercise 1 - Question 1 (Language Models)

In [None]:
#Requirements: to be installed in the virtual environment
# Elad Sofer 312124662 and Tomer Shaked 315822221

!pip install torch
!pip install transformers
!pip install datasets
!pip install numpy


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


<a name="ngram_lm"></a>
## Task A: Data Exploration


Let's use an unsupervised dataset (raw corpus) to evaluate language models' perplexity. We use Huggingface's `datasets` library to download needed datasets.


Here we use the `Penn Treebank` dataset, featuring a million words of 1989 Wall Street Journal material. The rare words in this version are already replaced with `<unk>` token. The numbers are also replaced with a special token. This token replacement helps us to end up with a more reasonable vocabulary size to work with.


In [None]:
import numpy as np
import datasets
from datasets import load_dataset
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

ptb_dataset = load_dataset("ptb_text_only", split="train")

# splitting dataset in train/test (to be later used for language model evaluation)
ptb_dataset = ptb_dataset.train_test_split(test_size=0.2, seed=1)
ptb_train, ptb_test = ptb_dataset['train'], ptb_dataset['test']



#### Let's have a look at a few samples of the training dataset (and also the structure of the dataset)

In [None]:
print(f"{ptb_train[0]}\n\n{ptb_train[1]}\n\n{ptb_train[2]}")

{'sentence': "a former executive agreed that the departures do n't reflect major problems adding if you see any company that grows as fast as reebok did it is going to have people coming and going"}

{'sentence': 'with talk today of a second economic <unk> in west germany east germany no longer can content itself with being the economic star in a loser league'}

{'sentence': 'transportation secretary sam skinner who earlier fueled the anti-takeover fires with his <unk> attacks on foreign investment in u.s. carriers now says the bill would further <unk> the jittery capital markets'}


During generation with a given language model, we often need to have a `<stop>` token in our vocabulary to terminate the generation of a given sentence/paragraph. In this dataset, every sample is a sentence, and the `<stop>` token should be added to the end of every sample (i.e., end of sentence).

#### Create a new train/test dataset starting from `ptb_train` and `ptb_test` that has a `<stop>` at the end of each sentence. (Note: do not change the structure of the datasets objects, and just change the respective sentences as discussed).
Hint: use the `.map()` functionality of the `datasets` package (read more [here](https://huggingface.co/docs/datasets/process#map])).

In [None]:
from copy import deepcopy

def add_stop_token(input_sample: dict):
    '''
    args:
        input_sample: a dict representing a sample of the dataset. (look above for the dict struture)
    output:
        modified_sample: modified dict adding <stop> at the end of each sentence.
    '''
    # YOUR CODE HERE

    modified_sample = deepcopy(input_sample)
    modified_sample['sentence']+=' <stop>'
    return modified_sample


ptb_cleaned_train = ptb_train.map(add_stop_token)
ptb_cleaned_test = ptb_test.map(add_stop_token)



For the both `ptb_train` and `ptb_test` datasets, filter out every sample that has less than 3 tokens. it will help remove very short sentences that are not very helpful for training/evaluating a langugage model.

Hint: use `.filter()` functionality of the `datasets` package (read more [here](https://huggingface.co/docs/datasets/process#select-and-filter)).

In [None]:
print(f"{ptb_train[0]}\n\n{ptb_train[1]}\n\n{ptb_train[2]}")
cleaned_train_dataset = ptb_cleaned_train.filter(lambda s: True if len(s['sentence'].split(' ')) >= 3 else False)
cleaned_test_dataset = ptb_cleaned_test.filter(lambda s: True if len(s['sentence'].split(' ')) >= 3 else False)



{'sentence': "a former executive agreed that the departures do n't reflect major problems adding if you see any company that grows as fast as reebok did it is going to have people coming and going"}

{'sentence': 'with talk today of a second economic <unk> in west germany east germany no longer can content itself with being the economic star in a loser league'}

{'sentence': 'transportation secretary sam skinner who earlier fueled the anti-takeover fires with his <unk> attacks on foreign investment in u.s. carriers now says the bill would further <unk> the jittery capital markets'}


#### What are the 10 most frequent tokens in this dataset? Can you spot the token used to replace the numbers in this dataset? How are rare tokens replaced in this dataset?

In [None]:
# YOUR CODE HERE

from collections import Counter
def count_appearnces(counter,dataset):
  for s in dataset:
    counter.update(s['sentence'].split(' '))
  return counter

c_train = count_appearnces(Counter(),cleaned_train_dataset)
c_test = count_appearnces(Counter(),cleaned_test_dataset)

The 'N' token is used to represent numbers as can be seen by the following sentence. We can also notice that 'N' appeared 23912 times in the text which suggests it is a general number replacement since usually 'N' is not a common word in english.

In [None]:
print(f"{ptb_train[3]}")
print(c_train['N'])

{'sentence': "separately the company 's board adopted a proposal to <unk> its N shareholder rights plan further <unk> the company from takeover"}
25966


In [None]:
sorted_c = sorted(c_train.items(), key=lambda x: x[1])
print("The most 10 common tokens are:\n{0}".format(sorted_c[-10:][::-1]))

The most 10 common tokens are:
[('the', 40616), ('<unk>', 35888), ('<stop>', 33539), ('N', 25966), ('of', 19459), ('to', 18896), ('a', 16901), ('in', 14473), ('and', 14013), ("'s", 7850)]


## Task B: Fixed-Window Neural Language Models <a name='fixed_window_neural_lm'></a>

This language model take as input a constant number of tokens, and then outputs a probability distribution for the next token. In this section, we assume the underlying model is a FeedForward Network (FFN) with a single hidden layer. This model doesn't have the sparsity issue of N-gram language models, but is always limited to a fixed window of tokens.

In this section, we don't include the training of the model but rather we use a pretrained model on the same training dataset. We evaluate the language model over the `ptb_test` dataset, to show the power of neural language models, when compared to N-gram language models.

More importantly, we use PyTorch modules in this section, so that you get more familiar with its capabilities. Throughout this exercise, we use a `window_size=3` for this model.



Let's first create a dataset of all consecutive tokens of length `window_size` from the `ptb_train` dataset. you can read more about PyTorch datasets and how to create a custom dataset  [here](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files).

In [None]:
from torch.utils.data import Dataset, DataLoader

window_size = 3
vocabulary_size = 10000
word_emb_dim = 100
hidden_dim = 100

class FixedWindowDataset(Dataset):
    # read more about custom datasets at https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
    def __init__(self,
                 train_dataset: datasets.arrow_dataset.Dataset,
                 test_dataset: datasets.arrow_dataset.Dataset,
                 window_size: int,
                 vocabulary_size: int
                ):
        self.prepared_train_dataset = self.prepare_fixed_window_lm_dataset(train_dataset, window_size + 1)
        self.prepared_test_dataset = self.prepare_fixed_window_lm_dataset(test_dataset, window_size + 1)

        dataset_vocab = self.get_dataset_vocabulary(train_dataset)
        # defining a dictionary that simply maps tokens to their respective index in the embedding matrix
        self.word_to_index = {word: idx for idx,word in enumerate(dataset_vocab)}
        self.index_to_word = {idx: word for idx,word in enumerate(dataset_vocab)}
        assert vocabulary_size > len(dataset_vocab) , f"The dataset vocab size is {len(dataset_vocab)}!"

    def __len__(self):
        return len(self.prepared_train_dataset)

    def get_encoded_test_samples(self):
        all_token_lists = [sample.split() for sample in self.prepared_test_dataset]
        all_token_ids = [[self.word_to_index.get(word, self.word_to_index["<unk>"])
                          for word in token_list[:-1]]
                         for token_list in all_token_lists
                        ]
        all_next_token_ids = [self.word_to_index.get(token_list[-1], self.word_to_index["<unk>"])
                              for token_list in all_token_lists]
        return torch.tensor(all_token_ids).to(device), torch.tensor(all_next_token_ids).to(device)

    def __getitem__(self, idx):
        # here we need to transform the data to the format we expect at the model input
        token_list = self.prepared_train_dataset[idx].split()
        # having a fallback to <unk> token if an unseen word is encoded.
        token_ids = [self.word_to_index.get(word, self.word_to_index["<unk>"]) for word in token_list[:-1]]
        next_token_id = self.word_to_index.get(token_list[-1], self.word_to_index["<unk>"])
        return torch.tensor(token_ids).to(device), torch.tensor(next_token_id).to(device)

    def decode_idx_to_word(self, token_id):
        return [self.index_to_word[id_.item()] for id_ in token_id]

    def get_dataset_vocabulary(self, train_dataset: datasets.arrow_dataset.Dataset):
        vocab = sorted(set(" ".join([sample["sentence"] for sample in train_dataset]).split()))
        # we also add a <start> token to include initial tokens in the sentences in the dataset
        vocab += ["<start>"]
        return vocab

    @staticmethod
    def prepare_fixed_window_lm_dataset(target_dataset: datasets.arrow_dataset.Dataset,
                                        window_size: int):
        '''
        Please note that for the very first tokens, they will be added like "<start> <start> Token#1".
        args:
            target_dataset: the target dataset where its consecutive tokens of length 'window_size' should be extracted
            window_size: the window size for the language model
        output:
            prepared_dataset: a list of strings each containing 'window_size' tokens.
        '''

        prepared_dataset = []
        for s in target_dataset:
          prevs = ['<start>']*(window_size-1)
          for w in s['sentence'].split(' '):
            first_words = " ".join(prevs)
            prepared_dataset.append('{0} {1}'.format(first_words, w))
            prevs = prevs[1:] + [w]

        return prepared_dataset



In [None]:
fixed_window_dataset = FixedWindowDataset(ptb_train, ptb_test, window_size, vocabulary_size)

# let's create a simple dataloader for this dataset
train_dataloader =  DataLoader(fixed_window_dataset, batch_size=8, shuffle=True)


Now, let's define the underlying PyTorch model for the language model. You can read more about PyTorch models [here](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

**Note**: Here in the forward pass, we compute the negative log-likelihood after passing through the FFN layers. Here we use `torch.nn.LogSoftmax`, as it's numerically more stable than doing seperately `softmax` followed by taking its logarithm.

In [None]:
import torch.optim as optim

class Fixed_window_language_model(torch.nn.Module):
    def __init__(self, emb_dim, hidden_dim, window_size, vocab_size=10000):
        super().__init__()

        self.window_size = window_size
        self.emb_dim = emb_dim
        self.word_embeddings = torch.nn.Embedding(vocab_size, emb_dim) # word embeddings
        self.linear1 = torch.nn.Linear(window_size * emb_dim, hidden_dim) # first linear layer
        self.activation_func = torch.tanh # the activation function
        self.linear2 = torch.nn.Linear(hidden_dim, vocab_size) # second linear layer

        self.log_softmax = torch.nn.LogSoftmax(dim=1)
        self.softmax = torch.nn.Softmax(dim=1)
        self.criterion = torch.nn.NLLLoss()

    def forward(self, input_ids, labels, softmax=False):
        inputs_embeds = self.word_embeddings(input_ids)
        concat_input_embed = inputs_embeds.reshape(-1, self.emb_dim * self.window_size)
        hidden_state = self.activation_func( self.linear1(concat_input_embed) )

        logits = self.log_softmax(self.linear2(hidden_state))
        loss = self.criterion(logits, labels)

        return loss

In [None]:
import torch.optim as optim

class FixedPredictionModel(Fixed_window_language_model):
    def __init__(self, emb_dim, hidden_dim, window_size, vocab_size=10000):
        super().__init__(emb_dim, hidden_dim, window_size)
        self.softmax = torch.nn.Softmax(dim=1)

    def predict(self, input_ids, labels):
        inputs_embeds = self.word_embeddings(input_ids)
        concat_input_embed = inputs_embeds.reshape(-1, self.emb_dim * self.window_size)
        hidden_state = self.activation_func( self.linear1(concat_input_embed) )
        out = self.linear2(hidden_state)
        Px = self.softmax(out)
        return Px

Now let's see how easy it is to train a model with PyTorch! (we provide a trained model in the cell after train, so that you can just start using the model without going through the time-consuming training)

In [None]:
# defining the model
# model_fixed_window = Fixed_window_language_model(emb_dim=word_emb_dim, hidden_dim=hidden_dim,
#                                                  window_size=window_size, vocab_size=vocabulary_size).to(device)

# defining the model
model_fixed_window = FixedPredictionModel(emb_dim=word_emb_dim, hidden_dim=hidden_dim,
                                                 window_size=window_size, vocab_size=vocabulary_size).to(device)


# defining the optimizer
optimizer = optim.SGD(model_fixed_window.parameters(),
                      lr=0.005,
                      momentum=0.9)

In [None]:
def train_fixed_window():
  for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(train_dataloader):
        # get the inputs; data is a tuple of (context, target)
        context, target = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        loss = model_fixed_window(context, target)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 5000 == 4999. :    # print every 5000 mini-batches
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 5000:.3f}')
            running_loss = 0.0

# train_fixed_window()
print('Finished Training')

# saving the trained model
# torch.save(model_fixed_window.state_dict(), "fixed_window_model.pt")

Finished Training


We provide a trained model, so that you can start using it right away

In [None]:
fixed_window_checkpoint_file = "fixed_window_model.pt"
model_fixed_window.load_state_dict(torch.load(fixed_window_checkpoint_file))

<All keys matched successfully>

In [None]:
# context and 'target' ids (target is the next word after the context)
test_token_ids, test_target_ids = fixed_window_dataset.get_encoded_test_samples()

We now have the `test_token_ids`, `test_target_ids` tensors for the test dataset. The `test_token_ids` are the context ids and `test_target_ids` are the respective **next token** (a.k.a. target here) for these contexts.
#### Using the trained model, implement a function that can output the loss for the discussed test dataset. How can we generally decide if the model is overfitted to the train dataset or not?

In [None]:
def generate_test_dataset_loss(model: torch.nn.Module,
                               test_token_ids: torch.Tensor,
                               test_target_ids: torch.Tensor):
    '''
    args:
        model: fixed-window language model
        test_token_ids: the context ids in a single tensor.
        test_target_ids: the target ids (next token after the context) in a single tensor.
    output:
        avg_test_loss: The average loss of model over test dataset.
    '''

    batch_size = 4
    test_loss = []


    model = model.to(device)

    class MyDataset(Dataset):
      def __init__(self, input_data, labels):
          self.input_data = input_data
          self.labels = labels

      def __len__(self):
          return len(self.input_data)

      def __getitem__(self, idx):
          input_sample = self.input_data[idx]
          label = self.labels[idx]
          return input_sample, label

    # YOUR CODE HERE
    test_ds = MyDataset(test_token_ids.to(device), test_target_ids.to(device))
    test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=True)

    for idx, (inputs, labels) in enumerate(test_dl):
      loss = model(inputs, labels)
      test_loss.append(loss.item())

    batch_avg = np.array(test_loss).mean()

    # Average loss per sample
    return batch_avg/batch_size

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

test_dataset_loss = generate_test_dataset_loss(model_fixed_window, test_token_ids, test_target_ids)
print(f"Test dataset loss is {test_dataset_loss}")

Test dataset loss is 1.8215297501377739


#### Using the trained fixed-window model, implemention a function that can output entropy for a given sequence.

In [None]:
def get_seqeuence_entropy_fixed_window_lm(model: torch.nn.Module,
                                              input_sequence: str,
                                              window_size: int,
                                              word_to_idx: dict):
    '''
    Note that e.g., in order to get the first token probability, you need to pass a sequence
    like "<start> <start> <start>" (prefix padding) to the neural model. In a similar fashion, we need to pass
    "<start> <start> TOKEN#1" for getting the probability of the second token.
    args:
        model: fixed-window language model
        input_sequence: the sequence for which we want to calculate the probability
        window_size: the size of window for the language model
        word_to_idx: a mapping from words to the embedding indices (to encode tokens before being
                     passed to model). You can get this dict from 'fixed_window_dataset.word_to_index'
    output:
        sequence_entropy: the entropy for the input sequence using the trained model
    '''
    # YOUR CODE HERE
    input_dict = {'sentence': input_sequence}
    sub_sequences = FixedWindowDataset.prepare_fixed_window_lm_dataset([input_dict], window_size=4)
    entropy_sum=0
    for n_word_seq in sub_sequences:
        # pdb.set_trace()
        # here we need to transform the data to the format we expect at the model input
        token_list = n_word_seq.split()

        # having a fallback to <unk> token if an unseen word is encoded.
        token_ids = [word_to_idx.get(word, word_to_idx["<unk>"]) for word in token_list[:-1]]
        next_token_id = word_to_idx.get(token_list[-1], word_to_idx["<unk>"])

        x, y = torch.tensor(token_ids).to(device), torch.tensor([next_token_id]).to(device)
        Px = model.predict(x, y)
        entropy_sum+=-(Px*torch.log(Px)).sum().item()

    return entropy_sum/len(sub_sequences)

In [None]:
# import pdb
word_to_idx = fixed_window_dataset.word_to_index
window_size = 4
sentence = ptb_test[0]['sentence']

sentence_ent = get_seqeuence_entropy_fixed_window_lm(model_fixed_window, sentence ,window_size,word_to_idx)
print("Sentence: \"{0}\"\n has entropy of {1} with window size of {2}".format(sentence,sentence_ent, window_size))

Sentence: "jefferies group inc. said third-quarter net income fell N N to $ N million or N cents a share from $ N million or N cents a share on more shares a year earlier"
 has entropy of 4.592703431406441 with window size of 4


In [None]:
# ptb_test[0]
# fixed_window_dataset.get_encoded_test_samples()[1].shape
len(ptb_test)

8414

#### Compute the perplexity for the trained fixed-window language model over `ptb_test` dataset using the previous function.

In [None]:
entropy = 0
test_len = len(ptb_test)
print("starting perplexity calculation")
for i,test_dict in enumerate(ptb_test):
  sentence = test_dict['sentence']
  entropy += get_seqeuence_entropy_fixed_window_lm(model_fixed_window, sentence ,window_size,word_to_idx)
  if i%400 == 0:
    print(i/test_len)
avg_entropy = entropy/test_len
perplexity = 2**avg_entropy
print(f"The fixed-window model perplexity over test dataset is {perplexity}")

starting perplexity calculation
0.0
0.04753981459472308
0.09507962918944617
0.14261944378416924
0.19015925837889233
0.2376990729736154
0.2852388875683385
0.33277870216306155
0.38031851675778466
0.4278583313525077
0.4753981459472308
0.5229379605419539
0.570477775136677
0.6180175897314001
0.6655574043261231
0.7130972189208462
0.7606370335155693
0.8081768481102923
0.8557166627050155
0.9032564772997386
0.9507962918944616
0.9983361064891847
The fixed-window model perplexity over test dataset is 42.749297469868104


### Task C: RNN-based Language Model <a name='rnn_lm'></a>
To address the need for a neural architecture that can proceed with any length input (as opposed to the fixed-window model that can only process a fixed number of tokens), we implement the Recurrent Neural Network (RNN). The core idea behind is that we can apply the same weight W repeatedly.

An advatange of RNN model compared to fixed-window langauage model is that we can pass a given sentence at once, instead of passing it in many windows of size `window_size`. Moreover, the language model has the ability to look behind further that a fixed number of tokens.

 As we already did a neural model training exercise for the previous neural model, we only provide a trained LM at this section, so that you can focus only on the analysis part.

You can find the dataset structure as well as the RNN architecture in the `rnn_utils.py` file.

In [None]:

from rnn_utils import RNNDataset, RNN_language_model

class My_RNN_language_model(RNN_language_model):
  def __init__(self, vocab_size, emb_dim, hidden_dim, dropout=0.001, pad_idx = -1):
        super().__init__(vocab_size, emb_dim, hidden_dim, dropout=0.001, pad_idx = -1)
        self.softmax = torch.nn.Softmax(dim=1)

  def predict(self, context):
        context = context.t() # transposing it for RNN model
        #context = [src len, batch size]

        embedded = self.dropout(self.embedding(context))

        #embedded = [src len, batch size, emb dim]

        outputs, hidden = self.rnn(embedded)
        #outputs = [src len, batch size, hidden_dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        outputs = self.lm_decoder(outputs.permute(1, 0, 2))[:, :-1, :].permute(0, 2, 1)
        outputs = self.softmax(outputs)
        return outputs


vocabulary_size = 10000
word_emb_dim = 200
hidden_dim = 200

rnn_dataset = RNNDataset(ptb_train, ptb_test, vocabulary_size)

# if gpu is available, we puts the model on it
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Here we need a <pad> token for the RNN model, in order to have a batch of sequences with difference sizes
pad_idx = rnn_dataset.pad_idx # the index for <pad> token
rnn_model = My_RNN_language_model(vocab_size=vocabulary_size, emb_dim=word_emb_dim, hidden_dim=hidden_dim,
                               pad_idx=pad_idx)
rnn_model.to(device)
rnn_model.eval()

My_RNN_language_model(
  (criterion): CrossEntropyLoss()
  (embedding): Embedding(10000, 200)
  (rnn): RNN(200, 200, num_layers=4)
  (dropout): Dropout(p=0.001, inplace=False)
  (lm_decoder): Linear(in_features=200, out_features=10000, bias=True)
  (softmax): Softmax(dim=1)
)

load the model weights using the state_dict in `rnn_model.pt` file.

In [None]:
# YOUR CODE HERE
rnn_checkpoint_file = "rnn_model.pt"
rnn_model.load_state_dict(torch.load(rnn_checkpoint_file))

<All keys matched successfully>

As the training of an RNN model is time-consuming, we provide a trained language model on this dataset (`rnn_model.pt`), so that you can just analyze the model performance here.
As mentioned above, as RNN can get sequences with varying lengths, the input sequences should be padded with a special token like `<pad>`, so that we can create a batch of sentences. The output of the defined RNN model (see the architecture detail `rnn_utils.py`) is the model's entropy over the input data.

#### First get the encoded test samples of `ptb_test` dataset, and then pass these (already padded) sentences to the RNN model to get the respective entropy values. Compute the perplexity of the model and compare it with previous approaches.
**HINT**: You can use the `get_encoded_test_samples` function of `rnn_dataset` to get encoded test samples.


In [None]:
def get_seqeuence_entropy_rnn_lm(model: torch.nn.Module,
                                              input_sequence: str,
                                              window_size: int,
                                              word_to_idx: dict):
    '''
    Note that e.g., in order to get the first token probability, you need to pass a sequence
    like "<start> <start> <start>" (prefix padding) to the neural model. In a similar fashion, we need to pass
    "<start> <start> TOKEN#1" for getting the probability of the second token.
    args:
        model: fixed-window language model
        input_sequence: the sequence for which we want to calculate the probability
        window_size: the size of window for the language model
        word_to_idx: a mapping from words to the embedding indices (to encode tokens before being
                     passed to model). You can get this dict from 'fixed_window_dataset.word_to_index'
    output:
        sequence_entropy: the entropy for the input sequence using the trained model
    '''
    # YOUR CODE HERE
    input_dict = {'sentence': input_sequence}
    sub_sequences = FixedWindowDataset.prepare_fixed_window_lm_dataset([input_dict], window_size=4)
    entropy_sum=0
    for n_word_seq in sub_sequences:
        # pdb.set_trace()
        # here we need to transform the data to the format we expect at the model input
        token_list = n_word_seq.split()

        # having a fallback to <unk> token if an unseen word is encoded.
        token_ids = [word_to_idx.get(word, word_to_idx["<unk>"]) for word in token_list[:-1]]
        next_token_id = word_to_idx.get(token_list[-1], word_to_idx["<unk>"])

        x, y = torch.tensor(token_ids).to(device), torch.tensor([next_token_id]).to(device)
        Px = model.predict(x, y)
        entropy_sum+=-(Px*torch.log(Px)).sum().item()

    return entropy_sum/len(sub_sequences)
test_perplexity = -1

# YOUR CODE HERE

torch.cuda.empty_cache()
def rnn_entropy():

  entropy_sum = 0
  test_samples = rnn_dataset.get_encoded_test_samples()
  test_len = len(test_samples)
  for i, test_sample in enumerate(test_samples):

    sample = test_sample.reshape(1,-1).to(device)
    output = rnn_model.predict(sample)

    entropy_sum +=-float(torch.sum(output*torch.log(output))/output.shape[2])
    if i%500 == 0:
      print("{:.2f}".format(i/test_len))
  return entropy_sum/test_len
entropy = rnn_entropy()
test_perplexity = 2**entropy
print(f"The model perplexity is {test_perplexity}")

0.00
0.06
0.12
0.18
0.24
0.30
0.36
0.42
0.48
0.53
0.59
0.65
0.71
0.77
0.83
0.89
0.95
The model perplexity is 58.556497632025945


### Task D: MLM Transformer Language Models (Bonus Question: 10 pts) <a name='rnn_lm'></a>

We are here interested in computing the perplexity of MLM Transformer Language Models such as BERT and RoBERTa. Hoewever, the perplexity for MLM models is not well-defined (The difference with GPT models is illstrated [here](https://huggingface.co/docs/transformers/perplexity).

Instructions: First clone the following repository: https://github.com/asahi417/lmppl.
Install the requirements and follow the instructions to compute the pseudo-perplexity [(Wang and Cho, 2019)](https://aclanthology.org/W19-2304.pdf) of 'BERT-base-uncased', 'BERT-large-uncased', 'RoBERTa-base' and 'RoBERTa-large' for the sentences:
'Shelly ate the sliced banana with a fork.' and 'The fork of Shelly ate the sliced banana.'.

Which sentence gets the lowest pseudo-perplexity for each of the models? Which is the best model according to this test?
What is the relation of this test to semantic roles?


In [26]:
#!git clone https://github.com/asahi417/lmppl
!pip install lmppl
sentence1 = 'Shelly ate the sliced banana with a fork.'
sentence2 = 'The fork of Shelly ate the sliced banana.'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [18]:
scorers = {}
scorers['RoBERTa_base_sc'] = lmppl.MaskedLM('RoBERTa-base', max_length=100)
scorers['BERT_large_uncased'] = lmppl.MaskedLM('BERT-large-uncased', max_length=100)
scorers['RoBERTa_large'] = lmppl.MaskedLM('RoBERTa-large', max_length=100)


Some weights of the model checkpoint at BERT-large-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [28]:
#!pip install lmppl
import lmppl
text = [sentence1, sentence2]

for scorer_name, scorer in scorers.items():
  ppl = scorer.get_perplexity(text)
  print("\n########## {0} ##########".format(scorer_name))
  print("{0} has Psuedo-Perplexity of {1}".format(text[0], ppl[0]))
  print("{0} has Psuedo-Perplexity of {1}\n".format(text[1], ppl[1]))


100%|██████████| 1/1 [00:00<00:00,  5.23it/s]



########## RoBERTa_base_sc ##########
Shelly ate the sliced banana with a fork. has Psuedo-Perplexity of 13.817431435935024
The fork of Shelly ate the sliced banana. has Psuedo-Perplexity of 31.41023792391415



100%|██████████| 1/1 [00:00<00:00,  3.09it/s]



########## BERT_large_uncased ##########
Shelly ate the sliced banana with a fork. has Psuedo-Perplexity of 22.458738502630812
The fork of Shelly ate the sliced banana. has Psuedo-Perplexity of 199.5484317656372



100%|██████████| 1/1 [00:00<00:00,  2.73it/s]


########## RoBERTa_large ##########
Shelly ate the sliced banana with a fork. has Psuedo-Perplexity of 13.226297032073228
The fork of Shelly ate the sliced banana. has Psuedo-Perplexity of 42.13932649615006






# YOUR ANSWERS HERE

##Answer
1. The sentence "Shelly ate the sliced banana with a fork." has a lower perplexity, indicating that it is more comprehensible to the average reader. Consequently, it is expected to achieve a lower perplexity score.

2. Based on this evaluation, the RoBERTa_large model demonstrates superior performance.

Explanation:

RoBERTa_large surpasses RoBERTa_base_sc in terms of the number of parameters it possesses. With its increased size, RoBERTa_large has the potential to capture more intricate language patterns and dependencies, resulting in improved perplexity performance.

RoBERTa_base_sc has undergone additional fine-tuning specifically for sentence classification tasks, suggesting that RoBERTa_large may be more suitable for different types of tasks.

In general, RoBERTa represents an enhanced iteration of BERT, thereby making it logical for BERT to exhibit the lowest perplexity among the models considered.

3. What is the relation of this test to semantic roles?

In terms of semantic roles, the fork couldn't eat a bannana. It is an AGENT which cannot perform an action.
