Imported important libraries

In [1]:
import collections

import datasets
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import tqdm
import transformers

  from .autonotebook import tqdm as notebook_tqdm


Updated dataset library to get imdb dataset since previous versions had issues

In [2]:
print(datasets.__version__)

2.18.0


splitting the imported imdb dataset into train and test data

In [3]:
train_data, test_data = datasets.load_dataset("imdb", split=["train", "test"])


In [5]:
print(train_data.shape)
print(test_data.shape)

(25000, 2)
(25000, 2)


Initializing the Tokenizer from the 'bert-base-uncased' model. Following lines of code I ran inbuilt libraries from tokenizer to convert one sample sentence into tokens and visiulized what one training sample looks like.

In [7]:
tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-uncased')

In [8]:
train_data[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [9]:
tokenizer.convert_ids_to_tokens(tokenizer.encode(train_data[0]['text']))

['[CLS]',
 'i',
 'rented',
 'i',
 'am',
 'curious',
 '-',
 'yellow',
 'from',
 'my',
 'video',
 'store',
 'because',
 'of',
 'all',
 'the',
 'controversy',
 'that',
 'surrounded',
 'it',
 'when',
 'it',
 'was',
 'first',
 'released',
 'in',
 '1967',
 '.',
 'i',
 'also',
 'heard',
 'that',
 'at',
 'first',
 'it',
 'was',
 'seized',
 'by',
 'u',
 '.',
 's',
 '.',
 'customs',
 'if',
 'it',
 'ever',
 'tried',
 'to',
 'enter',
 'this',
 'country',
 ',',
 'therefore',
 'being',
 'a',
 'fan',
 'of',
 'films',
 'considered',
 '"',
 'controversial',
 '"',
 'i',
 'really',
 'had',
 'to',
 'see',
 'this',
 'for',
 'myself',
 '.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'the',
 'plot',
 'is',
 'centered',
 'around',
 'a',
 'young',
 'swedish',
 'drama',
 'student',
 'named',
 'lena',
 'who',
 'wants',
 'to',
 'learn',
 'everything',
 'she',
 'can',
 'about',
 'life',
 '.',
 'in',
 'particular',
 'she',
 'wants',
 'to',
 'focus',
 'her',
 'attention',
 '##s',
 'to',
 'making',
 'some',
 

tokenize_and_convert_to_tensor function is created to take input data and converts all values belonging to 'text' key into
tonkenizer keeping certain parameters in mind. Uses max-length to restrict the length of tokenizer here it will work to maintain the tokenized sentence length to 256-2 because it appends the [CLS] and [SEP] tokens at beginning and end. The truncate helps shorten the snetence to max-length value and in case the sentence is short padding will append 0's to match it to the length of all samples (which is max-length). Finally, it returns the tokens in a form of tensor.

In [38]:
def tokenize_and_convert_to_tensor(examples):
    tokenized_inputs = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256, return_tensors="pt")
    return { key: value for key, value in tokenized_inputs.items()
      }


Code below will run the tokenizer function defined above on training data and return tensor values. Further it will pass these tensors into data loader to create batches. 

Later I added a print statement to check the type of train_dataset due to issues I faced during training. More info on this during training. 

In [39]:
train_dataset = train_data.map(tokenize_and_convert_to_tensor, batched=True)
print(type(train_dataset['input_ids']))
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=5, shuffle=True)

<class 'list'>


Here I am initializing the Bert Model to acquire the word embeddings of the tokens from the bert-base-uncased model later.

In [30]:
from transformers import BertModel
bertmodel = BertModel.from_pretrained('bert-base-uncased')

Below i am defining my model which uses pytorch nn module and its transformer libraries to create the encoder layers and takes the input of embedding_size(comes from the word_embeddings), the total number of attention heads within an encoder and total number of layers defines the number of transformers stacked together. It also defines a linear layer with output of 2 (this is because of the binary nature of the sentiment classifier as the output can only be either positive or negative). 

Below the model class I am initializing the model with embedding size as 768 since bert model above returns word embeddings in that size. This initilization also includes dropout hyperparameter value set to 0.5.
Following this code I also have the gradient descent function and loss calculator functions defined using the in-built libraries.

In [12]:
class TransformerEncoder(nn.Module):
    def __init__(self, embedding_size, attention_heads, layers, dropout):
        super(TransformerEncoder, self).__init__()
        self.encoder = nn.TransformerEncoder(nn.TransformerEncoderLayer(d_model=embedding_size, nhead=attention_heads),
                                            num_layers=layers)
        self.fc = nn.Linear(embedding_size, 2)
    def forward(self, x):
        x = self.encoder(x)
        x = x.mean(dim=1)
        return self.fc(x)
        
model = TransformerEncoder(embedding_size=768, attention_heads=8, layers=3, dropout=0.5)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()



The two cells below right before I get into training loop were created in order to check the dimensions of the input_ids, attention_mask and labels from the tokenizer, tensor and dataloader steps since during training each batch contained a list of tensors each of length 256(max-length set during tokenizer step). However, after exhaustive research I found that this behaviour is incorrect as it should be a single tensor of 256 length.

In [50]:
for batch in train_dataloader:
    print(batch.keys())
    print()

    print('Type of input_ids: ', type(batch['input_ids']))
    print('Length of input_ids: ', len(batch['input_ids']))
    print('type of one instance of input_ids: ', type(batch['input_ids'][0]))
    print('length of one instance of input_ids: ', len(batch['input_ids'][0]))
    print()

    print('Type of attention_mask: ', type(batch['attention_mask']))
    print('Length of attention_mask: ', len(batch['attention_mask']))
    print('type of one instance of attention_mask: ', type(batch['attention_mask'][0]))
    print('length of one instance of attention_mask: ', len(batch['attention_mask'][0]))

    print()
    print('Type of label: ', type(batch['label']))
    print('Length of label: ', len(batch['label']))
    print('Shape of label: ', batch['label'].shape)
    

    break

dict_keys(['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'])

Type of input_ids:  <class 'list'>
Length of input_ids:  256
type of one instance of input_ids:  <class 'torch.Tensor'>
length of one instance of input_ids:  5

Type of attention_mask:  <class 'list'>
Length of attention_mask:  256
type of one instance of attention_mask:  <class 'torch.Tensor'>
length of one instance of attention_mask:  5

Type of label:  <class 'torch.Tensor'>
Length of label:  5
Shape of label:  torch.Size([5])


One of the main issues I faced was during training each batch in batches is supposed to be a hold a tensor belonging to input_ids of length 256 which is passed into Bert Model to provide the word embeddings for. However, it holds a list of 256 tensors. I had to run torch.stack to combine all tensors to bypass this issue. Further, I am having issues with comparing the output from my model that takes in word embeddings from Bert Model of size 768. 

The training is now breaking at the point of calculating loss since tensor returned from my model is 256x2 whereas label is a tensor of 1x5. My intuition is 5 comes from batch_size in dataloader. However, the problem either lies in calculating the output size for each layer in the model or the tokenizer method since it is outputting a list of tensor for each batch insteas of a tensor itself.

Unfortunately i am not able to finish training due to issues i am hitting and would need more time to figure out the soluion.

In [56]:
for epoch in range(1):
    for batch in train_dataloader:
        
        batch['input_ids'] = torch.stack(batch['input_ids'])   
        batch['attention_mask'] = torch.stack(batch['attention_mask'])   
        
        
        #for debugging        
        print('type of input_ids now ', type(batch['input_ids']))
        print('shape of input-ids now ', batch['input_ids'].shape)
        print()
        ###
        
        outputembed = bertmodel(
            input_ids=batch['input_ids'], 
            attention_mask=batch['attention_mask'])
        
        #embedding size is the 768 because of BERT MODEL
        word_embeddings = (outputembed.last_hidden_state)
        
        #for debugging        
        print('type of word_embeddings', type(word_embeddings))
        print('shape of word_embeddings ', word_embeddings.shape)
        print()
        ###
        
        outputs = model(word_embeddings)
        
        #for debugging
        print('type of outputs', type(outputs))
        print('shape of outputs', outputs.shape)
        print(outputs)
        ###
        
        
        loss = criterion(outputs[:, -1], batch['label']) 

        optimizer.zero_grad()  
        loss.backward()  
        optimizer.step() 

        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

type of input_ids now  <class 'torch.Tensor'>
shape of input-ids now  torch.Size([256, 5])

type of word_embeddings <class 'torch.Tensor'>
shape of word_embeddings  torch.Size([256, 5, 768])

type of outputs <class 'torch.Tensor'>
shape of outputs torch.Size([256, 2])
tensor([[-7.5746e-01, -8.2665e-01],
        [-5.4519e-01, -9.6301e-01],
        [-6.5882e-01, -5.2784e-01],
        [-8.6217e-01, -5.8591e-01],
        [-1.7153e-02, -5.9775e-01],
        [-2.7030e-01, -7.1750e-01],
        [-3.2290e-01, -5.3839e-01],
        [-7.6048e-01, -4.7219e-01],
        [-4.8242e-01, -7.8979e-01],
        [-5.8220e-01, -5.4729e-01],
        [-8.0593e-01, -8.1654e-01],
        [-4.0697e-01, -3.7847e-01],
        [-7.8396e-01, -7.6987e-01],
        [-6.3885e-01, -2.1720e-01],
        [-1.2698e-01, -7.7166e-01],
        [-8.6424e-01, -6.0170e-01],
        [-4.6396e-01, -6.9555e-01],
        [-3.9011e-01, -1.2904e+00],
        [-4.6060e-01, -3.0133e-01],
        [-6.1745e-01, -9.1420e-01],
        [-4

RuntimeError: size mismatch (got input: [256], target: [5])