<a href="https://colab.research.google.com/github/anilbhatt1/Deep_Learning_EVA4_Phase2/blob/master/E4P2S9_Transformer_Senti_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! nvidia-smi

Wed Nov 25 11:38:11 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Source : https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import torch
from torchtext import data
!pip install transformers

SEED = 1234

torch.manual_seed(SEED)                      # We are using seed to ensure that we get similar data while splitting train & test data
torch.backends.cudnn.deterministic = True

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 5.9MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 38.1MB/s 
[?25hCollecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp36m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 45.1MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)

#### We are using transformers here. Transformers are already trained with specific vocabulary. Hence we also need to train with exact vocabulary and tokenization for our IMDB example. Fortunately, each transformer comes with its own tokenization method & these tokenization are available for us to use. We are using BERT model here and its tokenization is 'BertTokenizer' which is available for us to use. Hence, we will use BertTokenizer as our tokenizer.

In [4]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  #BERT ignores casing i.e. everything is lower-case. This is specified by 'bert-base-uncased'

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




##### We will be using the vocabulary present in BERT. We can understand how many unique words are present in this vocabulary as below

In [5]:
len(tokenizer.vocab)

30522

#### Example of tokenization using BertTokenizer

In [6]:
tokens = tokenizer.tokenize("Hey How are You, Man  ?")
tokens

['hey', 'how', 'are', 'you', ',', 'man', '?']

#### Example of numericalizing the tokens (i.e. converting tokens into indexes present in vocabulary) using BertTokenizer

In [7]:
indexes = tokenizer.convert_tokens_to_ids(tokens)
indexes

[4931, 2129, 2024, 2017, 1010, 2158, 1029]

##### Transformers are trained with special tokens at begining and end, also with pad and unknown token. We can get the index of them as shown below. Please note that we will use these indexes later while field definition ('txt').

In [8]:
init_token = tokenizer.cls_token
eos_token  = tokenizer.sep_token
pad_token  = tokenizer.pad_token
unk_token  = tokenizer.unk_token
print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


In [9]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx  = tokenizer.sep_token_id
pad_token_idx  = tokenizer.pad_token_id
unk_token_idx  = tokenizer.unk_token_id
print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


#### Transformers are trained for fixed length sequences. It cant handle sequences more than this fixed length. In this example with BERT 512 is the fixed length. Also, previosuly we used 'spacy' for tokenizing, but in this case we will define a function for the same which will also manage to keep the token size within 512 limit.

In [10]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']
print(max_input_length)

512


In [11]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length - 2]  # -2 to give room for special tokens at begining and end
    return tokens

#### Defining fields

In [12]:
from torchtext import data

txt = data.Field(batch_first = True,  # Transformers need batch dimension to come first
                 use_vocab = False,   # This is to tell torchtext that vocabulary will be taken care by ourselves  i.e we wont do txt.build_vocab    
                 tokenize = tokenize_and_cut, # Using our customized tokenization function instead of 'spacy'
                 preprocessing = tokenizer.convert_tokens_to_ids, # For numericalization of tokens
                 init_token = init_token_idx, # Defining special tokens interms of their index values (bcoz this line comes after numericalization)
                 eos_token  = eos_token_idx,
                 pad_token  = pad_token_idx,
                 unk_token  = unk_token_idx)

lbl = data.LabelField(dtype = torch.float)

#### Creating train and test data

In [13]:
from torchtext import datasets
import random
train_data, test_data = datasets.IMDB.splits(txt, lbl)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:07<00:00, 11.2MB/s]


In [14]:
print('Length of train_data:',len(train_data), 'Type:', type(train_data))
print('Length of test_data :',len(test_data), 'Type:', type(test_data))
print(train_data.fields)
print(test_data.fields)
print(vars(train_data.examples[0]))   # vars -> Built-in function, with an argument equivalent to object.dict. 
print(vars(test_data.examples[0])) 
print(vars(train_data[-1]))

Length of train_data: 25000 Type: torchtext.datasets.imdb.IMDB
Length of test_data : 25000 Type: torchtext.datasets.imdb.IMDB
{'text': <torchtext.data.field.Field object at 0x7f2553021c88>, 'label': <torchtext.data.field.LabelField object at 0x7f25530219e8>}
{'text': <torchtext.data.field.Field object at 0x7f2553021c88>, 'label': <torchtext.data.field.LabelField object at 0x7f25530219e8>}
{'text': [1045, 4635, 2023, 1996, 2190, 1997, 1996, 1062, 29459, 3127, 13068, 2015, 1012, 1996, 10990, 3315, 3556, 9909, 8595, 2000, 2019, 10990, 3898, 2377, 1012, 2045, 2003, 2019, 6581, 4637, 3459, 1998, 6547, 12700, 2008, 2097, 2562, 2017, 16986, 2127, 1996, 2345, 3127, 1012, 7305, 21681, 2515, 1037, 2986, 3105, 2004, 2123, 5277, 1998, 2010, 11477, 13059, 1062, 29459, 1012, 2197, 1010, 2021, 5121, 2025, 2560, 1010, 2003, 1996, 2307, 9855, 2136, 1997, 9809, 1998, 2394, 1012], 'label': 'pos'}
{'text': [2798, 1000, 9610, 2278, 1000, 4341, 2003, 7078, 27547, 2004, 1996, 7082, 2266, 1997, 1996, 7873, 21

#### Unlike in previous examples, here data is organized in terms of indexes. If we wish to see them as words we can use tokenizer.convert_ids_to_tokens as shown below

In [15]:
example = tokenizer.convert_ids_to_tokens(vars(train_data.examples[0])['text'])

print(example)

['i', 'rank', 'this', 'the', 'best', 'of', 'the', 'z', '##orro', 'chapter', '##play', '##s', '.', 'the', 'exciting', 'musical', 'score', 'adds', 'punch', 'to', 'an', 'exciting', 'screen', 'play', '.', 'there', 'is', 'an', 'excellent', 'supporting', 'cast', 'and', 'mystery', 'villain', 'that', 'will', 'keep', 'you', 'guessing', 'until', 'the', 'final', 'chapter', '.', 'reed', 'hadley', 'does', 'a', 'fine', 'job', 'as', 'don', 'diego', 'and', 'his', 'alter', 'ego', 'z', '##orro', '.', 'last', ',', 'but', 'certainly', 'not', 'least', ',', 'is', 'the', 'great', 'directing', 'team', 'of', 'whitney', 'and', 'english', '.']


##### Split the train_data further into train_data & valid_data

In [16]:
import random
train_data, valid_data = train_data.split(split_ratio=0.8,random_state=random.seed(SEED))

In [17]:
print('Length of train_data:',len(train_data))
print('Length of valid_data:',len(valid_data))
print('Length of test_data:',len(test_data))

Length of train_data: 20000
Length of valid_data: 5000
Length of test_data: 25000


### Building vocabulary. We already build txt vocab early ie we used whatever is used for training BERT as is the norm. However, we still need to build the vocab for lbl as below

In [18]:
lbl.build_vocab(train_data)

In [19]:
print('Unique words in lbl:',len(lbl.vocab))

Unique words in lbl: 2


In [20]:
print(lbl.vocab.stoi)

defaultdict(<function _default_unk_index at 0x7f2591708158>, {'neg': 0, 'pos': 1})


#### Creating iterator. Using buckeiterator that will return batch of examples where each example is of similar length, minimizing the amount of padding per example. 

In [21]:
batch_size = 128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train_data, valid_data, test_data), 
                                                                            batch_size = batch_size,
                                                                            device = device)

#### Building the model. We'll load the pre-trained model, making sure to load the same model as we did for tokenizer

In [22]:
from transformers import BertModel
bert = BertModel.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [23]:
import torch.nn as nn

class BERTGRUSentiment(nn.Module):
     def __init__(self, bert, hidden_dim, output_dim, n_layers, bidirectional, dropout):
         super().__init__()
         self.bert     = bert
         embedding_dim = bert.config.to_dict()['hidden_size']
         self.rnn      = nn.GRU(embedding_dim,
                                hidden_dim,
                                num_layers = n_layers,
                                bidirectional = bidirectional,
                                batch_first = True,
                                dropout = 0 if n_layers < 2 else dropout)
         self.out     = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
         self.dropout = nn.Dropout(dropout)
     
     def forward(self, text):  
         #text = [batch size, sentence len] --> eg: [64, 1150] means 64 sentences with 1150 words each. Length of sentence will vary for each batch.
         #we gave 'batch_first = True' while data pre-processing, hence comes with batch_dimension first
         with torch.no_grad():
             embedded = self.bert(text)[0]
         #embedded = [batch size, sentence len, emb dim] --> [64, 1150, 100] adding one more dimension for embed dimension

         _, hidden = self.rnn(embedded)
         #hidden = [n layers * n directions, batch size, emb dim] 

         if self.rnn.bidirectional:
             hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
         else:
             hidden = self.dropout(hidden[-1,:,:])
         #hidden = [batch size, hid dim]        

         output = self.out(hidden)
         #output = [batch size, out dim]
         
         return output 

In [24]:
hidden_dim    = 256
output_dim    = 1            # pos(1) or neg(0) for sentiment analysis 
n_layers      = 2
bidirectional = True
dropout       = 0.25

model = BERTGRUSentiment(bert, hidden_dim, output_dim, n_layers, bidirectional, dropout)

In [25]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 112,241,409 trainable parameters


In [26]:
for name, param in model.named_parameters(): 
    if name.startswith('bert'):
        param.requires_grad = False

In [27]:
print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,759,169 trainable parameters


In [28]:
for name, param in model.named_parameters(): 
    if param.requires_grad:
        print(name)   

rnn.weight_ih_l0
rnn.weight_hh_l0
rnn.bias_ih_l0
rnn.bias_hh_l0
rnn.weight_ih_l0_reverse
rnn.weight_hh_l0_reverse
rnn.bias_ih_l0_reverse
rnn.bias_hh_l0_reverse
rnn.weight_ih_l1
rnn.weight_hh_l1
rnn.bias_ih_l1
rnn.bias_hh_l1
rnn.weight_ih_l1_reverse
rnn.weight_hh_l1_reverse
rnn.bias_ih_l1_reverse
rnn.bias_hh_l1_reverse
out.weight
out.bias


##### SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. Adam adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. Hence using Adam in this example

In [29]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

In [30]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division
    acc = correct.sum() / len(correct)
    return acc

In [31]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc  = 0

    model.train()

    for idx, batch in enumerate(iterator):       
        optimizer.zero_grad()      
        predictions = model(batch.text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc  = binary_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc  += acc.item()    

    return epoch_loss / len(iterator), epoch_acc / len(iterator)   

In [32]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc  = 0

    model.eval()

    with torch.no_grad():
        for idx, batch in enumerate(iterator):      
            predictions = model(batch.text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc  = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc  += acc.item() 

    return epoch_loss / len(iterator), epoch_acc / len(iterator)      

In [33]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
n_epochs = 5
best_valid_loss = float('inf')

for epoch in range(n_epochs):
    
    start_time = time.time()

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), '/content/gdrive/My Drive/EVA4P2_S9/E4P2_S9_Transformer_Senti_Analysis.pt')
       
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 17m 29s
	Train Loss: 0.264 | Train Acc: 89.29%
	 Val. Loss: 0.241 |  Val. Acc: 90.98%
Epoch: 02 | Epoch Time: 17m 29s
	Train Loss: 0.231 | Train Acc: 90.85%
	 Val. Loss: 0.200 |  Val. Acc: 92.50%
Epoch: 03 | Epoch Time: 17m 30s
	Train Loss: 0.205 | Train Acc: 91.96%
	 Val. Loss: 0.194 |  Val. Acc: 92.52%
Epoch: 04 | Epoch Time: 17m 26s
	Train Loss: 0.177 | Train Acc: 93.32%
	 Val. Loss: 0.246 |  Val. Acc: 90.08%
Epoch: 05 | Epoch Time: 17m 27s
	Train Loss: 0.151 | Train Acc: 94.19%
	 Val. Loss: 0.220 |  Val. Acc: 91.97%


In [34]:
model.load_state_dict(torch.load('/content/gdrive/My Drive/EVA4P2_S9/E4P2_S9_Transformer_Senti_Analysis.pt'))
test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.192 | Test Acc: 92.22%


##### Giving user input to get the sentiment back. Please note that we trained on movie review comments, hence input also should be similar text.

In [35]:
def predict_sentiment(model, tokenizer, sentence): 
    model.eval()
    tokens  = tokenizer.tokenize(sentence)
    tokens  = tokens[:max_input_length - 2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor  = torch.LongTensor(indexed).to(device)  # converts 'indexed' which is a Python list into a PyTorch tensor
    tensor  = tensor.unsqueeze(0)                   # adding batch dimension to feed it to GPU
    preds   = torch.sigmoid(model(tensor))
    return preds.item()

In [37]:
predict_sentiment(model, tokenizer, "This film is terrible")

0.03443735092878342