<h1> BERT: <i><b>B</b>idirectional <b>E</b>ncoder <b>R</b>epresentations from <b>T</b>ransformers</i></h1>
<h6><i>created by Google AI Language Team in 2018</i></h6>
BERT is designed to pre-train deep bidirectional representations from unlabelled text by jointly conditioning on both left and right context in all the layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is trained on unlabelled dataset to achieve state of the art results on 11 individual NLP tasks. And all of this with little fine tuning.

Deeply Bidirectional means that BERT learns information from both the left and right side of a token's context during the training.
<p>Let's try to understand the concept of left and right context in Deeply Bidirectional</p>
<ul>
<li>Sentence 1: They exchanged addresses <b>and agreed to keep in touch.</b></li>
<li>Sentence 2: <b>People of India will be</b> addressed by Prime Minister today.</li>
</ul>

If model is trained unidirectional and we try to predict the word <i><b>"Address"</b></i> from the above two sentences dataset, then the model will be making error in predicting either of them.

<h3> Word Embedding</h3>

![](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/06062705/Word-Vectors.png)

Before BERT, NLP community used features based on searching the key terms in the word corpus using Term Frequency.These vectors were used in mathematical and statistical models for classification and regression tasks. There was nothing much that could be done mathematically on term frequency to understand the syntax and semantics of the word in a sentence. Then arrived an era of word embedding. Here every word can be represented in their vector space and words having same meaning were close to each other in vector space. This started from Word2Vec and GloVe. 

Consider an example:
<ul>
<li>Sentence 1: Man is related to Woman</b></li>
<li>Sentence 2: Then King is related to ...</li>
</ul>

Above sentence can be explained mathematically as: <b>King - Man + Woman = Queen</b>

And this can be achieved using word embeddings.Only issue with such word embeddings was with respect to the information they could store. Word2Vec could store only feedforward information. Resulting in same vectors for similar words used in different context. Such words are know as <b>Polysemy</b> words. To handle polysemy words, prediction led to more complex and deeper LSTM models.

The revolutionary NLP architecture, which marked the era of transfer learning in NLP and also letting the model understand the syntax and semantics of a word, ELMo (<i>Embeddings from Language Models</i>) and ULMFit started the new trend. ELMo was then, the answer to the problem of <b>Polysemy</b> words- <i> same words having different meanings based on the context </i>.

<h2>Previous NLP model Architectures </h2>

![alt text](https://1.bp.blogspot.com/-RLAbr6kPNUo/W9is5FwUXmI/AAAAAAAADeU/5y9466Zoyoc96vqLjbruLK8i_t8qEdHnQCLcBGAs/s640/image3.png)

> <i>BERT is deeply bidirectional, OpenAI GPT is unidirectional, and ELMo is shallowly bidirectional.</i>

<b>ELMo</b> used weighted sum of forward (<i>context before the token/word</i>) and backward (<i>context after the token/word</i>) pass generated, Intermediate Word vectors from two stacked biLM layers and raw vector generated from character convolutions to produce the final ELMo vector. This helped ELMo look at the past and future context, basically the whole sentence to generate the word vector, resulting in unique vector for Polysemy words.

The true power of transfer learning in NLP was unleashed after <b>ULMFiT</b> (<i>Universal Language Model Fine-tuning</i>). The concept revolved around having an Language Model (LM) trained on generic corpora. These LMs were based on same ideology what ImageNet helped to acheive transfer learning in Computer Vision. The stages in transfer learnng <b>pretraining</b> and <b>Fine-tuning</b> which is still followed now started with ULMFiT. In pretraining stage the LMs will be trained to learn generic information over language corpora. When fine-tuning the pretrained model to a downstream task, we will train the model on task specific data. Only the last few layers are the ones that will be trained from scratch. Resulting in better accurracy as the initial layers had generic language understanding and last layers had task specific information. BERT is based on the same idea that fine-tuning a pre-trained language model can help the model achieve better results in the downstream tasks.

Following ELMo and UMLFiT on the same ground, came <b>OpenAI GPT</b>(<i>Generative Pre-trained Transformers</i>). OpenAI GPT was based on Transformer based network, as suggested in Google Brains research
paper "[Attention is all you need](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)". They replaced the whole LSTM architecture with encoder decoder layer stack. GPT also emphasized the importance of the Transformer framework, which has a simpler architecture and can train faster than an LSTM-based model. It is also able to learn complex patterns in the data by using the Attention mechanism. This started the breaktrough for NLP <i>state of the art</i> frameworks using <b>Transformers</b> which includes BERT.


<h2> Coming back to BERT... </h2>
BERT surpass the unidirectionality constraints by using a “<i>Masked Language Model (MLM)</i>” pre-training objective. MLM randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. It enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer. In addition to the MLM, BERT also uses a “<i> next sequence prediction</i>” task that jointly pretrains text-pair representations.

There are two steps involved in BERT:

![](https://www.researchgate.net/profile/Jan_Christian_Blaise_Cruz/publication/334160936/figure/fig1/AS:776030256111617@1562031439583/Overall-BERT-pretraining-and-finetuning-framework-Note-that-the-same-architecture-in.ppm)


*   Pre-training: the model is trained on unlabelled data over different pre-training task.
*   Fine-tuning: BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labelled data from the downstream task.

With the basic understanding of the above two steps, lets deep dive to understand BERT framework.


*   <h3>BERT Model Architecture:</h3>
BERT Model architecture is a multi-layer bidirectional Transformer encoder-decoder structure.
    
    ![](https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/img/encoder.png)

    *   <b>Encoder</b>: Encoder is composed of a stack of N=6 identical layers. Each layer has two sub layers. The first layer is a multi-head self-attention mechanism, and the second is a position wise fully connected feed-forward network. There is a residual connection around each of the two sub layers, followed by layer normalization.

    *   <b>Decoder</b>: Decoder is also composed of N=6 identical layers. Decoder has additional one sub-layer over two sub-layers as present in encoder, which performs multi-head attention over the output of the encoder stack. Similar to encoder we have residual connection around every sub-layers, followed by layer normalization.

    *   <b>Attention</b>: Attention is a mechanism to know which word in the context, better contribute to the current word. It is calculated using the dot product between query vector Q and key vector K. The output from attention head is the weighted sum of value vector V, where the weights assigned to each value is computed by a compatibility function of the Query with the corresponding Key.
The general formula that sums up the whole process of attention calculation in BERT is:

          ![alt text](https://miro.medium.com/proxy/1*V6LGUR-0NmlOGmm0TDAa5g.png)

      where, Q is the matrix of queries, K an V matrix represent keys and values.

      To fully understand the attention calculation with example, I would request you to go through the [Analytics Vidya blog](https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/?utm_source=blog&utm_medium=demystifying-bert-groundbreaking-nlp-framework)

2.   <h3> Pre-training BERT:<h3> BERT is pretrained using two unsupervised task:
        <ul>
        <li> <b>Masked Language Model</b>: In order to train the bidirectional representation, BERT simply mask 15% of the input tokens at random, and then predict those masked tokens. A downside is that it creates a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To deal with this situation, BERT not always replaces the masked words with actual [MASKED] token. The BERT training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, BERT replaces the i-th token with: <ul><li> the [MASK] token 80% of the time</li><li>a random token 10% of the time</li><li>the unchanged i-th token 10% of the time</li></ul>
        </li>
        <li><b> Next Sentence Prediction (NSP)</b>: In order to train a model that understands sentence relationships, we pre-train for a next sentence prediction task. If there are two sentences A and B, BERT trains on 50% of the time with B as the actual next sentence that follows A (labeled as isNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext).
        </li>
        </ul>
3.   <h3>Fine-tuning BERT:</h3>The self-attention mechanism in the Transformer allows BERT to model any downstream task. BERT with self-attention encodes a concatenated text pair, which effectively includes bidirectional cross attention between two sentences. For each task, we simply plug in the task specific inputs and outputs into BERT and fine-tune all the parameters end to end. At the output the token representations are fed into an output layer for token level tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as sentimental analysis or entailment.
      <!-- <ul><li><b></b></li></ul> -->

<h2> Now, Lets start with BERT implementaion using PyTorch: </h2>

In [0]:
#install packages
#Implementing BERT using huggingface/transformer PyTorch library
!pip install transformers

In [0]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig,AdamW, BertForSequenceClassification,get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import random
% matplotlib inline

In [0]:
# identify and specify the GPU as the device, later in training loop we will load data into device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

SEED = 2019

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if device == torch.device("cuda"):
    torch.cuda.manual_seed_all(SEED)

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
## We are using CoLA dataset for single sentence classification
## It’s a set of sentences labeled as grammatically correct or incorrect
## Link to dataset : https://nyu-mll.github.io/CoLA/
## We will use the raw version because we need to use the BERT tokenizer to break the text down into tokens and chunks that the model will recognize.
# Upload the train file from your local drive

# To upload data from local machine at run time
#from google.colab import files
#uploaded = files.upload()

#The below code is when we integrate Google drive to current Colab session.
# This is helpful when we want to store the trained model, and later download it to local
import os
os.chdir('/content/gdrive/My Drive')
if not (os.path.exists('/content/gdrive/My Drive/BERTFineTuning')): 
  os.mkdir('BERTFineTuning')
  os.chdir('/content/gdrive/My Drive/BERTFineTuning')
else:
  os.chdir('/content/gdrive/My Drive/BERTFineTuning')
os.listdir()



In [0]:
df = pd.read_csv("raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
df.shape
df.sample(10)

## create label and sentence list
sentences = df.sentence.values

In [0]:
#check distribution of data based on labels
df.label.value_counts()

In [0]:
## Import BERT tokenizer, that is used to convert our text into tokens that corresponds to BERT library
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

## We need to add special tokens at the beginning and end of each sentence for BERT to work properly
tokenized_texts = [["[CLS]"] + sentence + ["[SEP]"] for sentence in tokenized_texts]
print(' '.join(tokenized_texts[0]))
labels = df.label.values

<h2>BERT requires specifically formatted inputs. For each tokenized input sentence, we need to create:</h2>
<ul>
<li><b>input ids:</b> a sequence of integers identifying each input token to its index number in the BERT tokenizer vocabulary
</li>
<li><b>segment mask:</b> <i>(optional)</i> a sequence of 1s and 0s used to identify whether the input is one sentence or two sentences long. For one sentence inputs, this is simply a sequence of 0s. For two sentence inputs, there is a 0 for each token of the first sentence, followed by a 1 for each token of the second sentence
</li>
<li><b>attention mask:</b> <i>(optional)</i> a sequence of 1s and 0s, with 1s for all input tokens and 0s for all padding tokens (we’ll detail this in the next paragraph)
</li>
<li><b>labels:</b> a single value of 1 or 0. In our task 1 means “grammatical” and 0 means “ungrammatical”
</li>

In [0]:
# Set the maximum sequence length. The longest sequence in our training set is 47, but we'll leave room on the end anyway. 
# In the original paper, the authors used a length of 512.
MAX_LEN = 128

In [0]:
# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
print(tokenized_texts[0:2])
print(input_ids[0:2])

In [0]:
def pad_zeros(source,maxlen,dtype="long",truncating="post",padding="post"):
  for i,src in enumerate(source):
    if len(src) < maxlen:
      #print("before changing input_ids[{0}]:{1}".format(i,input))
      src.extend([0] * (maxlen - len(src))) 
      #print("after changing input_ids[{0}]:{1}".format(i,input))
      source[i]=src
    elif len(src) > maxlen:
      while True:
        if len(src) == maxlen:
          break
        elif len(src) > maxlen:
          src = src[:-1]
      source[i] = src
  
  return source

In [0]:
# Pad our input tokens
input_ids = pad_zeros(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
print(input_ids[1:2])

In [0]:
## Create attention mask
attention_masks = []

## Create a mask of 1 for all input tokens and 0 for all padding tokens

for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

print(attention_masks[0])

In [0]:
# Split into a training set and a test set using a stratified k fold
train_inputs,validation_inputs,train_labels,validation_labels = train_test_split(input_ids,labels,random_state=SEED,test_size=0.1)
train_masks,validation_masks,_,_ = train_test_split(attention_masks,input_ids,random_state=SEED,test_size=0.1)

In [0]:
# convert all our data into torch tensors, required data type for our model
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

In [0]:
# Select a batch size for training. For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32
batch_size = 16

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory
train_data = TensorDataset(train_inputs,train_masks,train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data,sampler=train_sampler,batch_size=batch_size)

validation_data = TensorDataset(validation_inputs,validation_masks,validation_labels)
validation_sampler = RandomSampler(validation_data)
validation_dataloader = DataLoader(validation_data,sampler=validation_sampler,batch_size=batch_size)

<h2>Train Model</h2>
<p>Now that our input data is properly formatted, it’s time to fine tune the BERT model.</p>

<p>For this task, we first want to modify the pre-trained BERT model to give outputs for classification, and then we want to continue training the model on 
our dataset until that the entire model, end-to-end, is well-suited for our task. The huggingface pytorch implementation includes a set of interfaces 
designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained BERT model, each has different top layers and output types 
designed to accomodate their specific NLP task.</p>

<p>We’ll load <i>BertForSequenceClassification</i>. This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task.</p>


In [0]:
# Load BertForSequenceClassification, the pretrained BERT model with a single linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2).to(device)

<h2>For the purposes of fine-tuning, the authors recommend the following hyperparameter ranges:</h2>
<ul>
<li>Batch size: 16, 32</li>
<li>Learning rate (Adam): 5e-5, 3e-5, 2e-5</li>
<li>Number of epochs: 2, 3, 4</li>
</li>

In [0]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

In [0]:
# Parameters:
lr = 1e-5
num_training_steps = len(train_data)
num_warmup_steps = len(train_data)/10
warmup_proportion = float(num_warmup_steps) / float(num_training_steps)  # 0.1

### In Transformers, optimizer and schedules are splitted and instantiated like this:
optimizer = AdamW(optimizer_grouped_parameters, lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler
#optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
criterion = torch.nn.BCEWithLogitsLoss()

In [0]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [0]:
# Store our loss and accuracy for plotting
train_loss_set = []

# Number of training epochs (authors recommend between 2 and 4)
epochs = 3

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):
  
  
  # Training
  # Tracking variables
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    
    # Set our model to training mode (as opposed to evaluation mode)
    model.train()

    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()

    # Forward pass
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    
    #print(loss)
    loss = outputs[0]
    train_loss_set.append(loss.item())    
    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    scheduler.step()  # Update learning rate schedule
    
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Train loss: {}".format(tr_loss/nb_tr_steps))
    
    
  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0

  # Evaluate data for one epoch
  for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    
    # Move logits and labels to CPU
    logits = logits[0].to('cpu').numpy()
    label_ids = b_labels.to('cpu').numpy()

    tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1

  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

In [0]:
#Training Evaluation

plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

In [0]:
#Predict and Evaluate

# Upload the test file from your local drive
from google.colab import files
uploaded = files.upload()

In [0]:
df = pd.read_csv("out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Create sentence and label lists
sentences = df.sentence.values

tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

# We need to add special tokens at the beginning and end of each sentence for BERT to work properly
tokenized_texts = [["[CLS]"] + sentence + ["[SEP]"] for sentence in tokenized_texts]
labels = df.label.values

# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
# Pad our input tokens
input_ids = pad_zeros(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask) 

prediction_inputs = torch.tensor(input_ids)
prediction_masks = torch.tensor(attention_masks)
prediction_labels = torch.tensor(labels)

prediction_data = TensorDataset(prediction_inputs, prediction_masks, prediction_labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

In [0]:
# Prediction on test set

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  # Telling the model not to compute or store gradients, saving memory and speeding up prediction
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
  #print('logits: ',logits[0].cpu().numpy())

  # Move logits and labels to CPU
  logits = logits[0].cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

In [0]:
# Import and evaluate each test batch using Matthew's correlation coefficient
from sklearn.metrics import matthews_corrcoef
matthews_set = []

for i in range(len(true_labels)):
  matthews = matthews_corrcoef(true_labels[i],
                 np.argmax(predictions[i], axis=1).flatten())
  matthews_set.append(matthews)

In [0]:
# Flatten the predictions and true values for aggregate Matthew's evaluation on the whole dataset
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]

In [0]:
matthews_corrcoef(flat_true_labels, flat_predictions)

In [0]:
import torch
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from random import choice

tok = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
text= "Today is a "
p=0.5
input_ids = torch.tensor(tok.encode(text)).unsqueeze(0)
logits = model(input_ids)[0][:, -1]
probs = F.softmax(logits, dim=-1).squeeze()
idxs = torch.argsort(probs, descending=True)
res, cumsum = [], 0.
for idx in range(0,10):
  pred = tok.convert_ids_to_tokens(int(idxs[idx]))
  print(tok.convert_tokens_to_string(pred))