# BERT to tokenize sentence pairs

BERT can also be used to tokenize a pair of text sequences, such as question-answer scenarios. The token_type_ids tensor helps in such scenarios to 
indicate which of the two text sequences each token belongs to:

0 for the first sentence.

1 for the second sentence.

In the following code, the dataset consists of question/answer pairs. When we tokenize a question/answer pair with the BERT tokenizer, we see the two
values in the token_type_ids tensor returned by the tokenizer. A value of 0 in the token_type_ids tensor indicates that the token belongs to the 
first text sequence in the pair, which is the question, and a value of 1 indicates that the token belongs to the second text sequence in the pair, 
which is the answer.


In [19]:
from transformers import BertTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

In [20]:
Dataset = [["Question 1", "Answer to question 1"],
 ["Question 2", "Answer to question 2"]
           ]
print("Orginal dataset:\n", Dataset)

Orginal dataset:
 [['Question 1', 'Answer to question 1'], ['Question 2', 'Answer to question 2']]


In [21]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and token not in string.punctuation]
    return ' '.join(tokens)

preprocessed_dataset = []
for QApair in Dataset:
  preprocessed_QApair = []
  preprocessed_QApair.append(preprocess(QApair[0]))
  preprocessed_QApair.append(preprocess(QApair[1]))
  preprocessed_dataset.append(preprocessed_QApair)
print("Preprocessed dataset:\n", preprocessed_dataset)

Preprocessed dataset:
 [['question 1', 'answer question 1'], ['question 2', 'answer question 2']]


In [22]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')



In [23]:
def generate_embedding(QApair):
    print("Preprocessed QApair:\n", QApair)
    inputs = tokenizer(QApair[0], QApair[1], return_tensors='pt')
    bert_tokenized_text = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    print("BERT tokens:\n", bert_tokenized_text)
    print("BERT token_type_ids:\n", inputs['token_type_ids'])

In [24]:
for QApair in preprocessed_dataset:
  QApair_individual_word_embeddings = generate_embedding(QApair)

Preprocessed QApair:
 ['question 1', 'answer question 1']
BERT tokens:
 ['[CLS]', 'question', '1', '[SEP]', 'answer', 'question', '1', '[SEP]']
BERT token_type_ids:
 tensor([[0, 0, 0, 0, 1, 1, 1, 1]])
Preprocessed QApair:
 ['question 2', 'answer question 2']
BERT tokens:
 ['[CLS]', 'question', '2', '[SEP]', 'answer', 'question', '2', '[SEP]']
BERT token_type_ids:
 tensor([[0, 0, 0, 0, 1, 1, 1, 1]])
