## Installing Transformers

In [1]:
# !pip install transformers



## Importing Libraries


In [6]:
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
import os

## Get Current Working Directory

In [7]:
my_dir = os.getcwd() + '/'

For question answering tasks, we can even use the already trained model and get decent results even when our text is from a completely different domain. To get decent results, we are using a BERT model which is fine-tuned on the SQuAD benchmark.

For our task, we will use the BertForQuestionAnswering class from the transformers library.

Expect the downloading to take a couple of minutes as BERT-large is a really big model with 24 layers and 340M parameters, making it a 1.34GB model.

In [3]:
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

## Saving Tokenizer and Model.

In [8]:
tokenizer.save_pretrained(amy_dir+"save_tokenizer")
model.save_pretrained(my_dir+"save_model")

Loding Tokenizer From Save Directory

In [10]:
tokenizer = BertTokenizer.from_pretrained('./save_tokenizer/')

Loding Model From Save Directory

In [12]:
model = BertForQuestionAnswering.from_pretrained("./save_model/")

## Asking a Question

In [27]:
text = """
Stocks give you partial ownership in a corporation, while bonds are a loan from you 
to a company or government. The biggest difference between them is how they generate profit: 
stocks must appreciate in value and be sold later on the stock market, while most bonds pay
fixed interest over time."""

question = "what is Bond?"

Let’s tokenize the question and text as a pair.

In [16]:
input_ids = tokenizer.encode(question, text)
# Let’s see how many tokens this question and text pair have.
print("The input has a total of {} tokens.".format(len(input_ids)))

The input has a total of 63 tokens.


To look at what our tokenizer is doing, let’s just print out the tokens and their IDs.

In [17]:
tokens = tokenizer.convert_ids_to_tokens(input_ids)
for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

[CLS]        101
what       2,054
is         2,003
bond       5,416
?          1,029
[SEP]        102
stocks    15,768
give       2,507
you        2,017
partial    7,704
ownership   6,095
in         1,999
a          1,037
corporation   3,840
,          1,010
while      2,096
bonds      9,547
are        2,024
a          1,037
loan       5,414
from       2,013
you        2,017
to         2,000
a          1,037
company    2,194
or         2,030
government   2,231
.          1,012
the        1,996
biggest    5,221
difference   4,489
between    2,090
them       2,068
is         2,003
how        2,129
they       2,027
generate   9,699
profit     5,618
:          1,024
stocks    15,768
must       2,442
appreciate   9,120
in         1,999
value      3,643
and        1,998
be         2,022
sold       2,853
later      2,101
on         2,006
the        1,996
stock      4,518
market     3,006
,          1,010
while      2,096
most       2,087
bonds      9,547
pay        3,477
fixed      4,964
inte

BERT has a unique way of processing the tokenized inputs. From the above output, we can see two special tokens [CLS] and [SEP]. [CLS] token stands for classification and is there to represent sentence-level classification and is used when we are classifying. Another token used by BERT is [SEP]. It is used to separate the two pieces of text. You can see two [SEP] tokens in the above output, one after the question and another after the text.

Apart from the “Token Embeddings”, BERT internally also uses “Segment Embeddings” and “Position Embeddings”. Segment embeddings help BERT in differentiating a question from the text. In practice, we use a vector of 0's if embeddings are from sentence 1 else a vector of 1’s if embeddings are from sentence 2. Position embeddings help in specifying the position of words in the sequence. All these embeddings are fed to the input layer.

Transformers library can create segment embeddings on its own using PretrainedTokenizer.encode_plus(). But, we can even create our own. For that, we just need to specify a 0 or 1 for each token.

In [18]:
#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print("SEP token index: ", sep_idx)

#number of tokens in segment A (question) - this will be one more than the sep_idx as the index in Python starts from 0
num_seg_a = sep_idx+1
print("Number of tokens in segment A: ", num_seg_a)

#number of tokens in segment B (text)
num_seg_b = len(input_ids) - num_seg_a
print("Number of tokens in segment B: ", num_seg_b)

#creating the segment ids
segment_ids = [0]*num_seg_a + [1]*num_seg_b

#making sure that every input token has a segment id
assert len(segment_ids) == len(input_ids)

SEP token index:  5
Number of tokens in segment A:  6
Number of tokens in segment B:  57


## Let’s now feed this to our model.

#token input_ids to represent the input and token segment_ids to differentiate our segments - question and text

In [19]:
output = model(torch.tensor([input_ids]),  token_type_ids=torch.tensor([segment_ids]))

Looking at the most probable start and end words and providing answers only if the end token is after the start token.

In [20]:
#tokens with highest start and end scores
answer_start = torch.argmax(output.start_logits)

answer_end = torch.argmax(output.end_logits)

if answer_end >= answer_start:
    
    answer = " ".join(tokens[answer_start:answer_end+1])
    
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")
    
print("\nQuestion:\n{}".format(question.capitalize()))
print("\nAnswer:\n{}.".format(answer.capitalize()))


Question:
What is bond?

Answer:
A loan from you to a company or government.


Let us now turn this question-answering process into a function for ease.

In [22]:
def question_answer(question, text):
    
    #tokenize question and text as a pair
    input_ids = tokenizer.encode(question, text)
    
    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    
    #segment IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)
    #number of tokens in segment A (question)
    num_seg_a = sep_idx+1
    #number of tokens in segment B (text)
    num_seg_b = len(input_ids) - num_seg_a
    
    #list of 0s and 1s for segment embeddings
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    assert len(segment_ids) == len(input_ids)
    
    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))
    
    #reconstructing the answer
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]
                
    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."
    
    print("\nPredicted answer:\n{}".format(answer.capitalize()))

In [28]:
question = "Where is stock?"
question_answer(question, text)


Predicted answer:
Partial ownership in a corporation


Here, is a small function to test out how well BERT understands contexts. I just made the question answering process as a loop to play around with the model

In [29]:
question = input("\nPlease enter your question: \n")
while True:
    question_answer(question, text)
    
    flag = True
    flag_N = False
    
    while flag:
        response = input("\nDo you want to ask another question based on this text (Y/N)? ")
        if response[0] == "Y":
            question = input("\nPlease enter your question: \n")
            flag = False
        elif response[0] == "N":
            print("\nBye!")
            flag = False
            flag_N = True
            
    if flag_N == True:
        break


Please enter your question: 
What is Stock

Predicted answer:
Stocks give you partial ownership in a corporation

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
What is difference between stock and bond?

Predicted answer:
How they generate profit

Do you want to ask another question based on this text (Y/N)? What is Bond

Do you want to ask another question based on this text (Y/N)? Y

Please enter your question: 
 What is Bond?

Predicted answer:
A loan from you to a company or government

Do you want to ask another question based on this text (Y/N)? N

Bye!
