## Question Answering with BERT and Hugging Face

In [1]:
from transformers import pipeline
# If the above line produce errors make sure following dependencies are installed
# conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses



In [2]:
#Initialize the pipeline
answerer = pipeline(task="question-answering", model="distilbert-base-cased-distilled-squad")

Downloading config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [3]:
context= '''
Tea is an aromatic beverage prepared by pouring hot or boiling water over cured or fresh leaves of Camellia sinensis,
an evergreen shrub native to China and East Asia. After water, it is the most widely consumed drink in the world. 
There are many different types of tea; some, like Chinese greens and Darjeeling, have a cooling, slightly bitter, 
and astringent flavour, while others have vastly different profiles that include sweet, nutty, floral, or grassy 
notes. Tea has a stimulating effect in humans primarily due to its caffeine content.

The tea plant originated in the region encompassing today's Southwest China, Tibet, north Myanmar and Northeast India,
where it was used as a medicinal drink by various ethnic groups. An early credible record of tea drinking dates to 
the 3rd century AD, in a medical text written by Hua Tuo. It was popularised as a recreational drink during the 
Chinese Tang dynasty, and tea drinking spread to other East Asian countries. Portuguese priests and merchants 
introduced it to Europe during the 16th century. During the 17th century, drinking tea became fashionable among the 
English, who started to plant tea on a large scale in India.

The term herbal tea refers to drinks not made from Camellia sinensis: infusions of fruit, leaves, or other plant 
parts, such as steeps of rosehip, chamomile, or rooibos. These may be called tisanes or herbal infusions to prevent
confusion with 'tea' made from the tea plant.
'''

In [5]:
result = answerer(question="Where is tea native to?", context=context)
print(result)

{'score': 0.8982149958610535, 'start': 148, 'end': 167, 'answer': 'China and East Asia'}


In [6]:
print(result["answer"])

China and East Asia


In [7]:
questions = ["Where is tea native to?",
             "When was tea discovered?",
             "What is the species name for tea?"]
result = answerer(question= questions, context=context)
for q,r in zip(questions,result):
    print(f"Question: {q} \n Answer:{r['answer']}\n\n")

Question: Where is tea native to? 
 Answer:China and East Asia


Question: When was tea discovered? 
 Answer:3rd century AD


Question: What is the species name for tea? 
 Answer:Camellia sinensis




In [8]:
context1 = '''
The Golden Age of Comic Books describes an era of American comic books from the 
late 1930s to circa 1950. During this time, modern comic books were first published 
and rapidly increased in popularity. The superhero archetype was created and many 
well-known characters were introduced, including Superman, Batman, Captain Marvel 
(later known as SHAZAM!), Captain America, and Wonder Woman.
Between 1939 and 1941 Detective Comics and its sister company, All-American Publications, 
introduced popular superheroes such as Batman and Robin, Wonder Woman, the Flash, 
Green Lantern, Doctor Fate, the Atom, Hawkman, Green Arrow and Aquaman.[7] Timely Comics, 
the 1940s predecessor of Marvel Comics, had million-selling titles featuring the Human Torch,
the Sub-Mariner, and Captain America.[8]
As comic books grew in popularity, publishers began launching titles that expanded 
into a variety of genres. Dell Comics' non-superhero characters (particularly the 
licensed Walt Disney animated-character comics) outsold the superhero comics of the day.[12] 
The publisher featured licensed movie and literary characters such as Mickey Mouse, Donald Duck,
Roy Rogers and Tarzan.[13] It was during this era that noted Donald Duck writer-artist
Carl Barks rose to prominence.[14] Additionally, MLJ's introduction of Archie Andrews
in Pep Comics #22 (December 1941) gave rise to teen humor comics,[15] with the Archie 
Andrews character remaining in print well into the 21st century.[16]
At the same time in Canada, American comic books were prohibited importation under 
the War Exchange Conservation Act[17] which restricted the importation of non-essential 
goods. As a result, a domestic publishing industry flourished during the duration 
of the war which were collectively informally called the Canadian Whites.
The educational comic book Dagwood Splits the Atom used characters from the comic 
strip Blondie.[18] According to historian Michael A. Amundson, appealing comic-book 
characters helped ease young readers' fear of nuclear war and neutralize anxiety 
about the questions posed by atomic power.[19] It was during this period that long-running 
humor comics debuted, including EC's Mad and Carl Barks' Uncle Scrooge in Dell's Four 
Color Comics (both in 1952).[20][21]
'''

In [10]:
questions = ["What popular superheroes were introduced between 1939 and 1941?",
             "What superheroes were introduced between 1939 and 1941 by Detective Comics and its sister company?",
             "What comic book characters were created between 1939 and 1941?",
             "What well-known characters were created between 1939 and 1941?",
             "What well-known superheroes were introduced between 1939 and 1941 by Detective Comics?"]
result = answerer(question= questions, context=context1)
for q,r in zip(questions,result):
    print(f"Question: {q} \n Answer:{r['answer']}\n\n")

Question: What popular superheroes were introduced between 1939 and 1941? 
 Answer:teen humor comics


Question: What superheroes were introduced between 1939 and 1941 by Detective Comics and its sister company? 
 Answer:Archie Andrews


Question: What comic book characters were created between 1939 and 1941? 
 Answer:Archie 
Andrews


Question: What well-known characters were created between 1939 and 1941? 
 Answer:Archie 
Andrews


Question: What well-known superheroes were introduced between 1939 and 1941 by Detective Comics? 
 Answer:Archie Andrews




###### This Model is a fan of Archie Andrews. We will finetune the model on TyDiQA dataset to get proper answers

## Fine-Tuning  QA with Transformrs Hugging-Face

In [11]:
from datasets import load_dataset

In [12]:
train_data = load_dataset('tydiqa', 'primary_task')
tydiqa_data = train_data.filter(lambda example: example['language']=='english')

Downloading builder script:   0%|          | 0.00/3.49k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading and preparing dataset tydiqa/primary_task (download: 1.82 GiB, generated: 5.62 GiB, post-processed: Unknown size, total: 7.44 GiB) to C:\Users\Sheraz\.cache\huggingface\datasets\tydiqa\primary_task\1.0.0\b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/161M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/58.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.62M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/166916 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/18670 [00:00<?, ? examples/s]

Dataset tydiqa downloaded and prepared to C:\Users\Sheraz\.cache\huggingface\datasets\tydiqa\primary_task\1.0.0\b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/167 [00:00<?, ?ba/s]

  0%|          | 0/19 [00:00<?, ?ba/s]

In [13]:
type (tydiqa_data['train'])

datasets.arrow_dataset.Dataset

In [14]:
tydiqa_data['train']

Dataset({
    features: ['passage_answer_candidates', 'question_text', 'document_title', 'language', 'annotations', 'document_plaintext', 'document_url'],
    num_rows: 9211
})

In [15]:
idx = 600
# start index
start_index = tydiqa_data['train'][idx]['annotations']['minimal_answers_start_byte'][0]
# end index
end_index = tydiqa_data['train'][idx]['annotations']['minimal_answers_end_byte'][0]

print("Question: " + tydiqa_data['train'][idx]['question_text'])
print("\nContext (truncated): "+ tydiqa_data['train'][idx]['document_plaintext'][0:512] + '...')
print("\nAnswer: " + tydiqa_data['train'][idx]['document_plaintext'][start_index:end_index])

Question: What mental effects can a mother experience after childbirth?

Context (truncated): 

Postpartum depression (PPD), also called postnatal depression, is a type of mood disorder associated with childbirth, which can affect both sexes.[1][3] Symptoms may include extreme sadness, low energy, anxiety, crying episodes, irritability, and changes in sleeping or eating patterns.[1] Onset is typically between one week and one month following childbirth.[1] PPD can also negatively affect the newborn child.[2]

While the exact cause of PPD is unclear, the cause is believed to be a combination of physi...

Answer: Postpartum depression (PPD)


In [16]:
tydiqa_data['train'][0]['annotations']

{'passage_answer_candidate_index': [-1],
 'minimal_answers_start_byte': [-1],
 'minimal_answers_end_byte': [-1],
 'yes_no_answer': ['NONE']}

Now, you have to flatten the dataset to work with an object with a table structure instead of a dictionary structure. This step facilitates the pre-processing steps.

In [17]:
# Flattening the datasets
flattened_train_data = tydiqa_data['train'].flatten()
flattened_test_data =  tydiqa_data['validation'].flatten()
# Selecting a Subset of Data
flattened_train_data = flattened_train_data.select(range(3000))
flattened_test_data = flattened_test_data.select(range(1000))

In [18]:
#retrieving previous tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")

In [None]:
#processing samples using 3 steps
# Step1 - If No answer to questions given context, use CLS to represent start of sequence.
# Step2 - tokenizer can create misalignment between list of dataset tags and labels genereted by tokenizer. Therfor align the start and end indices
# tokens associated with target answer word.
# Step3 - Tokenizer can truncate a very long sequence.SO if start/end position of answer in None, assume truncation and assign maximum length of 
# tokenizer to those positions
def process_samples(sample):
    tokenized_data = tokenizer(sample['document_plaintext'], sample['question_text'], truncation='only_first', padding='max_length')
    #Label impossible answers with id of CLS
    input_ids = tokenized_data['input_ids']
    cls_index= input_ids.index(tokenizer.cls_token_id)
    
    if sample["annotations.minimal_answers_start_byte"][0] == -1:
        start_position= cls_index
        end_position = cls_index
    else:
        gold_text = sample['document_plaintext'][sample['annotations.minimal_answers_start_byte'][0]:sample['annotations.minimal_answers_end_byte'][0]]
        start_char = sample['annotations.minimal_answers_start_byte'][0]
        end_char = sample['annotations.minimal_answers_end_byte'][0]
        
        if sample['document_plaintext'][start_char-1:end_char-1] == gold_text:
            start_char = start_char - 1
            end_char = end_char - 1     # When the gold label is off by one character
        elif sample['document_plaintext'][start_char-2:end_char-2] == gold_text:
            start_char = start_char - 2
            end_char = end_char - 2     # When the gold label is off by one character 
            
        start_token = tokenized_data.char_to_token (start_char)
        end_token = tokenized_data.char_to_token (end_char-1)
        
        # if start position is None, the answer passage has been truncated
        if start_token is None:
            start_token = tokenizer.model_max_length
        if end_token is None:
            end_token = tokenizer.model_max_length
                
        
        