## Day 2 (Topic 9.1) : BERT-based QnA Model ###
---


(The Hugging Face transformers package is an immensely popular Python library providing pretrained models that are extraordinarily useful for a variety of natural language processing (NLP) tasks.

Note: https://huggingface.co/models provides a selection of pre-trained models that can be used to quickly build prediction models for various nlp tasks. This demo uses the deepset/bert-basecased-squad2 model.

---

## Step 2: Import the BertForQuestionAnswering class and use it to define a QnA prediction model based on your selected pretrained BERT model from huggingface.

In [1]:
!pip install transformers
from transformers import BertForQuestionAnswering
model = BertForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2',force_download=True, resume_download=False)

Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m98.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.2 MB/s[0m eta [36m0:00:00[0m
Col

Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


---

## Step 3: Import the bert tokenizer class and use it to construct the tokenizer using the same BERT model. The tokenizer is used to prepare the string inputs for the prediction model by splitting the strings into sub-word token strings and converting them into transformer readable token IDs.   

In [2]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('deepset/bert-base-cased-squad2')

Downloading (…)okenizer_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

#Categories of special tokens used in the tokenization process and their corresponding token ID

| Token | Meaning | Token ID |
| --- | --- | --- |
| **[PAD]** | Padding token, allows us to maintain same-length sequences (512 tokens for Bert) even when different sized sentences are fed in | 0 |
| **[UNK]** | Used when a word is unknown to Bert | 100 |
| **[CLS]** | Appears at the start of every sequence | 101 |
| **[SEP]** | Indicates a seperator - used to indicate point between context-question and appears at end of sequences | 102 |
| **[MASK]** | Used when masking tokens, for example in training with masked language modelling (MLM) | 103 |

In [3]:
tokenizer.encode("Import the bert tokenizer class and use it to construct the tokenizer using the same BERT model.", max_length=512, truncation =True, padding = True)

[101,
 146,
 24729,
 3740,
 1103,
 1129,
 3740,
 22559,
 17260,
 1705,
 1105,
 1329,
 1122,
 1106,
 9417,
 1103,
 22559,
 17260,
 1606,
 1103,
 1269,
 139,
 9637,
 1942,
 2235,
 119,
 102]

---

## Step 4: Import the pipeline wrapper class and use it to construct the pipeline for a specific NLP task, in this case the Q&A task, by passing to it the built model and tokenizer.

In [4]:
from transformers import pipeline
qna = pipeline('question-answering', model=model, tokenizer=tokenizer)

---

## Step 5: Upload the document to be used as context for the QnA along with the question that the BERT model will need to answer. The QnA ability of this model revolves around answering questions about a passage of text that it has read.

In [6]:
context = "The Intergovernmental Panel on Climate Change (IPCC) is a scientific intergovernmental body under the auspices of the United Nations, set up at the request of member governments. It was first established in 1988 by two United Nations organizations, the World Meteorological Organization (WMO) and the United Nations Environment Programme (UNEP), and later endorsed by the United Nations General Assembly through Resolution 43/53. Membership of the IPCC is open to all members of the WMO and UNEP. The IPCC produces reports that support the United Nations Framework Convention on Climate Change (UNFCCC), which is the main international treaty on climate change. The ultimate objective of the UNFCCC is to \"stabilize greenhouse gas concentrations in the atmosphere at a level that would prevent dangerous anthropogenic [i.e., human-induced] interference with the climate system\". IPCC reports cover \"the scientific, technical and socio-economic information relevant to understanding the scientific basis of risk of human-induced climate change, its potential impacts and options for adaptation and mitigation.\""
question = "What organization is the IPCC a part of?"
qna({'question': question,'context': context})


{'score': 0.4881570637226105,
 'start': 118,
 'end': 133,
 'answer': 'United Nations,'}

---

#Step 6: Plug-in QnA model to a basic user interdace and start testing the capability of the QnA Model

In [None]:
import textwrap
context = input("Enter Context Article: ")
dedented_text = textwrap.dedent(context).strip()
print("Context Article:\n")
print(textwrap.fill(dedented_text, width=120))

newcontext = 'y'
inquiry = input("\nType your question: ")
while (inquiry!='*'):
  answer = qna({'question': inquiry,'context': context})

  print("Answer found: "+ answer['answer'])
  print("At Index :", answer['start']," - ",answer['end'])
  print("with Probability:", answer['score'],"\n")
  inquiry = input("Enter another question (* to stop):")


Context Article:

The 1986 People Power Revolution in EDSA marked another time where Batangueños enter the picture. When Corazon Aquino
was inaugurated as president by the bloodless revolution, the Batangueño Salvador Laurel is no less than her Vice-
President.  She also appointed Renato de Villa as the Chief of Constabulary and Director-General of the Integrated
National Police, and later the Chief of Staff of the Armed Forces of the Philippines. It was under his leadership that
the Military remained loyal to Aquino despite the many coup d’etat attempts of Gregorio Honasan. He was also one of the
influences behind the Second People Power in 2001.  During the Presidency of Joseph Estrada, he also chose four
Batangueños to be his closest advisers. The group was composed of Domingo Panganiban (Department of Agriculture),
Benjamin Diokno (Department of Budget and Management), Dong Apacible (Legislative Liaison), Tony “Lepili” Leviste (Board
of Investments Governor), and PedFaytaren (Econo