# Using Transformers for Question Answering

BERT:  https://huggingface.co/transformers/model_doc/bert.html

Example using Bert for question answering.  Found here:  
https://huggingface.co/transformers/task_summary.html#extractive-question-answering

Need to study this script for fine-tuning the transformers library models for question answering:
https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/run_qa.py

**Things to Do**
I want to run BERT on a small set of the paragraphs and question in in SQuAD 2.0 to get a sense of what the files look like.

In [3]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# Here we use a Bert model that has been fine-tuned.
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

# The text is from the Wikipedia article on Austen's last novel Persuasion.
# https://en.wikipedia.org/wiki/Persuasion_(novel)

text = r"""BERT has its origins from pre-training contextual representations\
including Semi-supervised Sequence Learning,[12] Generative Pre-Training, ELMo\
,[13] and ULMFit.[14] Unlike previous models, BERT is a deeply bidirectional,\
unsupervised language representation, pre-trained using only a plain text corpus.\
Context-free models such as word2vec or GloVe generate a single word embedding \
representation for each word in the vocabulary, where BERT takes into account the \
context for each occurrence of a given word. For instance, whereas the vector for 
"running" will have the same word2vec vector representation for both of its \
occurrences in the sentences "He is running a company" and "He is running a marathon"\
, BERT will provide a contextualized embedding that will be different according\
to the sentence.
"""

questions = ['What is Bert?',\
             'How is BERT pre-trained?',\
             'WHat does BERT take into account?',\
             'Who is Ernie?',\
            ]

n = 50 # number of asterisks printed for a frame around the results
print(n * '*')
for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    # pick location of most probable answers
    answer_start = torch.argmax(answer_start_scores)  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
    
    # converts answer from tokens to strings
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    
    
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(n * '*')

**************************************************
Question: What is Bert?
Answer: a deeply bidirectional, \ unsupervised language representation
**************************************************
Question: How is BERT pre-trained?
Answer: using only a plain text corpus
**************************************************
Question: WHat does BERT take into account?
Answer: context for each occurrence of a given word
**************************************************
Question: Who is Ernie?
Answer: elmo
**************************************************


The follow code follows a portion of the run_squad.py file found here:

https://github.com/google-research/bert

README for BERT files maintained by the creators of BERT.

https://github.com/google-research/bert/blob/master/README.md

The code was written by the creators of BERT.

In [40]:
import json

path = "C:\\Users\\Alex\\nlp\\pytorch\\data_sets\\SQuAD\\train-v2.0.json"


# Similar to code found around line 227 or run_squad.py

# Load json file for SQuAD 2.0
with open(path, 'r') as reader:
    data = json.loads(reader.read())['data']
    print(len(data))

# print paragraphs and questions from SQuAD dataset
for i, entry in enumerate(data):
    for j, paragraph in enumerate(entry['paragraphs']):
        print(f'*** Paragraph {j+1} ***')
        paragraph_text = paragraph['context']
        print(f'\nparagraph_text = {paragraph_text}\n')
        for qa in paragraph['qas']:
            question_text = qa['question']
            print(f'question_text = {question_text}')
        print(10 * '*')
        number_of_paragraphs = 5
        if j >= (number_of_paragraphs - 1):
            break
    


442
*** Paragraph 1 ***

paragraph_text = Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

question_text = When did Beyonce start becoming popular?
question_text = What areas did Beyonce compete in when she was growing up?
question_text = When did Beyonce leave Destiny's Child and become a solo singer?
question_text = In what city and state did Beyonc

question_text = What did those close to him call Bell?
**********
*** Paragraph 5 ***

paragraph_text = As a child, young Bell displayed a natural curiosity about his world, resulting in gathering botanical specimens as well as experimenting even at an early age. His best friend was Ben Herdman, a neighbor whose family operated a flour mill, the scene of many forays. Young Bell asked what needed to be done at the mill. He was told wheat had to be dehusked through a laborious process and at the age of 12, Bell built a homemade device that combined rotating paddles with sets of nail brushes, creating a simple dehusking machine that was put into operation and used steadily for a number of years. In return, John Herdman gave both boys the run of a small workshop in which to "invent".

question_text = What sort of things did Bell collect as a child?
question_text = Who was Bell's closest friend as a child?
question_text = What sort of mill did Bell's neighbors run?
question_text = Bell's de

question_text = What municipality serves as the seat of government of Delhi?
**********
*** Paragraph 2 ***

paragraph_text = The foundation stone of the city was laid by George V, Emperor of India during the Delhi Durbar of 1911. It was designed by British architects, Sir Edwin Lutyens and Sir Herbert Baker. The new capital was inaugurated on 13 February 1931, by India's Viceroy Lord Irwin.

question_text = In what year was the foundation stone of New Delhi laid?
question_text = Who designed the foundation stone of the city of New Delhi?
question_text = On what date was New Delhi inaugurated?
question_text = Who inaugurated the city of New Delhi?
**********
*** Paragraph 3 ***

paragraph_text = Although colloquially Delhi and New Delhi as names are used interchangeably to refer to the jurisdiction of NCT of Delhi, these are two distinct entities, and the latter is a small part of the former.

question_text = What are the two terms colloquially used to refer to the jurisdiction of NCT 

question_text = How many FIFA Club World Cup trophies does football club Barcelona have?
question_text = What club is Barcelona's long time rival?
**********
*** Paragraph 2 ***

paragraph_text = On 14 June 1925, in a spontaneous reaction against Primo de Rivera's dictatorship, the crowd in the stadium jeered the Royal March. As a reprisal, the ground was closed for six months and Gamper was forced to relinquish the presidency of the club. This coincided with the transition to professional football, and, in 1926, the directors of Barcelona publicly claimed, for the first time, to operate a professional football club. On 3 July 1927, the club held a second testimonial match for Paulino Alcántara, against the Spanish national team. To kick off the match, local journalist and pilot Josep Canudas dropped the ball onto the pitch from his airplane. In 1928, victory in the Spanish Cup was celebrated with a poem titled "Oda a Platko", which was written by a member of the Generation of '27, Raf

question_text = In what place with the word "name" in it do most people speak Dutch?
question_text = Islands in the Caribbean that include Dutch as an official language include Curaçao, Sint Maarten, and what other place?
question_text = It's been estimated that up to what number of native Dutch speakers live in Australia, the U.S., and Canada?
question_text = In Southern Africa, Dutch has developed over many years into what daughter language?
question_text = What the low estimate for the number of people who speak Afrikaans?
**********
*** Paragraph 2 ***

paragraph_text = Dutch is one of the closest relatives of both German and English[n 5] and is said to be roughly in between them.[n 6] Dutch, like English, has not undergone the High German consonant shift, does not use Germanic umlaut as a grammatical marker, has largely abandoned the use of the subjunctive, and has levelled much of its morphology, including the case system.[n 7] Features shared with German include the survival of 

question_text = What was one of the Thankful Villages?
question_text = Which village suffered the most First World War casualties?
question_text = How many Somerset soldiers died in all in the First World War?
question_text = How many Somerset soldiers did in the Second World War?
question_text = How many pill boxes can still be seen along the coast?
**********
*** Paragraph 1 ***

paragraph_text = Yale University is an American private Ivy League research university in New Haven, Connecticut. Founded in 1701 in Saybrook Colony as the Collegiate School, the University is the third-oldest institution of higher education in the United States. The school was renamed Yale College in 1718 in recognition of a gift from Elihu Yale, who was governor of the British East India Company. Established to train Congregationalist ministers in theology and sacred languages, by 1777 the school's curriculum began to incorporate humanities and sciences. In the 19th century the school incorporated graduate

question_text = What was censored?
question_text = Whose anti-alcohol program did Gorbachev's remind people of?
question_text = When did Tsar Nicholas II ban alcohol?
**********
*** Paragraph 5 ***

paragraph_text = On July 1, 1985, Gorbachev promoted Eduard Shevardnadze, First Secretary of the Georgian Communist Party, to full member of the Politburo, and the following day appointed him minister of foreign affairs, replacing longtime Foreign Minister Andrei Gromyko. The latter, disparaged as "Mr Nyet" in the West, had served for 28 years as Minister of Foreign Affairs. Gromyko was relegated to the largely ceremonial position of Chairman of the Presidium of the Supreme Soviet (officially Soviet Head of State), as he was considered an "old thinker." Also on July 1, Gorbachev took the opportunity to dispose of his main rival by removing Grigory Romanov from the Politburo, and brought Boris Yeltsin and Lev Zaikov into the CPSU Central Committee Secretariat.

question_text = When did Eduar

question_text = What type of human is God portrayed as in some religions?
question_text = What is a God in monotheism?
question_text = How should monotheism not be portrayed?
question_text = What gender do theologians usually use to define human biological gender?
question_text = What has creating a non-corporeal being caused some religions to do regarding divine simplicity?
question_text = In biological gender, what is the universe conceived of?
question_text = What are some concepts of the universe according to theologians?
**********
*** Paragraph 2 ***

paragraph_text = In theism, God is the creator and sustainer of the universe, while in deism, God is the creator, but not the sustainer, of the universe. Monotheism is the belief in the existence of one God or in the oneness of God. In pantheism, God is the universe itself. In atheism, God is not believed to exist, while God is deemed unknown or unknowable within the context of agnosticism. God has also been conceived as being incor

question_text = The most differences between Czech and Slovak can be found in colloquial vocabulary as well as what?
question_text = What does Slovak have slightly more of than Czech?
question_text = What  does Praha have more of than Prahe?
question_text = How many people speak Czech in Russia?
question_text = Why do 80% of people in Russia speak Czech?
question_text = Where can the most differences in Praha be found besides vocabulary?
question_text = When is Praha morphology more regular than scientific terminology?
**********
*** Paragraph 4 ***

paragraph_text = The similarities between Czech and Slovak led to the languages being considered a single language by a group of 19th-century scholars who called themselves "Czechoslavs" (Čechoslováci), believing that the peoples were connected in a way which excluded German Bohemians and (to a lesser extent) Hungarians and other Slavs. During the First Czechoslovak Republic (1918–1938), although "Czechoslovak" was designated as the republ

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

