# Assignment 3: Question Answering with a Language Model

**Description:** This assignment covers question answering with a language model. There are many ways to formulate the question ansering task and this is one of them.  You will use the masked token with T5 to develop a sentence construct that allows the model to answer the question more than 75% of the time. You should also be able to develop an intuition for:


* Working with masked language models 
* Working with prompt based models 
* The depths and limits of knowledge in these large models 

 
This notebook will run on your GCP instance as the generation of sentences does not require a GPU to work in a timely fashion. This notebook should be run on a Google Colab but it does not require a GPU. By default, when you open the notebook in Colab it will not configure a GPU. 


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-fall-main/blob/master/assignment/a3/QuestionAnswering_test.ipynb)

The overall assignment structure is as follows:

1. Setup
  
  1.1 Libraries & Helper Functions

  1.2 Data Acquisition

  1.3 Training/Test/Validation Sets for BERT-based models

**INSTRUCTIONS:** 

* Questions are always indicated as **QUESTION:**, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the **answers** file as you did in a1 and a2.





In [1]:
!pip install -q sentencepiece

In [2]:
!pip install -q transformers

In [3]:
from collections import Counter
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [4]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

In [7]:
t5_model = TFT5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

t5_model.summary()

"\<extra_id_0\>" is the special token we can use with T5 to invoke its masked word modeling ability.  This means we can construct sentences, like a fill in the blank test, that allow us to probe the knowledge embedded in the model based on its pre-training.  Here's an example that works well.  We can construct with the special token a prompt sentence that says "A poodle is a type of "\<extra_id_0\>"".  We expect the model to fill in the word 'dog' as it predicts the missing word.  Note that it also predicts 'pet' as another possibility as a poodle can be a type of pet.  Also the 
"\<extra_id_0\>" token can appear anywhere in the sentence, not just at the end.

In [17]:
PROMPT_SENTENCE = ( "A poodle is a type of <extra_id_0> .")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=1,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=5)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

After you've run it once, try substituting 'beagle' for 'poodle' and you'll see the model gets confused.

Notice too that we are using a beam search approach to generate multiple possibilities but only accept the top three choices rather than just the first choice. We're asking for three answer sequences to be returned and they should be between 1 and 5 subwords long.

With the growth of text generation models, developing a good prompt is an increasingly important skill. 

**QUESTION:**

1.1 Let's test the actual knowledge encoded in the T5 model. Let's construct prompts that return provably true or false facts like you might see on a fill in the blank test.  Given the following ten countries (England, France, Germany, Russia, Egypt, Thailand, Japan, Canada, India, China) construct **two** different PROMPT_SENTENCEs using the special token and the values of the countries list so that in at least 7 of the 10 cases one of the top three answers is a correct fact.  Use the string COUNTRY to stand in for each of the elements in the list.  For example, "\<extra_id_0\> is the chief export of COUNTRY".

Note that a fact usually takes the form of a noun phrase - verb phrase - noun phrase triple where one of those noun phrases will consist of the country value.   

In [16]:
#Use this space to craft your sentence.  You do NOT need to modify the hyperparameters!
PROMPT_SENTENCE = ( "Fill in the <extra_id_0> is a form of question.")
t5_input_text = PROMPT_SENTENCE
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=2,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])

In [None]:
#Use this space to craft your second sentence.  You do NOT need to modify the hyperparameters!
PROMPT_SENTENCE2 = ( "<extra_id_0> is another form of test question.")
t5_input_text = PROMPT_SENTENCE2
t5_inputs = t5_tokenizer([t5_input_text], return_tensors='tf')
t5_summary_ids = t5_model.generate(t5_inputs['input_ids'], 
                                   num_beams=9,
                                   no_repeat_ngram_size=2,
                                   num_return_sequences=3,
                                   min_length=1,
                                   max_length=3)
                             
print([t5_tokenizer.decode(g, skip_special_tokens=True, 
                           clean_up_tokenization_spaces=False) for g in t5_summary_ids])