# Jeopardy Notebook

**Please Run with GPU in Colab.** The notebook will take about 3m to execute fully.

Instructions:

- Ensure sections `Part 1` and `Part 2` are collapsed
- Enter input below
- Press play button beneath `Part 1`
- Wait for an error *"Session Crashed for unknown reason"*
- Press play button beneath `Part 2`
- Execute the final cell


> ***You must `Factory Reset Runtime` to run the notebook on a new input***



In [2]:
import torch
assert torch.cuda.is_available(), 'Please enable GPU runtime in `Runtime>Change Runtime Type`'
text = str(input('Please enter the context (only first 128 words will be used): '))

Please enter the context (only first 128 words will be used): The Balfour Declaration was a public statement issued by the British government in 1917 during the First World War announcing support for the establishment of a "national home for the Jewish people" in Palestine, then an Ottoman region with a small minority Jewish population. The declaration was contained in a letter dated 2 November 1917 from the United Kingdom's Foreign Secretary Arthur Balfour to Lord Rothschild, a leader of the British Jewish community, for transmission to the Zionist Federation of Great Britain and Ireland. The text of the declaration was published in the press on 9 November 1917


> Run the above cell ^ and enter a piece of context to generate a clue from

# Part 1 - Fine-tuned BERT Model Inference

> Collapse me

In [3]:
!gdown --id 1y267OwUrFRTCHxqet3l7dEEnCMmGZJGK
!gdown --id 1-1YIOyMdCvdhO9Z_DRHRBk20DhLB9XSp
!unzip t5-small-e2e-qg-7k.zip
!rm t5-small-e2e-qg-7k.zip

Downloading...
From: https://drive.google.com/uc?id=1y267OwUrFRTCHxqet3l7dEEnCMmGZJGK
To: /content/final_BERT_model.pt
992MB [00:17, 56.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-1YIOyMdCvdhO9Z_DRHRBk20DhLB9XSp
To: /content/t5-small-e2e-qg-7k.zip
224MB [00:02, 90.1MB/s]
Archive:  t5-small-e2e-qg-7k.zip
   creating: t5-small-e2e-qg-7k/
  inflating: t5-small-e2e-qg-7k/special_tokens_map.json  
  inflating: t5-small-e2e-qg-7k/tokenizer_config.json  
  inflating: t5-small-e2e-qg-7k/spiece.model  
  inflating: t5-small-e2e-qg-7k/training_args.bin  
  inflating: t5-small-e2e-qg-7k/added_tokens.json  
  inflating: t5-small-e2e-qg-7k/config.json  
  inflating: t5-small-e2e-qg-7k/pytorch_model.bin  


In [4]:
import os
assert (os.path.exists('/content/final_BERT_model.pt')
    and os.path.exists('/content/t5-small-e2e-qg-7k')), 'download error: rerun the above cell please'

In [5]:
import os, pickle, time, random, logging, json, gc
from datetime import datetime
from tqdm import tqdm

import numpy as np
import pandas as pd

## BERT Model Definition

In [6]:
!pip install --user -U nltk
!python -m nltk.downloader punkt
!pip install --force-reinstall transformers==2.11.0

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB)
[K     |▎                               | 10kB 24.4MB/s eta 0:00:01[K     |▌                               | 20kB 30.6MB/s eta 0:00:01[K     |▊                               | 30kB 21.4MB/s eta 0:00:01[K     |█                               | 40kB 17.9MB/s eta 0:00:01[K     |█▏                              | 51kB 9.4MB/s eta 0:00:01[K     |█▍                              | 61kB 9.6MB/s eta 0:00:01[K     |█▋                              | 71kB 10.9MB/s eta 0:00:01[K     |█▉                              | 81kB 11.2MB/s eta 0:00:01[K     |██                              | 92kB 11.3MB/s eta 0:00:01[K     |██▎                             | 102kB 9.1MB/s eta 0:00:01[K     |██▌                             | 112kB 9.1MB/s eta 0:00:01[K     |██▊                             | 122kB 9.1MB/s eta 0:

In [7]:
import torch
import transformers
assert transformers.__version__ == '2.11.0', 'Wrong Transformer Version (must be 2.11.0). Please Factory Reset Runtime'
from transformers import BertTokenizer

In [8]:
MAX_SEQ_LEN = 128
BERT_MODEL_PATH = '/content/final_BERT_model.pt'

In [9]:
def get_mask_ids(tokens):
    return [1]*len(tokens) + [0] * (MAX_SEQ_LEN - len(tokens))

def get_segment_ids(tokens):
    segments = []
    first_sep = True
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            if first_sep:
                first_sep = False 
                current_segment_id = 1
    assert current_segment_id == 1
    return segments + [0] * (MAX_SEQ_LEN - len(tokens))

def convert_to_input(tokenizer, text, ans=None):
    text_token = tokenizer.tokenize(text)[:MAX_SEQ_LEN]
    print(len(text_token))
    if ans:
        ans_token= tokenizer.tokenize(ans)
        text_token = text_token[:MAX_SEQ_LEN - (3-len(ans_token))]
        all_tokens = ["[CLS]"] + text_token + ["[SEP]"] + ans_token + ["[SEP]"]
    else:
        text_token = text_token[:MAX_SEQ_LEN - 2]
        all_tokens = ["[CLS]"] + text_token + ["[SEP]"]

    token_ids = tokenizer.convert_tokens_to_ids(all_tokens)
    input_ids = token_ids + [0] * (MAX_SEQ_LEN-len(token_ids))
    
    attention_mask = get_mask_ids(all_tokens)
    token_type_ids = get_segment_ids(all_tokens)
    return (
        torch.tensor(input_ids, dtype=torch.long), 
        torch.tensor(attention_mask, dtype=torch.long), 
        torch.tensor(token_type_ids, dtype=torch.long), 
    )

In [10]:
def bert_inference(bert_model, text, ans=None):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                              do_lower_case=True)
    vocab_size = tokenizer.vocab_size
    input_ids, attention_mask, token_type_ids = (i.unsqueeze(0).to(device) for i in 
                            convert_to_input(tokenizer, text, ans))
    logits = bert_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=input_ids,
        token_type_ids=token_type_ids,
        masked_lm_labels=None
    )[0]
    logits = logits.view(-1, vocab_size)
    logits = logits.detach().cpu().numpy()

    prediction_raw = logits.argmax(axis=1).flatten().squeeze()
    predicted = list(prediction_raw)
    try:
        length = predicted.index(102) # find first sep token
    except ValueError:
        length = len(predicted)-1
    
    predicted = predicted[:length+1]
    predicted = tokenizer.decode(predicted, skip_special_tokens=True)
    return predicted

## BERT Model Inference

In [11]:
assert torch.cuda.is_available(), 'CUDA device is required'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert_model = torch.load(BERT_MODEL_PATH)

In [12]:
bert_output = bert_inference(bert_model, text)
print('Output', bert_output)
pickle.dump({'text': text, 'bert_output': bert_output}, 
            open('/content/temp.pkl', 'wb'))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


107
Output in 1930,,,, the the the to to to to to to to to to to to


## T5 Model

In [None]:
!pip install --force-reinstall git+git://github.com/adamnpeace/nlpt5
import os
os.kill(os.getpid(), 9) # hack to force restart of runtime

Collecting git+git://github.com/adamnpeace/nlpt5
  Cloning git://github.com/adamnpeace/nlpt5 to /tmp/pip-req-build-lvurbxza
  Running command git clone -q git://github.com/adamnpeace/nlpt5 /tmp/pip-req-build-lvurbxza
Collecting nlp==0.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/9c/69/17c95e9bdb431bb5102f331d3d34e0f3aabef14a8041690ad72c2b11d1d0/nlp-0.2.0-py3-none-any.whl (857kB)
[K     |████████████████████████████████| 860kB 7.5MB/s 
[?25hCollecting datasets==1.6.2
[?25l  Downloading https://files.pythonhosted.org/packages/46/1a/b9f9b3bfef624686ae81c070f0a6bb635047b17cdb3698c7ad01281e6f9a/datasets-1.6.2-py3-none-any.whl (221kB)
[K     |████████████████████████████████| 225kB 21.4MB/s 
[?25hCollecting transformers==3.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |████████████████████████████████| 757kB 21.9MB/s 
[?25hCollectin

# Part 2 - Fine-tuned T5-small Model Inference



> Collapse me



In [1]:
import os, pickle, time, random, logging, json, gc
from datetime import datetime
from tqdm import tqdm

import numpy as np
import pandas as pd
from nlpt5 import pipeline
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# T5_PATH = '/content/drive/MyDrive/NLP/models/t5-small-e2e-qg-7k'
T5_PATH = '/content/t5-small-e2e-qg-7k'

In [3]:
def get_t5_model(path):
    return pipeline('e2e-qg', model=path)

In [4]:
def t5_inference(t5_model, text):
    prediction = t5_model(text)
    prediction = prediction[0]
    return prediction

## T5 Model Inference

In [5]:
t5_model = get_t5_model(T5_PATH)

In [6]:
temp = pickle.load(open('/content/temp.pkl', 'rb'))
text = temp['text']

In [7]:
t5_output = t5_inference(t5_model, text)
temp['t5_output'] = t5_output
pickle.dump(temp, open('/content/temp.pkl', 'wb'))

## Scoring

In [8]:
# !pip install -U nltk
# !pip install easy-rouge
# !python -m nltk.downloader punkt

In [9]:
# import string, re
# import nltk
# from rouge.rouge import rouge_n_sentence_level

In [10]:
# def clean_sentence(collection):
#     collection = collection.translate(str.maketrans('','',string.punctuation))
#     collection = re.sub(r'\d+', '', collection)
#     collection = collection.strip()
#     return collection.split()

In [11]:
# def get_metrics(prediction, truth):
#     recall, precision, rouge = rouge_n_sentence_level(
#             truth.split(),
#             prediction.split(),
#             2)
#     bleu = nltk.translate.bleu_score.sentence_bleu(
#         [clean_sentence(truth)], clean_sentence(prediction))
#     meteor = nltk.translate.meteor_score.meteor_score(
#         truth, prediction)
#     return rouge, bleu, meteor

# Results

In [12]:
import pickle
for k, v in pickle.load(open('/content/temp.pkl', 'rb')).items():
    print('{}: {}'.format(k, v))

text: The Balfour Declaration was a public statement issued by the British government in 1917 during the First World War announcing support for the establishment of a "national home for the Jewish people" in Palestine, then an Ottoman region with a small minority Jewish population. The declaration was contained in a letter dated 2 November 1917 from the United Kingdom's Foreign Secretary Arthur Balfour to Lord Rothschild, a leader of the British Jewish community, for transmission to the Zionist Federation of Great Britain and Ireland. The text of the declaration was published in the press on 9 November 1917
bert_output: in 1930,,,, the the the to to to to to to to to to to to
t5_output: When was the Balfour Declaration issued?
