# Jeopardy Notebook

**Please Run with GPU in Colab.**

Instructions:

- Ensure sections `Part 1` and `Part 2` are collapsed
- Enter input below
- Press play button beneath `Part 1`
- Wait for an error *"Session Crashed for unknown reason"*
- Press play button beneath `Part 2`
- Execute the final cell


> ***You must `Factory Reset Runtime` to run the notebook on a new input***



In [None]:
import torch
assert torch.cuda.is_available(), 'Please enable GPU runtime in `Runtime>Change Runtime Type`'
text = str(input('Please enter the context (only first 128 words will be used): '))

Please enter the context (only first 128 words will be used):  Sweden is a constitutional monarchy and a parliamentary democracy, with legislative power vested in the 349-member unicameral Riksdag. It is a unitary state, currently divided into 21 counties and 290 municipalities. Sweden maintains a Nordic social welfare system that provides universal health care and tertiary education for its citizens. It has the world's eleventh-highest per capita income and ranks very highly in quality of life, health, education, protection of civil liberties, economic competitiveness, income equality, gender equality, prosperity and human development. Sweden joined the European Union on 1 January 1995, but has rejected NATO membership, as well as Eurozone membership following a referendum. It is also a member of the United Nations, the Nordic Council, the Council of Europe, the World Trade Organization and the Organisation for Economic Co-operation and Development (OECD).


> Run the above cell ^ and enter a piece of context to generate a clue from

# Part 1 - Fine-tuned BERT Model Inference

> Collapse me

In [None]:
!gdown --id 1y267OwUrFRTCHxqet3l7dEEnCMmGZJGK
!gdown --id 1-1YIOyMdCvdhO9Z_DRHRBk20DhLB9XSp
!unzip t5-small-e2e-qg-7k.zip
!rm t5-small-e2e-qg-7k.zip

Downloading...
From: https://drive.google.com/uc?id=1y267OwUrFRTCHxqet3l7dEEnCMmGZJGK
To: /content/final_BERT_model.pt
992MB [00:07, 127MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-1YIOyMdCvdhO9Z_DRHRBk20DhLB9XSp
To: /content/t5-small-e2e-qg-7k.zip
224MB [00:03, 66.3MB/s]
Archive:  t5-small-e2e-qg-7k.zip
   creating: t5-small-e2e-qg-7k/
  inflating: t5-small-e2e-qg-7k/special_tokens_map.json  
  inflating: t5-small-e2e-qg-7k/tokenizer_config.json  
  inflating: t5-small-e2e-qg-7k/spiece.model  
  inflating: t5-small-e2e-qg-7k/training_args.bin  
  inflating: t5-small-e2e-qg-7k/added_tokens.json  
  inflating: t5-small-e2e-qg-7k/config.json  
  inflating: t5-small-e2e-qg-7k/pytorch_model.bin  


In [None]:
import os, pickle, time, random, logging, json, gc
from datetime import datetime
from tqdm import tqdm

import numpy as np
import pandas as pd

## BERT Model Definition

In [None]:
!pip install --user -U nltk
!python -m nltk.downloader punkt
!pip install --force-reinstall transformers==2.11.0

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 8.4MB/s eta 0:00:01
Installing collected packages: nltk
Successfully installed nltk-3.6.2
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Collecting transformers==2.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/48/35/ad2c5b1b8f99feaaf9d7cdadaeef261f098c6e1a6a2935d4d07662a6b780/transformers-2.11.0-py3-none-any.whl (674kB)
[K     |████████████████████████████████| 675kB 7.7MB/s 
[?25hCollecting packaging
[?25l  Downloading https://files.pythonhosted.org/packages/3e/89/7ea760b4daa42653ece2380531c90f64788d979110a2ab51049d92f408af/packaging-20.9-py2.py3-none-any.whl (40kB)
[K     |████████████████████████████████| 40kB 5.6MB/s 
[?25hCollecting filelock
  Downloading https://files.python

In [None]:
import torch
import transformers
assert transformers.__version__ == '2.11.0', 'Wrong Transformer Version (must be 2.11.0). Please Factory Reset Runtime'
from transformers import BertTokenizer

In [None]:
MAX_SEQ_LEN = 128
BERT_MODEL_PATH = '/content/final_BERT_model.pt'

In [None]:
def get_mask_ids(tokens):
    return [1]*len(tokens) + [0] * (MAX_SEQ_LEN - len(tokens))

def get_segment_ids(tokens):
    segments = []
    first_sep = True
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            if first_sep:
                first_sep = False 
                current_segment_id = 1
    assert current_segment_id == 1
    return segments + [0] * (MAX_SEQ_LEN - len(tokens))

def convert_to_input(tokenizer, text, ans=None):
    text_token = tokenizer.tokenize(text)[:MAX_SEQ_LEN]
    print(len(text_token))
    if ans:
        ans_token= tokenizer.tokenize(ans)
        text_token = text_token[:MAX_SEQ_LEN - (3-len(ans_token))]
        all_tokens = ["[CLS]"] + text_token + ["[SEP]"] + ans_token + ["[SEP]"]
    else:
        text_token = text_token[:MAX_SEQ_LEN - 2]
        all_tokens = ["[CLS]"] + text_token + ["[SEP]"]

    token_ids = tokenizer.convert_tokens_to_ids(all_tokens)
    input_ids = token_ids + [0] * (MAX_SEQ_LEN-len(token_ids))
    
    attention_mask = get_mask_ids(all_tokens)
    token_type_ids = get_segment_ids(all_tokens)
    return (
        torch.tensor(input_ids, dtype=torch.long), 
        torch.tensor(attention_mask, dtype=torch.long), 
        torch.tensor(token_type_ids, dtype=torch.long), 
    )

In [None]:
def bert_inference(bert_model, text, ans=None):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                              do_lower_case=True)
    vocab_size = tokenizer.vocab_size
    input_ids, attention_mask, token_type_ids = (i.unsqueeze(0).to(device) for i in 
                            convert_to_input(tokenizer, text, ans))
    logits = bert_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=input_ids,
        token_type_ids=token_type_ids,
        masked_lm_labels=None
    )[0]
    logits = logits.view(-1, vocab_size)
    logits = logits.detach().cpu().numpy()

    prediction_raw = logits.argmax(axis=1).flatten().squeeze()
    predicted = list(prediction_raw)
    try:
        length = predicted.index(102) # find first sep token
    except ValueError:
        length = len(predicted)-1
    
    predicted = predicted[:length+1]
    predicted = tokenizer.decode(predicted, skip_special_tokens=True)
    return predicted

## BERT Model Inference

In [None]:
assert torch.cuda.is_available(), 'CUDA device is required'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert_model = torch.load(BERT_MODEL_PATH)

In [None]:
bert_output = bert_inference(bert_model, text)
print('Output', bert_output)
pickle.dump({'text': text, 'bert_output': bert_output}, 
            open('/content/temp.pkl', 'wb'))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


128
Output the country s s s,,,,,,,,,,,, in in


## T5 Model

In [None]:
!pip install --force-reinstall transformers==3.0.0
import os
os.kill(os.getpid(), 9) # hack to force restart of runtime

Collecting transformers==3.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |████████████████████████████████| 757kB 9.3MB/s eta 0:00:01
[?25hCollecting tqdm>=4.27
  Using cached https://files.pythonhosted.org/packages/72/8a/34efae5cf9924328a8f34eeb2fdaae14c011462d9f0e3fcded48e1266d1c/tqdm-4.60.0-py2.py3-none-any.whl
Collecting sentencepiece
  Using cached https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl
Collecting tokenizers==0.8.0-rc4
[?25l  Downloading https://files.pythonhosted.org/packages/f7/82/0e82a95bd9db2b32569500cc1bb47aa7c4e0f57aa5e35cceba414096917b/tokenizers-0.8.0rc4-cp37-cp37m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 20.7MB/s 
[?25hCollecting filelock
  Using cached https://files.pythonh

# Part 2 - Fine-tuned T5-small Model Inference



> Collapse me



In [None]:
!git clone https://github.com/patil-suraj/question_generation.git
%cd question_generation

Cloning into 'question_generation'...
remote: Enumerating objects: 265, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 265 (delta 1), reused 2 (delta 0), pack-reused 259[K
Receiving objects: 100% (265/265), 298.28 KiB | 9.94 MiB/s, done.
Resolving deltas: 100% (141/141), done.
/content/question_generation


In [None]:
import os, pickle, time, random, logging, json, gc
from datetime import datetime
from tqdm import tqdm

import numpy as np
import pandas as pd
from pipelines import pipeline
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# T5_PATH = '/content/drive/MyDrive/NLP/models/t5-small-e2e-qg-7k'
T5_PATH = '/content/t5-small-e2e-qg-7k'

In [None]:
def get_t5_model(path):
    return pipeline('e2e-qg', model=path)

In [None]:
def t5_inference(t5_model, text):
    prediction = t5_model(text)
    prediction = prediction[0]
    return prediction

## T5 Model Inference

In [None]:
t5_model = get_t5_model(T5_PATH)

In [None]:
temp = pickle.load(open('/content/temp.pkl', 'rb'))
text = temp['text']

In [None]:
t5_output = t5_inference(t5_model, text)
temp['t5_output'] = t5_output
pickle.dump(temp, open('/content/temp.pkl', 'wb'))

## Scoring

In [None]:
# !pip install -U nltk
# !pip install easy-rouge
# !python -m nltk.downloader punkt

In [None]:
# import string, re
# import nltk
# from rouge.rouge import rouge_n_sentence_level

In [None]:
# def clean_sentence(collection):
#     collection = collection.translate(str.maketrans('','',string.punctuation))
#     collection = re.sub(r'\d+', '', collection)
#     collection = collection.strip()
#     return collection.split()

In [None]:
# def get_metrics(prediction, truth):
#     recall, precision, rouge = rouge_n_sentence_level(
#             truth.split(),
#             prediction.split(),
#             2)
#     bleu = nltk.translate.bleu_score.sentence_bleu(
#         [clean_sentence(truth)], clean_sentence(prediction))
#     meteor = nltk.translate.meteor_score.meteor_score(
#         truth, prediction)
#     return rouge, bleu, meteor

# Results

In [None]:
import pickle
for k, v in pickle.load(open('/content/temp.pkl', 'rb')).items():
    print('{}: {}'.format(k, v))

text:  Sweden is a constitutional monarchy and a parliamentary democracy, with legislative power vested in the 349-member unicameral Riksdag. It is a unitary state, currently divided into 21 counties and 290 municipalities. Sweden maintains a Nordic social welfare system that provides universal health care and tertiary education for its citizens. It has the world's eleventh-highest per capita income and ranks very highly in quality of life, health, education, protection of civil liberties, economic competitiveness, income equality, gender equality, prosperity and human development. Sweden joined the European Union on 1 January 1995, but has rejected NATO membership, as well as Eurozone membership following a referendum. It is also a member of the United Nations, the Nordic Council, the Council of Europe, the World Trade Organization and the Organisation for Economic Co-operation and Development (OECD).
bert_output: the country s s s,,,,,,,,,,,, in in
t5_output: What country has the wor