<a href="https://colab.research.google.com/github/krishnarevi/END2.0/blob/main/w2_Copy_of_Capstone_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we show how we can take advantage of these recent advances to train a long form question answering system which takes in a question, fetches relevant passages from a document corpus, and writes a multi-sentence answer based on the question and retrieved passages.In particular, training embedding-based retrieval models to gather supporting evidence for open-domain questions is relatively new research area: the last few months have seen some significant progress in cases where direct supervision is available, or with extensive task-specific pretraining. Here, we show how our custom dataset allows us to train a dense retrieval system without access to either, making dense retrieval models more accessible.

## 1.a - Preliminaries
The implementation presented here relies on the Hugging Face 🤗transformers and 🤗nlp libraries. Wikipedia indexing relies on faiss for the dense version. You can get all of these by running:

<!-- pip install elasticsearch -->
pip install faiss_gpu
pip install nlp
pip install transformers
<!-- 
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.1-linux-x86_64.tar.gz
tar -xzvf elasticsearch-7.7.1-linux-x86_64.tar.gz -->

In [1]:
! nvidia-smi

Sun Sep 12 07:09:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install faiss_gpu nlp transformers

Collecting faiss_gpu
  Downloading faiss_gpu-1.7.1.post2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89.7 MB)
[K     |████████████████████████████████| 89.7 MB 9.1 kB/s 
[?25hCollecting nlp
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 47.5 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 54.0 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 90.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 61.6 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 68.

In [3]:
import functools
import math
import os  # noqa: F401
from random import choice, randint
from time import time

import numpy as np
import torch
import torch.utils.checkpoint as checkpoint
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from tqdm import tqdm

import faiss  # noqa: F401
import nlp  # noqa: F401
import pandas as pd
from transformers import AdamW, AutoModel, AutoModelForSeq2SeqLM, AutoTokenizer, get_linear_schedule_with_warmup
import json

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import sys
sys.path.append('/content/drive/MyDrive/TSAI/Capstone_1')

In [6]:
# from lfqa_utils import *

In [7]:
import os
print(os.getcwd())
path = '/content/drive/MyDrive/TSAI/Capstone_1'
os.chdir(path)

/content


In [8]:
folder = "retriever_models"
# os.chdir(path)
print("current dir is: %s" % (os.getcwd()))

if os.path.isdir(folder):
    print("retriever_models directory exists")
else:
    print("retriever_models directory Doesn't exists, creating one")
    os.mkdir(folder)

current dir is: /content/drive/MyDrive/TSAI/Capstone_1
retriever_models directory exists


In [9]:
folder = "seq2seq_models"
# os.chdir(path)
print("current dir is: %s" % (os.getcwd()))

if os.path.isdir(folder):
    print("seq2seq_models directory exists")
else:
    print("seq2seq_models directory Doesn't exists, creating one")
    os.mkdir(folder)

current dir is: /content/drive/MyDrive/TSAI/Capstone_1
seq2seq_models directory exists


In [10]:
with open(path+'/train_data.json') as f:
        train = json.load(f)
with open(path+'/test_data.json') as f:
        test = json.load(f)
with open(path+'/context_master.json') as f:
        passage_snippets = json.load(f)

In [11]:
train[0]

{'id': 1,
 'x': 'Maxout Layer',
 'y': 'The Maxout layer can be implemented as follows \npython\nclass Maxout(nn.Module):\n def __init__(self, d_in, d_out, pool_size):\n super().__init__()\n self.d_in, self.d_out, self.pool_size = d_in, d_out, pool_size\n self.lin = nn.Linear(d_in, d_out * pool_size)\n def forward(self, inputs):\n shape = list(inputs.size())\n shape[-1] = self.d_out\n shape.append(self.pool_size)\n max_dim = len(shape) - 1\n out = self.lin(inputs)\n m, i = out.view(*shape).max(max_dim)\n return m\n',
 'z': 'For ones who need Maxout, I changed the above code to make it work. \npython\nclass Maxout(nn.Module):\n def __init__(self, d_in, d_out, pool_size):\n super().__init__()\n self.d_in, self.d_out, self.pool_size = d_in, d_out, pool_size\n self.lin = nn.Linear(d_in, d_out * pool_size)\n def forward(self, inputs):\n shape = list(inputs.size())\n shape[-1] = self.d_out\n shape.append(self.pool_size)\n max_dim = len(shape) - 1\n out = self.lin(inputs)\n m, i = out.view(*sh

In [12]:
len(train)

9140

In [13]:
test[100]

{'id': 101,
 'x': 'What do Variable(tensor, requires_grad) return instead of Variables?',
 'y': 'Tensors',
 'z': 'The Variable API has been deprecated: Variables are no longer necessary to use autograd with tensors. Autograd automatically supports Tensors with requires_grad set to True. Below please find a quick guide on what has changed: Variable(tensor) and Variable(tensor, requires_grad) still work as expected, but they return Tensors instead of Variables. var.data is the same thing as tensor.data. Methods such as var.backward(), var.detach(), var.register_hook() now work on tensors with the same method names.'}

In [14]:
len(test)

2286

### Retrieving Support Documents with an ELI5-Trained Dense Model

The sparse retriever works by finding passages which feature the words from the query. However, it has no way to know a priori which of these words are more important in context, and seems to struggle with understanding the central theme of the query (human-perceived temperature).

Thankfully, some recent works have taken advantage of advances in pre-trained contextual word representations to solve this problem. Models such as DPR or REALM for example learn to compute a vector representation of the query, as well as vector representations of Wikipedia passages in such a way that the passages that best answers a question maximize the dot product between the two representations. Retrieval is then reduced to a Maximum Inner Product Search, which can be executed efficiently using systems like FAISS.

These successes are very encouraging for our Open-Domain Long Form QA application. However, our task and setup do not quite meet the requirements of either of either of these approaches. On the one hand, the DPR system is trained using gold passage annotations: most major QA dataset tell the system which Wikipedia passage contains the answer. Unfortunately, we do not have such annotations for the ELI5 data. On the other hand, while REALM is trained without passage supervision, it requires a pretty expensive pre-training step with an Inverse Cloze Task (100,000 steps with batch size 4096), and the ability to re-compute the embeddings of all Wikipedia passages regularly during training.

In order to train a similar dense retrieval system at reduced cost without having access to gold passage annotation, we will have to take advantage of another unique feature of our dataset, namely the fact that the long form answers are quite similar in style to the Wikipedia passages we want to index. Our hypothesis then is that if we train a system to embed the questions and answers in our dataset in a way that allows us to easily match questions to answers, then using the answer embedder on Wikipedia passages should allow us to similarly match questions to supporting evidence from Wikipedia.

4.a - Contrastive Training with ELI5 In-Batch Negatives
As mentioned above, we want to train a system to produce question and answer embeddings, such that the dot product between the representation of a question and any of its answers is greater than between it and answers of all of the other questions in the dataset.

Unfortunately, actually comparing all questions to all answers before taking every single gradient step is computationally prohibitive: instead, we follow previous work in simply processing medium to large batches of question-answer pairs, and making sure that the dot product of a question with its answer is larger than with all other answers in the batch, and vice versa.

We use a cross-entropy loss for the multinomial distribution over all of the answers (or questions) in a batch, and make use of PyTorch gradient checkpointing to be able to use large batches with limited GPU memory: you can find all implementation details in the RetrievalQAEmbedder class in eli5_utils.py.

We use a single BERT-style pre-trained model to embed the questions and answers, and learn different projection matrices to bring both representations down to dimension 128: the projection matrices are trained from scratch as the sentence embedding model is fine-tuned. We found that the 8-layer distilled version of BERT from the Well-Read Students Learn Better paper performed as well or better as full BERT for a notable gain in computation speed: if you want an even faster model, that work provides pre-trained models spanning the full range of computation/accuracy trade-offs.

The model can than be trained with the following code: with batch size 32/512 on a single 16GB GPU, you can run 10 training epochs in under 6 hours.

In [15]:
###############
# retriever training
###############
class ELI5DatasetQARetriver(Dataset):
    def __init__(self, examples_array, num_rows, extra_answer_threshold=2, min_answer_length=1, training=True, n_samples=None):
        self.data = examples_array
        self.answer_thres = extra_answer_threshold
        self.min_length = min_answer_length
        self.training = training
        self.n_samples = num_rows if n_samples is None else n_samples
        self.num_rows = num_rows

    def __len__(self):
        return self.n_samples

    def make_example(self, idx):
        example = self.data[idx]
        question = example["x"]
        answer = example["y"]
        return (question, answer)

    def __getitem__(self, idx):
        return self.make_example(idx % self.num_rows)


class RetrievalQAEmbedder(torch.nn.Module):
    def __init__(self, sent_encoder, dim):
        super(RetrievalQAEmbedder, self).__init__()
        self.sent_encoder = sent_encoder
        self.output_dim = 128
        self.project_q = torch.nn.Linear(dim, self.output_dim, bias=False)
        self.project_a = torch.nn.Linear(dim, self.output_dim, bias=False)
        self.ce_loss = torch.nn.CrossEntropyLoss(reduction="mean")

    def embed_sentences_checkpointed(self, input_ids, attention_mask, checkpoint_batch_size=-1):
        # reproduces BERT forward pass with checkpointing
        if checkpoint_batch_size < 0 or input_ids.shape[0] < checkpoint_batch_size:
            return self.sent_encoder(input_ids, attention_mask=attention_mask)[1]
        else:
            # prepare implicit variables
            device = input_ids.device
            input_shape = input_ids.size()
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
            head_mask = [None] * self.sent_encoder.config.num_hidden_layers
            extended_attention_mask: torch.Tensor = self.sent_encoder.get_extended_attention_mask(
                attention_mask, input_shape, device
            )

            # define function for checkpointing
            def partial_encode(*inputs):
                encoder_outputs = self.sent_encoder.encoder(inputs[0], attention_mask=inputs[1], head_mask=head_mask,)
                sequence_output = encoder_outputs[0]
                pooled_output = self.sent_encoder.pooler(sequence_output)
                return pooled_output

            # run embedding layer on everything at once
            embedding_output = self.sent_encoder.embeddings(
                input_ids=input_ids, position_ids=None, token_type_ids=token_type_ids, inputs_embeds=None
            )
            # run encoding and pooling on one mini-batch at a time
            pooled_output_list = []
            for b in range(math.ceil(input_ids.shape[0] / checkpoint_batch_size)):
                b_embedding_output = embedding_output[b * checkpoint_batch_size : (b + 1) * checkpoint_batch_size]
                b_attention_mask = extended_attention_mask[b * checkpoint_batch_size : (b + 1) * checkpoint_batch_size]
                pooled_output = checkpoint.checkpoint(partial_encode, b_embedding_output, b_attention_mask)
                pooled_output_list.append(pooled_output)
            return torch.cat(pooled_output_list, dim=0)

    def embed_questions(self, q_ids, q_mask, checkpoint_batch_size=-1):
        q_reps = self.embed_sentences_checkpointed(q_ids, q_mask, checkpoint_batch_size)
        return self.project_q(q_reps)

    def embed_answers(self, a_ids, a_mask, checkpoint_batch_size=-1):
        a_reps = self.embed_sentences_checkpointed(a_ids, a_mask, checkpoint_batch_size)
        return self.project_a(a_reps)

    def forward(self, q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=-1):
        device = q_ids.device
        q_reps = self.embed_questions(q_ids, q_mask, checkpoint_batch_size)
        a_reps = self.embed_answers(a_ids, a_mask, checkpoint_batch_size)
        compare_scores = torch.mm(q_reps, a_reps.t())#cosine similarity
        loss_qa = self.ce_loss(compare_scores, torch.arange(compare_scores.shape[1]).to(device))#cross entrophy loss
        loss_aq = self.ce_loss(compare_scores.t(), torch.arange(compare_scores.shape[0]).to(device))
        loss = (loss_qa + loss_aq) / 2
        return loss


def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from_file=None, device="cuda"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    bert_model = AutoModel.from_pretrained(model_name).to(device)
    # run bert_model on a dummy batch to get output dimension
    d_ids = torch.LongTensor(
        [[bert_model.config.bos_token_id if bert_model.config.bos_token_id is not None else 1]]
    ).to(device)
    d_mask = torch.LongTensor([[1]]).to(device)
    sent_dim = bert_model(d_ids, attention_mask=d_mask)[1].shape[-1]
    qa_embedder = RetrievalQAEmbedder(bert_model, sent_dim).to(device)
    if from_file is not None:
        param_dict = torch.load(from_file)  # has model weights, optimizer, and scheduler states
        qa_embedder.load_state_dict(param_dict["model"])
    return tokenizer, qa_embedder


def make_qa_retriever_batch(qa_list, tokenizer, max_len=128, device="cuda"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
 
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
    
    q_ids, q_mask = (torch.LongTensor(q_toks["input_ids"]).to(device),torch.LongTensor(q_toks["attention_mask"]).to(device),)
    # print(len(a_ls))

    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=max_len, pad_to_max_length=True)
    # TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
    # print(a_toks)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )

    return (q_ids, q_mask, a_ids, a_mask)


def train_qa_retriever_epoch(model, dataset, tokenizer, optimizer, scheduler, args, e=0):
    model.train()
    # make iterator
    train_sampler = RandomSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # print(next(iter(data_loader)).shape)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, batch in enumerate(epoch_iterator):
        # print("q_ids",q_ids.shape)
        # print(" q_mask,", q_mask.shape)
        # print("A_id", a_ids.shape)
        q_ids, q_mask, a_ids, a_mask = batch
        pre_loss = model(q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=args.checkpoint_batch_size)
        loss = pre_loss.sum()
        # optimizer
        loss.backward()
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        # some printing within the epoch
        loc_loss += loss.item()
        loc_steps += 1
        if step % args.print_freq == 0 or step == 1:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


# def train_qa_retriever_joint_epoch(model, dataset_list, tokenizer, optimizer, scheduler, args, e=0):
#     model.train()
#     model_collate_fn = functools.partial(
#         make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
#     )
#     # make iterator
#     train_samplers = [RandomSampler(dataset) for dataset in dataset_list]
#     data_loaders = [
#         DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
#         for dataset, train_sampler in zip(dataset_list, train_samplers)
#     ]
#     iterators = [iter(dloader) for dloader in data_loaders]
#     joint_iter = zip(*iterators)
#     # accumulate loss since last print
#     loc_steps = 0
#     loc_loss = 0.0
#     st_time = time()
#     for step, (batches,) in enumerate(zip(joint_iter)):
#         for batch in batches:
#             q_ids, q_mask, a_ids, a_mask = batch

#             loss = model(q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=args.checkpoint_batch_size)
#             # optimizer
#             loss.backward()
#             optimizer.step()
#             scheduler.step()
#             model.zero_grad()
#             # some printing within the epoch
#             loc_loss += loss.item()
#             loc_steps += 1
#         if step % args.print_freq == 0:
#             print(
#                 "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
#                     e, step, len(dataset_list[0]) // args.batch_size, loc_loss / loc_steps, time() - st_time,
#                 )
#             )
#             loc_loss = 0
#             loc_steps = 0


def evaluate_qa_retriever(model, dataset, tokenizer, args):
    model.eval()
    # make iterator
    eval_sampler = SequentialSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=eval_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    tot_loss = 0.0
    with torch.no_grad():
        for step, batch in enumerate(epoch_iterator):
            q_ids, q_mask, a_ids, a_mask = batch
            loss = model(q_ids, q_mask, a_ids, a_mask)
            tot_loss += loss.item()
        return tot_loss / (step + 1)


def train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args):
    qar_optimizer = AdamW(qar_model.parameters(), lr=qar_args.learning_rate, eps=1e-8)
    qar_scheduler = get_linear_schedule_with_warmup(
        qar_optimizer,
        num_warmup_steps=100,
        num_training_steps=(qar_args.num_epochs + 1) * math.ceil(len(qar_train_dset) / qar_args.batch_size),
    )
    for e in range(qar_args.num_epochs):
        train_qa_retriever_epoch(qar_model, qar_train_dset, qar_tokenizer, qar_optimizer, qar_scheduler, qar_args, e)
        m_save_dict = {
            "model": qar_model.state_dict(),
            "optimizer": qar_optimizer.state_dict(),
            "scheduler": qar_scheduler.state_dict(),
        }
        print("Saving model {}".format(qar_args.model_save_name))
        # torch.save(m_save_dict, "{}_{}.pth".format(qar_args.model_save_name, e))
        eval_loss = evaluate_qa_retriever(qar_model, qar_valid_dset, qar_tokenizer, qar_args)
        print("Evaluation loss epoch {:4d}: {:.3f}".format(e, eval_loss))

In [16]:
# training arguments
class ArgumentsQAR():
    def __init__(self):
        self.batch_size = 512
        self.max_length = 128
        self.checkpoint_batch_size = 32
        self.print_freq = 100
        self.pretrained_model_name = "google/bert_uncased_L-8_H-768_A-12"
        self.model_save_name = "retriever_model_l-8_h-768_b-512-512"
        self.learning_rate = 2e-4
        self.num_epochs =1

qar_args = ArgumentsQAR()

# prepare torch Dataset objects
qar_train_dset = ELI5DatasetQARetriver(train,num_rows=len(train), training=True)
qar_valid_dset = ELI5DatasetQARetriver(test,num_rows=len(test), training=False)

# load pre-trained BERT and make model
qar_tokenizer, qar_model = make_qa_retriever_model(
        model_name=qar_args.pretrained_model_name,
        from_file=None,
        device="cuda"
)

# train the model
train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args)

Downloading:   0%|          | 0.00/384 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/327M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/bert_uncased_L-8_H-768_A-12 were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncat

 0     0 of    17 	 L: 6.417 	 -- 16.878
 0     1 of    17 	 L: 6.418 	 -- 33.808
Saving model retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    0: 5.872


In [17]:
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

Once the model is trained, it can be used to compute passage embeddings for all document corpus. The make_qa_dense_index method takes advantage of numpy memory-mapping, so embeddings are written directly to disk. Again with a single GPU, computing the full set of passage embeddings should take about 18 hours.

In [18]:
# type(qar_model)

In [19]:
# qar_model.save_pretrained('/content/drive/MyDrive/TSAI/Capstone_1/retriever_models/')

In [20]:
# qar_model = AutoModel.from_pretrained('/content/').to('cuda')

In [21]:
# type(qar)

In [22]:
# type(qar_tokenizer)

In [23]:
# qar_tokenizer.save_pretrained('/content/drive/MyDrive/TSAI/Capstone_1/ret_tokenizer/')

In [24]:
# qar_tokenizer = AutoTokenizer.from_pretrained('/content/drive/MyDrive/TSAI/Capstone_1qa_s2s_tokenizer/')

In [25]:

###############
# ELI5-trained retrieval model usage
###############
def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=128, device="cuda"):
    a_toks = tokenizer.batch_encode_plus(passages, max_length=max_length, pad_to_max_length=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )
    with torch.no_grad():
        a_reps = qa_embedder.embed_answers(a_ids, a_mask).cpu().type(torch.float)
    return a_reps.numpy()
def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda"):
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=128, pad_to_max_length=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
    with torch.no_grad():
        q_reps = qa_embedder.embed_questions(q_ids, q_mask).cpu().type(torch.float)
    return q_reps.numpy()
def make_qa_dense_index(
    qa_embedder,
    tokenizer,
    passages_dset,
    batch_size=512,
    max_length=128,
    index_name="kilt_passages_reps.dat",
    dtype="float32",
    device="cuda",
):
    st_time = time()
    fp = np.memmap(index_name, dtype=dtype, mode="w+", shape=(len(passages_dset),128))
    n_batches = math.ceil(len(passages_dset) / batch_size)
    for i in range(n_batches):
        passages = [p["z"] for p in passages_dset[i * batch_size : (i + 1) * batch_size]]
        reps = embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length, device)
        fp[i * batch_size : (i + 1) * batch_size] = reps
        if i % 50 == 0:
            print(i, time() - st_time)

In [26]:
os.chdir(r'/content/drive/MyDrive/TSAI/Capstone_1')

In [27]:
if not os.path.isfile('wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat'):
  print("hi")

  make_qa_dense_index(
          qar_model, qar_tokenizer, passage_snippets, device='cuda',
          index_name='wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat' )

### 4.b - Using the Trained Dense Retriever and Wikipedia Index
Now that we have trained our model to compute query and answer embeddings and used it to compute passage embeddings for all our Wikipedia snippets, let's see whether it can actually find supporting evidence for a new question. Recall the the two steps to using the dense retriever: we first compute an embedding for a new question, then do Max Inner Product Search with the pre-computed passage representations.

The MIPS part can be executed efficiently with the faiss library. Additionally, since we computed 128-dimensional passage embeddings, the whole of the representations fits on a GPU, making retrieval even faster. We can create the faiss_gpu index with the following code:

In [28]:
n_ret = 5

In [29]:
faiss_res = faiss.StandardGpuResources()
wiki40b_passage_reps = np.memmap(
            'wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat',
            dtype='float32', mode='r',
            # shape=(wiki40b_snippets.num_rows, 128)
            # wiki40b_snippets.num_rows = 11378343,english sections from wiki40B dataset
            shape=(len(passage_snippets), 128)
)

wiki40b_index_flat = faiss.IndexFlatIP(128)
wiki40b_gpu_index = faiss.index_cpu_to_gpu(faiss_res, 0, wiki40b_index_flat)
wiki40b_gpu_index.add(wiki40b_passage_reps)

In [30]:

# build a support document for the question out of Wikipedia snippets
# def query_qa_dense_index(
#     question, qa_embedder, tokenizer, wiki_passages, wiki_index, n_results=10, min_length=2, device="cuda"
# ):
#     q_rep = embed_questions_for_retrieval([question], tokenizer, qa_embedder, device=device)
#     D, I = wiki_index.search(q_rep, 2 * n_results)
#     res_passages = [wiki_passages[int(i)] for i in I[0]]
#     support_doc = "<P> " + " <P> ".join([p["z"] for p in res_passages])
#     res_list = [p['z'] for p in res_passages]

#     for r, sc in zip(res_list, D[0]):
#         r["score"] = float(sc)
#     return support_doc, res_list
# find nearest neighbors of an answer or declarative text in Wikipedia snippets

# build a support document for the question out of Wikipedia snippets
def query_qa_dense_index(
    question, qa_embedder, tokenizer, wiki_passages, wiki_index, n_results=n_ret, min_length=1, device="cuda"
):
    q_rep = embed_questions_for_retrieval([question], tokenizer, qa_embedder, device=device)
    D, I = wiki_index.search(q_rep, 2 * n_results)
    res_passages = [wiki_passages[int(i)] for i in I[0]]
    support_doc = "<P> " + " <P> ".join([p["z"] for p in res_passages])
    res_list = [dict([(k, p[k]) for k in ["z"]]) for p in res_passages]
    res_list = [res for res in res_list if len(res["z"].split()) > min_length][:n_results]
    for r, sc in zip(res_list, D[0]):
        r["score"] = float(sc)
    return support_doc, res_list

Now we can use the query_qa_dense_index function to query the dense index for our running example question :

In [31]:
question = test[12]['x']
question

'In what platform do the modules Conv2d() and Linear() run?'

In [32]:
doc, res_list = query_qa_dense_index(question, qar_model, qar_tokenizer, passage_snippets, wiki40b_gpu_index, device='cuda')
print(res_list)
df = pd.DataFrame({
    
    'Text': ['--- ' + question] + [res['z'] for res in res_list],
})
df.style.set_properties(**{'text-align': 'left'})

[{'z': "I've put all your functions followed by the corresponding pytorch function. Most are the same name and put in the pytorch docs (https://pytorch.org/docs/stable/index.html)\ntf.cumsum(alpha, axis = 1)  \ntorch.cumsum(alpha, dim=1)\n\ntf.shape(alpha_cumsum)[0]\nalpha_cumsum.shape[0]\n\ntf.random_uniform(shape = [len_batch, 1], minval = 0., maxval = 1.)\ntorch.rand([len_batch,1])\n\ntf.nn.relu(rand_prob - alpha_cumsum)\ntorch.nn.functional.relu(rand_prob - alpha_cumsum)\n\ntf.count_nonzero(alpha_relu, 1)\ntorch.count_nonzero(alpha_relu, dim=1)\n\ntf.one_hot(alpha_index, len(a))\ntorch.nn.functional.one_hot(alpha_index, len(a)) # assuming len(a) is number of classes\n\n", 'score': 4.655016899108887}, {'z': "I'm working on the same thing.\nHere is what I got.\ndef compute_angle(p1, p2):\n    # inner_product = torch.dot(p1, p2)\n    inner_product = (p1*p2).sum(-1)\n    p1_norm = torch.linalg.norm(p1, axis=-1)\n    p2_norm = torch.linalg.norm(p2, axis=-1)\n    cos = inner_product / (p



Unnamed: 0,Text
0,--- In what platform do the modules Conv2d() and Linear() run?
1,"I've put all your functions followed by the corresponding pytorch function. Most are the same name and put in the pytorch docs (https://pytorch.org/docs/stable/index.html) tf.cumsum(alpha, axis = 1) torch.cumsum(alpha, dim=1) tf.shape(alpha_cumsum)[0] alpha_cumsum.shape[0] tf.random_uniform(shape = [len_batch, 1], minval = 0., maxval = 1.) torch.rand([len_batch,1]) tf.nn.relu(rand_prob - alpha_cumsum) torch.nn.functional.relu(rand_prob - alpha_cumsum) tf.count_nonzero(alpha_relu, 1) torch.count_nonzero(alpha_relu, dim=1) tf.one_hot(alpha_index, len(a)) torch.nn.functional.one_hot(alpha_index, len(a)) # assuming len(a) is number of classes"
2,"I'm working on the same thing. Here is what I got. def compute_angle(p1, p2):  # inner_product = torch.dot(p1, p2)  inner_product = (p1*p2).sum(-1)  p1_norm = torch.linalg.norm(p1, axis=-1)  p2_norm = torch.linalg.norm(p2, axis=-1)  cos = inner_product / (p1_norm * p2_norm)  cos = torch.clamp(cos, -0.99999, 0.99999)  angle = torch.acos(cos)  return angle def compute_dihedral(v1,v2,v3,v4):  ab = v1 - v2  cb = v3 - v2  db = v4 - v3  u = torch.cross(ab, cb)  v = torch.cross(db, cb)  w = torch.cross(u, v)  angle = compute_angle(u, v)  # angle = torch.where(compute_angle(cb, w) 0.001, -angle, angle)  angle = torch.where(compute_angle(cb, w) 1, -angle, angle) # try: # if compute_angle(cb, w) 0.001: # angle = -angle # except ZeroDivisionError: # # dihedral=pi # pass  return angle v1 = torch.tensor([-17.0490, 5.9270, 21.5340], requires_grad=True) v2 = torch.tensor([-0.1608, 0.0600, -0.0371], requires_grad=True) v3 = torch.tensor([-0.2000, 0.0007, -0.0927], requires_grad=True) v4 = torch.tensor([-0.1423, 0.0197, -0.0727], requires_grad=True) dihedral = compute_dihedral(v1,v2,v3,v4) target_dihedral = -2 print(dihedral) # should print -2.6387 for i in range(100):  dihedral = compute_dihedral(v1,v2,v3,v4)  loss = (dihedral - target_dihedral)**2  loss.backward()  learning_rate = 0.001  with torch.no_grad():  v1 -= learning_rate * v1.grad  v2 -= learning_rate * v2.grad  v3 -= learning_rate * v3.grad  v4 -= learning_rate * v4.grad  # Manually zero the gradients after updating weights  v1.grad = None  v2.grad = None  v3.grad = None  v4.grad = None print(compute_dihedral(v1,v2,v3,v4)) # should print -2"
3,"Cool! Take a look at how similar functions, like sin, are implemented. See: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/UnaryOps.cpp for the device-independent part of the code. Then you have the CPU and CUDA-specific parts in ATen/native/cpu and ATen/native/cuda, respectively."
4,"torch.baddbmm performs a batch matrix-matrix product of matrices in batch1 and batch2. Performs a batch matrix-matrix product of matrices stored in input and mat2. Returns the matrix product of the NNN 2-D tensors. Computes the Cholesky decomposition of a symmetric positive-definite matrix AAA or for batches of symmetric positive-definite matrices. Computes the inverse of a symmetric positive-definite matrix AAA using its Cholesky factor uuu: returns matrix inv. Solves a linear system of equations with a positive semidefinite matrix to be inverted given its Cholesky factor matrix uuu. Computes the dot product of two 1D tensors. Computes the eigenvalues and eigenvectors of a real square matrix. This is a low-level function for calling LAPACK’s geqrf directly. Alias of torch.outer(). Computes the dot product for 1D tensors. Alias for torch.linalg.inv() Alias for torch.linalg.det() Calculates log determinant of a square matrix or batches of square matrices. Alias for torch.linalg.slogdet() Computes the solution to the least squares and least norm problems for a full rank matrix AAA of size (m × n)(m \times n)(m × n) and a matrix BBB of size (m × k)(m \times k)(m × k). Computes the LU factorization of a matrix or batches of matrices A. Returns the LU solve of the linear system Ax=bAx = bAx=b using the partially pivoted LU factorization of A from torch.lu(). Unpacks the data and pivots from a LU factorization of a tensor into tensors L and U and a permutation tensor P such that LU_data, LU_pivots = (P @ L @ U).lu(). Matrix product of two tensors. Alias for torch.linalg.matrix_power() Returns the numerical rank of a 2-D tensor. Computes the matrix exponential of a square matrix or of each square matrix in a batch. Performs a matrix multiplication of the matrices input and mat2. Performs a matrix-vector product of the matrix input and the vector vec. Alias for torch.linalg.householder_product()."
5,"Performs a matrix-vector product of the matrix mat and the vector vec. Performs the outer-product of vectors vec1 and vec2 and adds it to the matrix input. torch.baddbmm performs a batch matrix-matrix product of matrices in batch1 and batch2. Performs a batch matrix-matrix product of matrices stored in input and mat2. Returns the matrix product of the NNN 2-D tensors. Computes the Cholesky decomposition of a symmetric positive-definite matrix AAA or for batches of symmetric positive-definite matrices. Computes the inverse of a symmetric positive-definite matrix AAA using its Cholesky factor uuu: returns matrix inv. Solves a linear system of equations with a positive semidefinite matrix to be inverted given its Cholesky factor matrix uuu. Computes the dot product of two 1D tensors. Computes the eigenvalues and eigenvectors of a real square matrix. This is a low-level function for calling LAPACK’s geqrf directly. Alias of torch.outer(). Computes the dot product for 1D tensors. Alias for torch.linalg.inv() Alias for torch.linalg.det() Calculates log determinant of a square matrix or batches of square matrices. Alias for torch.linalg.slogdet() Computes the solution to the least squares and least norm problems for a full rank matrix AAA of size (m × n)(m \times n)(m × n) and a matrix BBB of size (m × k)(m \times k)(m × k). Computes the LU factorization of a matrix or batches of matrices A. Returns the LU solve of the linear system Ax=bAx = bAx=b using the partially pivoted LU factorization of A from torch.lu(). Unpacks the data and pivots from a LU factorization of a tensor into tensors L and U and a permutation tensor P such that LU_data, LU_pivots = (P @ L @ U).lu(). Matrix product of two tensors. Alias for torch.linalg.matrix_power() Returns the numerical rank of a 2-D tensor. Computes the matrix exponential of a square matrix or of each square matrix in a batch. Performs a matrix multiplication of the matrices input and mat2."


### 4.c - Retriever Model Evaluation
We have trained a retrieval model that seems to be working a little better than the traditional word-matching based approach, at least on our running example. Before we use it to actually answer questions, however, we would like to be able to get some quantitative evaluation of the performances of both approaches.

For the retriever, we want to favor recall over precision: our first priority is to make sure that all of the information needed to write the answers is present in the support document. If there is unrelated information, the generation model can learn to sort it out. We measure this by computing the proportion of words in the high-scoring answers which are present in the retrieved support document. To focus on important words, we also weigh answer words by their Inverse Document Frequency. This gives us the following IDF-recall scoring function:

## 5. Generating Answers with a Sequence-to-Sequence Model

In [33]:
# ELI5 seq2seq model training
###############
class ELI5DatasetS2S(Dataset):
    def __init__(
        self, examples_array,num_rows, make_doc_fun=None, document_cache=None, training=True
    ):
        self.training = training
        self.data = examples_array
        self.make_doc_function = make_doc_fun
        self.document_cache = {} if document_cache is None else document_cache
        self.num_rows = num_rows
        assert not (make_doc_fun is None and document_cache is None)
        # make index of specific question-answer pairs from multi-answers
        if self.training:
            self.qa_id_list = [(i, 0) for i in range(self.num_rows)]

        else:
            self.qa_id_list = [(i, 0) for i in range(self.num_rows)]

    def __len__(self):
        return len(self.qa_id_list)

    def make_example(self, idx):
        i, j = self.qa_id_list[idx]
        example = self.data[i]
        question = example["x"] 
        answer = example["y"]
        q_id = example["id"]
        if self.make_doc_function is not None:
            self.document_cache[q_id] = self.document_cache.get(q_id, self.make_doc_function(example["x"]))
        document = self.document_cache[q_id]
        in_st = "question: {} context: {}".format(
            question.lower().strip(), document.lower().strip(),
        )
        out_st = answer
        return (in_st, out_st)

    def __getitem__(self, idx):
        return self.make_example(idx)


def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device="cuda"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
    if from_file is not None:
        param_dict = torch.load(from_file)  # has model weights, optimizer, and scheduler states
        model.load_state_dict(param_dict["model"])
    return tokenizer, model


def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=128, device="cuda"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=min(max_len, max_a_len), pad_to_max_length=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )
    labels = a_ids[:, 1:].contiguous().clone()
    labels[a_mask[:, 1:].contiguous() == 0] = -100
    # print(labels)
    model_inputs = {
        "input_ids": q_ids,
        "attention_mask": q_mask,
        "decoder_input_ids": a_ids[:, :-1].contiguous(),
        "labels": labels,
    }
    # print("it'sme",model_inputs)
    return model_inputs


def train_qa_s2s_epoch(model, dataset, tokenizer, optimizer, scheduler, args, e=0, curriculum=True):
    model.train()
    # make iterator
    if curriculum:
        train_sampler = SequentialSampler(dataset)
    else:
        train_sampler = RandomSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_s2s_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)

  
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, batch_inputs in enumerate(epoch_iterator):
        # print(type(step))
        
        pre_loss = model(**batch_inputs)[0]
        # print(pre_loss.s(),"pre sum")
        # print(pre_loss,"pre loss")
        # print(pre_loss.item(),"pre shape 0")
        loss = pre_loss.sum() / pre_loss.item()
        loss.backward()
        # optimizer
        if step % args.backward_freq == 0:
            optimizer.step()
            scheduler.step()
            model.zero_grad()
        # some printing within the epoch
        loc_loss += loss.item()
        loc_steps += 1
        if step % args.print_freq == 0 or step == 1:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


def eval_qa_s2s_epoch(model, dataset, tokenizer, args):
    model.eval()
    # make iterator
    train_sampler = SequentialSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_s2s_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    with torch.no_grad():
        for step, batch_inputs in enumerate(epoch_iterator):
            pre_loss = model(**batch_inputs)[0]
            print(pre_loss,'pre_loss')
            print(pre_loss.shape)
            print(pre_loss.sum(),'sum')
            loss = pre_loss.sum() / pre_loss.item()
            loc_loss += loss.item()
            print("loc loss here",loc_loss)
            loc_steps += 1
            if step % args.print_freq == 0:
                print(
                    "{:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                        step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                    )
                )
    print("Total \t L: {:.3f} \t -- {:.3f}".format(loc_loss / loc_steps, time() - st_time,))


def train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args):
    s2s_optimizer = AdamW(qa_s2s_model.parameters(), lr=s2s_args.learning_rate, eps=1e-8)
    s2s_scheduler = get_linear_schedule_with_warmup(
        s2s_optimizer,
        num_warmup_steps=400,
        num_training_steps=(s2s_args.num_epochs + 1) * math.ceil(len(s2s_train_dset) / s2s_args.batch_size),
    )
    for e in range(s2s_args.num_epochs):
        # print((e == 0))

        train_qa_s2s_epoch(
            qa_s2s_model,
            s2s_train_dset,
            qa_s2s_tokenizer,
            s2s_optimizer,
            s2s_scheduler,
            s2s_args,
            e,
            curriculum=True,
        )
        m_save_dict = {
            "model": qa_s2s_model.state_dict(),
            "optimizer": s2s_optimizer.state_dict(),
            "scheduler": s2s_scheduler.state_dict(),
        }
        print("Saving model {}".format(s2s_args.model_save_name))
        eval_qa_s2s_epoch(qa_s2s_model, s2s_valid_dset, qa_s2s_tokenizer, s2s_args)
        # torch.save(m_save_dict, "\{}_{}.pth".format(s2s_args.model_save_name, e))



In [34]:
# qa_s2s_tokenizer.save_pretrained(path+"qa_s2s_tokenizer")

In [35]:
# qar_tokenizer

In [36]:
n_ret = 2

In [37]:
# pre-computing support documents
eli5_train_docs = []
for example in train:
    support_doc, dense_res_list = query_qa_dense_index(
        example['x'], qar_model, qar_tokenizer,passage_snippets, wiki40b_gpu_index, n_results=n_ret
    )
    eli5_train_docs += [(example['id'], support_doc, dense_res_list)]

eli5_valid_docs = []
for example in test:
    support_doc, dense_res_list = query_qa_dense_index(
        example['x'], qar_model, qar_tokenizer, passage_snippets, wiki40b_gpu_index, n_results=n_ret
    )
    eli5_valid_docs += [(example['id'], support_doc, dense_res_list)]

# training loop proper
class ArgumentsS2S():
    def __init__(self):
        self.batch_size = 2
        self.backward_freq = 16
        self.max_length = 512
        self.print_freq = 100
        self.model_save_name = "eli5_bart_model"
        self.learning_rate = 2e-4
        self.num_epochs =1

s2s_args = ArgumentsS2S()

# eli5_train_docs = json.load(open('precomputed/eli5_train_precomputed_dense_docs.json'))
# eli5_valid_docs = json.load(open('precomputed/eli5_valid_precomputed_dense_docs.json'))
s2s_train_dset = ELI5DatasetS2S(train,num_rows =len(train), document_cache=dict([(k, d) for k, d, src_ls in eli5_train_docs]))
s2s_valid_dset = ELI5DatasetS2S(test,num_rows =len(test), document_cache=dict([(k, d) for k, d, src_ls in eli5_valid_docs]), training=False)

qa_s2s_tokenizer, pre_model = make_qa_s2s_model(
    model_name="facebook/bart-large",
    from_file=None,
    device="cuda"
)
# qa_s2s_model = torch.nn.DataParallel(pre_model)
qa_s2s_model =pre_model
train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args)



Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


 0     0 of  4570 	 L: 1.000 	 -- 0.708
 0     1 of  4570 	 L: 1.000 	 -- 1.228
 0   100 of  4570 	 L: 1.000 	 -- 55.804
 0   200 of  4570 	 L: 1.000 	 -- 112.262
 0   300 of  4570 	 L: 1.000 	 -- 168.675
 0   400 of  4570 	 L: 1.000 	 -- 226.245
 0   500 of  4570 	 L: 1.000 	 -- 283.617
 0   600 of  4570 	 L: 1.000 	 -- 341.312
 0   700 of  4570 	 L: 1.000 	 -- 397.916
 0   800 of  4570 	 L: 1.000 	 -- 454.350
 0   900 of  4570 	 L: 1.000 	 -- 511.381
 0  1000 of  4570 	 L: 1.000 	 -- 568.925
 0  1100 of  4570 	 L: 1.000 	 -- 625.586
 0  1200 of  4570 	 L: 1.000 	 -- 682.082
 0  1300 of  4570 	 L: 1.000 	 -- 738.325
 0  1400 of  4570 	 L: 1.000 	 -- 795.237
 0  1500 of  4570 	 L: 1.000 	 -- 852.131
 0  1600 of  4570 	 L: 1.000 	 -- 910.072
 0  1700 of  4570 	 L: 1.000 	 -- 967.810
 0  1800 of  4570 	 L: 1.000 	 -- 1024.755
 0  1900 of  4570 	 L: 1.000 	 -- 1081.397
 0  2000 of  4570 	 L: 1.000 	 -- 1139.332
 0  2100 of  4570 	 L: 1.000 	 -- 1196.711
 0  2200 of  4570 	 L: 1.000 	 -- 1

We now have everything we need to answer any question! Now let's try the full system on our running example along with the first four questions of the test set:

In [38]:
import psutil
def get_size(bytes, suffix="B"):
    factor = 1024
    for unit in ["", "K", "M", "G", "T", "P"]:
        if bytes < factor:
            return f"{bytes:.2f}{unit}{suffix}"
        bytes /= factor
print("="*40, "Memory Information", "="*40)
svmem = psutil.virtual_memory()
print(f"Total: {get_size(svmem.total)}") ; print(f"Available: {get_size(svmem.available)}")
print(f"Used: {get_size(svmem.used)}") ; print(f"Percentage: {svmem.percent}%")

Total: 25.46GB
Available: 21.75GB
Used: 6.33GB
Percentage: 14.6%


In [39]:
    import torch
    torch.cuda.empty_cache()

In [40]:
# generate answer from input "question: ... context: <p> ..."
def qa_s2s_generate(
    question_doc,
    qa_s2s_model,
    qa_s2s_tokenizer,
    num_answers=1,
    num_beams=None,
    min_len=64,
    max_len=512,
    do_sample=False,
    temp=1.0,
    top_p=None,
    top_k=None,
    max_input_length=1024,
    device="cuda:0",
):
    model_inputs = make_qa_s2s_batch([(question_doc, "A")], qa_s2s_tokenizer, max_input_length, device=device,)
    n_beams = num_answers if num_beams is None else max(num_beams, num_answers)
    generated_ids = qa_s2s_model.generate(
        input_ids=model_inputs["input_ids"],
        attention_mask=model_inputs["attention_mask"],
        min_length=min_len,
        max_length=max_len,
        do_sample=do_sample,
        early_stopping=True,
        num_beams=1 if do_sample else n_beams,
        temperature=temp,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=qa_s2s_tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        num_return_sequences=num_answers,
        decoder_start_token_id=qa_s2s_tokenizer.bos_token_id,
    )
    return [qa_s2s_tokenizer.decode(ans_ids, skip_special_tokens=True).strip() for ans_ids in generated_ids]

In [41]:
questions = []
answers = []

for i in [10] + [j for j in range(4)]:
    # create support document with the dense index
    question = test[i]['x']
    doc, res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer,
        passage_snippets, wiki40b_gpu_index, device='cuda'
    )
    # concatenate question and support document into BART input
    question_doc = "question: {} context: {}".format(question, doc)
    # generate an answer with beam search
    answer = qa_s2s_generate(
            question_doc, qa_s2s_model, qa_s2s_tokenizer,
            num_answers=1,
            num_beams=8,
            min_len=64,
            max_len=256,
            max_input_length=1024,
            device="cuda:0"
    )[0]
    questions += [question]
    answers += [answer]

df = pd.DataFrame({
    'Question': questions,
    'Answer': answers,
})
df.style.set_properties(**{'text-align': 'left'})



Unnamed: 0,Question,Answer
0,`with torch.enable_grad` also works outside a `no_grad` context,"enable_grad` also works outside a no_grad context context: def __init__(self):  super(self, self).__init__()  self.len(len) self.len = nn.Linear(2, 2, 2)"
1,"""exp_cuda"" not implemented for 'ComplexDouble'","You can use the following code: import torch.utils as tf def __init__(self):  super(tf, self).__init__() print(self.getattr(self).getattr()) def forward(self, x):  self.setattr(x)"
2,How to correctly use CTC Loss with GRU in pytorch?,"PyTorch is a deep learning deep learning extension library for PyTorch. It supports CTCLoss, which is an extension of the GRU library. You can do something like this: import torch.utils as tf def __init__(self):  super(tf, self).__init__() def forward(self, x):  x = torch.randn(3, 3) x_predicted = x.sum(x) print(x.shape[0]) If you want to do something more complicated, you can do the following: self.x = x[0] = x(x[0].shape[1]"
3,glibc error while importing torch,"You can use the following code: def __init__(self):  super(self, self).__init__() class MyModel(torch.randn(10, 10, 10))  self.train_state = train_state[None, None, None] class TrainingMode(nn.Module):  """""""
4,"TypeError: add(): argument 'other' (position 1) must be Tensor, not numpy.ndarray","If you want to use a GPU for deep learning there is no difference between CUDA and CUDA... You can use PyTorch's nn.Conv2d library. If you are using pytorch, you can use the following code: import torch from torch.utils import nn def __init__(self):  super().__init__() print(self).__getitem__() This will return a tuple of tensor([[1, 2, 3], [4, 5, 6], [7, 8], [9, 9], [10, 11], [12, 13], [13, 14], [14, 15], [16, 16], [17, 17], [18, 18], [19, 19], [20, 20], [21, 21], [22, 22], [23, 23], [24, 24], [25, 25], [26, 26], [27, 27], [28, 28, 28], [29, 28]], [32, 32], [33, 33], [34, 34, 34], [35, 35, 35], [36, 36,"
