# Information Retrieval

Information Retrieval (IR), aka _open-domain Question Answering_ is performed by a _search engine_.

_ad hoc retrieval_ is when a user poses a query and a search engine returns an ordered list of documents. Google is the most famous of the ad hoc search engines.

The __deep learning__ approach to this problem makes use of _BERT_. A _query-encoder_ is used to encode BERT's special `[CLS]` token. Another _document_ encoder is used to encode the `[CLS]` token of every document (or sub-sections of a document when the document is too long). The dot product is taken between the encoded query and the encoded documents - documents with higher documents are more relevant:

$$ q = BERT_{Q}(query)[CLS] $$
$$ d = BERT_{D}(query)[CLS], d \in D $$
$$ score(d, q) = d \cdot q $$

The most relevant document:

$$ d_1 = argmax(D \cdot q)$$

In [1]:
# Naive Transformers approach
import torch
import numpy as np
from transformers import BertModel, BertTokenizer, BertConfig


In [2]:
from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [95]:
def embed_phrase(phrase):
    inputs = tokenizer(phrase, return_tensors="pt")
    outputs = model(**inputs)
    cls_token = outputs['last_hidden_state'][0][0]
    
    return cls_token.detach().numpy()

In [143]:
documents = [
    "the dog ran to the doghouse",
    "fido fetched a bone",
    "he wagged his tail",
    "catch the ball and chew on it",
    "rocky mountains and lots of rivers and lakes",
    "Colorado is a beautiful state",
    "I really love hiking, don't you?"
]

In [144]:
import numpy as np
# np.cos(embed_phrase(documents[0]), embed_phrase(documents[5]))




In [145]:
embed_phrase(documents[0]).shape

(768,)

In [146]:
from sklearn.metrics.pairwise import cosine_similarity

In [147]:
cosine_similarity([[1,1,1]], [[1,1,1]])

array([[1.]])

In [148]:
cosine_similarity([list(embed_phrase(documents[0]))], [list(embed_phrase(documents[3]))])

array([[0.65469146]], dtype=float32)

In [13]:
configuration = BertConfig()
model = BertModel(configuration)

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

question = "How many parameters does BERT-large have?"
answer_text = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

input_ids = tokenizer.encode(question, answer_text)

tokens = tokenizer.convert_ids_to_tokens(input_ids)

The input has a total of 70 tokens.


In [11]:
# Search the input_ids for the first instance of the `[SEP]` token.
sep_index = input_ids.index(tokenizer.sep_token_id)

# The number of segment A tokens includes the [SEP] token istelf.
num_seg_a = sep_index + 1

# The remainder are segment B.
num_seg_b = len(input_ids) - num_seg_a

# Construct the list of 0s and 1s.
segment_ids = [0]*num_seg_a + [1]*num_seg_b

# There should be a segment_id for every input token.
assert len(segment_ids) == len(input_ids)

In [19]:
model(np.array(input_ids))

TypeError: 'int' object is not callable

In [12]:
# # Run our example through the model.
# start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing our input text.
#                                  token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text