## Preparation

### Installation

We assume that the repo is cloned, all necessary packages are installed, including calling the script:

```./install_packages.sh```

and the code is compiled:

```./build.sh```

### Changing directory to the repo root

In [1]:
cd ../..

/Users/alp/Documents/FlexNeuART


### Downloading demo data

1. Download [this file from our Google Drive](https://drive.google.com/file/d/1mDa6J4hNYPyqlS8hVi6bykSbAOMKsDwe/view?usp=sharing) and copy it to the source root directory, where it should be unpacked. As a result, a source directory should contain a sub-directory ``collections/msmarco_doc``.

### Sanity check: statistics on downloaded data should look like this

In [2]:
!scripts/report/get_basic_collect_stat.sh msmarco_doc

Using collection root: collections
Checking data sub-directory: bitext
Checking data sub-directory: dev
Checking data sub-directory: dev_official
Checking data sub-directory: docs
Found indexable data file: docs/AnswerFields.jsonl.gz
Checking data sub-directory: test2019
Checking data sub-directory: test2020
Checking data sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: test2019/QuestionFields.jsonl
Found query file: test2020/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
getIndexQueryDataInfo return value:  docs AnswerFields.jsonl.gz ,bitext,dev,dev_official,test2019,test2020,train_fusion QuestionFields.jsonl
Using the data input files: AnswerFields.jsonl.gz, QuestionFields.jsonl
Index dirs: docs
Query dirs:  bitext dev dev_official test2019 test2020 train_fusion
Queries/questions:
bitext 352013
dev 5000
dev_official 5193
t

## Indexing (each step takes a few hours)

### Lucene index

In [3]:
!scripts/index/create_lucene_index.sh msmarco_doc

Using collection root: collections
Data directory: collections/msmarco_doc/input_data
Index directory: collections/msmarco_doc/lucene_index
Checking data sub-directory: bitext
Checking data sub-directory: dev
Checking data sub-directory: dev_official
Checking data sub-directory: docs
Found indexable data file: docs/AnswerFields.jsonl.gz
Checking data sub-directory: test2019
Checking data sub-directory: test2020
Checking data sub-directory: train_fusion
Found query file: bitext/QuestionFields.jsonl
Found query file: dev/QuestionFields.jsonl
Found query file: dev_official/QuestionFields.jsonl
Found query file: test2019/QuestionFields.jsonl
Found query file: test2020/QuestionFields.jsonl
Found query file: train_fusion/QuestionFields.jsonl
Using the data input file: AnswerFields.jsonl.gz
JAVA_OPTS=-Xms8388608k -Xmx14680064k -server
Creating a new Lucene index, maximum # of docs to process: 2147483647
Input file name: collections/msmarco_doc/input_data/docs/AnswerFields.jsonl.gz
Indexed 100

### Forward indices (text is not really necessary for this notebook)

In [None]:
scripts/index/create_fwd_index.sh msmarco_doc mapdb "text:parsedText text_raw:raw" 

### Download and instantiate the model

In [3]:
!wget boytsov.info/models/msmarco_doc/2019/bert_vanilla/model.best

--2021-02-07 13:44:14--  http://boytsov.info/models/msmarco_doc/2019/bert_vanilla/model.best
Resolving boytsov.info (boytsov.info)... 69.60.127.165
Connecting to boytsov.info (boytsov.info)|69.60.127.165|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 438863972 (419M) [text/plain]
Saving to: ‘model.best’


2021-02-07 13:46:41 (2.86 MB/s) - ‘model.best’ saved [438863972/438863972]



### Here, we do inference on CPU, which is pretty slow. To use a GPU change the ``DEVICE_NAME``.

In [2]:
import torch
#DEVICE_NAME='cuda:0'
MAX_QUERY_LEN=32
MAX_DOC_LEN=512 - 32 - 3
BATCH_SIZE=16
DEVICE_NAME='cpu'
MODEL_FILE='model.best'
model=torch.load(MODEL_FILE, map_location='cpu')
model.to(DEVICE_NAME)

VanillaBertRanker(
  (bert): CustomBertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
         

## Model inference/API demo

In [3]:
COLLECTION='msmarco_doc'

### Execute a query

In [4]:
QUERY_JSON={"DOCNO": "961921", 
            "text": "national park system establish",
             "text_raw": "when was the national park system established", "text_bert_tok": "when was the national park system established"}
QUERY_JSON

{'DOCNO': '961921',
 'text': 'national park system establish',
 'text_raw': 'when was the national park system established',
 'text_bert_tok': 'when was the national park system established'}

In [5]:
from scripts.config import DOCID_FIELD, TEXT_FIELD_NAME, TEXT_RAW_FIELD_NAME

In [6]:
from scripts.py_flexneuart.setup import *
# add Java JAR to the class path
configure_classpath('target')
# create a resource manager
resource_manager=create_featextr_resource_manager(f'collections/{COLLECTION}/forward_index')

In [7]:
from scripts.py_flexneuart.cand_provider import *
# create a candidate provider/generator
cand_prov = create_cand_provider(resource_manager, PROVIDER_TYPE_LUCENE, f'collections/{COLLECTION}/lucene_index')

In [14]:
query_text=QUERY_JSON[TEXT_FIELD_NAME]
query_id=QUERY_JSON[DOCID_FIELD]
query_res=run_text_query(cand_prov, 20, query_text)
query_id, query_res

('961921',
 (1204206,
  [CandidateEntry(doc_id='D2527574', score=18.659997940063477),
   CandidateEntry(doc_id='D2398015', score=18.492298126220703),
   CandidateEntry(doc_id='D1578785', score=18.234092712402344),
   CandidateEntry(doc_id='D2189735', score=18.2298583984375),
   CandidateEntry(doc_id='D1578782', score=17.947647094726562),
   CandidateEntry(doc_id='D2527573', score=17.892498016357422),
   CandidateEntry(doc_id='D1578784', score=17.88416862487793),
   CandidateEntry(doc_id='D2106902', score=17.869140625),
   CandidateEntry(doc_id='D2591882', score=17.70314598083496),
   CandidateEntry(doc_id='D2443070', score=17.63814926147461),
   CandidateEntry(doc_id='D1578783', score=17.51651382446289),
   CandidateEntry(doc_id='D3525662', score=17.447235107421875),
   CandidateEntry(doc_id='D2769926', score=17.322866439819336),
   CandidateEntry(doc_id='D1737386', score=17.243505477905273),
   CandidateEntry(doc_id='D1514002', score=17.16539192199707),
   CandidateEntry(doc_id='D1455

### Retrieve a document (D1578782 is marked as a relevant entry)

In [9]:
from scripts.py_flexneuart.fwd_index import get_forward_index
raw_indx = get_forward_index(resource_manager, 'text_raw')

In [10]:
DOC_ID='D1578782' # relevant
#DOC_ID='D1462277' # not marked as relevant
doc_text=raw_indx.get_doc_raw(DOC_ID)

In [11]:
print(query_text)
print()
print(doc_text)

national park system establish

national park mashups "national park service's 100 year birthday is in 2016. august 25, 2016 is the 100th birthday of the national park service. starting with yellowstone in 1872 there are over 400 units in the national park service today. how old is the system? the national park service was created by an act of congress and signed by president woodrow wilson on august 25, 1916. yellowstone national park was established by an act signed by president ulysses s. grant on march 1, 1872, as the nation's first national park. the mission of the national park service: the national park service preserves unimpaired the natural and cultural resources and values of the national park system for the enjoyment, education, and inspiration of this and future generations. the national park service cooperates with partners to extend the benefits of natural and cultural resource conservation and outdoor recreation throughout this country and the world. national park mashu

## Score candidate documents

In [15]:
doc_data = {}
bm25_scores = {}
for doc_id, bm25_score in query_res[1]:
    doc_text = raw_indx.get_doc_raw(doc_id)
    doc_data[doc_id] = doc_text
    bm25_scores[doc_id] = bm25_score

query_data = {query_id : query_text}

In [16]:
from scripts.cedr.data import iter_valid_records

data_set = query_data, doc_data
run = {query_id : doc_data.keys()}

for records in iter_valid_records(model, DEVICE_NAME, data_set, run,
                                       BATCH_SIZE,
                                       MAX_QUERY_LEN, MAX_DOC_LEN):
    scores = model(records['query_tok'],
                    records['query_mask'],
                    records['doc_tok'],
                    records['doc_mask'])
    
    
    scores = scores.tolist()

    for qid, doc_id, score in zip(records['query_id'], records['doc_id'], scores):
        print(f'{qid} {doc_id} BM25 score: {bm25_scores[doc_id]} model score: {score}')

961921 D2527574 BM25 score: 18.659997940063477 model score: 1.320546269416809
961921 D2398015 BM25 score: 18.492298126220703 model score: 0.9334409236907959
961921 D1578785 BM25 score: 18.234092712402344 model score: 2.141911029815674
961921 D2189735 BM25 score: 18.2298583984375 model score: -0.07196071743965149
961921 D1578782 BM25 score: 17.947647094726562 model score: 0.38077396154403687
961921 D2527573 BM25 score: 17.892498016357422 model score: 0.9013162851333618
961921 D1578784 BM25 score: 17.88416862487793 model score: 1.047125220298767
961921 D2106902 BM25 score: 17.869140625 model score: -2.150390386581421
961921 D2591882 BM25 score: 17.70314598083496 model score: 1.2829278707504272
961921 D2443070 BM25 score: 17.63814926147461 model score: 0.7396841645240784
961921 D1578783 BM25 score: 17.51651382446289 model score: 0.8640072345733643
961921 D3525662 BM25 score: 17.447235107421875 model score: 1.0577561855316162
961921 D2769926 BM25 score: 17.322866439819336 model score: 0.57

### Score the document against the query (under the hood)

In [17]:
query_bert_tok = model.tokenize(query_text)
query_bert_tok

[2120, 2380, 2291, 5323]

In [18]:
doc_bert_tok = model.tokenize(doc_text)
print(doc_bert_tok, len(doc_bert_tok))

[10117, 8573, 1998, 5680, 1000, 1000, 1000, 2057, 2031, 5357, 15891, 2000, 1996, 2087, 14013, 4348, 1037, 2111, 2412, 2363, 1010, 1998, 2169, 2028, 2442, 2079, 2010, 2112, 2065, 2057, 4299, 2000, 2265, 2008, 1996, 3842, 2003, 11007, 1997, 2049, 2204, 7280, 1012, 1000, 1000, 1011, 10117, 8573, 8573, 2018, 2023, 3746, 2579, 2005, 1996, 3104, 1997, 2010, 2338, 1010, 1000, 1000, 5933, 9109, 1997, 1037, 8086, 2386, 1012, 1000, 1000, 17590, 2110, 2118, 10117, 8573, 2003, 2411, 2641, 1996, 1000, 1000, 5680, 2923, 2343, 1012, 1000, 1000, 2182, 1999, 1996, 2167, 7734, 2919, 8653, 1010, 2073, 2116, 1997, 2010, 3167, 5936, 2034, 2435, 4125, 2000, 2010, 2101, 4483, 4073, 1010, 8573, 2003, 4622, 2007, 1037, 2120, 2380, 2008, 6468, 2010, 2171, 1998, 7836, 1996, 3638, 1997, 2023, 2307, 5680, 2923, 1012, 10117, 8573, 2034, 2234, 2000, 1996, 2919, 8653, 1999, 2244, 7257, 1012, 1037, 27168, 1011, 4477, 2035, 2010, 2166, 1010, 8573, 4912, 1037, 3382, 2000, 5690, 1996, 2502, 2208, 1997, 2167, 2637, 2077, 

### It is important to truncate queries and documents ...

In [19]:
query_bert_tok=query_bert_tok[0:MAX_QUERY_LEN]
doc_bert_tok=doc_bert_tok[0:MAX_DOC_LEN]

### ... and pad queries

In [20]:
from scripts.cedr.data import PAD_CODE

query_bert_tok_pad = query_bert_tok + [PAD_CODE] * (MAX_QUERY_LEN - len(query_bert_tok))
print(query_bert_tok_pad)

[2120, 2380, 2291, 5323, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]


### Call unsqueeze(0) is required to create a batch dimension (we can have multiple queries & documents batched together)

In [21]:
query_tok_tensor_pad = torch.LongTensor(query_bert_tok_pad).unsqueeze(0).to(DEVICE_NAME)
doc_tok_tensor = torch.LongTensor(doc_bert_tok).unsqueeze(0).to(DEVICE_NAME)
len(query_tok_tensor_pad[0]), len(doc_tok_tensor[0])

(32, 477)

In [22]:
query_tok_tensor_pad.shape, doc_tok_tensor.shape

(torch.Size([1, 32]), torch.Size([1, 477]))

In [23]:
query_mask = torch.FloatTensor([1.0] * len(query_bert_tok) + 
                              [0.] * (MAX_QUERY_LEN - len(query_bert_tok))).unsqueeze(0).to(DEVICE_NAME)
doc_mask = torch.ones_like(doc_tok_tensor).float()

In [24]:
query_mask.shape, doc_mask.shape

(torch.Size([1, 32]), torch.Size([1, 477]))

In [25]:
query_mask, doc_mask

(tensor([[1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
 tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
          1., 1., 1., 1., 1.,

In [26]:
model(query_tok_tensor_pad, query_mask, doc_tok_tensor, doc_mask)

tensor([-1.8628], grad_fn=<SqueezeBackward1>)