# This script is for creating the JSON data for training the DPR on the MSMARCO dataset.
**Step 1**: Identify hard negatives and write to file.

**Step 2**: Link qrel pids to passages in collection.

**Step 3**: Join all neg_ctxs of a single qid to qids of pos_ctxs.

**Step 4**: Write to JSON with DPR formatting.

### Code to extract potential hard negatives

In [2]:
!wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
!wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
!tar xzf collectionandqueries.tar.gz
!tar xzf top1000.dev.tar.gz

print('Generate potential hard negatives for the query dev set\n') 

# qrels.dev.tsv contains relevant passages for each query in dev
# the dict qid_pid stores all relevant (query, passage) pairs
qid_pid = {}   # dict for checking if a top 1000 is in qrel
pid_2_qid = {} # dict for linking positive passage from collection to qrel pid
qids = set()   # set for checking if a qid in top1000 does not have a qrel. (error checking step)
with open("/content/qrels.dev.small.tsv") as infile:
  for line in infile:
    qid, _, pid, rel = line.split('\t')
    qid_pid[(qid, pid)] = rel
    qids.add(qid)

    if pid not in pid_2_qid:
      pid_2_qid[pid] = [qid]
    else: # multiple pids serve same question
      pid_2_qid[pid] += [qid]
    

# top1000.dev contains the top 1000 passages retrieved by BM25. 
# For each (query, passage) pair in top1000.dev, if it's not contained in the qrels file, it's a potential hard negative 
#   - not relevant but is retrieved by BM25,
#   - or relevant but is not assessed by the assessors
# note: top1000 
with open('/content/potential_strong_neg.dev.tsv', 'w') as outfile:
  with open("/content/top1000.dev") as infile:
    for line in infile:
      qid, pid, query, passage = line.split('\t')
      if qid not in qids:
        print("query {} not in qrels.tsv\n")
        continue

      if (qid, pid) not in qid_pid:
        s = '\t'.join([str(qid), str(pid), query, passage.strip()])
        outfile.write(s + '\n')

# Clear some memory
del qid_pid
del qids

# print file size
!ls -sh /content/potential_strong_neg.dev.tsv

--2020-11-17 01:57:43--  https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 40.112.152.16
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|40.112.152.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 687414398 (656M) [application/x-gzip]
Saving to: ‘top1000.dev.tar.gz’


2020-11-17 01:59:53 (5.07 MB/s) - ‘top1000.dev.tar.gz’ saved [687414398/687414398]

--2020-11-17 01:59:53--  https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 40.112.152.16
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|40.112.152.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1057717952 (1009M) [application/gzip]
Saving to: ‘collectionandqueries.tar.gz’


2020-11-17 02:01:20 (11.7 MB/s) - ‘collectionandquerie

### Code to format into JSON file for training

Inputs:
- Potential strong/hard negatives
- top1000.dev.tar.gz
- qrels.dev.small.tsv

For now only JSON files will be created for the dev set for testing of DPR's training and the pipeline.

In [3]:
hard_neg_file = "potential_strong_neg.dev.tsv" # Name of hard negative file

Loop over collection pids and place passage inside of pid_2_qid.


In [4]:
with open("/content/collection.tsv") as infile:
  for line in infile:
    pid, passage = line.split('\t')
    if pid in pid_2_qid:
      pid_2_qid[pid] += [passage.rstrip()] # [qid_1, ..., qid_n, passage]
    else:
      continue

infile.close()

Convert hard_neg to final dict for JSON:

    {qid:"some number string"

      {'question':query, 

          'answers':[""],

          "positive_ctxs":[{"title":"",

                            "text":passage,

                            "score":1,

                            "title_score":0

                            "passage_id":"XXX"},...,]
                            
          "negative_ctxs":[""],

          "hard_negative_ctxs":[SAME AS "positive_ctxs"],
      }      


    }

In [5]:
qid_2_JSON = {} # These hard_negatives may be false negatives... since not all evaluated

with open("/content/" + hard_neg_file) as infile:
  for line in infile:
    qid, pid, query, passage = line.split('\t')
    if qid not in qid_2_JSON:
      qid_2_JSON[qid] = {'question':query,
                         "answers":[""],
                         "positive_ctxs":[],
                         "negative_ctxs":[""],  # Leave empty since we in-batch sample
                         "hard_negative_ctxs":[ 
                              {"title":"",
                                "text":passage.rstrip(),
                                "score":-1,
                                "title_score":int(0),
                                "passage_id":str(pid)
                                }]}
    else:
      qid_2_JSON[qid]["hard_negative_ctxs"] += [{"title":"",
                                                 "text":passage.rstrip(),
                                                 "score":-1,
                                                 "title_score":int(0),
                                                 "passage_id":str(pid)
                                                }]
                
infile.close()

Store positive_ctxs

    loop over pid_2_qid -> pid
      loop over pid -> qid
        qid_2_JSON[qid] <- postive_ctxs[pid, pid -> passage]
        

In [6]:
qid_no_neg = []
for pid, data in pid_2_qid.items():
  passage = data[-1]
  for qid in data[:-1]:
    if qid not in qid_2_JSON: # Why would the qid not have any hard negatives...
      continue
      #qid_2_JSON[qid] = {'question':'UNKNOWN',
      #                   "answers":[""],
      #                   "positive_ctxs":[{"title":"",
      #                                      "text": passage,
      #                                      "score":1,
      #                                      "title_score":int(0),
      #                                      "passage_id":str(pid)
      #                                      }],
      #                   "negative_ctxs":[""],  # Leave empty since we in-batch sample
      #                   "hard_negative_ctxs":[""]}
      #
      # print('qid {} probably missing from top1000'.format(qid))
      # qid_no_neg.append(qid)
      # Only found 1: qid '983451' probably missing from top1000
      # Would need to load qid -> query to get last one... since need question in text.
      continue

    else:
      qid_2_JSON[qid]["positive_ctxs"] += [{"title":"",
                                            "text": passage,
                                            "score":1,
                                            "title_score":int(0),
                                            "passage_id":str(pid)
                                            }]

Finally write to json file

**Note:** A lot of qid which we have relavance score some pid don't have potential hard negatives in the top1000 file.

In [7]:
import json

del pid_2_qid # clear memory (ok maybe this doesn't)

out_path = ""
filename = "MSMARCO.dev.json"

with open(out_path + filename, 'w') as json_file:
  json.dump(list(qid_2_JSON.values()), json_file)

json_file.close()

Compress file probably should juse use a zip

In [9]:
!tar -czvf MSMARCO.dev.json.tar.gz MSMARCO.dev.json

MSMARCO.dev.json


In [12]:
from google.colab import drive
drive.mount('/content/gdrive')
!cp MSMARCO.dev.json.tar.gz 'gdrive/My Drive/STAT 946 Project/MSMARCO.dev.json.tar.gz'

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
