# Projeto final IA376E - 2020
# Título: Busca utilizando vetores densos

Nomes: Rafael Gonçalves, Thomas Portugal

## Semana 1: Entendendo a base de dados MSMARCO e a biblioteca de busca pyserini

## Configuração geral

Processador disponível no ambiente

In [0]:
import os

os.cpu_count()

2

Montar o drive para download de dados

In [0]:
# Mount drive
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Pyserini

Pyserini - wrapper do sistema de busca anserini: https://github.com/castorini/pyserini/

Referências:
 - https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_robust04_demo.ipynb
 - https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_msmarco_passage_demo.ipynb

### Instalação

In [0]:
%%capture
!pip install pyserini

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"

### SimpleSearch

In [0]:
DATA_DIR=r'/content/drive/My\ Drive/pyserini'
data_dir='/content/drive/My Drive/pyserini'

In [0]:
%%capture
!mkdir {DATA_DIR}
!wget -nc https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz -P {DATA_DIR}
!tar xvkfz {DATA_DIR}/index-robust04-20191213.tar.gz -C {DATA_DIR}

In [0]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher(data_dir + '/index-robust04-20191213/')
hits = searcher.search('hubble space telescope')

# Print the first 10 hits:
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

 1 LA071090-0047   16.85690
 2 FT934-5418      16.75630
 3 FT921-7107      16.68290
 4 LA052890-0021   16.37390
 5 LA070990-0052   16.36460
 6 LA062990-0180   16.19260
 7 LA070890-0154   16.15610
 8 FT934-2516      16.08950
 9 LA041090-0148   16.08810
10 FT944-128       16.01920


Exemplo de documento buscado:

In [0]:
from IPython.core.display import display, HTML
display(HTML('<div style="padding-bottom:10px">' + hits[0].raw[:1000] + '</div>'))

Configurar opções de busca:

In [0]:
searcher.set_bm25_similarity(0.9, 0.4)
searcher.set_rm3_reranker(10, 10, 0.5)

hits2 = searcher.search('hubble space telescope')

# Print the first 10 hits:
for i in range(0, 10):
    print(f'{i+1:2} {hits2[i].docid:15} {hits2[i].score:.5f}')

 1 FT934-5418      5.24090
 2 LA071090-0047   5.19570
 3 LA070990-0052   4.94740
 4 LA052890-0021   4.93000
 5 FT921-7107      4.91800
 6 LA062990-0180   4.89580
 7 FT934-2516      4.80640
 8 LA070890-0154   4.78170
 9 LA041490-0064   4.74820
10 LA040190-0178   4.73550


## MSMarco

Disponível em: https://microsoft.github.io/msmarco/

In [0]:
DATA_DIR='drive/My\ Drive/msmarco'

In [0]:
%%capture
!mkdir {DATA_DIR}
!wget -nc https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P {DATA_DIR}
!tar -xzvkf {DATA_DIR}/collectionandqueries.tar.gz -C {DATA_DIR}

In [0]:
import collections

In [0]:
# From https://github.com/nyu-dl/dl4ir-doc2query/blob/master/convert_msmarco_to_opennmt.py
def load_qrels(path):
  """Loads qrels into a dict of key: query id, value: set of relevant doc ids."""
  qrels = collections.defaultdict(set)
  with open(path) as f:
    for i, line in enumerate(f):
      query_id, _, doc_id, relevance = line.rstrip().split('\t')
      if int(relevance) >= 1:
        qrels[query_id].add(doc_id)
      if i % 100000 == 0:
        print('Loading qrels {}'.format(i))
  return qrels


def load_queries(path):
  """Loads queries into a dict of key: query id, value: query text."""
  queries = {}
  with open(path) as f:
    for i, line in enumerate(f):
      query_id, query = line.rstrip().split('\t')
      queries[query_id] = query
      if i % 100000 == 0:
        print('Loading queries {}'.format(i))
  return queries


def load_collection(path):
  """Loads tsv collection into a dict of key: doc id, value: doc text."""
  collection = {}
  with open(path) as f:
    for i, line in enumerate(f):
      doc_id, doc_text = line.rstrip().split('\t')
      collection[doc_id] = doc_text.replace('\n', ' ')
      if i % 1000000 == 0:
        print('Loading collection, doc {}'.format(i))

  return collection

In [0]:
data_dir='drive/My Drive/msmarco'
docs_path = data_dir + '/collection.tsv'
qrels_path = data_dir + '/qrels.train.tsv'
queries_path = data_dir + '/queries.train.tsv'

In [0]:
col = load_collection(docs_path)
queries = load_queries(queries_path)
qrel = load_qrels(qrels_path)

Loading collection, doc 0
Loading collection, doc 1000000
Loading collection, doc 2000000
Loading collection, doc 3000000
Loading collection, doc 4000000
Loading collection, doc 5000000
Loading collection, doc 6000000
Loading collection, doc 7000000
Loading collection, doc 8000000
Loading queries 0
Loading queries 100000
Loading queries 200000
Loading queries 300000
Loading queries 400000
Loading queries 500000
Loading queries 600000
Loading queries 700000
Loading queries 800000
Loading qrels 0
Loading qrels 100000
Loading qrels 200000
Loading qrels 300000
Loading qrels 400000
Loading qrels 500000


## Pyserini + MSMARCO

Referencias:
 - https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc.md
 - https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md
 - https://github.com/castorini/anserini/blob/master/docs/experiments-doc2query.md
 - https://colab.research.google.com/github/castorini/anserini-notebooks/blob/master/pyserini_msmarco_passage_demo.ipynb#scrollTo=yFZlcqEX0t1f

In [0]:
%%capture
!mkdir {DATA_DIR}/pyserini
!wget -nc https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-passage-20191117-0ed488.tar.gz -P {DATA_DIR}/pyserini
!tar xvfkz {DATA_DIR}/pyserini/index-msmarco-passage-20191117-0ed488.tar.gz -C {DATA_DIR}/pyserini

In [0]:
!du -h {DATA_DIR}/pyserini/index-msmarco-passage-20191117-0ed488

2.5G	drive/My Drive/msmarco/pyserini/index-msmarco-passage-20191117-0ed488


In [0]:
from pyserini.search import pysearch

topics = pysearch.get_topics('msmarco_passage_dev_subset')
print(f'{len(topics)} queries total')

6980 queries total


Exemplo usando query do pyserini

In [0]:
topics[1102400]['title']

'why do bears hibernate'

In [0]:
from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher(data_dir + '/pyserini/index-msmarco-passage-20191117-0ed488')
hits = searcher.search(topics[1102400]['title'])

# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].score:.5f} {hits[i].raw[:70]}...')

 1 17.33580 Why do Bears hibernate? March 31, 2010, Joan, Leave a comment. Why do ...
 2 13.23090 Why do bears hibernate? Watch this to discover how much effort is spen...
 3 13.13570 Technically, as the other anwerer said, bears do not hibernate, but th...
 4 13.01460 It is a common misconception that bears hibernate during the winter. W...
 5 13.00390 To prepare for hibernation, grizzlies must prepare a den, and consume ...
 6 12.68940 Some zoo bears are fed year round, and do not hibernate. Since they do...
 7 12.55450 Bears in zoos will not hibernate if food is available, though they wil...
 8 12.51710 All kinds of bears technically don't hibernate. They enter into a phas...
 9 12.43500 Date: 12-11-2012. It is a common misconception that bears hibernate du...
10 12.37460 While bears tend to slow down during the winter, they are not true hib...


Examplo usando query do arquivo tsv baixado

In [0]:
doc_id = '410621'

In [0]:
queries[doc_id]

'is frida kahlo mexican'

In [0]:
hits = searcher.search(queries[doc_id], 1000)

In [0]:
# Prints the first 10 hits
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid} {hits[i].score:.5f} {hits[i].raw[:70]}...')

 1 5144374 22.64810 Frida Kahlo, Frida Kahlo print, Frida Kahlo Poster, Frida Kahlo Wall A...
 2 5144368 22.49320 Frida Kahlo, Self Portrait, Frida Kahlo print, Frida Kahlo quote, Frid...
 3 1155174 20.99080 Frida Kahlo Birthday: Beloved Mexican Artist Turns 106 (PHOTOS) In hon...
 4 3932807 20.90960 1 Frida Kahlo Quotes Frida Kahlo was a Mexican painter famous for her ...
 5 1122572 19.82170 Frida Kahlo. This name uses Spanish naming customs: the first or pater...
 6 2429097 19.74680 Frida Kahlo and her paintings. Mexican artist Frida Kahlo is remembere...
 7 3232442 19.71270 About Frida Kahlo. Frida Kahlo de Rivera, born Magdalena Carmen Frieda...
 8 3232448 19.38600 Frida Kahlo, in full Frida Kahlo de Rivera, original name Magdalena Ca...
 9 2429101 19.28310 Frida Kahlo's House â¢â¢â¢ La Casa Azul was Frida Kahlo's home. Â© ...
10 3192157 18.94840 Museo Frida Kahlo. Renowned Mexican artist Frida Kahlo was born in, an...


In [0]:
# Print documents in qrel
for docid in qrel[doc_id]:
    print(f'{docid} {col[str(docid)][:70]}...')

6461 Throughout most of her life, however, Frida remained close to her fath...
6458 With slim sable brushes, Frida Kahlo painstakingly rendered her bold u...


In [0]:
from IPython.core.display import display, HTML
display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[0].raw + '</div>'))

## Doc2query

Técnica utilizada para criar queries dos documentos

Referências:
- https://github.com/nyu-dl/dl4ir-doc2query
- https://github.com/castorini/docTTTTTquery

In [0]:
%%capture
!wget -nc https://www.dropbox.com/s/709q495d9hohcmh/pred-test_topk10.tar.gz -P {DATA_DIR}/msmarco-passage
!tar -xzvkf {DATA_DIR}/msmarco-passage/pred-test_topk10.tar.gz -C {DATA_DIR}/msmarco-passage

In [0]:
!head {DATA_DIR}/collection.tsv

0	The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.
1	The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.
2	Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.
3	The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 â¦ 2-1946 under the control of the U.S. Army Corps of Engineers

In [0]:
!head {DATA_DIR}/msmarco-passage/pred-test_topk10.txt

why was the research of manhattan used why was manhattan an important factor what was a important achievement of manhattan why was manhattan so important what did the manhattan project do what was the most important modern strategy for north paul impressive what is the most important modern strategy in the atomic what is the most important research of manhattan why was sigma important and why was it important why was manhattan a good language
is manhattan an example which atomic bomb and led the most of the united states that have the future when is the new sigma project railroad used why was the manhattan project energy what was manhattan legacy which two countries did the manhattan project have who created the atomic bomb powers which of the following was considered a result of the atomic bomb on which was created by the atomic why was manhattan communist created what was the manhattan project created
what is considered a project of manhattan what is a project review what is a projec

In [0]:
def load_doc2query(path):
  """Loads tsv collection into a dict of key: doc id, value: doc text."""
  collection = {}
  with open(path) as f:
    for i, line in enumerate(f):
      doc_text = line.rstrip()
      collection[str(i)] = doc_text.replace('\n', ' ')
      if i % 1000000 == 0:
        print('Loading doc2query, doc {}'.format(i))

  return collection

In [0]:
doc_queries = load_doc2query(data_dir + '/msmarco-passage/pred-test_topk10.txt')

Loading doc2query, doc 0
Loading doc2query, doc 1000000
Loading doc2query, doc 2000000
Loading doc2query, doc 3000000
Loading doc2query, doc 4000000
Loading doc2query, doc 5000000
Loading doc2query, doc 6000000
Loading doc2query, doc 7000000
Loading doc2query, doc 8000000


In [0]:
doc_id = '7423243'
doc_queries[doc_id][:150], col[doc_id][:150]

('what day is credit union what is a credit others hours how many credit hours in a credit hour what is the full time credit union what is Â§35.201, cre',
 'Section 35.201 of the Commissionâ\x80\x99s Regulations, 49 Pa. Code Â§35.201, defines a credit as a period of 15 hours of instruction. The Commission uses b')

## Evaluation metrics

In [0]:
ans = {}
i = 0
j = 0
while i < 1_000:
    j += 1
    try:
        v = queries[str(j)]
        hits = searcher.search(v, 100)
        ans[str(j)] = [e.docid for e in hits]
        i += 1
        print('.', end='')
    except:
        pass

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

### recall@k

In [0]:
def recall_at_k(qrel, ans, k=100):
    n = 0
    total = 0
    for qid, v in zip(ans.keys(), ans.values()):
        if len(qrel[qid]):
            acc = sum([1 for e in v[:k] if e in qrel[qid]])
            recall = acc/len(qrel[qid])
            total += recall
            n += 1
    return total/n

In [0]:
recall_at_k(qrel, ans)

0.6052380952380952

###MRR@10

In [0]:
%%capture
!wget -nc https://raw.githubusercontent.com/spacemanidol/MSMARCO/master/Ranking/Baselines/msmarco_eval.py

In [0]:
import msmarco_eval

In [0]:
qrels = load_qrels(qrels_path)

Loading qrels 0
Loading qrels 100000
Loading qrels 200000
Loading qrels 300000
Loading qrels 400000
Loading qrels 500000


In [0]:
msmarco_eval.compute_metrics(qrels, ans)

{'MRR @10': 0.00011735936798419533, 'QueriesRanked': 1000}

# Carregar dados para treino de transformers

Dividimos os dados em dois grupos referentes a dois momentos: treino e inferência. 
No momento de treino precisamos executar uma tarefa que, implicitamente, construa representações (embeddings) para as perguntas e os trechos de modo que perguntas estejam "mais próximas" de trechos relevantes. 

Para isso, os dados que serão utilizados são: 


*    **qidpidtriples.train.full.tar.gz**: O dataset de treino de msmarco na seguinte estrutura: (qid,pid relevante, pid irrelevante)
*   **predicted_queries_topk_sampling.zip**: Queries geradas pelo doc2query para cada documento da collections. 

Cada amostra seria retirada do **qidpidtriples.train.full.tar.gz** e teria uma label especificando se é ou não é relevante.

Nesse momento eu fiquei um pouco na dúvida de como treinar esse modelo usando duas torres simultaneamente. Mas acredito que isso se explique aqui: https://arxiv.org/pdf/2004.04906.pdf


Já na fase de inferência, utilizaríamos apenas o encoder do modelo treinado. Geraríamos os embeddings referentes as queries artificiais dos documentos (doc2query). Para cada query nova no conjunto de treinamento, gerar o embedding e fazer o produto interno com cada documento. Teríamos scores relacionados a cada documento e  podemos fazer o rankeamento. Após fazer o rankeamento, avaliamos com o mrr10 os resultados. Nesse momento podemos restringir os documentos que são rankeados, aos resultados da pesquisa utilizando o b25 do pyserini. Os trechos relevantes estão sempre no top 1000 (ou não existem dependendendo da pergunta).

In [0]:
TRIPLES = DATA_DIR + '/triples'

In [0]:
#%%capture
# Conjunto de treino do msmarco com (qid, positive pid, negative pid)
!mkdir {TRIPLES}
!wget -nc https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.tsv.gz -P {TRIPLES}
!gunzip {TRIPLES}/qidpidtriples.train.full.tsv.gz 

mkdir: cannot create directory ‘drive/My Drive/msmarco/triples’: File exists
--2020-05-27 19:48:53--  https://msmarco.blob.core.windows.net/msmarcoranking/qidpidtriples.train.full.tsv.gz
Resolving msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)... 40.112.152.16
Connecting to msmarco.blob.core.windows.net (msmarco.blob.core.windows.net)|40.112.152.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2633557579 (2.5G) [application/octet-stream]
Saving to: ‘drive/My Drive/msmarco/triples/qidpidtriples.train.full.tsv.gz’


2020-05-27 19:51:50 (14.3 MB/s) - ‘drive/My Drive/msmarco/triples/qidpidtriples.train.full.tsv.gz’ saved [2633557579/2633557579]

gzip: drive/My Drive/msmarco/triples/qidpidtriples.train.full.tsv already exists; do you wish to overwrite (y or n)? y
n


## Outras referencias

- https://colab.research.google.com/drive/1NXJZ5TaBj_i_g_0KxzJ9ZMsjn310h2YQ (Colab BERT finetunning w/ doc2query)
- https://arxiv.org/pdf/2004.04906.pdf (Dense Passage Retrieval for Open-Domain Question Answering)