# Aula 8 - Solução dos exercícios - Parte 1 - Geração de queries

Leandro Carísio Fernandes

<br>

Objetivo: gerar dataset para treino de modelos de buscas usando a técnica do InPars e avaliar um modelo reranqueador treinado neste dataset no TREC-COVID:

    - Entrada: 3-5 exemplos few-shot + documento amostrado da coleção do TREC-COVID
    - Saída: query que seja relevante para o documento amostrado
    - É opcional fazer a etapa de filtragem usando as queries de maior prob descrita no Artigo.   
    - Como modelo gerador, use um dos seguintes modelos:
        - ChatGPT-3.5-turbo: ~1 USD para cada 1k exemplos
        - FLAN-T5 (base, large ou XL), LLAMA-(7,13B), Alpaca-(7/13B), que são possiveis de rodar no Colab Pro.
        - Também tem a inference-api da HF: https://huggingface.co/inference-api.
        - Com exceção do LLAMA, é possivel usar zero-shot ao inves de few-shot.
    - Dado 1k-10k pares <query sintética; documento>, treinar um modelo reranqueador miniLM igual ao da aula 2/3.
    - Exemplos negativos (i.e., <query sintética; doc não relevant) vem do BM25: dado a query sintetica, retornar top 1000 com o BM25, e amostrar aleatoriamente alguns documentos como negativo
    - Começar treino do miniLM já treinado no MS MARCO

Avaliar no TREC-COVID e comparar com o reranqueador apenas treinado no MSMARCO

Nota: Também usar o dataset dos colegas para obter diversidade de exemplos: Assim que tiver gerado o dataset sintético, favor colocar na planilha, assim outras pessoas podem usa-lo.

    - Para aumentar a aleatoriedade, seed usada deve o seu numero na planilha.

Colocar dataset no formato jsonlines:
{"query": query, "positive_doc_id": doc_id, "negative_doc_ids": [opcional]}\n 


## Gera os exemplos positivos

In [1]:
%%time

url_trec_covid = 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip'


!pip install openai -q
!pip install wget -q
!pip install -qU huggingface_hub datasets

CPU times: total: 0 ns
Wall time: 12.2 s


In [3]:
%%time
from pathlib import Path
import wget
import json
import zipfile

    
if not Path('./collections/trec-covid.zip').is_file():
    !mkdir collections
    wget.download(url_trec_covid, out='./collections/')
    with zipfile.ZipFile('./collections/trec-covid.zip', 'r') as zip_ref:
        zip_ref.extractall('./collections/')

    
def carrega_corpus_trec_covid():
    retorno = []
    with open('./collections/trec-covid/corpus.jsonl') as corpus:
        for i, line in enumerate(corpus):
            doc = json.loads(line)
            retorno.append( doc )
            if (i % 10000 == 0):
                print(f'Processado {i} documentos')
    return retorno

corpus_trec_covid = carrega_corpus_trec_covid()

Processado 0 documentos
Processado 10000 documentos
Processado 20000 documentos
Processado 30000 documentos
Processado 40000 documentos
Processado 50000 documentos
Processado 60000 documentos
Processado 70000 documentos
Processado 80000 documentos
Processado 90000 documentos
Processado 100000 documentos
Processado 110000 documentos
Processado 120000 documentos
Processado 130000 documentos
Processado 140000 documentos
Processado 150000 documentos
Processado 160000 documentos
Processado 170000 documentos
CPU times: total: 1.84 s
Wall time: 2.38 s


In [107]:
import os
import openai

openai.api_key = os.getenv('API_KEY_OPENAI')

def adiciona_query_gpt_no_doc(idx):
    texto = corpus_trec_covid[idx]['text']

    msg = f"Formulate ONE query for the following passage. \
            Consider how a human use a search engine. Randomly choose if your question starts with what, how, why or which. \
            \n\n\
            {texto}"

    response = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You're cataloging documents and need to associate queries with document passages."},
                    {"role": "user", "content": msg}
                ],
                temperature=0,
                max_tokens=500)

    corpus_trec_covid[idx]['query'] = response['choices'][0]['message']['content']

    
def adiciona_query_no_corpus(nome_arquivo, indices):
    with open(nome_arquivo, 'a', encoding='utf-8') as arquivo:
        for i, idx in enumerate(indices):
            adiciona_query_gpt_no_doc(idx)
            
            doc = {"positive_doc_id": corpus_trec_covid[idx]["_id"], "query": corpus_trec_covid[idx]['query']}
            
            arquivo.write(f"{json.dumps(doc)}\n")
            
            if i % 20 == 0:
                print(f'{i} documentos processados')
    


In [62]:
import numpy as np

np.random.seed(8)

indices_candidatos = np.random.randint(0, high=len(corpus_trec_covid)-1, size=2000)
indices = [i for i in indices_candidatos if len(corpus_trec_covid[i]['text']) > 300]
indices = indices[0:1000]

In [109]:
#adiciona_query_no_corpus('leandro_carisio_20230428_01.jsonl', indices[600:])

0 documentos processados
20 documentos processados
40 documentos processados
60 documentos processados
80 documentos processados
100 documentos processados
120 documentos processados
140 documentos processados
160 documentos processados
180 documentos processados
200 documentos processados
220 documentos processados
240 documentos processados
260 documentos processados
280 documentos processados
300 documentos processados
320 documentos processados
340 documentos processados
360 documentos processados
380 documentos processados


In [157]:
#adiciona_query_no_corpus('leandro_carisio_20230428_01.jsonl', indices)

0 documentos processados


## Testa leitura do Dataset no repositório


In [2]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
import datasets
ds = datasets.load_dataset('unicamp-dl/trec-covid-experiment')

Downloading builder script:   0%|          | 0.00/2.18k [00:00<?, ?B/s]

Downloading and preparing dataset trec-covid-experiment/default to C:/Users/caris/.cache/huggingface/datasets/unicamp-dl___trec-covid-experiment/default/0.0.0/39177de766cb7ace6cb0b5a27f1db36700b181fd775fbd271589004af3109267...


Downloading data files:   0%|          | 0/11 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/255k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/152k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/311k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/627k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/280k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/307k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/237k [00:00<?, ?B/s]

Generating example split: 0 examples [00:00, ? examples/s]

Generating example2 split: 0 examples [00:00, ? examples/s]

Generating eduseiti_100_queries_expansion_20230501_01 split: 0 examples [00:00, ? examples/s]

Generating leandro_carisio_01 split: 0 examples [00:00, ? examples/s]

Generating thales_1k_generated_queries_20230429 split: 0 examples [00:00, ? examples/s]

Generating manoel_1k_generated_queries_20230430 split: 0 examples [00:00, ? examples/s]

Generating manoel_2k_generated_queries_20230501 split: 0 examples [00:00, ? examples/s]

Generating thiago_laitz_1k_queries split: 0 examples [00:00, ? examples/s]

Generating mirelle_1k_generated_queries_20230501 split: 0 examples [00:00, ? examples/s]

Generating hugo_padovani_query_generation split: 0 examples [00:00, ? examples/s]

Generating marcus_borela_1k_gptj6b_20230501 split: 0 examples [00:00, ? examples/s]

Dataset trec-covid-experiment downloaded and prepared to C:/Users/caris/.cache/huggingface/datasets/unicamp-dl___trec-covid-experiment/default/0.0.0/39177de766cb7ace6cb0b5a27f1db36700b181fd775fbd271589004af3109267. Subsequent calls will reuse this data.


  0%|          | 0/11 [00:00<?, ?it/s]

In [4]:
import pandas as pd

In [5]:
df = pd.concat((v.to_pandas().assign(origin=k) for k,v in ds.items()),
               ignore_index=True)
df

Unnamed: 0,query,positive_doc_id,negative_doc_ids,origin
0,This is a example query 1,doc1,"[xxx, yyy, zzz]",example
1,This is another example query,doc2,"[aaa, bbb, ccc]",example
2,Example of query with no negative doc_ids,doc2,[],example
3,This is a example query 1 (file 2),doc12222,"[xxx, yyy, zzz]",example2
4,This is another example query (file 2),doc12345,"[aaa, bbb, ccc]",example2
...,...,...,...,...
9443,What is the significance of the phylogeny of t...,28nv3b8a,"[0sedbv51, m8lrc4c0, i6puqauk, bp8dqis1, exlsn...",marcus_borela_1k_gptj6b_20230501
9444,What is the significance of the accuracy of th...,sffuejo1,"[go4d8jwn, n5gk1xhb, edunzo0f, xk2a7tsw, 6l888...",marcus_borela_1k_gptj6b_20230501
9445,What is the significance of the low fecundity ...,3dyatdlv,"[azv9yw9n, r55fe25x, wc23tqv2, vqmqnipq, d0dni...",marcus_borela_1k_gptj6b_20230501
9446,What is the significance of regional and natio...,jps1j60a,"[uu4k2j2a, mo4luyx6, mnt12ot2, fex8sd1t, fex8s...",marcus_borela_1k_gptj6b_20230501


In [6]:
pd.unique(df.origin)

array(['example', 'example2',
       'eduseiti_100_queries_expansion_20230501_01', 'leandro_carisio_01',
       'thales_1k_generated_queries_20230429',
       'manoel_1k_generated_queries_20230430',
       'manoel_2k_generated_queries_20230501', 'thiago_laitz_1k_queries',
       'mirelle_1k_generated_queries_20230501',
       'hugo_padovani_query_generation',
       'marcus_borela_1k_gptj6b_20230501'], dtype=object)

In [7]:
df[df.origin == 'marcus_borela_1k_gptj6b_20230501'][:10]

Unnamed: 0,query,positive_doc_id,negative_doc_ids,origin
8448,What is the significance of the speculation ab...,jmn8ctlt,"[n26csuks, 09jyekp5, g5bhkd55, x1xzcacy, qv794...",marcus_borela_1k_gptj6b_20230501
8449,What is the significance of the TRAP-18 in ass...,ysz3oocp,"[ayt31knx, hycd9zua, xug622im, rjzy3z8t, m54mo...",marcus_borela_1k_gptj6b_20230501
8450,€œIn augustus 2011 wijdde de Journal of the Am...,jnuncmd5,"[tfvn7xp8, yxmqmetv, 420px62r, s3xnieig, aey40...",marcus_borela_1k_gptj6b_20230501
8451,What is the estimated reproduction number of C...,k1qq4qgg,"[axjrmk48, q4a1n1wm, m2bgbqzg, raav221g, oen0y...",marcus_borela_1k_gptj6b_20230501
8452,What is the significance of automatic semantic...,c5zewz3m,"[70x44y0t, 5b78ow1j, vr1kmria, 5htvzkpt, borka...",marcus_borela_1k_gptj6b_20230501
8453,What is the significance of the lack of second...,7xp143nc,"[hp8pswt7, xsaat823, huxbyvfd, dyww1yh0, lf2mf...",marcus_borela_1k_gptj6b_20230501
8454,What are the enabling opportunities and challe...,l5441kqp,"[994i9rtb, xw244riy, y9n9d6s3, fbppvt75, 4og21...",marcus_borela_1k_gptj6b_20230501
8455,【Analysis on epidemic situation and spatiotemp...,35roxf0h,"[3cj05yc8, qhhssnqk, oqo1xcb3, 8dlnq07a, n30nj...",marcus_borela_1k_gptj6b_20230501
8456,What is the significance of environmental sani...,czo533oj,"[z59cvvkf, kxwgpymq, b3oh77tu, 95ap34cz, ciuhy...",marcus_borela_1k_gptj6b_20230501
8457,What are the challenges posed by the COVID-19 ...,8kfb9alv,"[ozggtbhj, gdhcpmom, vw1lwev0, h76dfo1j, vlzbr...",marcus_borela_1k_gptj6b_20230501


In [11]:
ds['marcus_borela_1k_gptj6b_20230501']['query']

['What is the significance of the speculation about the catastrophe that awaits once COVID-19 establishes itself in the poorest communities of South Africa?',
 'What is the significance of the TRAP-18 in assessing the threat of lone-actor terrorism?',
 '€œIn augustus 2011 wijdde de Journal of the American Medical Association (JAMA) een redactioneel commentaar aan het groeiende probleem van morbide obese kinderen: kinderen die aan een dusdanig ernstige obesitas lijden dat hun gezondheid er direct door wordt',
 'What is the estimated reproduction number of COVID-19 in Iran?',
 'What is the significance of automatic semantic description extraction from SBD for emergency management?',
 'What is the significance of the lack of secondary transmission of Ebola virus from healthcare worker to 238 contacts?',
 'What are the enabling opportunities and challenges that IoT brings to quadruple helix actors?',
 '【Analysis on epidemic situation and spatiotemporal changes of COVID-19 in Anhui】.',
 'Wh