# Trec Covid data Process

[TREC-COVID](https://ir.nist.gov/covidSubmit/)

Round 5 will use the July 16, 2020 version of CORD-19.

The query field provides a short keyword statement of the information need, while the question field provides a more complete description of the information need. The narrative provides extra clarification, and is not necessarily a super-set of the information in the question.

Residual collection evaluation, the ranked lists that you submit should not contain any documents that exist in a qrels file for the topic (even if you did not make use of the judgments). Any pre-judged documents that are submitted will be automatically removed upon submission, and thus you will have effectively returned fewer documents than you might have.

The relevance judgments will be made by human annotators that have biomedical expertise. Annotators will use a three-way scale:

- Relevant: the article is fully responsive to the information need as expressed by the topic, i.e. answers the Question in the topic. The article need not contain all information on the topic, but must, on its own, provide an answer to the question.
- Partially Relevant: the article answers part of the question but would need to be combined with other information to get a complete answer.
- Not Relevant: everything else.

Relevance is known to be highly subjective; as in all test collections, the opinion of the particular human annotator making the judgment will be final.

The format of a relevance judgments file ("qrels") is lines of

topic-id iteration cord-id judgment

where judgment is 0 for not relevant, 1 for partially relevant, and 2 for fully relevant; and iteration records the round in which the document was judged.

[Round 5 submissions](https://ir.nist.gov/covidSubmit/archive/archive-round5.html)

[Round 5 ranking](https://docs.google.com/spreadsheets/d/1n1IJ6gkZQh3lyjJFQZPOm2WL5Ct3GFSWa7ImqBhcvn4/edit?usp=sharing)

Use P@20 as evaluation metric

In [2]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET

In [14]:
df_meta = pd.read_csv("trec_covid/metadata_20200716.csv", dtype="str", na_filter=False)
df_meta = df_meta.drop_duplicates(['cord_uid'])

In [15]:
print(df_meta.shape)
df_meta.head()

(191175, 19)


Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972,no-cc,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
3,2b73a28n,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,11686871,no-cc,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",Respir Res,,,,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
4,9785vg6d,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in respons...,10.1186/rr61,PMC59580,11686888,no-cc,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",Respir Res,,,,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,


In [114]:
df_meta.pubmed_id.value_counts()

            74946
32419926        3
28794858        2
27192395        2
25215100        2
            ...  
29324693        1
29553813        1
32352366        1
23522661        1
32446539        1
Name: pubmed_id, Length: 116178, dtype: int64

In [13]:
with open("trec_covid/topics-rnd5.xml") as f:
    soup = BeautifulSoup(f.read(), "xml")
    topics = {"num":[], "query":[], "question":[], "narrative":[]}
    for topic in soup.find_all('topic'):
        topics["num"].append(topic['number'])
        topics["query"].append(topic.query.text)
        topics["question"].append(topic.question.text)
        topics["narrative"].append(topic.narrative.text)
df_topics = pd.DataFrame(topics)

In [32]:
df_topics

Unnamed: 0,num,query,question,narrative
0,1,coronavirus origin,what is the origin of COVID-19,seeking range of information about the SARS-Co...
1,2,coronavirus response to weather changes,how does the coronavirus respond to changes in...,seeking range of information about the SARS-Co...
2,3,coronavirus immunity,will SARS-CoV2 infected people develop immunit...,seeking studies of immunity developed due to i...
3,4,how do people die from the coronavirus,what causes death from Covid-19?,Studies looking at mechanisms of death from Co...
4,5,animal models of COVID-19,what drugs have been active against SARS-CoV o...,Papers that describe the results of testing d...
5,6,coronavirus test rapid testing,what types of rapid testing for Covid-19 have ...,Looking for studies identifying ways to diagno...
6,7,serological tests for coronavirus,are there serological tests that detect antibo...,Looking for assays that measure immune respons...
7,8,coronavirus under reporting,how has lack of testing availability led to un...,Looking for studies answering questions of imp...
8,9,coronavirus in Canada,how has COVID-19 affected Canada,"seeking data related to infections (confirm, s..."
9,10,coronavirus social distancing impact,has social distancing had an impact on slowing...,seeking specific information on studies that h...


In [27]:
with open("trec_covid/qrels-covid_d5_j0.5-5.txt") as f:
    qrels = []
    for row in f:
        qrels.append(row.strip().split())
df_qrels_gs = pd.DataFrame(qrels, columns=['topic_id', 'iteration', 'cord_id', 'judgment'])

In [29]:
df_qrels_gs.head()

Unnamed: 0,topic_id,iteration,cord_id,judgment
0,1,4.5,005b2j4b,2
1,1,4.0,00fmeepz,1
2,1,0.5,010vptx3,2
3,1,2.5,0194oljo,1
4,1,4.0,021q9884,1


In [28]:
pd.crosstab(df_qrels_gs.topic_id, df_qrels_gs.judgment)

judgment,-1,0,1,2
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,948,362,337
10,0,644,203,294
11,0,1379,226,216
12,0,978,295,353
13,0,973,656,264
14,0,1023,172,101
15,0,1535,266,180
16,0,1230,236,174
17,0,636,372,345
18,0,659,319,347


# Indexing

Use [Pyserini](https://github.com/castorini/pyserini/) for indexing

Prepare json files for indexing

In [2]:
import json

In [17]:
# prepare json files for indexing
df_meta = pd.read_csv("trec_covid/metadata_20200716.csv", dtype="str", na_filter=False)
df_meta = df_meta.drop_duplicates(['cord_uid'])
trecovid_titles = []
trecovid_abstrs = []
trecovid_titabs = []
for i, row in df_meta.iterrows():
    trecovid_titles.append({'id':row['cord_uid'], 'contents':row['title']})
    trecovid_abstrs.append({'id':row['cord_uid'], 'contents':row['abstract']})
    trecovid_titabs.append({'id':row['cord_uid'], 'contents':row['title']+' '+row['abstract']})

json.dump(trecovid_titles, open(f"trec_covid/articles/titles/trecovid_titles.json", "w", encoding='utf-8'))
json.dump(trecovid_abstrs, open(f"trec_covid/articles/abstrs/trecovid_abstrs.json", "w", encoding='utf-8'))
json.dump(trecovid_titabs, open(f"trec_covid/articles/titabs/trecovid_titabs.json", "w", encoding='utf-8'))

!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input trec_covid/articles/titles \
  --index trec_covid/indexes/titles \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input trec_covid/articles/abstrs \
  --index trec_covid/indexes/abstrs \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

!python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input trec_covid/articles/titabs \
  --index trec_covid/indexes/titabs \
  --generator DefaultLuceneDocumentGenerator \
  --threads 1 \
  --storePositions --storeDocvectors --storeRaw

Title
- Indexing Complete! 191,126 documents indexed
- ============ Final Counter Values ============
- indexed:          191,126
- unindexable:            0
- empty:                 49
- skipped:                0
- errors:                 0
- Total 191,126 documents indexed in 00:00:07

Abstract
- Indexing Complete! 136,746 documents indexed
- ============ Final Counter Values ============
- indexed:          136,746
- unindexable:            0
- empty:             54,429
- skipped:                0
- errors:                 0
- Total 136,746 documents indexed in 00:00:27

Title and Abstract
- Indexing Complete! 191,132 documents indexed
- ============ Final Counter Values ============
- indexed:          191,132
- unindexable:            0
- empty:                 43
- skipped:                0
- errors:                 0
- Total 191,132 documents indexed in 00:00:30

In [3]:
from pyserini.search.lucene import LuceneSearcher

In [19]:
df_topics.question[0]

'what is the origin of COVID-19'

In [20]:
searcher = LuceneSearcher('trec_covid/indexes/titles')
hits = searcher.search(df_topics.question[0], 10) #query, question, narrative

for i in range(len(hits)):
    print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')

 1 dv9m19yk 6.07950
 2 h4gi99hn 5.51870
 3 zqx1swhj 5.51870
 4 alx5uc95 5.22930
 5 vq4t53l9 5.22930
 6 gv1k7u7j 4.66420
 7 u7u75sl0 4.66420
 8 4977dzxz 4.57310
 9 4s72gdvq 4.57310
10 kgifmjvb 4.57310


In [23]:
df_topics[['num', 'query']].to_csv('trec_covid/trecovid_topics_query.tsv', sep='\t', header=False, index=False)
df_topics[['num', 'question']].to_csv('trec_covid/trecovid_topics_question.tsv', sep='\t', header=False, index=False)
df_topics[['num', 'narrative']].to_csv('trec_covid/trecovid_topics_narrative.tsv', sep='\t', header=False, index=False)

In [4]:
from text_to_wordlist import text_to_wordlist

In [152]:
df_topics['proc_query'] = df_topics['query'].apply(lambda x:text_to_wordlist(x))
df_topics['proc_question'] = df_topics['question'].apply(lambda x:text_to_wordlist(x))
df_topics['proc_narrative'] = df_topics['narrative'].apply(lambda x:text_to_wordlist(x))

In [153]:
df_topics[['num', 'proc_query']].to_csv('trec_covid/trecovid_topics_proc_query.tsv', sep='\t', header=False, index=False)
df_topics[['num', 'proc_question']].to_csv('trec_covid/trecovid_topics_proc_question.tsv', sep='\t', header=False, index=False)
df_topics[['num', 'proc_narrative']].to_csv('trec_covid/trecovid_topics_proc_narrative.tsv', sep='\t', header=False, index=False)

In [None]:
# query
# query title
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titles \
  --topics trec_covid/trecovid_topics_query.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_query_title.txt \
  --bm25

# query abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/abstrs \
  --topics trec_covid/trecovid_topics_query.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_query_abstr.txt \
  --bm25

# query title and abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titabs \
  --topics trec_covid/trecovid_topics_query.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_query_titab.txt \
  --bm25

# question
# question title
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titles \
  --topics trec_covid/trecovid_topics_question.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_question_title.txt \
  --bm25

# question abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/abstrs \
  --topics trec_covid/trecovid_topics_question.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_question_abstr.txt \
  --bm25

# question title and abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titabs \
  --topics trec_covid/trecovid_topics_question.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_question_titab.txt \
  --bm25

# narrative
# narrative title
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titles \
  --topics trec_covid/trecovid_topics_narrative.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_narrative_title.txt \
  --bm25

# narrative abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/abstrs \
  --topics trec_covid/trecovid_topics_narrative.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_narrative_abstr.txt \
  --bm25

# narrative title and abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titabs \
  --topics trec_covid/trecovid_topics_narrative.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_narrative_titab.txt \
  --bm25

# processed queries
# proc_query title and abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titabs \
  --topics trec_covid/trecovid_topics_proc_query.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_proc_query_titab.txt \
  --bm25

# proc_question title and abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titabs \
  --topics trec_covid/trecovid_topics_proc_question.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_proc_question_titab.txt \
  --bm25

# proc_narrative title and abstract
python -m pyserini.search.lucene \
  --index trec_covid/indexes/titabs \
  --topics trec_covid/trecovid_topics_proc_narrative.tsv \
  --output trec_covid/trec_eval/test/trecovid_bm25_proc_narrative_titab.txt \
  --bm25

In [154]:
que_results = {'query':[], 'index':[], 'topic_id':[], 'top1':[], 'top20':[], 'top100':[]}
for q in ['query', 'question', 'narrative']:
    for idx in ['titab']: #'title', 'abstr', 
        with open(f"trec_covid/trec_eval/test/trecovid_bm25_proc_{q}_{idx}.txt") as f:
            qrels = []
            for row in f:
                qrels.append(row.strip().split()[:5])
            df_qrels_bm25 = pd.DataFrame(qrels, columns=['topic_id', 'iteration', 'cord_id', 'rank', 'score'])
            for topid in range(1, 51):
                que_bm25 = df_qrels_bm25[(df_qrels_bm25.topic_id==str(topid))]
                gs = df_qrels_gs[(df_qrels_gs.topic_id==str(topid))]
                df_join = que_bm25[['cord_id', 'rank', 'score']].set_index('cord_id').join(gs[['cord_id', 'judgment']].set_index('cord_id'), how='inner')
                num_top1 = df_join[(df_join.judgment.isin(['1', '2']))&(df_join['rank'].astype('int')<=1)].shape[0]
                num_top20 = df_join[(df_join.judgment.isin(['1', '2']))&(df_join['rank'].astype('int')<=20)].shape[0]
                num_top100 = df_join[(df_join.judgment.isin(['1', '2']))&(df_join['rank'].astype('int')<=100)].shape[0]
#                 print(f"Topic {topid+1}: Top1: {num_top1}; Top10: {num_top10}; Top100: {num_top100}")
                que_results['query'].append(q)
                que_results['index'].append(idx)
                que_results['topic_id'].append(topid)
                que_results['top1'].append(num_top1)
                que_results['top20'].append(num_top20)
                que_results['top100'].append(num_top100)

In [155]:
pd.DataFrame(que_results).to_csv('trec_covid/trecovid_bm25_proc_titab_results.csv', index=False)

## Trec Covid Eval

cd trec_covid/trec_eval

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_query_titab.txt  > test/trecovid_eval_bm25_query_titab.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_question_titab.txt  > test/trecovid_eval_bm25_question_titab.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_narrative_titab.txt  > test/trecovid_eval_bm25_narrative_titab.txt

### Processed Queries

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_proc_query_titab.txt  > test/trecovid_eval_bm25_proc_query_titab.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_proc_question_titab.txt  > test/trecovid_eval_bm25_proc_question_titab.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_proc_narrative_titab.txt  > test/trecovid_eval_bm25_proc_narrative_titab.txt

# Preparing Data for Re-ranking

'sentence', 'sentence_pmid', 'pmid', 'citation_pmid', 'articleTitle', 'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation', 'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score', 'title_coverage', 'abstract_coverage', 'rank'

In [5]:
# import deepsense text processing functions
from create_pubmed_search_test_dataset import load_normalized_citation_by_year, cal_sum_citation, cal_coverage
from text_to_wordlist import text_to_wordlist

In [121]:
# load deepsense relevant data
citations_dict=json.load(open('/data/shubo/deepsense/project/src_new/input/abstract_citations_by_year.json'))

reader_abs_citation_IF=pd.read_csv('/data/shubo/deepsense/project/src_new/input/title_abstracts_complete.tsv', encoding='utf-8',sep='\t')
reader_abs_citation_IF.set_index(['pmid'], inplace=True)
print(reader_abs_citation_IF.shape)
reader_abs_citation_IF = reader_abs_citation_IF.loc[~reader_abs_citation_IF.index.duplicated(keep='first')]
print(reader_abs_citation_IF.shape)
indexs=set(reader_abs_citation_IF.index)

normalized_citation_by_year=load_normalized_citation_by_year('/data/shubo/deepsense/project/src_new/input/')

(18854309, 17)
(18853210, 17)


In [None]:
df_meta = pd.read_csv("trec_covid/metadata_20200716.csv", dtype="str", na_filter=False)
df_meta = df_meta.drop_duplicates(['cord_uid'])
for q in ['query', 'question', 'narrative']: #, 'question', 'narrative'
    for idx in ['titab']: #'title', 'abstr', 
        with open(f"trec_covid/trec_eval/test/trecovid_bm25_proc_{q}_{idx}.txt") as f, \
             open(f"trec_covid/file_for_rerank/trecovid_topics_proc_{q}_{idx}_for_rerank.csv", "w") as testf:
            testf.write('\t'.join(['sentence', 'sentence_pmid', 'pmid', 'citation_pmid', 'articleTitle',
                                   'abstract', 'sentence_year', 'year', 'publicationType', 'sum_citation',
                                   'normalized_citation', 'journal_IF', 'title_score', 'title_abstract_score',
                                   'title_coverage', 'abstract_coverage', 'rank'])+'\n')
            for row in f:
                items = row.strip().split()[:5]
#                 print(items)
                sentence = df_topics[df_topics.num==items[0]][q].iloc[0]
                sentence = text_to_wordlist(sentence)
                sentence_pmid = items[0]
                
                pmid = df_meta[df_meta.cord_uid==items[2]]['pubmed_id'].iloc[0]
                pmid = int(pmid) if pmid.isnumeric() else items[2]
                
                title = df_meta[df_meta.cord_uid==items[2]]['title'].iloc[0]
                title = text_to_wordlist(title)
                
                abstract = df_meta[df_meta.cord_uid==items[2]]['abstract'].iloc[0]
                abstract = text_to_wordlist(abstract)
                
                title_coverage = cal_coverage(sentence, title)
                abstract_coverage = cal_coverage(sentence, abstract)
                
                year = df_meta[df_meta.cord_uid==items[2]]['publish_time'].iloc[0][:4]
                year = int(year) if year.isnumeric() else 0
                year = 2020 if year>2020 else year
                
                year_diff = 2020-year
                
                publicationType = '0'
                sum_citation = 0
                normalized_citation = 0
                journal_IF = 0.0
                if pmid in indexs:
                    publicationType = reader_abs_citation_IF.loc[pmid]['publicationType']
                    avg_citation_all_abstracts = normalized_citation_by_year[year][year_diff]
                    sum_citation = cal_sum_citation(2020, pmid, citations_dict)
                    normalized_citation = sum_citation - avg_citation_all_abstracts
                    journal_IF = float(reader_abs_citation_IF.loc[pmid]['journal_IF'])
                
                testf.write('\t'.join([sentence, sentence_pmid, items[2], '0', title,
                                   abstract, '2020', str(year), publicationType, str(sum_citation),
                                   str(normalized_citation), str(journal_IF), '0', items[4],
                                   str(title_coverage), str(abstract_coverage), items[3]])+'\n')

#                 testf.write('\t'.join([sentence, sentence_pmid, pmid, '0', title, abstract]+others)+'\n')
#                 print([sentence, sentence_pmid, pmid, 0, title, abstract]+others)

In [17]:
rerank_output_path = "/data/shubo/deepsense/project/src_new/src_model_full_sentence/test_dataset_trec_covid_bm25"
ranks = list(range(1, 1001))*50
rank_100 = list(range(1, 101))*50
df_rerank = pd.DataFrame(ranks, columns=['rerank'])
df_rerank_100 = pd.DataFrame(rank_100, columns=['rerank'])
for q in ['question']: #'query', 'question', 'narrative'
    for idx in ['titab']: #'title', 'abstr', 
        with open(f"trec_covid/trec_eval/test/trecovid_bm25_proc_{q}_{idx}.txt") as f:
            qrels = []
            for row in f:
                qrels.append(row.strip().split())
            df_qrels_bm25 = pd.DataFrame(qrels, columns=['topic_id', 'iteration', 'cord_id', 'rank', 'score', 'team'])
            df_qrels_bm25[df_qrels_bm25['rank'].astype(int)<=100].to_csv(f"trec_covid/trec_eval/test/trecovid_bm25_proc_{q}_{idx}_100.txt", sep=' ', header=False, index=False)
            preds = np.load(f"{rerank_output_path}/trecovid_topics_proc_{q}_{idx}_for_rerank.csv.npy")
            df_rerank_score = pd.DataFrame(preds, columns=['rerank_score'])
            df = pd.concat([df_qrels_bm25, df_rerank_score], axis=1).sort_values(['topic_id', 'rerank_score'], ascending=[True, False]).reset_index()
            df_100 = pd.concat([df_qrels_bm25, df_rerank_score], axis=1)
            df_100 = df_100[df_100['rank'].astype(int)<=100].sort_values(['topic_id', 'rerank_score'], ascending=[True, False]).reset_index()
            df_qrels_rerank = pd.concat([df, df_rerank], axis=1)[['topic_id', 'iteration', 'cord_id', 'rerank', 'rerank_score', 'team']]
            df_qrels_rerank_100 = pd.concat([df_100, df_rerank_100], axis=1)[['topic_id', 'iteration', 'cord_id', 'rerank', 'rerank_score', 'team']]
            df_qrels_rerank.to_csv(f"trec_covid/trec_eval/test/trecovid_deepsense_rerank_proc_{q}_{idx}.txt", sep=' ', header=False, index=False)
            df_qrels_rerank_100.to_csv(f"trec_covid/trec_eval/test/trecovid_deepsense_rerank_proc_{q}_{idx}_100.txt", sep=' ', header=False, index=False)

In [19]:
df_top100 = pd.concat([df_100, df_rerank_100], axis=1)

In [21]:
df_top100_articles = df_meta[df_meta.cord_uid.isin(df_top100.cord_id.unique())]

In [30]:
(df_top100
 .merge(df_qrels_gs[['topic_id', 'cord_id', 'judgment']], how='left', left_on=['topic_id', 'cord_id'], right_on=['topic_id', 'cord_id'])
 .merge(df_topics[['num', 'question']], how='left', left_on='topic_id', right_on='num')
 .merge(df_top100_articles[['cord_uid', 'title', 'abstract']], how='left', left_on='cord_id', right_on='cord_uid'))

Unnamed: 0,index,topic_id,iteration,cord_id,rank,score,team,rerank_score,rerank,judgment,num,question,cord_uid,title,abstract
0,0,1,Q0,kgifmjvb,1,4.683900,Anserini,0.902880,1,0,1,what is the origin of COVID-19,kgifmjvb,Tracking the origin of early COVID-19 cases in...,The original coronavirus disease (COVID-19) ou...
1,1,1,Q0,wmfcey6f,2,4.683899,Anserini,0.895439,2,1,1,what is the origin of COVID-19,wmfcey6f,Tracking the origin of early COVID-19 cases in...,Abstract The original coronavirus disease (COV...
2,62,1,Q0,aw0kkvvl,63,3.861098,Anserini,0.861958,3,1,1,what is the origin of COVID-19,aw0kkvvl,The current understanding and potential therap...,The ongoing wreaking global outbreak of the no...
3,64,1,Q0,otal8v81,65,3.861096,Anserini,0.861958,4,1,1,what is the origin of COVID-19,otal8v81,The current understanding and potential therap...,The ongoing wreaking global outbreak of the no...
4,61,1,Q0,3zy5dgxz,62,3.861099,Anserini,0.861597,5,1,1,what is the origin of COVID-19,3zy5dgxz,The current understanding and potential therap...,Abstract The ongoing wreaking global outbreak ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,8076,9,Q0,ca4c6b8a,77,5.165200,Anserini,0.131632,96,2,9,how has COVID-19 affected Canada,ca4c6b8a,Modeling risk of infectious diseases: a case o...,Background The novel coronavirus (2019-nCOV) o...
4996,8011,9,Q0,pbemaim9,12,6.051000,Anserini,0.125323,97,0,9,how has COVID-19 affected Canada,pbemaim9,Porcine teschovirus polioencephalomyelitis in ...,"Beginning in 2002, a small number of pig farms..."
4997,8090,9,Q0,5m1ch7ls,91,5.009300,Anserini,0.050081,98,0,9,how has COVID-19 affected Canada,5m1ch7ls,Vulnerability of Aboriginal health systems in ...,Abstract Climate change has been identified as...
4998,8042,9,Q0,7n8rrvrs,43,5.424600,Anserini,0.045338,99,1,9,how has COVID-19 affected Canada,7n8rrvrs,"Public health in Canada: Evolution, meaning an...",Chronic disease burden in Canada poses an immi...


In [31]:
(df_top100
 .merge(df_qrels_gs[['topic_id', 'cord_id', 'judgment']], how='left', left_on=['topic_id', 'cord_id'], right_on=['topic_id', 'cord_id'])
 .merge(df_topics[['num', 'question']], how='left', left_on='topic_id', right_on='num')
 .merge(df_top100_articles[['cord_uid', 'title', 'abstract']], how='left', left_on='cord_id', right_on='cord_uid')
 .to_csv('trec_covid/trecovid_bm25_top100_deepsense_rerank_for_manual_review_question.csv', index=False))

In [180]:
que_results = {'query':[], 'index':[], 'topic_id':[], 'top1':[], 'top20':[], 'top100':[]}
for q in ['query', 'question', 'narrative']:
    for idx in ['titab']: #'title', 'abstr', 
        with open(f"trec_covid/trec_eval/test/trecovid_deepsense_rerank_proc_{q}_{idx}.txt") as f:
            qrels = []
            for row in f:
                qrels.append(row.strip().split()[:5])
            df_qrels_bm25 = pd.DataFrame(qrels, columns=['topic_id', 'iteration', 'cord_id', 'rank', 'score'])
            for topid in range(1, 51):
                que_bm25 = df_qrels_bm25[(df_qrels_bm25.topic_id==str(topid))]
                gs = df_qrels_gs[(df_qrels_gs.topic_id==str(topid))]
                df_join = que_bm25[['cord_id', 'rank', 'score']].set_index('cord_id').join(gs[['cord_id', 'judgment']].set_index('cord_id'), how='inner')
                num_top1 = df_join[(df_join.judgment.isin(['1', '2']))&(df_join['rank'].astype('int')<=1)].shape[0]
                num_top20 = df_join[(df_join.judgment.isin(['1', '2']))&(df_join['rank'].astype('int')<=20)].shape[0]
                num_top100 = df_join[(df_join.judgment.isin(['1', '2']))&(df_join['rank'].astype('int')<=100)].shape[0]
#                 print(f"Topic {topid+1}: Top1: {num_top1}; Top10: {num_top10}; Top100: {num_top100}")
                que_results['query'].append(q)
                que_results['index'].append(idx)
                que_results['topic_id'].append(topid)
                que_results['top1'].append(num_top1)
                que_results['top20'].append(num_top20)
                que_results['top100'].append(num_top100)

In [181]:
pd.DataFrame(que_results).to_csv('trec_covid/trecovid_deepsense_rerank_proc_titab_results.csv', index=False)

cd trec_covid/trec_eval

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_query_titab.txt  > test/trecovid_eval_deepsense_rerank_query_titab.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_question_titab.txt  > test/trecovid_eval_deepsense_rerank_question_titab.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_narrative_titab.txt  > test/trecovid_eval_deepsense_rerank_narrative_titab.txt

## Processed queries

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_proc_query_titab.txt  > test/trecovid_eval_deepsense_rerank_proc_query_titab.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_proc_question_titab.txt  > test/trecovid_eval_deepsense_rerank_proc_question_titab.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_proc_narrative_titab.txt  > test/trecovid_eval_deepsense_rerank_proc_narrative_titab.txt

### Top 100

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_proc_query_titab_100.txt > test/trecovid_eval_bm25_proc_query_titab_100.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_proc_question_titab_100.txt > test/trecovid_eval_bm25_proc_question_titab_100.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_bm25_proc_narrative_titab_100.txt > test/trecovid_eval_bm25_proc_narrative_titab_100.txt

Processed

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_proc_query_titab_100.txt  > test/trecovid_eval_deepsense_rerank_proc_query_titab_100.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_proc_question_titab_100.txt  > test/trecovid_eval_deepsense_rerank_proc_question_titab_100.txt

./trec_eval -q -c -M1000 test/qrels-covid_d5_j0.5-5.txt test/trecovid_deepsense_rerank_proc_narrative_titab_100.txt  > test/trecovid_eval_deepsense_rerank_proc_narrative_titab_100.txt

In [3]:
df = pd.DataFrame()
for md in ['bm25', 'deepsense_rerank']:
    for dt in ['query', 'question', 'narrative', 'proc_query', 'proc_question', 'proc_narrative']:
        with open(f"trec_covid/trec_eval/test/trecovid_eval_{md}_{dt}_titab.txt") as f:
            results = []
            for row in f:
                results.append(row.strip().split())
            df_results = pd.DataFrame(results, columns=[f'metric_{md}_{dt}', f'topic_id_{md}_{dt}', f'score_{md}_{dt}'])
            df_results = df_results[df_results[f'metric_{md}_{dt}'].isin(['P_5', 'P_10', 'P_15', 'P_20', 'P_30', 'P_100', 'P_200', 'P_500', 'P_1000'])]
            df = pd.concat([df, df_results], axis=1)

In [4]:
df.to_csv('trec_covid/trecovid_topics_titab_results_all_R.csv')

In [5]:
df = pd.DataFrame()
for md in ['bm25', 'deepsense_rerank']:
    for dt in ['proc_query', 'proc_question', 'proc_narrative']:
        with open(f"trec_covid/trec_eval/test/trecovid_eval_{md}_{dt}_titab_100.txt") as f:
            results = []
            for row in f:
                results.append(row.strip().split())
            df_results = pd.DataFrame(results, columns=[f'metric_{md}_{dt}', f'topic_id_{md}_{dt}', f'score_{md}_{dt}'])
            df_results = df_results[df_results[f'metric_{md}_{dt}'].isin(['P_5', 'P_10', 'P_15', 'P_20', 'P_30', 'P_100', 'P_200', 'P_500', 'P_1000'])]
            df = pd.concat([df, df_results], axis=1)

In [6]:
df.to_csv('trec_covid/trecovid_topics_titab_results_all_100_R.csv')