In [None]:
!pip install transformers sentencepiece

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

In [None]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

In [None]:
prompt = """
These are examples of queries with sample relevant documents for each query. 
The query must be specific and detailed.

Example 1:
document: Monophyletic Relationship between Severe Acute Respiratory Syndrome Coronavirus and Group 2 Coronaviruses Although primary genomic analysis has revealed that severe acute respiratory syndrome coronavirus (SARS CoV) is a new type of coronavirus, the different protein trees published in previous reports have provided no conclusive evidence indicating the phylogenetic position of SARS CoV. To clarify the phylogenetic relationship between SARS CoV and other coronaviruses, we compiled a large data set composed of 7 concatenated protein sequences and performed comprehensive analyses, using the maximum-likelihood, Bayesian-inference, and maximum-parsimony methods. All resulting phylogenetic trees displayed an identical topology and supported the hypothesis that the relationship between SARS CoV and group 2 CoVs is monophyletic. Relationships among all major groups were well resolved and were supported by all statistical analyses.
query: what is the origin of COVID-19
 
Example 2:
document: Association between climate variables and global transmission oF SARS-CoV-2 Abstract In this study, we aimed at analyzing the associations between transmission of and deaths caused by SARS-CoV-2 and meteorological variables, such as average temperature, minimum temperature, maximum temperature, and precipitation. Two outcome measures were considered, with the first aiming to study SARS-CoV-2 infections and the second aiming to study COVID-19 mortality. Daily data as well as data on SARS-CoV-2 infections and COVID-19 mortality obtained between December 1, 2019 and March 28, 2020 were collected from weather stations around the world. The country's population density and time of exposure to the disease were used as control variables. Finally, a month dummy variable was added. Daily data by country were analyzed using the panel data model. An increase in the average daily temperature by one degree Fahrenheit reduced the number of cases by approximately 6.4 cases/day. There was a negative correlation between the average temperature per country and the number of cases of SARS-CoV-2 infections. This association remained strong even with the incorporation of additional variables and controls (maximum temperature, average temperature, minimum temperature, and precipitation) and fixed country effects. There was a positive correlation between precipitation and SARS-CoV-2 transmission. Countries with higher rainfall measurements showed an increase in disease transmission. For each average inch/day, there was an increase of 56.01 cases/day. COVID-19 mortality showed no significant association with temperature.
query: how does the coronavirus respond to changes in the weather
 
Example 3:
document: Cross-immunity between respiratory coronaviruses may limit COVID-19 fatalities Of the seven coronaviruses associated with disease in humans, SARS-CoV, MERS-CoV and SARS-CoV-2 cause considerable mortality but also share significant sequence homology, and potentially antigenic epitopes capable of inducing an immune response. The degree of similarity is such that perhaps prior exposure to one virus could confer partial immunity to another. Indeed, data suggests a considerable amount of cross-reactivity and recognition by the hosts immune response between different coronavirus infections. While the ongoing COVID-19 outbreak rapidly overwhelmed medical facilities of particularly Europe and North America, accounting for 78% of global deaths, only 8% of deaths have occurred in Asia where the outbreak originated. Interestingly, Asia and the Middle East have previously experienced multiple rounds of coronavirus infections, perhaps suggesting buildup of acquired immunity to the causative SARS-CoV-2 that underlies COVID-19. This article hypothesizes that a causative factor underlying such low morbidity in these regions is perhaps (at least in part) due to acquired immunity from multiple rounds of coronavirus infections and discusses the mechanisms and recent evidence to support such assertions. Further investigations of such phenomenon would allow us to examine strategies to confer protective immunity, perhaps aiding vaccine development.
query: will SARS-CoV2 infected people develop immunity? Is cross protection possible?
 

Example 4:
document: The activity of the HIV-1 IRES is stimulated by oxidative stress and controlled by a negative regulatory element Initiation of translation of the full-length messenger RNA of HIV-1, which generates the viral structural proteins and enzymes, is cap-dependent but can also use an internal ribosome entry site (IRES) located in the 5′ untranslated region. Our aim was to define, through a mutational analysis, regions of HIV-1 IRES that are important for its activity. A dual-luciferase reporter construct where the Renilla luciferase (Rluc) translation is cap-dependent while the firefly luciferase (Fluc) translation depends on HIV-1 IRES was used. The Fluc/Rluc ratio was measured in lysates of Jurkat T cells transfected with the dual-luciferase plasmid bearing either the wild-type or a mutated IRES. Deletions or mutations in three regions decreased the IRES activity but deletion or mutations of a stem-loop preceding the primer binding site increased the IRES activity. The wild-type IRES activity, but not that of an IRES with a mutated stem-loop, was increased when cells were treated with agents that induce oxidative stress. Such stress is known to be caused by HIV-1 infection and we propose that this stem-loop is involved in a switch that stimulates the IRES activity in cells infected with HIV-1, supporting the suggestion that the IRES activity is up-regulated in the course of HIV-1 replication cycle.
query:

"""

In [None]:
input_ids = tokenizer(prompt, return_tensors="pt", max_length=1024, truncation=True).input_ids.to(device)

outputs = model.generate(input_ids, max_length=32)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

query: what is the function of the internal ribosome entry site in hiv-1?


In [None]:
!pip install ir_datasets

In [None]:
import ir_datasets
dataset = ir_datasets.load('beir/trec-covid')
docstore = dataset.docs_store()
qrels = []
for qrel in dataset.qrels_iter():
    if qrel.relevance==2:
        qrels.append(qrel)
queries = []
for query in dataset.queries_iter():
    queries.append(query)

def generate_example(query, doc, id):
    example = "Example %s:\ndocument: %s %s\nquery: %s\n" % (id, doc.title, doc.text, query)
    return example


In [None]:
prompt = """These are examples of queries with a sample relevant document for each query. 
The query must be specific and detailed. 

"""
for i,q in enumerate(queries[0:3]):
    id = q.query_id
    query_text = q.text
    docs = []
    for d in qrels:
        if d.query_id == id:
            doc = d.doc_id
            doc = docstore.get(doc)
            docs.append(doc)
            if len(docs) == 1:
                break
    a = generate_example(query_text, docs[0],i+1)
    prompt = prompt + a + " \n"

In [None]:
prompt_docs = []
for d in dataset.docs_iter():
    doc_text = "%s %s" % (d.title, d.text) 
    gen = "Example 4:\ndocument: %s\nquery:" % (doc_text) 
    doc_prompt = prompt + "\n" + gen
    prompt_docs.append((d.doc_id,doc_prompt))

In [None]:
print(prompt_docs[34][1])

These are examples of queries with a sample relevant document for each query. 
The query must be specific and detailed. 

Example 1:
document: Monophyletic Relationship between Severe Acute Respiratory Syndrome Coronavirus and Group 2 Coronaviruses Although primary genomic analysis has revealed that severe acute respiratory syndrome coronavirus (SARS CoV) is a new type of coronavirus, the different protein trees published in previous reports have provided no conclusive evidence indicating the phylogenetic position of SARS CoV. To clarify the phylogenetic relationship between SARS CoV and other coronaviruses, we compiled a large data set composed of 7 concatenated protein sequences and performed comprehensive analyses, using the maximum-likelihood, Bayesian-inference, and maximum-parsimony methods. All resulting phylogenetic trees displayed an identical topology and supported the hypothesis that the relationship between SARS CoV and group 2 CoVs is monophyletic. Relationships among all 

In [None]:
fw = open("dump.csv", "w+")

In [None]:
from tqdm import tqdm  

In [None]:
sample = 1
for doc_id, doc_prompt in tqdm(prompt_docs[2332:50000]):
  input_ids = tokenizer(doc_prompt, return_tensors="pt", max_length=1024, truncation=True).input_ids.to(device)
  outputs = model.generate(input_ids, max_length=32)
  query = tokenizer.decode(outputs[0], skip_special_tokens=True)
  fw.write("%s\t%s\n" % (doc_id,query))
  fw.flush()
  sample = sample +1

100%|██████████| 47668/47668 [12:43:21<00:00,  1.04it/s]


In [None]:
fw.close()