In [2]:
import torch
from transformers import GPT2TokenizerFast, GPT2LMHeadModel

CoronaNlpGPT2 = '/home/ego/huggingface-models/finetuned/gpt2-lm-cord19-v2/CoronaNLPGPT2'

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = GPT2TokenizerFast.from_pretrained(CoronaNlpGPT2)
model = GPT2LMHeadModel.from_pretrained(CoronaNlpGPT2, pad_token_id=tokenizer.eos_token_id)
model = model.to(device)

In [22]:
input = 'The coronavirus is related to'
ids = tokenizer.encode(input, return_tensors='pt').to(device)

greedy = model.generate(input_ids=ids, max_length=50)
print(tokenizer.decode(greedy.tolist()[0], skip_special_tokens=True))

The coronavirus is related to the severe acute respiratory syndrome (SARS) coronavirus, which is a member of the Coronaviridae family. The SARS-CoV is a single-stranded, positive-sense RNA


In [23]:
# activate beam search and early stopping
beam = model.generate(
    ids, max_length=50, num_beams=5,
    no_repeat_ngram_size=2, early_stopping=True
)
print(tokenizer.decode(beam.tolist()[0], skip_special_tokens=True))

The coronavirus is related to the severe acute respiratory syndrome (SARS) and Middle East Respiratory Syndrome (MERS) outbreaks in 2012 and 2013, respectively. The SARS-CoV was first identified in Guangdong Province,


In [24]:
beams = model.generate(
    input_ids=ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True,
)
for idx, beam in enumerate(beams):
    print('{}: {}\n'.format(
        idx, tokenizer.decode(beam.tolist(), skip_special_tokens=True)))

0: The coronavirus is related to the severe acute respiratory syndrome (SARS) and Middle East Respiratory Syndrome (MERS) outbreaks in 2012 and 2013, respectively. The SARS-CoV was first identified in Guangdong Province,

1: The coronavirus is related to the severe acute respiratory syndrome (SARS) and Middle East Respiratory Syndrome (MERS) outbreaks in 2012 and 2013, respectively. The SARS-CoV was first identified in Guangdong Province in

2: The coronavirus is related to the severe acute respiratory syndrome (SARS) and Middle East Respiratory Syndrome (MERS) outbreaks in 2012 and 2013, respectively. The SARS-CoV genome encodes a single open reading frame (

3: The coronavirus is related to the severe acute respiratory syndrome (SARS) and Middle East Respiratory Syndrome (MERS) outbreaks in 2012 and 2013, respectively. The SARS-CoV was first identified in Guangdong, China

4: The coronavirus is related to the severe acute respiratory syndrome (SARS) and Middle East Respiratory Syndro

In [25]:
torch.random.manual_seed(0)
sample = model.generate(ids, do_sample=True, max_length=50, top_k=0)
print(tokenizer.decode(sample.tolist()[0], skip_special_tokens=True))

The coronavirus is related to FMD, single stranded RNA viruses with a polyprotein (VP1) higher than 10-120 kDa. It is a lineage C novel-order virus with genome size 480.0 kb. The papill


In [26]:
torch.random.manual_seed(0)
# Decrease sensitivity to low probability candidates:
sample = model.generate(ids, do_sample=True, max_length=50, top_k=0, temperature=0.7)
print(tokenizer.decode(sample.tolist()[0], skip_special_tokens=True))

The coronavirus is related to many other respiratory and gastrointestinal diseases, such as the severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS). The coronavirus is a member of the family Coronaviridae,


In [35]:
torch.random.manual_seed(0)
# Set top_k to limit sampling pool:
sample = model.generate(ids, do_sample=True, max_length=50, top_k=50)
print(tokenizer.decode(sample.tolist()[0], skip_special_tokens=True))

The coronavirus is related to SARS, which causes severe respiratory illness in humans and animals. The SARS corona virus (SARS-CoV) belongs to the SARS Coronavirus genus in the Coronaviridae


In [36]:
torch.random.manual_seed(0)
# Deactivate top_k and sample only from 90% most likely words:
sample = model.generate(ids, do_sample=True, max_length=50, top_p=0.90, top_k=0)
print(tokenizer.decode(sample.tolist()[0], skip_special_tokens=True))

The coronavirus is related to C pneumonia, type 3 diabetes and cryptococcal disease. However, the clinical presentation of the patient is often murine-like without significant gross or clinical symptoms. Interestingly, pneumonia can be partially treated with antibiotic.


In [37]:
torch.random.manual_seed(0)
samples = model.generate(
    input_ids=ids,
    do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)
print(f"Output:\n{100*'-'}")
for i, sample in enumerate(samples):
    print('{}: {}\n'.format(i, tokenizer.decode(sample.tolist(), skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: The coronavirus is related to SARS, which causes severe respiratory illness in humans and animals. The SARS-CoV replicates in the respiratory tract, infects cells, and causes acute respiratory distress syndrome (ARDS). The human coronav

1: The coronavirus is related to alphaviruses (A, B and C) and is transmitted through contact with faecal material. In addition, SARS-CoV (severe acute respiratory syndrome coronavirus) has been isolated

2: The coronavirus is related to a series of human-associated coronaviruses known as alpha-, beta-, gamma-, and delta-coronaviruses, as of December 2019, and its genome consists of three segments. CoVs infect



In [43]:
def gpt2chat(flag='quit'):
    from builtins import input
    def encode_prompt(text: str) -> str:
        input_ids = tokenizer.encode(text, return_tensors='pt')
        samples = model.generate(input_ids=input_ids.to(device),
                                 max_length=200,
                                 min_length=15,
                                 do_sample=True,
                                 temperature=0.7,
                                 top_k=50,
                                 top_p=0.95,
                                 repetition_penalty=1.1)
        generated = []
        for gen in samples:
            seq = tokenizer.decode(token_ids=gen.tolist(),
                                   skip_special_tokens=True)
            generated.append(seq)
        return " ".join(generated)

    while True:
        text = input('GPT2 prompt >>> ')
        if text.strip() == flag:
            break
        elif len(text.strip()) == 0:
            print('Prompt should not be empty 🤔')
        else:
            print(f"\n{'='*40} Generated 🤗 {'='*40}\n")
            print(f"\n\n\t{encode_prompt(text)}\n\n{'='*80}\n")

In [44]:
gpt2chat()

GPT2 prompt >>> The widespread of current exposure is to be able to make immediate policy recommendations on mitigation measures, depends on the following;




	The widespread of current exposure is to be able to make immediate policy recommendations on mitigation measures, depends on the following; (1) public health preparedness should focus on preventing and controlling the spread of the disease; (2) the use of emergency medical resources should be encouraged to minimize the risk for the spread of the disease; (3) the use of effective contact tracing methods should be carried out in order to identify contacts and obtain information regarding their exposure before entering into quarantine; and (4) social distancing measures should be taken to protect the population from the spread.

 = = = Discussion = = = 

 The results of this study provide valuable information for planning and implementing preventive measures for the prevention and control of a pandemic influenza. We found that a l

GPT2 prompt >>> Technology roadmap for diagnostics >>




	Technology roadmap for diagnostics >> What is the current state of the art for diagnostic testing?

 = = = Conclusion = = = 

 The rapid development of diagnostic assays has made it possible to test thousands of samples at once without a single specimen being needed. The advantages of rapid test formats include improved sensitivity, reduced sample volume and rapid turnaround time. The disadvantages of these formats include: (i) the use of multiple laboratories in multiple sites to perform multiple tests, (ii) sample preparation can take up to 1 h; (iii) the cost is relatively low, and (iv) the turnaround time is limited to one day, making it feasible for any test manufacturer to produce a panel. To date, no standardized assay is commercially available for all pathogens, and no commercially available diagnostics are available for bacteria or viruses. However, it should be noted that these limitations may still exist and that cur

GPT2 prompt >>> = = Improvements in testing; Data collection = =




	= = Improvements in testing; Data collection = = = 

 The surveillance systems and reporting system of the Ministry of Health, which was established in 2005, are continuously changing and need to be updated to reflect the changing trends and characteristics. This process is being implemented as a core component of public health research.

 Background: Laboratory-confirmed influenza infections (ILI) are a major public health problem worldwide and affect a substantial proportion of children and adults. We report on the incidence of ILI in a community-based study conducted in Kenya. Methods: We used an electronic surveillance system for ILI from May 2008 to December 2010. A total of 17 surveillance sites were visited. A sample of 50 ILI patients who were diagnosed with ILI at five sites were selected and evaluated by medical staffs. Results: Of the 53 samples tested positive for ILI, 31 samples (n = 473) had ILI-positiv