<a href="https://colab.research.google.com/github/VincentK1991/BERT_summarization_1/blob/master/Copy_of_BERTandGPT2_abstractive_summarization_Apr28_2020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating a summary of COVID19 publication

This notebook introduces a NLP approach to summarize a and paraphrase a scientific publication. The model is specifically trained on COVID19 related data released as part of the COVID-19 Open Research Dataset Challenge on [Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

The strategy we'll be using here involve
1. extract sentences using BERT + clustering
2. extract keyword tokens from the extracted sentences using BERT fine-tuned for token classification
3. generate a paraphrases from the extracted keywords using GPT2 fine-tuned for making abstractive summarization from keywords

The fine-tuning is already done, so we will load the model weights to perform the task. GPU is not necessary for this task,but it should help speed things up a bit, especially at the GPT2 sentence generation step.



In [1]:
#@title Setup Environment and helper function
#@markdown Pip install Huggingface transformers

#@markdown if cuda is available, set device = 'cuda'

#@markdown setup pytorch environment

!pip install transformers==2.6.0

import transformers
from transformers import GPT2Tokenizer, GPT2DoubleHeadsModel, DistilBertModel, DistilBertTokenizer, BertTokenizer, BertForTokenClassification
import numpy as np

import nltk
nltk.download('punkt')
from nltk import sent_tokenize
%tensorflow_version 1.x
from keras.preprocessing.sequence import pad_sequences

from sklearn.cluster import KMeans
from sklearn.neighbors import NearestNeighbors

import json
import matplotlib.pyplot as plt
import timeit
import torch
import textwrap
wrapper = textwrap.TextWrapper(width=70)
SEED = 1234
torch.manual_seed(SEED)

Collecting transformers==2.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/4c/a0/32e3a4501ef480f7ea01aac329a716132f32f7911ef1c2fac228acc57ca7/transformers-2.6.0-py3-none-any.whl (540kB)
[K     |████████████████████████████████| 542kB 8.4MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 53.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/99/50/93509f906a40bffd7d175f97fd75ea328ad9bd91f48f59c4bd084c94a25e/sacremoses-0.0.41.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 61.7MB/s 
Collecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |███

Using TensorFlow backend.


<torch._C.Generator at 0x7f1df8cf1fd0>

In [3]:
#@title change to directory

#@markdown change directory to where to models are kept
#@markdown make sure this dir contain sub dirs for fine-tuned BERT and GPT2 models

%cd '/content/drive/My Drive/Colab Notebooks/BERT_GPT2_summarization'

/content/drive/My Drive/Colab Notebooks/BERT_GPT2_summarization


# A bit about the model

1. For sentence extraction, we can just use pre-trained distil-bert. this will speed things up even faster.

2. For token classification, we will be using BERT-based-cased

3. For the GPT2 model, we'll be using GPT2DoubleHead model. The "DoubleHead" of the model means the model is trained on both language modeling and multiple choice sentence prediction, and outputs 2 losses, the LM loss or language modeling loss, and MC loss or multiple choice loss.

In [4]:
#@title Choose Model Config and Weights

#@markdown Distil version is fine for this task
BERT_pretrained_weights = 'distilbert-base-uncased' #@param ["distilbert-base-uncased", "bert-base-uncased", "bert-base-cased"] {allow-input: true}

#@markdown for token classification we used 
BERTforTokenClassification_config_directory = 'BERT_dir' #@param {type:"string"}
token_label_files = 'BERT_dir/POS2idx.json' #@param {type:"string"}

GPT2_config_directory = 'GPT2_dir' #@param {type:"string"}

print('which BERT pre-trained ? ',BERT_pretrained_weights)
print('where is BERT token classifier dir ? ',BERTforTokenClassification_config_directory)
print('where is GPT2 dir ? ',GPT2_config_directory)

which BERT pre-trained ?  distilbert-base-uncased
where is BERT token classifier dir ?  BERT_dir
where is GPT2 dir ?  GPT2_dir


In [5]:
#@title Load models and tokenizers
#@markdown the models are big, these may take a few mins, read [here](https://huggingface.co/transformers/serialization.html) for more information

print('----loading pre-trained BERT----')
BERT_pretrained = DistilBertModel.from_pretrained(
                  BERT_pretrained_weights)
tokenizer_pretrained = DistilBertTokenizer.from_pretrained(
                  BERT_pretrained_weights)
print('----loading token labels----')
with open(token_label_files, 'r') as fp:
    POS2idx = json.load(fp)

POS_values = list(POS2idx.keys())
print('----loading BERT token classifier----')
BERT_token_classifier = BertForTokenClassification.from_pretrained(
                      BERTforTokenClassification_config_directory)
tokenizer_token_classifier = BertTokenizer.from_pretrained(
                      BERTforTokenClassification_config_directory)
#BERT_token_classifier.load_state_dict(torch.load(BERTforTokenClassification_finetuned_weights))
print('----loading GPT2 summary generator----')
tokenizer_GPT2 = GPT2Tokenizer.from_pretrained(
                  GPT2_config_directory)
special_tokens = {'bos_token':'<|startoftext|>','eos_token':'<|endoftext|>','pad_token':'<pad>','additional_special_tokens':['<|keyword|>','<|summarize|>']}
tokenizer_GPT2.add_special_tokens(special_tokens)
GPT2_generator = GPT2DoubleHeadsModel.from_pretrained(
                  GPT2_config_directory)

----loading pre-trained BERT----


HBox(children=(IntProgress(value=0, description='Downloading', max=442, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=267967963, style=ProgressStyle(description_…




HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…


----loading token labels----
----loading BERT token classifier----
----loading GPT2 summary generator----


In [6]:
#@title use GPU?

#@markdown check the box to indicate if GPU to be used for running any model?

use_GPU_BERT_pre_trained = False #@param {type:"boolean"}
use_GPU_BERT_token_classifier = False #@param {type:"boolean"}
use_GPU_GPT_generator = True #@param {type:"boolean"}

if torch.cuda.is_available():
  print('cuda is available')
  device = 'cuda'
  print('device is set to cuda')
if not torch.cuda.is_available():
  print('cuda is not available')
  device = 'cpu'
  print('device is set to cpu')
  use_GPU_BERT_pre_trained = False
  use_GPU_BERT_token_classifier = False
  use_GPU_GPT_generator = False

print(' ')
print('use GPU for pre-trained BERT?' ,use_GPU_BERT_pre_trained)
print('use GPU for BERT token classifier ?' ,use_GPU_BERT_token_classifier)
print('use GPU for GPT2?' ,use_GPU_GPT_generator)

cuda is available
device is set to cuda
 
use GPU for pre-trained BERT? False
use GPU for BERT token classifier ? False
use GPU for GPT2? True


In [34]:
#@title Main text file

#@markdown indicate the text file to be summarized
use_input_text = False

input_file = 'directory/file.txt' #@param {type:"string"}
max_len = 500 #@param {type:"integer",max:512}

#@markdown or copy paste your input here and check the box
use_input_text = True #@param {type:"boolean"}
input_text = "'Two months after it was firstly reported, the novel coronavirus disease COVID-19 has already spread worldwide. However, the vast majority of reported infections have occurred in China. To assess the effect of early travel restrictions adopted by the health authorities in China, we have implemented an epidemic metapopulation model that is fed with mobility data corresponding to 2019 and 2020. This allows to compare two radically different scenarios, one with no travel restrictions and another in which mobility is reduced by a travel ban. Our findings indicate that i) travel restrictions are an effective measure in the short term, however, ii) they are ineffective when it comes to completely eliminate the disease. The latter is due to the impossibility of removing the risk of seeding the disease to other regions. Our study also highlights the importance of developing more realistic models of behavioral changes when a disease outbreak is unfolding.'" #@param {type:"string"}

if not use_input_text:
  # open the txt file that is included
  with open(input_file, 'r') as file:
    input_text = file.read().replace('\n', '')

# split text to sentences
paragraph_split = sent_tokenize(input_text)

print('input text has',len(paragraph_split) ,'sentences.')

print('tokenizing sentences')

input_tokens = []
for i in paragraph_split:
  input_tokens.append(tokenizer_pretrained.encode(i, 
                              add_special_tokens=True))
temp = []
for i in input_tokens:
  temp.append(len(i))
if np.max(temp) > max_len:
  raise ValueError('sentence longer than the max_len')
if np.max(temp) > 512:
  print('warning: sentence longer than 512')
  print('suggest to change max_len to 512, the remainder will be truncated')
input_ids = pad_sequences(input_tokens, 
                          maxlen=max_len, dtype="long", 
                          value=0, 
                          truncating="post", 
                          padding="post")

print('creating attention masks')

attention_masks = []
for sent in input_ids:
  att_mask = [int(token_id > 0) for token_id in sent]  # create a list of 0 and 1.
  attention_masks.append(att_mask)  # basically attention_masks is a list of list

input_ids = torch.tensor(input_ids)  
attention_mask = torch.tensor(attention_masks)

input text has 7 sentences.
tokenizing sentences
creating attention masks


# More on unsupervised BERT embedding

Similar to word2vec, we use BERT to convert sentences in English to a vector that captures semantic relationships (i.e. distance among vectors represents semantic dis-similarity). But unlike word2vec, BERT allows us to do more than just a word in isolation. BERT allows us to a long sequence (up to 512 tokens), enabling us to get a semantic information of a long sequence.

To get a sentence level information, we output the last-layer hidden representation of the sentence header token (in BERT this is [CLS]).

Then we perform k-means clustering to see the clustering of semantic information into k clusters. The assumption is that each cluster represents a related semantic ideas expressed in the text. Then we extract the sentence closest to the cluster center as representing semantic meaning of the cluster. This sentence becomes *the extractive summary*.

In [69]:
#@title Extracting parameters

#@markdown make sure that the number_extract < number of sentences in input text
number_extract = 6 #@param {type:"slider", min:1, max:20, step:1}

if use_GPU_BERT_pre_trained:
  input_ids = input_ids.to(device)
  BERT_pretrained = BERT_pretrained.to(device)
  attention_mask = attention_mask.to(device)

if not use_GPU_BERT_pre_trained:
  input_ids = input_ids.to('cpu')
  BERT_pretrained = BERT_pretrained.to('cpu')
  attention_mask = attention_mask.to('cpu')

with torch.no_grad():
  last_hidden_states = BERT_pretrained(input_ids, 
                             attention_mask=attention_mask)

sentence_features = last_hidden_states[0][:,0,:].detach().cpu().numpy()

print('performing k-medoid clustering with '
        ,number_extract,' clusters')

kmeans = KMeans(n_clusters=number_extract, 
                random_state=0).fit(sentence_features)
cluster_center = kmeans.cluster_centers_
nbrs = NearestNeighbors(n_neighbors= 1, 
                        algorithm='brute').fit(sentence_features)
distances, indices = nbrs.kneighbors(
                  cluster_center.reshape(number_extract,-1))


indices = np.sort(indices.reshape(1,-1))
topic_answer = []
# for i in range(len(indices)):
#   topic_i = []
#   for j in indices[i]:
#     topic_i.append(paragraph_split[j])
#   topic_answer.append(topic_i)

for i in indices[0]:
  topic_answer.append(paragraph_split[i])

print('result:')

print('the ',number_extract,' extracted sentences are')
for i in topic_answer:
  print(i)

topic_answer_string = ''
for topic in topic_answer:
  topic_answer_string = topic_answer_string + ' '+ topic

performing k-medoid clustering with  6  clusters
result:
the  6  extracted sentences are
'Two months after it was firstly reported, the novel coronavirus disease COVID-19 has already spread worldwide.
However, the vast majority of reported infections have occurred in China.
To assess the effect of early travel restrictions adopted by the health authorities in China, we have implemented an epidemic metapopulation model that is fed with mobility data corresponding to 2019 and 2020.
This allows to compare two radically different scenarios, one with no travel restrictions and another in which mobility is reduced by a travel ban.
The latter is due to the impossibility of removing the risk of seeding the disease to other regions.
Our study also highlights the importance of developing more realistic models of behavioral changes when a disease outbreak is unfolding.'


In [70]:
#@title Read the Extracted summary
wrapper.wrap(topic_answer_string)

[" 'Two months after it was firstly reported, the novel coronavirus",
 'disease COVID-19 has already spread worldwide. However, the vast',
 'majority of reported infections have occurred in China. To assess the',
 'effect of early travel restrictions adopted by the health authorities',
 'in China, we have implemented an epidemic metapopulation model that is',
 'fed with mobility data corresponding to 2019 and 2020. This allows to',
 'compare two radically different scenarios, one with no travel',
 'restrictions and another in which mobility is reduced by a travel ban.',
 'The latter is due to the impossibility of removing the risk of seeding',
 'the disease to other regions. Our study also highlights the importance',
 'of developing more realistic models of behavioral changes when a',
 "disease outbreak is unfolding.'"]

# More on fine-tuning the token classification

To explore on how to get an abstractive summarization, one approach is to further break down the extracted summary to keyword level, and train a generative model to generate a new sentence from the keywords.

But first to get a keyword, we will use BERT that is fine-tuned for part of speech tagging. We will then extract the noun tokens and verb tokens from the sentences, and discard other words or parts of speech, such as adjectives, adverbs, determinants, etc. 

The transfer learning of BERT is done using an apparoch adapted from this [blogpost](https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/). The training dataset is obtained from [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus). 

After training BERT for token classification on this dataset for 3 epochs, the validation accuracy is 0.99 and F1-score is 0.93.


In [72]:
#@title Keyword extraction

list_to_pick = ['NN','NNP','NNPS','NNS','VBD','VB','VBZ','VBP']

tokenized_sentence = tokenizer_token_classifier.encode(
                      topic_answer_string)
input_ids2 = torch.tensor([tokenized_sentence[:510]])

if use_GPU_BERT_token_classifier:
  BERT_token_classifier = BERT_token_classifier.to(device)
  input_ids2 = input_ids2.to(device)

if not use_GPU_BERT_token_classifier:
  BERT_token_classifier = BERT_token_classifier.to('cpu')
  input_ids2 = input_ids2.to('cpu')

with torch.no_grad():
  output2 = BERT_token_classifier(input_ids2)
label_indices = np.argmax(output2[0].to('cpu').numpy(), axis=2)

list_keywords = []

tokens = tokenizer_token_classifier.convert_ids_to_tokens(
                        input_ids2.to('cpu').numpy()[0])
new_tokens, new_labels = [], []
for token, label_idx in zip(tokens, label_indices[0]):
    if token.startswith("##"):
        new_tokens[-1] = new_tokens[-1] + token[2:]
    else:
        new_labels.append(POS_values[label_idx])
        new_tokens.append(token)
for token, label in zip(new_tokens, new_labels):
    if label in list_to_pick:
      list_keywords.append(token)

print('finished keyword extraction ...')
print('the keywords are')

list_keywords = [i for i in list_keywords if i not in ['[CLS]','[SEP]','?','/','-','.','_','!','@','[',']']]
list_keywords

list_keywords_str = ' '.join(list_keywords)
wrapper.wrap(list_keywords_str)

finished keyword extraction ...
the keywords are


["' months was coronavirus disease covid has worldwide majority",
 'infections have china assess effect travel restrictions health',
 'authorities china have epidemic metapopulation model is mobility data',
 'allows compare scenarios travel restrictions mobility is travel ban is',
 'impossibility risk disease regions study highlights importance models',
 'changes disease outbreak is']

# More on OpenAI GPT-2

Unlike BERT which is bi-directional encoder, openAI-GPT2 is auto-regressive decoder. (BERT stands for Bidirectional Encoder Representations from Transformers, and GPT stands for Generative Pretrained Transformer). The illustration [here](http://jalammar.github.io/illustrated-gpt2/) is helpful. We will use GPT2 that is specifically fine-tuned for making a summary from keywords.

But first, how is the training done?

We use GPT2DoubleHead model, meaning, it has 2 heads: one head for causal language modeling (lm), and the other for multiple choice (mc) answering.So for training we have 2 losses to optimize, one is lm loss and the other one is mc loss. 

The causal language modeling head takes input up to the current tokens and output the next predicted token. The 

The rationale is that by optimizing two losses, we will force the model to learn both local context used to generate a next token and global semantic meaning for answering multiple choice question. 



---

The illustration of what the training looks like. We create a special token '<|summarize|>' that tells the model to start summarizing using information frm previous tokens up to that point. 

The training set looks like this:

$<|startoftext|> k_1, k_2, k_3, ..., k_n <|summarize|> gold summary...<|endoftext|><pad><pad><pad>... $

where $k_n$ is $n^{th}$ keyword of the summary;

and $<pad>$ is a token for padding the sequences in the dataset to the same length.

We have a distractor that contain the same keywords but a wrong summary. $<|startoftext|> k_1, k_2, k_3, ..., k_n <|summarize|> distractor ...<|endoftext|><pad><pad><pad>... $ 

The LM loss is the cross-entropy loss computed on the sequence generated after the token <|summarize|> compared to the gold summary.

The MC loss is the cross-entropy loss of classifying the gold summary among distractors.

---

## training of the model

we train the model, using 1 Tesla P100-PCIE GPU. The model was trained for 2 epochs. This takes over 8 hours of GPU time. The batch size is 1. To alleviate the problem, we accumulate gradient for 5 iterations before doing gradient descent step. 

In [0]:
list_keywords_str2 = 'Pathogenesis of Virus-Induced DemyelinationPublisher Summary Demyelination is component diseases humans are sclerosing panencephalitis SSPE leukoencephalopathy PML are number virus infections animals involve demyelination serve models demyelinating diseases diseases viruses have be demyelination situations demyelinating disease chapter reviews architecture organization CNS considers is interaction viruses CNS cells discusses immunology CNS differs aspects rest body models demyelination have Viruses disease have features include RNA viruses viruses chapter attempts summarize factors demyelination their features mechanisms'

In [0]:
title = 'A data-driven assessment of early travel restrictions related to the spreading of the novel COVID-19 within mainland China '

In [77]:
#@title GPT2 input preparation

GPT2_input = tokenizer_GPT2.encode(
      '<|startoftext|> ' +title + list_keywords_str + ' <|summarize|> ')
GPT2_input_torch = torch.tensor(GPT2_input, dtype=torch.long)

print("the keyword input :")
wrapper.wrap(tokenizer_GPT2.decode(GPT2_input_torch))

the keyword input :


['<|startoftext|>  A data-driven assessment of early travel restrictions',
 'related to the spreading of the novel COVID-19 within mainland',
 "China'months was coronavirus disease covid has worldwide majority",
 'infections have china assess effect travel restrictions health',
 'authorities china have epidemic metapopulation model is mobility data',
 'allows compare scenarios travel restrictions mobility is travel ban is',
 'impossibility risk disease regions study highlights importance models',
 'changes disease outbreak is <|summarize|>']

# More on Sentence generation

To generate a new sentence, we pass the input to the model, the output is the likelihood distribution of tokens.

To get the token from the output, we apply softmax activation to the output. Then we have the word for that time step. To get a sequence of tokens, we repeat this step, passing the output back in as input.

Now, there are many ways to generate a long sentence, one is greedy search, which always choose the next word as highest probability given previous words.

## greedy search

This is not always the best idea, because the correct word could have showed up behind a low probability word. The greedy search, by rejecting the low probability word, also reject the correct word that comes after. Note that this method is deterministic.

## top-k sampling/ top-p sampling

This is to pick next words according to its conditional probability on previous words. So this is not deterministic anymore. 

- Temperature is a scaling factor (always positive real number) apply to the likelihood before softmax. higher temperature (>1) shrink all the likelihood together, making high likelihood and low likelihood closer; this result in word sequence appear more random (and sounds creative) than before. Low temperature sharpening the distribution, increasing the high likelihood and decreasing the likelihood of low probability word, making the result more deterministic. 
- top-k = the top k word candidates to consider when doing the sampling.
For example, top-k = 5 will consider top 5 words when doing the sampling. Higher top-k value means considering many low probability words, thus, there is some chance that these words will be used.

- top-p = top p sampling choose top n word candidates such that the set of these n words have cumulative probability > p. For example, if we set p = 0.5 it might choose top 5 words, if the sum of the probability of these 5 words > 0.5. Next round it might choose different 6 words, ... etc. This allows more dynamic sampling there n cab vary depending on the cumulative probability.

- using both top-k and top-p sampling together allows us to cap the number of n to be <= k. This allows us to limit how many words we will consider in case the words all have low probability.

In [103]:
#@title GPT2 paraphrase generation

#@markdown this step may takes a few mins without GPU

temperature =  1#@param {type:"number"}
greedy_search = False #@param {type:"boolean"}
top_k =  50#@param {type:"integer",min:1}
top_p = 0.8 #@param {type:"number",max:1}
max_length = 200 #@param {type:"integer",max:1}

min_length= 20 #@param {type:"integer",max:1}
num_return_sequences=3 #@param {type:"integer",min:1}

if use_GPU_GPT_generator:
  GPT2_generator = GPT2_generator.to(device)
  GPT2_input_torch = GPT2_input_torch.to(device)

do_sample = not greedy_search
if do_sample == False:
  num_return_sequences = 1
  
sampling_output = GPT2_generator.generate(
      input_ids=GPT2_input_torch.unsqueeze(0),
      max_length=max_length + len(GPT2_input_torch),
      min_length = min_length + len(GPT2_input_torch),
      temperature=temperature,
      decoder_start_token_id= '<|summarize|>',
      top_k=top_k,
      top_p=top_p,
      do_sample=do_sample,
      num_return_sequences=num_return_sequences, 
      no_repeat_ngram_size=3)

print('finish generating')

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


finish generating


In [106]:
#@title GPT2 generated output

which_output = 2 #@param {type:"slider", min:0, max:10, step:1}
wrapper.wrap(tokenizer_GPT2.decode(
    sampling_output[which_output,len(GPT2_input_torch):], 
    skip_special_tokens=True)[:5000])

[' Abstract In order to assess the impact of travel restrictions, global',
 'health authorities, including the affected china, have established an',
 'epidemic metaperopulation. A multi-species model is developed and',
 'applied to the mobility of the data, which allows us to compare the',
 'scenarios of travel restriction, mobility, and is safe. The rapid',
 'travel ban that is proposed is not without a great obstacle to the',
 'disease regions of this study. Here, we highlight the importance of',
 'multiple models of the epidemiological changes in disease outbreak and',
 'is presented in order to compare in our scenarios with travel',
 'restrictions and mobility. The epidemiological mobility is critical',
 'for predicting travel ban, which is feasible. It is difficult to risk',
 'a disease in these regions. In this study, we also highlights the',
 'importance and potential of multiple and distinct models of regional',
 'and international changes in the disease outbreak, which it is w

In [0]:
gold_label = 'Two months after it was firstly reported, the novel coronavirus disease COVID-19 has already spread worldwide. However, the vast majority of reported infections have occurred in China. To assess the effect of early travel restrictions adopted by the health authorities in China, we have implemented an epidemic metapopulation model that is fed with mobility data corresponding to 2019 and 2020. This allows to compare two radically different scenarios, one with no travel restrictions and another in which mobility is reduced by a travel ban. Our findings indicate that i) travel restrictions are an effective measure in the short term, however, ii) they are ineffective when it comes to completely eliminate the disease. The latter is due to the impossibility of removing the risk of seeding the disease to other regions. Our study also highlights the importance of developing more realistic models of behavioral changes when a disease outbreak is unfolding.'

In [81]:
wrapper.wrap(title + gold_label)

['A data-driven assessment of early travel restrictions related to the',
 'spreading of the novel COVID-19 within mainland China Two months after',
 'it was firstly reported, the novel coronavirus disease COVID-19 has',
 'already spread worldwide. However, the vast majority of reported',
 'infections have occurred in China. To assess the effect of early',
 'travel restrictions adopted by the health authorities in China, we',
 'have implemented an epidemic metapopulation model that is fed with',
 'mobility data corresponding to 2019 and 2020. This allows to compare',
 'two radically different scenarios, one with no travel restrictions and',
 'another in which mobility is reduced by a travel ban. Our findings',
 'indicate that i) travel restrictions are an effective measure in the',
 'short term, however, ii) they are ineffective when it comes to',
 'completely eliminate the disease. The latter is due to the',
 'impossibility of removing the risk of seeding the disease to other',
 'regio

# conclusion

1. The extractive method really work well. From eye-ball sampling, what we have is really human-level performance. This is mainly because we extract text from human-written text, without modification. So the human touch is still there.

  - It is most useful especially when we do not have sample data to train the abstractive summarization model. In domain-specific or highly technical text, the amount of training data may be limited. But since this method is unsupervised learning, it should not be affected by this limitation as long as it can perform the embedding well.

  - However, it has limitation in that it cannot generate a paraphrase. Sometimes, this can lead to awkward or clumpsy result that looks like a copy-paste extraction.


2. The abstractive method is doing OK, and generally generate a text that sounds like coming from a correct topic. And generally, one can understand what it tries to say. But the performance is really not at the human level performance. 

  - For one, the abstractive method is limited by the amount of training resource, namely the GPU time and the labeled data. We only have 32.1K text to train on a highly technical domain. So it's very difficult for the generative model to learn the relationship between these technical terms from this small set. 

  - it often makes syntactical blunder, like repeating itself is a big problem. This has to do with the auto-regressive model, feeding generated output back as a new input and this feedback is susceptible to a kind of vicious circle.

# Citation

[COVID19 Open Research Dataset](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
- dataset of peer-reviewed medical journals used to do the fine-tuning of GPT2 model

[Huggingface](https://github.com/huggingface/transformers)
- repository of pytorch-based NLP models.

[BERT for first-time users](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)
- blogpost on how BERT embedding works.

[Named Entity Recognition with BERT](https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/)
- blogpost on fine-tuning BERT model for token classification. The fine-tuning of BERT for token classification is based on this blogpost.

[Annotated-corpus for Named entity recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus#ner_dataset.csv)
- dataset containing labeled tokens for training NLP models on token classification.

[The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)
- a blogpost illustrating how GPT-2 works. 

[GPT2 Double Head Model](https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313)
- a blogpost on how the Double Head model works. The training of GPT2 is adapted from this blogpost.

[Different decoding method for language generation](https://huggingface.co/blog/how-to-generate)
- a blogpost on how language generation works, including explanation on greedy search, top-k and top-p sampling. 