#**Automatic text summarization**

Andrea Gatto - Deep Learning Specialist at Harman-Samsung ([contact](https://www.linkedin.com/in/andrea-gatto/))</br>
Machine Learning Milan meetup ApplyAI@7 - Hands-on in NLP part 2 - 9 May 2020 <br><br>
[These](https://drive.google.com/open?id=13zb4KPoZyAQVqRWn1xcbHgxNHPg5Lgc5wD7QEbq0K68) are the slides while [here](https://www.youtube.com/watch?v=1vL3rn2ctuw&feature=youtu.be) you can find the video of the workshop. <br></br></br>

###Introduction
In this notebook we perform text summarization using TextRank and BART (see the slides for details).
In the first section we build our own version of TextRank from scratch. BART is a big and complicated model, since we don't have time and resources to write the code and train the model in the second part we "simply" use two BART models already fine-tuned on summarization datasets to generate summaries starting from different input texts.<br>
Disclaimer: the code is very verbose and inefficient, it has been written so that everyone can understand it. Feel free to rewrite it in a nicer and more efficient way (see exercise 1 of TextRank section). <br><br>
I suggest to activate the GPU (on Colab: Runtime -> Change runtime type -> Hardware accelerator -> GPU) for faster inference when using BART.

In [0]:
# Install transformers library
!pip install transformers==2.8.0

Collecting transformers==2.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |████████████████████████████████| 573kB 2.7MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/3b/88/49e772d686088e1278766ad68a463513642a2a877487decbd691dec02955/sentencepiece-0.1.90-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 10.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 16.7MB/s 
Collecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |███

In [0]:
# Import libraries
import numpy as np
import textwrap
import pandas as pd
import string
from gensim.summarization.summarizer import summarize as gensim_textrank_summarizer
import transformers
import torch
pd.set_option('display.expand_frame_repr', False)
wrap_size = 150

## **TextRank**
![](https://0701.static.prezi.com/preview/v2/6hakrtevcrup5t3z2lpollz2qp6jc3sachvcdoaizecfr3dnitcq_1_0.png)

**PageRank**

In [0]:
# Fct for computing PageRank scores (compute PR = d*M*PR + (1-d)/N)
def compute_pagerank(M, d=0.85, tol=1.e-6, verbose=False):
  # Number of nodes
  N = M.shape[0]
  # Initialize pagerank scores to 1/N for all nodes
  pr = np.ones(N)/N
  # Initialize diff for stopping iteration
  diff = 1+tol
  # Loop
  counter = 0
  while diff > tol:
    # Update pagerank
    pr_old = pr
    pr = d*np.dot(M, pr) + ((1-d)/N)*np.ones(N)
    # Compute diff wrt previous step
    diff = abs(pr - pr_old).sum()
    # Print values for each iteration
    if verbose:
      counter += 1
      print('Iteration {:>2}: {}'.format(counter, pr))
  return pr

Let's compute the PageRank scores for the graph we've seen in the slides.<br>
The output is an array composed by [PageRank(A), PageRank(B), PageRank(C), PageRank(D)].
<br><br>
![](https://drive.google.com/uc?id=1k1wO-ceF_b5nZRMFv59iA5PJJYwVk2br)

In [0]:
'''
ROWS = incoming
COLUMNS = outgoing
                FROM
      |  A  |  B  |  C  |  D  
   ----------------------------
    A |  X  | NO  | NO  |  NO     
   ----------------------------
T   B | NO  |  X  | YES |  NO 
O  ----------------------------
    C | YES | YES |  X  |  YES 
   ----------------------------
    D | NO  | YES | NO  |   X 

'''
M = np.array([[0, 0,   0, 0],
              [0, 0,   1, 0],
              [1, 0.5, 0, 1],
              [0, 0.5, 0, 0]])
pagerank = compute_pagerank(M, verbose=True)
print('\n', pagerank)

Iteration  1: [0.0375  0.25    0.56875 0.14375]
Iteration  2: [0.0375    0.5209375 0.2978125 0.14375  ]
Iteration  3: [0.0375     0.29064062 0.41296094 0.25889844]
Iteration  4: [0.0375     0.3885168  0.41296094 0.16102227]
Iteration  5: [0.0375     0.3885168  0.37136356 0.20261964]
Iteration  6: [0.0375     0.35315903 0.40672133 0.20261964]
Iteration  7: [0.0375     0.38321313 0.39169428 0.18759259]
Iteration  8: [0.0375     0.37044014 0.39169428 0.20036558]
Iteration  9: [0.0375     0.37044014 0.3971228  0.19493706]
Iteration 10: [0.0375     0.37505438 0.39250856 0.19493706]
Iteration 11: [0.0375     0.37113228 0.39446961 0.19689811]
Iteration 12: [0.0375     0.37279917 0.39446961 0.19523122]
Iteration 13: [0.0375     0.37279917 0.39376118 0.19593965]
Iteration 14: [0.0375     0.372197   0.39436335 0.19593965]
Iteration 15: [0.0375     0.37270885 0.39410743 0.19568373]
Iteration 16: [0.0375     0.37249131 0.39410743 0.19590126]
Iteration 17: [0.0375     0.37249131 0.39419988 0.195808

Let's now slightly change the graph by adding an outgoing link from C to D and compute the new PageRanks.<br><br>
![](https://drive.google.com/uc?id=1UVT2Xshqe2KFXr5O1vm-XKkJNFSkJVTD)

In [0]:
'''
COLUMNS = outgoing
                FROM
      |  A  |  B  |  C  |  D  
   ----------------------------
    A |  X  | NO  | NO  |  NO     
   ----------------------------
T   B | NO  |  X  | YES |  NO 
O  ----------------------------
    C | YES | YES |  X  |  YES 
   ----------------------------
    D | NO  | YES | YES |   X 
'''
M = np.array([[0, 0,   0,   0],
              [0, 0,   0.5, 0],
              [1, 0.5, 0,   1],
              [0, 0.5, 0.5, 0]])
pagerank = compute_pagerank(M, verbose=True)
print('\n', pagerank)

Iteration  1: [0.0375  0.14375 0.56875 0.25   ]
Iteration  2: [0.0375     0.27921875 0.34296875 0.3403125 ]
Iteration  3: [0.0375     0.18326172 0.47730859 0.30192969]
Iteration  4: [0.0375     0.24035615 0.40390146 0.31824238]
Iteration  5: [0.0375     0.20915812 0.44203239 0.31130949]
Iteration  6: [0.0375     0.22536377 0.42288027 0.31425597]
Iteration  7: [0.0375     0.21722411 0.43227217 0.31300371]
Iteration  8: [0.0375     0.22121567 0.4277484  0.31353592]
Iteration  9: [0.0375     0.21929307 0.42989719 0.31330973]
Iteration 10: [0.0375     0.22020631 0.42888783 0.31340586]
Iteration 11: [0.0375     0.21977733 0.42935766 0.31336501]
Iteration 12: [0.0375     0.21997701 0.42914062 0.31338237]
Iteration 13: [0.0375     0.21988476 0.42924024 0.31337499]
Iteration 14: [0.0375     0.2199271  0.42919477 0.31337813]
Iteration 15: [0.0375     0.21990778 0.42921543 0.3133768 ]
Iteration 16: [0.0375     0.21991656 0.42920608 0.31337736]
Iteration 17: [0.0375     0.21991258 0.42921029 0.31

**TextRank**

In [0]:
# Compute cosine similarity between two vectors (compute (a*b)/(||a||*||b||))
def cosine_similarity(a, b):
  return np.dot(a, b)/(np.dot(a, a)**0.5 * np.dot(b, b)**0.5)

In [0]:
# Compute similarity matrix (compute S[i, j] = S[j, i] = cosine_similarity(sentence_i, sentence_j))
def get_similarity_matrix(features_array):

  # Create an empty similarity matrix with number of rows = number of columns = number of sentences = number of rows of features matrix
  sim_mat = np.zeros((features_array.shape[0], features_array.shape[0]))

  # Loop over rows and columns. Since cosine_similarity(a, b) = cosine_similarity(b, a) we just need to compute it for half of the matrix
  # S[i, j] = S[j, i] = cosine_similarity(sentence_i, sentence_j)
  for idx_row in range(1, sim_mat.shape[0]):
    for idx_col in range(idx_row):
      sim_mat[idx_row, idx_col] = cosine_similarity(features_array[idx_row], features_array[idx_col])
      sim_mat[idx_col, idx_row] = sim_mat[idx_row, idx_col]

  # Normalize column-wise
  for idx_col in range(sim_mat.shape[1]):
    sim_mat[:, idx_col] /= sim_mat[:, idx_col].sum()
  
  return sim_mat

In [0]:
# Fct for producing sentence embeddings via TF-IDF (compute TF-IDF(t,d,D) = tf(t,d)*idf(t,D))
def get_tfidf_features(text, add_one_to_idf=True, verbose=False):

  # Remove special chars
  text = text.replace('‘','').replace('’','').replace('“','').replace('”','').replace('\n', ' ')

  # Get vocabulary (= unique words in text)
  # First remove punctuation
  text_no_punct = text.translate(str.maketrans('', '', string.punctuation))
  # Then split text into (lower case) words and remove duplicates
  vocab = set([word.lower() for word in text_no_punct.split(' ') if word])

  # Get sentences from text 
  # (we assume that they are "."-separated, NB not really a good assumption, see e.g. nltk.tokenize.punkt)
  sentence_list = [sentence for sentence in text.split('.') if sentence]
  
  # Loop over sentences
  tf_list = []
  denominator_idf = {word: 0 for word in vocab}
  for sentence in sentence_list:

    # Remove punctuation and split sentence into (lower case) words
    word_list = [word.lower() for word in sentence.translate(str.maketrans('', '', string.punctuation)).split(' ') if word]
    
    # Compute tf = number of times word appears in sentence/total number of words in sentence and
    # idf = ln(total number of sentences/how many sentences contain the word), here we just compute the denominator
    # NB tf depends on the sentence, so we are going to append it to a list, while idf is a measure across all sentences
    tf = {}
    for word in vocab:
      tf[word] = word_list.count(word)/len(word_list)
      if word in word_list:
        # Denominator of idf
        denominator_idf[word] += 1
    tf_list.append(tf)
  
  # Compute idf and check for division by zero in tfidf computation (denominator_idf should be always > 0)
  idf = {}
  for word in vocab:
    if denominator_idf[word] == 0:
      raise ValueError('Denominator IDF is zero for word', word)
    else:
      idf[word] = np.log(len(sentence_list)/denominator_idf[word])
      if add_one_to_idf:
        idf[word] += 1

  # Loop over tf_list entries (sentences) and words to get final tf-idf score for each word
  features_list = []
  for i, tf in enumerate(tf_list):
    tfidf = {}
    for word in tf:
      tfidf[word] = tf[word]*idf[word]
    features_list.append(tfidf)

  # Print features
  if verbose:
    print('Denominator IDF:\n', pd.DataFrame([denominator_idf]), '\n')
    print('IDF:\n', pd.DataFrame([idf]), '\n')
    print('TF:\n', pd.DataFrame(tf_list), '\n')
    print('TF-IDF:\n', pd.DataFrame(features_list), '\n')

  # Convert to numpy array
  features_arr = np.array([list(d.values()) for d in features_list])

  return features_arr

In [0]:
# Get TFIDF features for some sample text
text = 'This is the example sentence. This sentence is another example. The third one is the final example but the sentence is longer.'
features_arr_tfidf = get_tfidf_features(text, verbose=True)

Denominator IDF:
    is  example  one  the  this  another  but  third  final  sentence  longer
0   3        3    1    2     2        1    1      1      1         3       1 

IDF:
     is  example       one       the      this   another       but     third     final  sentence    longer
0  1.0      1.0  2.098612  1.405465  1.405465  2.098612  2.098612  2.098612  2.098612       1.0  2.098612 

TF:
          is   example       one   the  this  another       but     third     final  sentence    longer
0  0.200000  0.200000  0.000000  0.20   0.2      0.0  0.000000  0.000000  0.000000  0.200000  0.000000
1  0.200000  0.200000  0.000000  0.00   0.2      0.2  0.000000  0.000000  0.000000  0.200000  0.000000
2  0.166667  0.083333  0.083333  0.25   0.0      0.0  0.083333  0.083333  0.083333  0.083333  0.083333 

TF-IDF:
          is   example       one       the      this   another       but     third     final  sentence    longer
0  0.200000  0.200000  0.000000  0.281093  0.281093  0.000000  0.0

In [0]:
# Get topk sentences from textrank scores
def select_topk_sentences(text, textrank, k=3):
  
  # Get sentences (as for TF-IDF fcts watch out for ".")
  sentence_list = [sentence.lstrip().rstrip() for sentence in text.split('.') if sentence]
  
  # Get indexes of sentences based on textrank score, from highest to lowest
  sorted_idx = np.flip(np.argsort(textrank))

  # Select top-k sentences
  topk_sentences = [sentence_list[i] for i in sorted_idx[:k]]

  return topk_sentences

In [0]:
# Put all together and define fct for processing text, compute textrank and get most important sentences
def get_summary(text_to_summarize, k=3, verbose=False):

  # Get features
  features_arr = get_tfidf_features(text_to_summarize['text'], verbose=verbose)

  # Get similarity matrix
  S = get_similarity_matrix(features_arr)
  if verbose:
    print('Similarity matrix:\n', np.array_str(S, precision=3), '\n')

  # Compute textrank
  textrank = compute_pagerank(S, verbose=verbose)
  
  # Extract most important sentences
  sentence_summary = select_topk_sentences(text_to_summarize['text'], textrank, k=k)

  return sentence_summary  

In [0]:
# Print summary and original text
def print_summary(summary, ori_text=''):
  if ori_text:
    if 'title' in ori_text:
      print('*** Title ***')
      print(ori_text['title'], '\n')
    print('*** Text ***')
    print('\n\n'.join([textwrap.fill(s, wrap_size).lstrip() for s in ori_text['text'].split('.') if s]), '\n')
  print('*** Summary ***')
  print('\n\n'.join(textwrap.fill(s, wrap_size) for s in summary))

In [0]:
# From The Guardian article "What is a wet market?"
article = {'title': "What is a wet market? (The Guardian)"}
article['text'] = """At the crack of dawn every day, “wet markets” in China and across Asia come to life, with stall owners touting their wares such as fresh meat, fish, fruits and vegetables, herbs and spices in an open-air setting.
The sights and sounds of the wet market form part of the rich tapestry of community life, where local people buy affordable food, or just go for a stroll and meet their neighbours for a chat. The markets have come under extra scrutiny following the coronavirus outbreak.
While supermarkets selling chilled or frozen meats are increasingly popular in Asia, older shoppers generally prefer buying freshly slaughtered meat for daily consumption, believing it produces flavour in dishes and soup that is superior to frozen meat. Slabs of beef and pork hang from the butchers’ stalls while various cuts are piled on the counters amid lights with a reddish glare and the occasional buzzing of flies. After widespread avian flu outbreaks in the late 1990s however, Hong Kong and many Chinese provinces have banned the sale of live poultry in markets.
While “wet markets”, where water is sloshed on produce to keep it cool and fresh, may be considered unsanitary by western standards, most do not trade in exotic or wild animals and should not be confused with “wildlife markets” – now the focus of vociferous calls for global bans.
The now-infamous Wuhan South China seafood market, suspected to be a primary source for spreading Covid-19 in late 2019, had a wild animal section where live and slaughtered species were for sale, including snakes, beavers, badgers, civet cats, foxes, peacocks and porcupines among other animals.
The Wuhan market was closed in January and the Chinese authorities placed a temporary ban on all trade in wildlife. But according to recent news reports, some wildlife markets in southern China have reopened amid the pandemic, selling dogs, cats, bats, lizards and scorpions among other species.
Many Chinese continue to believe in the health benefits of consuming meat from wild animals. Two leading Hong Kong microbiologists, Professor Yuen Kwok-yung and Dr David Lung, last month condemned the continuing practice of consuming wild game, warning that “Sars 3” could materialise if people do not refrain from eating wild animals."""

In [0]:
# Let's do it!
article_summary_textrank = get_summary(article, verbose=True)
print_summary(article_summary_textrank, article)

Denominator IDF:
0     1       1           1          1         1   2      3    3        1     1     2        1    1    4   1      1   2        1      2   8           1             1     1     1          1         1    1     1    1         1         1   3        1      1       1      1         1        1   6     1  ...         1    1          1   3     1    1     3       1     3            1     1         1        1   7        1        1        1          1     3     2          1          1      1       1      1    1       1     1        1      2       1    1            1             1       1     2        1     1       1        1

[1 rows x 232 columns] 

IDF:
0  3.484907  3.484907    3.484907   3.484907  3.484907  2.791759  2.386294  2.386294  3.484907  3.484907  2.791759  3.484907  3.484907  2.098612  3.484907  3.484907  2.791759  3.484907  2.791759  1.405465    3.484907      3.484907  3.484907  3.484907   3.484907  3.484907  3.484907  3.484907  3.484907  3.484907  3.484907  2.38629

In [0]:
# Summarize article using a more sophisticated TextRank (https://arxiv.org/abs/1602.03606)
article_summary_textrank_gensim = gensim_textrank_summarizer(article['text'])
print_summary(article_summary_textrank_gensim.split('.'))

*** Summary ***
At the crack of dawn every day, “wet markets” in China and across Asia come to life, with stall owners touting their wares such as fresh meat, fish,
fruits and vegetables, herbs and spices in an open-air setting

 Many Chinese continue to believe in the health benefits of consuming meat from wild animals




### TextRank exercises:<br>
Exercise 1: <br>
Rewrite get_tfidf_features function with fewer lines of code using a better sentence-splitting scheme <br><br>
Exercise 2: <br>
Produce summaries with your TextRank introducing different similarity measures (e.g. overlapping n-grams, soft cosine similarity, etc.) <br><br>
Exercise 3: <br>
Produce summaries with your TextRank introducing more sophisticated sentence embeddings (using e.g. Word2Vec, GloVe, Doc2Vec, Universal Sentence Encoder, BERT, etc.). When using word-level embeddings such as Word2Vec, Glove, etc. compare the sentence embeddings and resulting summaries found averaging the word embeddings vs. weighting the average via TF-IDF

## **BART**
<div>
<img src="https://drive.google.com/uc?id=1y5TRT-bEpysj8654qrUcgVfU5yKe2wIw" width="600"/>
</div>

In [0]:
# Set device (GPU or CPU)
dev='cuda' if torch.cuda.is_available() else 'cpu'
print(dev)

cuda


In [0]:
# Download and load models fine-tuned on CNN dataset
tokenizer = transformers.BartTokenizer.from_pretrained('bart-large-cnn')
model_cnn = transformers.BartForConditionalGeneration.from_pretrained('bart-large-cnn').to(dev)
model_xsum = transformers.BartForConditionalGeneration.from_pretrained('bart-large-xsum').to(dev)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1300.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1367.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…




In [0]:
# Fct for generating summary
def generate_summary(tokenizer, model, input_text, device, max_len=100, min_len=50, 
                     ngram_no_repeat=3, n_beams=10, n_return_sequences=1):

  # Tokenize text
  input_ids = tokenizer.encode_plus(input_text['text'], return_tensors='pt')['input_ids']

  # Generate summary
  outputs = model.generate(
      input_ids=input_ids.to(device),           # input to summarize
      max_length=max_len+2,                     # maximum length of output summary
      min_length=min_len+1,                     # minimum length of output summary
      no_repeat_ngram_size=ngram_no_repeat,     # size (n) of n-gram to not be repeated
      num_beams=n_beams,                        # number of beams to be used for beam search, if 1 then greedy search
      num_return_sequences=n_return_sequences   # number of final beams to be returned, if 1 only the first (highest probability) will be returned
  )

  # Convert IDs to tokens
  summaries = [tokenizer.decode(s, skip_special_tokens=True) for s in outputs]

  return summaries

In [0]:
# Get and print article summary using BART from CNN with greedy search
article_summary_bart_cnn_greedy = generate_summary(tokenizer, model_cnn, article, dev, n_beams=1)
print_summary(article_summary_bart_cnn_greedy, article)

*** Title ***
What is a wet market? (The Guardian) 

*** Text ***
At the crack of dawn every day, “wet markets” in China and across Asia come to life, with stall owners touting their wares such as fresh meat, fish,
fruits and vegetables, herbs and spices in an open-air setting

The sights and sounds of the wet market form part of the rich tapestry of community life, where local people buy affordable food, or just go for a
stroll and meet their neighbours for a chat

The markets have come under extra scrutiny following the coronavirus outbreak

While supermarkets selling chilled or frozen meats are increasingly popular in Asia, older shoppers generally prefer buying freshly slaughtered meat
for daily consumption, believing it produces flavour in dishes and soup that is superior to frozen meat

Slabs of beef and pork hang from the butchers’ stalls while various cuts are piled on the counters amid lights with a reddish glare and the
occasional buzzing of flies

After widespread avian flu 

In [0]:
# Get and print article summary using BART from CNN with beam search with beam size = 10
print('*** Beam search ***')
article_summary_bart_cnn_beam = generate_summary(tokenizer, model_cnn, article, dev, n_return_sequences=5)
print_summary(article_summary_bart_cnn_beam)

*** Beam search ***
*** Summary ***
Many Chinese continue to believe in the health benefits of consuming meat from wild animals. After widespread avian flu outbreaks in the late 1990s,
Hong Kong and many Chinese provinces have banned the sale of live poultry in markets. But according to recent news reports, some wildlife markets in
southern China have reopened amid the pandemic.

Many Chinese continue to believe in the health benefits of consuming meat from wild animals. Hong Kong and many Chinese provinces have banned the sale
of live poultry in markets. Some wildlife markets in southern China have reopened amid the pandemic, selling dogs, cats, bats, lizards and scorpions
among other species.

Many Chinese continue to believe in the health benefits of consuming meat from wild animals. Hong Kong and many Chinese provinces have banned the sale
of live poultry in markets. The now-infamous Wuhan South China seafood market is suspected to be a primary source for spreading Covid-19 in late

In [0]:
# Compare summaries found with beam search (beam_size=10), greedy search and TextRank
print('*** Beam search ***')
article_summary_bart_cnn = [article_summary_bart_cnn_beam[0]]
print_summary(article_summary_bart_cnn)
print('\n*** Greedy search ***')
print_summary(article_summary_bart_cnn_greedy)
print('\n*** TextRank ***')
print_summary(article_summary_textrank)

*** Beam search ***
*** Summary ***
Many Chinese continue to believe in the health benefits of consuming meat from wild animals. After widespread avian flu outbreaks in the late 1990s,
Hong Kong and many Chinese provinces have banned the sale of live poultry in markets. But according to recent news reports, some wildlife markets in
southern China have reopened amid the pandemic.

*** Greedy search ***
*** Summary ***
Wet markets are open-air markets where locals buy fresh meat, fish, fruits and vegetables. They are not to be confused with ‘wildlife markets’ which
sell live and slaughtered animals. Many Chinese believe in the health benefits of consuming meat from wild animals.

*** TextRank ***
*** Summary ***
While “wet markets”, where water is sloshed on produce to keep it cool and fresh, may be considered unsanitary by western standards, most do not trade
in exotic or wild animals and should not be confused with “wildlife markets” – now the focus of vociferous calls for global bans


In [0]:
# Let's try to break it
book = {'title': "The immortal (Borges)"}
book['text'] = """My travails, I have said, began in a garden in Thebes. All that night I did not sleep, for there was a combat in my heart. I rose at last a little before dawn. My slaves were sleeping; the moon was the color of the infinite sand. A bloody rider was approaching from the east, weak with exhaustion. A few steps from me, he dismounted and in a faint, insatiable voice asked me, in Latin, the name of the river whose waters laved the city’s walls. I told him it was the Egypt, fed by the rains. “It is another river that I seek,” he replied morosely, “the secret river that purifies men of death”. Dark blood was welling from his breast. He told me that the country of his birth was a mountain that lay beyond the Ganges; it was rumored on that mountain, he told me, that if one traveled westward, to the end of the world, one would come to the river whose waters give immortality. He added that on the far shore of that river lay the City of the Immortals, a city rich in bulwarks and amphitheaters and temples. He died before dawn, but I resolved to go in quest of that city and its river. When interrogated by the torturer, some of the Mauritanian prisoners confirmed the traveler’s tale: One of them recalled the Elysian plain, far at the ends of the earth, where men’s lives are everlasting; another, the peaks from which the Pactolus flows, upon which men live for a hundred years. In Rome, I spoke with philosophers who felt that to draw out the span of a man’s life was to draw out the agony of his dying and multiply the number of his deaths. I am not certain whether I ever believed in the City of the Immortals; I think the task of finding it was enough for me. Flavius, the Getulian proconsul, entrusted two hundred soldiers to me for the venture; I also recruited a number of mercenaries who claimed they knew the roads, and who were the first to desert."""
book_summary_bart_cnn = generate_summary(tokenizer, model_cnn, book, dev)
print_summary(book_summary_bart_cnn, book)

*** Title ***
The immortal (Borges) 

*** Text ***
My travails, I have said, began in a garden in Thebes

All that night I did not sleep, for there was a combat in my heart

I rose at last a little before dawn

My slaves were sleeping; the moon was the color of the infinite sand

A bloody rider was approaching from the east, weak with exhaustion

A few steps from me, he dismounted and in a faint, insatiable voice asked me, in Latin, the name of the river whose waters laved the city’s walls

I told him it was the Egypt, fed by the rains

“It is another river that I seek,” he replied morosely, “the secret river that purifies men of death”

Dark blood was welling from his breast

He told me that the country of his birth was a mountain that lay beyond the Ganges; it was rumored on that mountain, he told me, that if one traveled
westward, to the end of the world, one would come to the river whose waters give immortality

He added that on the far shore of that river lay the City of the Immor

In [0]:
# Let's try to break it - episode 2
song = {'title': "Present tense (Radiohead)"}
song['text'] = """This dance
This dance
It's like a weapon
It's like a weapon
Of self defense
Self defense
Against the present
Against the present
Present tense
I won't get heavy
Don't get heavy
Keep it light and
Keep it moving
I am doing
No harm
As my world
Comes crashing down
I'm dancing
Freaking out
Deaf, dumb, and blind
In you I'm lost
In you I'm lost
I won't turn around when the penny drops
I won't stop now
I won't slack off
Or all this love
Will be in vain
Stop from falling
Down a mine
It's no one's business but mine
That all this love
Has been in vain
In you I'm lost
In you I'm lost
In you I'm lost
In you I'm lost"""

In [0]:
song_summary_bart_cnn = generate_summary(tokenizer, model_cnn, song, dev)
print('*** Title ***')
print(song['title'], '\n')
print('*** Text ***')
print(song['text'], '\n')
print_summary(song_summary_bart_cnn)

*** Title ***
Present tense (Radiohead) 

*** Text ***
This dance
This dance
It's like a weapon
It's like a weapon
Of self defense
Self defense
Against the present
Against the present
Present tense
I won't get heavy
Don't get heavy
Keep it light and
Keep it moving
I am doing
No harm
As my world
Comes crashing down
I'm dancing
Freaking out
Deaf, dumb, and blind
In you I'm lost
In you I'm lost
I won't turn around when the penny drops
I won't stop now
I won't slack off
Or all this love
Will be in vain
Stop from falling
Down a mine
It's no one's business but mine
That all this love
Has been in vain
In you I'm lost
In you I'm lost
In you I'm lost
In you I'm lost 

*** Summary ***
I won't turn around when the penny drops. I won't slack off. Or all this love will be in vain. Stop from falling down a mine. It's no one's business
but mine. I am doing no harm. As my world comes crashing down, I'm dancing.


In [0]:
# Get and print article summary using BART from XSum
article_summary_bart_xsum = generate_summary(tokenizer, model_xsum, article, dev)
print_summary(article_summary_bart_xsum, article)
print('\n*** BART CNN ***')
print_summary(article_summary_bart_cnn)

*** Title ***
What is a wet market? (The Guardian) 

*** Text ***
At the crack of dawn every day, “wet markets” in China and across Asia come to life, with stall owners touting their wares such as fresh meat, fish,
fruits and vegetables, herbs and spices in an open-air setting

The sights and sounds of the wet market form part of the rich tapestry of community life, where local people buy affordable food, or just go for a
stroll and meet their neighbours for a chat

The markets have come under extra scrutiny following the coronavirus outbreak

While supermarkets selling chilled or frozen meats are increasingly popular in Asia, older shoppers generally prefer buying freshly slaughtered meat
for daily consumption, believing it produces flavour in dishes and soup that is superior to frozen meat

Slabs of beef and pork hang from the butchers’ stalls while various cuts are piled on the counters amid lights with a reddish glare and the
occasional buzzing of flies

After widespread avian flu 

In [0]:
# Let's try with a random webpage
random_page = {'title': 'Random webpage (busradar)'}
random_page['text']="""In Europe, public transport carried out by private operators is extremely cost-efficient. According to the European Commission, regulated competition brings advantages to everybody. Travelling by long distance bus is not only a low priced alternative to car sharing or the Saver-fares of the Deutsche Bahn, it is additionally environment-friendly: measured in terms of passenger kilometers the intercity bus produces the lowest CO2-emissions compared to all other means of transportation. The intercity bus enables you to city-hop around Europe at a cheaper price. It is perfect for travelers who want to combine comfort, low carbon footprint, safety and reasonable budget all in one faraway trip.  Already today there is a large number of long distance bus companies, who compete with their intercity bus travels for the favor of customers. This is why the long distance bus price comparison makes definitely sense in order to get the best ticket price. If you would like to book a long distance bus ticket, you do not only have the choice of several payment methods, but also a wide array of long distance bus providers. These include, among others, IDBus, terravision, megabus, and Eurolines. With their intercity bus connections, they offer regular bus trips with long distance buses all over Europe.
Low-cost bus travels across Europe: compare fares, times and ticketing.
The advantages of intercity buses are obvious: no other mode of transport lets you take in so many sights all over Europe and still have money left over for local cuisine and a few souvenirs. To benefit from these advantages, you are more than welcome to check out our long distance bus price comparison portal on busradar: you find all intercity buses, plus you are able to compare the intercity buses in terms of price and comfort and additionally benefit from numerous promotional fares, as for example the “Bahn Spezial” of the Deutsche Bahn. For this reason, busradar is your guidepost to the best bus trip and absolutely reasonable travelling all over Europe by long distance bus.
True to our slogan: “all long distance buses at a glance” you find the long distance buses and their connections of Eurolines, megabus & Co clearly and transparently displayed on one single page: Just a few clicks away to your most suitable bus."""

In [0]:
# Get and print random web page summary using BART from XSum and CNN
random_page_summary_bart_xsum = generate_summary(tokenizer, model_xsum, random_page, dev)
print_summary(random_page_summary_bart_xsum, random_page)
print('\n*** BART CNN ***')
random_page_summary_bart_cnn = generate_summary(tokenizer, model_cnn, random_page, dev)
print_summary(random_page_summary_bart_cnn)

*** Title ***
Random webpage (busradar) 

*** Text ***
In Europe, public transport carried out by private operators is extremely cost-efficient

According to the European Commission, regulated competition brings advantages to everybody

Travelling by long distance bus is not only a low priced alternative to car sharing or the Saver-fares of the Deutsche Bahn, it is additionally
environment-friendly: measured in terms of passenger kilometers the intercity bus produces the lowest CO2-emissions compared to all other means of
transportation

The intercity bus enables you to city-hop around Europe at a cheaper price

It is perfect for travelers who want to combine comfort, low carbon footprint, safety and reasonable budget all in one faraway trip

Already today there is a large number of long distance bus companies, who compete with their intercity bus travels for the favor of customers

This is why the long distance bus price comparison makes definitely sense in order to get the best ticke

In [0]:
# Get and print book summary using BART from XSum
book_summary_bart_xsum = generate_summary(tokenizer, model_xsum, book, dev)
print_summary(book_summary_bart_xsum, book)
print('\n*** BART CNN ***')
print_summary(book_summary_bart_cnn)

*** Title ***
The immortal (Borges) 

*** Text ***
My travails, I have said, began in a garden in Thebes

All that night I did not sleep, for there was a combat in my heart

I rose at last a little before dawn

My slaves were sleeping; the moon was the color of the infinite sand

A bloody rider was approaching from the east, weak with exhaustion

A few steps from me, he dismounted and in a faint, insatiable voice asked me, in Latin, the name of the river whose waters laved the city’s walls

I told him it was the Egypt, fed by the rains

“It is another river that I seek,” he replied morosely, “the secret river that purifies men of death”

Dark blood was welling from his breast

He told me that the country of his birth was a mountain that lay beyond the Ganges; it was rumored on that mountain, he told me, that if one traveled
westward, to the end of the world, one would come to the river whose waters give immortality

He added that on the far shore of that river lay the City of the Immor

In [0]:
# Get and print song summary using BART from XSum
song_summary_bart_xsum = generate_summary(tokenizer, model_xsum, song, dev)
print('*** Title ***')
print(song['title'], '\n')
print('*** Text ***')
print(song['text'], '\n')
print_summary(song_summary_bart_xsum)
print('\n*** BART CNN ***')
print_summary(song_summary_bart_cnn)

*** Title ***
Present tense (Radiohead) 

*** Text ***
This dance
This dance
It's like a weapon
It's like a weapon
Of self defense
Self defense
Against the present
Against the present
Present tense
I won't get heavy
Don't get heavy
Keep it light and
Keep it moving
I am doing
No harm
As my world
Comes crashing down
I'm dancing
Freaking out
Deaf, dumb, and blind
In you I'm lost
In you I'm lost
I won't turn around when the penny drops
I won't stop now
I won't slack off
Or all this love
Will be in vain
Stop from falling
Down a mine
It's no one's business but mine
That all this love
Has been in vain
In you I'm lost
In you I'm lost
In you I'm lost
In you I'm lost 

*** Summary ***
In our series of letters from European journalists, film-maker and columnist Quentin Sommerville reflects on the importance of music in our daily
lives and reflects on some of his favourite pieces of musical theatre and dance routines, some of which he has performed on stage.

*** BART CNN ***
*** Summary ***
I won

### BART exercises:<br>
Exercise 1: <br>
Instead of applying the model once per input text (batch size = 1) apply it to a bunch of input texts at once (batch size > 1)<br><br>
Exercise 2: <br>
Generate summaries using top-k and top-p sampling (check generate() parameters such as do_sample, top_p, top_k) and compare them with the ones found using greedy and beam search