In [1]:
import torch
import syntok.segmenter as segmenter
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from itertools import zip_longest

# Overview

## Summarization Models

At the time of this notebook, there are 3 state of the art text summarization models: 
1. Pegasus
2. T5
3. BART

Rather than comparing all 3 in this notebook, I have chosen to focus on BART.  BART has been shown to slightly outperform the others for summarization tasks (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7720861/).  

## BART Models

BART (https://arxiv.org/abs/1910.13461) was trained on a number of tasks.  There is no single BART model on HuggingFace for all of the training tasks presented in the paper, rather a bunch of models that are trained on the specific tasks.  For summarization, you can find BART models on HuggingFace that were trained on the following datasets:

* CNN/DailyMail - Dataset of news articles.  The summaries in this dataset resemble the source sentences and extractive summarization models perform well here.  BART beat all previous benchmarks. Use model ID = "facebook/bart-large-cnn"
* XSUM - Dataset of news articles.  The summaries in this dataset are highly abstract and generally much shorter than the original text.  Abstractive summarization models perform well here.  BART beat all previous benchmarks.  Use model ID = "facebook/bart-large-xsum"

Since there are 2 model choices, the one you should use depends on what you want the summarizer to do.  If it should be very high level and very short, go with XSUM.  If it should be more detailed but capture the basics of the parent text, then go with CNN.  

### Model Speed

Both models have distilled versions (smaller models for faster inference).  Since these models will be used in production environments, inference speed is important.  For this reason, I will evaluate the distilled versions of the models in this notebook.  Metrics showing how the distilled models compare to the full sized (baseline) models can be found here: https://huggingface.co/sshleifer/distilbart-xsum-12-6#metrics-for-distilbart-models.  

Dynamic quantization can be used to further reduce the model size and increase inference speed.  Pytorch makes this extremely simple with a 1-liner.  Quantization compresses the weights of a pre-trained model without much impact to model performance.  See an example with Pytorch here: https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html.

# Preparing the Models

In [2]:
# specify the model names for HuggingFace's model hub and the local paths to save/load them
bart_xsum = "sshleifer/distilbart-xsum-12-6"
bart_cnn = "sshleifer/distilbart-cnn-12-3"

bart_xsum_model_path = Path("models/bart_xsum")
bart_cnn_model_path = Path("models/bart_cnn")

In [3]:
# load the tokenizer and model from HuggingFace
bart_xsum_tokenizer = AutoTokenizer.from_pretrained(bart_xsum)
bart_xsum_model = AutoModelForSeq2SeqLM.from_pretrained(bart_xsum)

bart_cnn_tokenizer = AutoTokenizer.from_pretrained(bart_cnn)
bart_cnn_model = AutoModelForSeq2SeqLM.from_pretrained(bart_cnn)

In [4]:
bart_xsum_model_path = Path("models/bart_xsum")
bart_cnn_model_path = Path("models/bart_cnn")

In [5]:
# quantize and save the compressed models
bart_xsum_model_quantized = torch.quantization.quantize_dynamic(
    bart_xsum_model, {torch.nn.Linear}, dtype=torch.qint8
)
torch.save(bart_xsum_model_quantized, bart_xsum_model_path / "model.pt")


bart_cnn_model_quantized = torch.quantization.quantize_dynamic(
    bart_cnn_model, {torch.nn.Linear}, dtype=torch.qint8
)
torch.save(bart_cnn_model_quantized, bart_cnn_model_path / "model.pt")

# Simple Summarization

In [6]:
articles = [
    ("Denmark has agreed to provide Ukraine with a Harpoon launcher and missiles to 'help Ukraine defend its coast',"
     "US Secretary of Defense Lloyd Austin said at the conclusion of the second Ukraine Contact Group meeting hosted"
     "by Austin on Monday. The second contact group meeting was held virtually. "
     "The Czech Republic also agreed to send 'substantial support' to Ukraine including 'a recent donation of attack "
     "helicopters, tanks and rocket systems,' Austin said at a press conference at the conclusion of the meeting."
     "Overall, 20 countries 'announced new security assistance packages,' after the meeting, Austin said, including "
     "'donating critically needed artillery ammunition, coastal defense systems and tanks and other armored vehicles.'"
     "'Others came forward with new commitments for training Ukraine’s forces and sustaining its military systems,' "
     "Austin added. "
     "A total of 47 countries participated in the contact group’s second meeting, Chairman of the Joint Chiefs of "
     "Staff Gen. Mark Milley said."
     "Secretary Austin will host the third meeting of the Ukraine Contact Group in person in Brussels on June 15, "
     "Austin said at the conclusion of the second virtual meeting of the contact group Monday. "
     "'I will convene the Contact Group for our third meeting next month and will gather in person this time, on "
     "June 15, in the margins of the NATO defense ministerial in Brussels,' Austin said. 'Of course, it won't be a "
     "NATO event, but we want to keep up the, up, keep up the tempo of these meetings and I wanted to use my travel "
     "to Europe to ensure that we're building on our momentum.'"
    ),
    ("Germany's vice chancellor and economy minister told CNN a recession is not inevitable. "
     "Speaking to CNN's Julia Chatterley at the World Economic Forum in Davos, Robert Habeck insisted that "
     "'nothing is inevitable, we are human beings and can change the course of history.'"
     "He also spoke to CNN about the war in Ukraine and Europe’s efforts to lessen dependence on Russian energy. "
     "Asked whether the European Union could reach an agreement on the next round of sanctions, including an oil "
     "embargo, he said he was confident a deal could be reached and could be done within days."
     "'I expect everyone — also Hungary — that they work to find a solution and not saying 'OK we have an exception "
     "and then we will lay back and build on our partnership with Putin,' he said while speaking earlier on a panel "
     "at Davos."
     "Habeck also discussed Germany’s dependence on Russian gas, saying German industry would collapse without "
     "Russian energy. Asked whether Germany would pay for Russian gas in rubles, Habeck said that German companies "
     "would pay for gas in euros, if Russia then decided to exchange those euros into rubles, it was a 'face saving' "
     "measure for Putin."
     "He insisted that any such moves were approved by the EU Commission and did not break sanctions."
     "More background: Russian President Vladimir Putin said in March that 'unfriendly' nations would have to pay "
     "rubles, rather than the euros or dollars stated in contracts. Buyers could make euro or dollar payments into "
     "an account at Russia's Gazprombank, which would then convert the funds into rubles and transfer them to a "
     "second account from which the payment to Russia would be made."
     "Gas supplies to Poland and Bulgaria were cut off, after they refused to pay in rubles. Other big European "
     "gas companies have told CNN they are working on ways to pay for Russian gas, while not breaking EU sanctions."
    ),
    ("Coffee giant Starbucks says it has exited Russia and will no longer have a brand presence there, according "
     "to a press release on Monday. "
     "The coffee company says it has been operating in Russia for 15 years and has now closed its 130 licensed "
     "cafes in the country. Starbucks joins other companies like McDonald’s and Exxon Mobil in taking its business "
     "completely out of Russia. "
     "Starbucks says it will 'support' its nearly 2,000 workers in Russia, including pay for six months and "
     "assistance for partners to transition to new opportunities outside of Starbucks. "
     "This comes after Starbucks CEO Kevin Johnson said in March that it had suspended all business activity in "
     "Russia, including shipment of all Starbucks products. "
    ),
]

In [7]:
summarizer_xsum = pipeline("summarization", model=bart_xsum_model_quantized, tokenizer=bart_xsum_tokenizer)
summarizer_cnn = pipeline("summarization", model=bart_cnn_model_quantized, tokenizer=bart_cnn_tokenizer)

for article in articles:
    print("----------\nNews Article Summary\n")
    print(f"XSUM Summary:\n{summarizer_xsum(article)}")
    print(f"CNN Summary:\n{summarizer_cnn(article)}")

----------
News Article Summary

XSUM Summary:
[{'summary_text': "A meeting has been held to discuss Ukraine's response to the conflict."}]
CNN Summary:
[{'summary_text': " The U.S. is one of Russia's top defense teams . The second meeting of the Ukraine Contact Group will be held at the centre of the country . 'We will be able to make a point point point to the country's defense . 'I would have been given to the Ukraine' says Austin, Texas ."}]
----------
News Article Summary

XSUM Summary:
[{'summary_text': ' The Russian President Vladimir Putin has said that he will put the European Union under the threat of sanctions.'}]
CNN Summary:
[{'summary_text': " Habeck insisted that 'nothing is inevitable, we are human beings and can change the course of history' He also spoke to CNN about the situation in Ukraine, Ukraine, and the country . He said he was confident that a deal could be reached within hours of a day ."}]
----------
News Article Summary



Your max_length is set to 142, but you input_length is only 136. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=68)


XSUM Summary:
[{'summary_text': ' Starbucks has been in Russia for the past five years.'}]
CNN Summary:
[{'summary_text': ' The coffee company has been operating in Russia for 15 years . Starbucks says it will be no longer have a brand brand there . Starbucks has been closed in Russia . The company says it has no plans to do the same same as other Starbucks stores . Starbucks is also a hot hot spot for Starbucks .'}]


# Manipulating the Summary with Constrained Beam Search

Autoregressive models generate 1 word at a time, based on the word with the highest probability of occurring next.  The problem with a greedy search strategy like this is that it can be repetitive, and it could result in output that is suboptimal.  For example, if the next word is highly probable, but the word options after that (further down the search tree) are less probable, that could be a sub-optimal output.  See: https://huggingface.co/blog/how-to-generate.

Constrained beam searches inject desired words into the output sequence: https://huggingface.co/blog/constrained-beam-search.  Here I will try injecting some words into the summaries.

## Injecting Words into the Summary

The words we want inserted will come in a list.  The model will decide on the best place to put them.

In [8]:
force_words = [
    ["Denmark", "Ukraine", "Russia"],
    ["Germany", "European Union", "sanctions", "Russia"],
    ["Starbucks", "Russia", "employees"],
]
force_words = [[" " + str(w) + " " for w in wl] for wl in force_words]
print(force_words)

[[' Denmark ', ' Ukraine ', ' Russia '], [' Germany ', ' European Union ', ' sanctions ', ' Russia '], [' Starbucks ', ' Russia ', ' employees ']]


In [9]:
for article_id, article in enumerate(articles):
    print("----------\nNews Article Summary\n")
    
    input_ids = bart_xsum_tokenizer(article, return_tensors="pt").input_ids
    force_words_ids = bart_xsum_tokenizer(force_words[article_id], add_special_tokens=False).input_ids

    outputs = bart_xsum_model_quantized.generate(
        input_ids,
        force_words_ids=force_words_ids,
        num_beams=5,
        num_return_sequences=1,
        no_repeat_ngram_size=1,
        remove_invalid_values=True,
    )
    print(f"XSUM Summary:\n{bart_xsum_tokenizer.decode(outputs[0], skip_special_tokens=True)}")

    input_ids = bart_cnn_tokenizer(article, return_tensors="pt").input_ids
    force_words_ids = bart_cnn_tokenizer(force_words[article_id], add_special_tokens=False).input_ids

    outputs = bart_cnn_model_quantized.generate(
        input_ids,
        force_words_ids=force_words_ids,
        num_beams=5,
        num_return_sequences=1,
        no_repeat_ngram_size=1,
        remove_invalid_values=True,
    )
    print(f"CNN Summary:\n{bart_cnn_tokenizer.decode(outputs[0], skip_special_tokens=True)}")


----------
News Article Summary

XSUM Summary:
The Ukraine Contact Group has been in Russia for the first time Ukraine  Russia  Denmark's military forces have fought a war with their own allies. and it is now being held at an army base, as they are under pressure to defend themselves against Ukrainian troops on its territory of Crimea.'''Denmark 
CNN Summary:
 The second Ukraine Contact Group meeting was held Monday. Russia's defense Denmark  and other countries, including the U.S., to be seen as a top security risk for Moscow-Ukraine conflict? 'I would have been more accurate' Secretary of Defense said: "We will never see if you're in danger" or at least one point I'd like this year." We want us back on our radar when we are here again...here! And it is not clear how much time that exists around there has come from behind with missiles attached by some people who support Ukrainian troops during their mission missions abroad.' A total number had also emerged emerge emerges through an a

## Including Start Words

If you know that you want the summary to begin a certain way, you could enforce start words.  More abstractly, you could define a template that you want the summary to follow.  Unfortunately, that is not possible at this time.

I looked into adding start words.  It seems that this code (https://github.com/huggingface/transformers/blob/518bd02c9b71291333ef374f055a4d1ac3042654/src/transformers/generation_beam_search.py#L389) could be edited to enforce every beam to begin with the tokens you specify, and that would ensure that the beam search results in a summary that begins with your words.  However, it would take me a long time to implement in HuggingFace's source code.  Also, it would be difficult to ensure that only the first sentence in  the summary started with those words.  To implement start words, you would pretty much have to implement the entire template feature mentioned in their blog post.

## Returning Multiple Beams

Below I will experiment with the number of repeating n_grams and return multiple beams to see how they differ.

In [18]:
for article_id, article in enumerate(articles):
    print("----------\nNews Article Summary\n")
    
    input_ids = bart_xsum_tokenizer(article, return_tensors="pt").input_ids
    force_words_ids = bart_xsum_tokenizer(force_words[article_id], add_special_tokens=False).input_ids

    outputs = bart_xsum_model_quantized.generate(
        input_ids,
        num_beams=5,
        num_return_sequences=3,
        no_repeat_ngram_size=3,
        remove_invalid_values=True,
    )
    print("XSUM Summaries:\n")
    for i in outputs:
        print(bart_xsum_tokenizer.decode(i, skip_special_tokens=True), "\n")

    input_ids = bart_cnn_tokenizer(article, return_tensors="pt").input_ids
    force_words_ids = bart_cnn_tokenizer(force_words[article_id], add_special_tokens=False).input_ids

    outputs = bart_cnn_model_quantized.generate(
        input_ids,
        num_beams=5,
        num_return_sequences=3,
        no_repeat_ngram_size=3,
        remove_invalid_values=True,
    )
    print("CNN Summaries: \n")
    for i in outputs:
        print(bart_cnn_tokenizer.decode(i, skip_special_tokens=True), "\n")


----------
News Article Summary

XSUM Summaries:

 The first meeting of the contact group in Ukraine has been held in the UK. 

 The first meeting of the contact group in Ukraine has been held in the country. 

 The first meeting of the contact group in Ukraine has been held in the city of Kiev. 

CNN Summaries: 

 U.S. Secretary of Defense: 'We will always be able to defend its own life in Ukraine' The second meeting of the second Ukraine Contact Group has been held by the same team. Secretary of the Defense Department for the first time. 'I'm just a few hours after the meeting,' says a spokesperson for the project. 

 U.S. Secretary of Defense: 'We will always be able to defend its own life in Ukraine' The second meeting of the second Ukraine Contact Group has been held by the same team. Secretary of the Defense Department for the first time. 'I'm just a few hours after the meeting,' says a spokesperson for the military. 

 U.S. Secretary of Defense: 'We will always be able to defend

# Summarizing by Paragraph

I wondered how the summaries would change if they were given fewer sentences.  Here I will segment the documents by sentence, group the sentences by a certain number to simulate paragraphs, and summarize the paragraphs.

In [34]:
def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

In [47]:
for article_id, article in enumerate(articles):
    sents = [sent for paragraph in segmenter.analyze(article) for sent in paragraph]
    sents = [(''.join(str(t) for t in s).strip()) for s in sents]
    print("----------\nNews Article Summary\n")
    
    pg = 1
    sents_per_paragraph = 3
    for grouped in grouper(sents, sents_per_paragraph):
        grouped_sents = " ".join([g for g in grouped if g is not None])

        print(f"----------\nNews Article Paragraph {pg} of {len(sents) // sents_per_paragraph}\n")
        pg += 1

        input_ids = bart_xsum_tokenizer(grouped_sents, return_tensors="pt").input_ids

        outputs = bart_xsum_model_quantized.generate(
            input_ids,
            num_beams=5,
            num_return_sequences=1,
            no_repeat_ngram_size=1,
            remove_invalid_values=True,
        )
        print("XSUM Summaries:\n")
        for i in outputs:
            print(bart_xsum_tokenizer.decode(i, skip_special_tokens=True), "\n")

        input_ids = bart_cnn_tokenizer(grouped_sents, return_tensors="pt").input_ids

        outputs = bart_cnn_model_quantized.generate(
            input_ids,
            num_beams=5,
            num_return_sequences=1,
            no_repeat_ngram_size=1,
            remove_invalid_values=True,
        )
        print("CNN Summaries: \n")
        for i in outputs:
            print(bart_cnn_tokenizer.decode(i, skip_special_tokens=True), "\n")


----------
News Article Summary

----------
News Article Paragraph 1 of 2

XSUM Summaries:

 The BBC has been following a series of news reports from the UK and North America, as they are in their own country's most important battle with Ukraine to help improve its prospects for an international security service that will be under pressure on both sides when it is brought back into public space at this weekend. 

CNN Summaries: 

 Denmark has agreed to send'substantial support' for Ukraine. The second contact group meeting was held at a cost of $100,000 per cent in the U.S., or more than any other country where it will be called an anti-Uvoia military official says he is working on his side when they work together with Russia's Ousted from Moscow and Daugur: "I'm not going into this night" A report finds out that there are no plans but we'll see if you're doing anything better". This would have been released by public health officials until late last year (see how much money off). It a

# Conclusions

Constrained beam search tends to make the summaries worse.  They read better when the model is allowed to generate the summary without influence.  

Summaries are not guaranteed to have correct information.  Sometimes they make no sense.  Other times they directly contradict the original text.

The summaries seem to get worse with fewer sentences, partly because the model starts predicting words from outside the domain of the article.  Paragraph level summarization is likely not feasible at this time.  