# **Text summrization using an encoder-decoder architecture**

* Encoder-decoder models was proposed by Vaswani et al. (2017)] - https://arxiv.org/pdf/1706.03762.pdf and have recently experienced a surge of interest.

## **Main references:**
* https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Encoder_Decoder_Model.ipynb
* https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Leveraging_Pre_trained_Checkpoints_for_Encoder_Decoder_Models.ipynb

## **Dataset:** 
CNN/Dailymail dataset

## **Installing TensorBoard**

In [37]:
!pip install Tensorboard



In [38]:
from time import time
from keras.callbacks import TensorBoard
tensorboard = TensorBoard(log_dir='logs/{}'.format(time()))

In [40]:
from torch.utils.tensorboard import SummaryWriter
tb = SummaryWriter()

### **Data Preprocessing**

Installing datasets and transformers required.

In [2]:
%%capture
!pip install datasets==1.0.2
!pip install transformers==4.2.1

Downloading the CNN/Dailymail dataset.

In [3]:
import datasets
train_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train")

Downloading:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 (download: 558.32 MiB, generated: 1.28 GiB, post-processed: Unknown size, total: 1.82 GiB) to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602...


Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/661k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602. Subsequent calls will reuse this data.


In [4]:
## Printing dataset information
train_data.info.description

'CNN/DailyMail non-anonymized summarization dataset.\n\nThere are two features:\n  - article: text of news article, used as the document to be summarized\n  - highlights: joined text of highlights with <s> and </s> around each\n    highlight, which is the target summary\n'

Our input is called *article* and our labels are called *highlights*. Let's now print out the first example of the training data to get a feeling for the data.

In [5]:
import pandas as pd
from IPython.display import display, HTML
from datasets import ClassLabel

df = pd.DataFrame(train_data[:1])
del df["id"]
for column, typ in train_data.features.items():
      if isinstance(typ, ClassLabel):
          df[column] = df[column].transform(lambda i: typ.names[i])
display(HTML(df.to_html()))

Unnamed: 0,article,highlights
0,"It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force ""to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction."" It's a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but because he wants to. ""While I believe I have the authority to carry out this military action without specific congressional authorization, I know that the country will be stronger if we take this course, and our actions will be even more effective,"" he said. ""We should have this debate, because the issues are too big for business as usual."" Obama said top congressional leaders had agreed to schedule a debate when the body returns to Washington on September 9. The Senate Foreign Relations Committee will hold a hearing over the matter on Tuesday, Sen. Robert Menendez said. Transcript: Read Obama's full remarks . Syrian crisis: Latest developments . U.N. inspectors leave Syria . Obama's remarks came shortly after U.N. inspectors left Syria, carrying evidence that will determine whether chemical weapons were used in an attack early last week in a Damascus suburb. ""The aim of the game here, the mandate, is very clear -- and that is to ascertain whether chemical weapons were used -- and not by whom,"" U.N. spokesman Martin Nesirky told reporters on Saturday. But who used the weapons in the reported toxic gas attack in a Damascus suburb on August 21 has been a key point of global debate over the Syrian crisis. Top U.S. officials have said there's no doubt that the Syrian government was behind it, while Syrian officials have denied responsibility and blamed jihadists fighting with the rebels. British and U.S. intelligence reports say the attack involved chemical weapons, but U.N. officials have stressed the importance of waiting for an official report from inspectors. The inspectors will share their findings with U.N. Secretary-General Ban Ki-moon Ban, who has said he wants to wait until the U.N. team's final report is completed before presenting it to the U.N. Security Council. The Organization for the Prohibition of Chemical Weapons, which nine of the inspectors belong to, said Saturday that it could take up to three weeks to analyze the evidence they collected. ""It needs time to be able to analyze the information and the samples,"" Nesirky said. He noted that Ban has repeatedly said there is no alternative to a political solution to the crisis in Syria, and that ""a military solution is not an option."" Bergen: Syria is a problem from hell for the U.S. Obama: 'This menace must be confronted' Obama's senior advisers have debated the next steps to take, and the president's comments Saturday came amid mounting political pressure over the situation in Syria. Some U.S. lawmakers have called for immediate action while others warn of stepping into what could become a quagmire. Some global leaders have expressed support, but the British Parliament's vote against military action earlier this week was a blow to Obama's hopes of getting strong backing from key NATO allies. On Saturday, Obama proposed what he said would be a limited military action against Syrian President Bashar al-Assad. Any military attack would not be open-ended or include U.S. ground forces, he said. Syria's alleged use of chemical weapons earlier this month ""is an assault on human dignity,"" the president said. A failure to respond with force, Obama argued, ""could lead to escalating use of chemical weapons or their proliferation to terrorist groups who would do our people harm. In a world with many dangers, this menace must be confronted."" Syria missile strike: What would happen next? Map: U.S. and allied assets around Syria . Obama decision came Friday night . On Friday night, the president made a last-minute decision to consult lawmakers. What will happen if they vote no? It's unclear. A senior administration official told CNN that Obama has the authority to act without Congress -- even if Congress rejects his request for authorization to use force. Obama on Saturday continued to shore up support for a strike on the al-Assad government. He spoke by phone with French President Francois Hollande before his Rose Garden speech. ""The two leaders agreed that the international community must deliver a resolute message to the Assad regime -- and others who would consider using chemical weapons -- that these crimes are unacceptable and those who violate this international norm will be held accountable by the world,"" the White House said. Meanwhile, as uncertainty loomed over how Congress would weigh in, U.S. military officials said they remained at the ready. 5 key assertions: U.S. intelligence report on Syria . Syria: Who wants what after chemical weapons horror . Reactions mixed to Obama's speech . A spokesman for the Syrian National Coalition said that the opposition group was disappointed by Obama's announcement. ""Our fear now is that the lack of action could embolden the regime and they repeat his attacks in a more serious way,"" said spokesman Louay Safi. ""So we are quite concerned."" Some members of Congress applauded Obama's decision. House Speaker John Boehner, Majority Leader Eric Cantor, Majority Whip Kevin McCarthy and Conference Chair Cathy McMorris Rodgers issued a statement Saturday praising the president. ""Under the Constitution, the responsibility to declare war lies with Congress,"" the Republican lawmakers said. ""We are glad the president is seeking authorization for any military action in Syria in response to serious, substantive questions being raised."" More than 160 legislators, including 63 of Obama's fellow Democrats, had signed letters calling for either a vote or at least a ""full debate"" before any U.S. action. British Prime Minister David Cameron, whose own attempt to get lawmakers in his country to support military action in Syria failed earlier this week, responded to Obama's speech in a Twitter post Saturday. ""I understand and support Barack Obama's position on Syria,"" Cameron said. An influential lawmaker in Russia -- which has stood by Syria and criticized the United States -- had his own theory. ""The main reason Obama is turning to the Congress: the military operation did not get enough support either in the world, among allies of the US or in the United States itself,"" Alexei Pushkov, chairman of the international-affairs committee of the Russian State Duma, said in a Twitter post. In the United States, scattered groups of anti-war protesters around the country took to the streets Saturday. ""Like many other Americans...we're just tired of the United States getting involved and invading and bombing other countries,"" said Robin Rosecrans, who was among hundreds at a Los Angeles demonstration. What do Syria's neighbors think? Why Russia, China, Iran stand by Assad . Syria's government unfazed . After Obama's speech, a military and political analyst on Syrian state TV said Obama is ""embarrassed"" that Russia opposes military action against Syria, is ""crying for help"" for someone to come to his rescue and is facing two defeats -- on the political and military levels. Syria's prime minister appeared unfazed by the saber-rattling. ""The Syrian Army's status is on maximum readiness and fingers are on the trigger to confront all challenges,"" Wael Nader al-Halqi said during a meeting with a delegation of Syrian expatriates from Italy, according to a banner on Syria State TV that was broadcast prior to Obama's address. An anchor on Syrian state television said Obama ""appeared to be preparing for an aggression on Syria based on repeated lies."" A top Syrian diplomat told the state television network that Obama was facing pressure to take military action from Israel, Turkey, some Arabs and right-wing extremists in the United States. ""I think he has done well by doing what Cameron did in terms of taking the issue to Parliament,"" said Bashar Jaafari, Syria's ambassador to the United Nations. Both Obama and Cameron, he said, ""climbed to the top of the tree and don't know how to get down."" The Syrian government has denied that it used chemical weapons in the August 21 attack, saying that jihadists fighting with the rebels used them in an effort to turn global sentiments against it. British intelligence had put the number of people killed in the attack at more than 350. On Saturday, Obama said ""all told, well over 1,000 people were murdered."" U.S. Secretary of State John Kerry on Friday cited a death toll of 1,429, more than 400 of them children. No explanation was offered for the discrepancy. Iran: U.S. military action in Syria would spark 'disaster' Opinion: Why strikes in Syria are a bad idea .","Syrian official: Obama climbed to the top of the tree, ""doesn't know how to get down""\nObama sends a letter to the heads of the House and Senate .\nObama to seek congressional approval on military action against Syria .\nAim is to determine whether CW were used, not by whom, says U.N. spokesman ."



* The input data seems to consist of short news articles. 
* Interestingly, the labels appear to be bullet-point-like summaries. 
* At this point, one should probably take a look at a couple of other examples to get a better feeling for the data.
* The text is *case-sensitive*. This means that we have to be careful if we want to use *case-insensitive* models.
* As *CNN/Dailymail* is a summarization dataset, the model will be evaluated using the *ROUGE* metric. 
* As models compute length in *token-length*, we will make use of the `bert-base-uncased` tokenizer to compute the article and summary length.



## **Loading the tokenizer**

In [6]:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

* Making use of `.map()` to compute the length of the article and its summary. 
* Since we know that the maximum length that `bert-base-uncased` can process amounts to 512, we are also interested in the percentage of input samples being longer than the maximum length.
* Computing the percentage of summaries that are longer than 64, and 128 respectively.

 Defining the `.map()` function as follows.

In [7]:
# map article and summary len to dict as well as if sample is longer than 512 tokens
def map_to_length(x):
  x["article_len"] = len(tokenizer(x["article"]).input_ids)
  x["article_longer_512"] = int(x["article_len"] > 512)
  x["summary_len"] = len(tokenizer(x["highlights"]).input_ids)
  x["summary_longer_64"] = int(x["summary_len"] > 64)
  x["summary_longer_128"] = int(x["summary_len"] > 128)
  return x

It should be sufficient to look at the first 10000 samples. We can speed up the mapping by using multiple processes with `num_proc=4`.

In [8]:
sample_size = 10000
data_stats = train_data.select(range(sample_size)).map(map_to_length, num_proc=4)


Token indices sequence length is longer than the specified maximum sequence length for this model (1959 > 512). Running this sequence through the model will result in indexing errors


Having computed the length for the first 10000 samples, we should now average them together. For this, we can make use of the `.map()` function with `batched=True` and `batch_size=-1` to have access to all 10000 samples within the `.map()` function.

In [9]:
def compute_and_print_stats(x):
  if len(x["article_len"]) == sample_size:
    print(
        "Article Mean: {}, %-Articles > 512:{}, Summary Mean:{}, %-Summary > 64:{}, %-Summary > 128:{}".format(
            sum(x["article_len"]) / sample_size,
            sum(x["article_longer_512"]) / sample_size, 
            sum(x["summary_len"]) / sample_size,
            sum(x["summary_longer_64"]) / sample_size,
            sum(x["summary_longer_128"]) / sample_size,
        )
    )

output = data_stats.map(
  compute_and_print_stats, 
  batched=True,
  batch_size=-1,
)

  0%|          | 0/1 [00:00<?, ?ba/s]

Article Mean: 847.6216, %-Articles > 512:0.7355, Summary Mean:57.7742, %-Summary > 64:0.3185, %-Summary > 128:0.0


* 3/4 of the articles being longer than the model's `max_length` 512. The summary is on average 57 tokens long. * Over 30% of our 10000-sample summaries are longer than 64 tokens, but none are longer than 128 tokens.
* `bert-base-cased` is limited to 512 tokens, which means we would have to cut possibly important information from the article. 
* Because most of the important information is often found at the beginning of articles and because we want to be computationally efficient, we decide to stick to `bert-base-cased` with a `max_length` of 512 in this notebook. 
* Regarding the summary length, we can see that a length of 128 already includes all of the summary labels. 
* 128 is easily within the limits of `bert-base-cased`.

* Using of the `.map()` function again - this time to transform each training batch into a batch of model inputs.

* `"article"` and `"highlights"` are tokenized and prepared as the Encoder's `"input_ids"` and Decoder's `"decoder_input_ids"` respectively.

* `"labels"` are shifted automatically to the left for language modeling training.

Lastly, it is very important to ignore the loss of the padded labels. In Transformers this can be done by setting the label to -100.

In [10]:
encoder_max_length=512
decoder_max_length=128

def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
  inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=decoder_max_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`. 
  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

* Train and evaluating the model just on a few training examples for demonstration and setting the `batch_size` to 4 to prevent out-of-memory issues.  
* The following line reduces the training data to only the first `32` examples.

In [11]:
train_data = train_data.select(range(32))

## **Preparation of Training Data**

In [12]:
# batch_size = 16
batch_size=4

train_data = train_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["article", "highlights", "id"]
)

  0%|          | 0/8 [00:00<?, ?ba/s]

In [13]:
train_data

Dataset(features: {'attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'decoder_attention_mask': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'decoder_input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'input_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}, num_rows: 32)

Converting the data to PyTorch Tensors to be trained on GPU.

In [14]:
train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)


Performing the same for the validation data.
First, loading only 10% of the validation dataset

In [15]:
val_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="validation[:10%]")

Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602)


In [16]:
val_data = val_data.select(range(8))

The mapping function is applied.

In [17]:
val_data = val_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["article", "highlights", "id"]
)

  0%|          | 0/2 [00:00<?, ?ba/s]

The entiree validation data is also converted to PyTorch tensors.

In [18]:
val_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

### **Warm-starting the Encoder-Decoder Model**

Refernce:
https://huggingface.co/transformers/model_doc/encoderdecoder.html

In [19]:
from transformers import EncoderDecoderModel

* Warm-starting the *BERT2BERT* model.
* Warm-start both the encoder and decoder with the `"bert-base-cased"` checkpoint.

In [20]:
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer

## **Network architecture**

In [21]:
bert2bert

EncoderDecoderModel(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_af

* `bert2bert.encoder` is an instance of `BertModel` and that `bert2bert.decoder` one of `BertLMHeadModel`. 
* Both instances are now combined into a single `torch.nn.Module` and can thus be saved as a single `.pt` checkpoint file. 


In [22]:
bert2bert.save_pretrained("bert2bert")

Similarly, the model can be reloaded using the standard `.from_pretrained(...)` method.

In [23]:
bert2bert = EncoderDecoderModel.from_pretrained("bert2bert")

In [24]:
bert2bert.config

EncoderDecoderConfig {
  "_name_or_path": "bert2bert",
  "architectures": [
    "EncoderDecoderModel"
  ],
  "decoder": {
    "_name_or_path": "bert-base-uncased",
    "add_cross_attention": true,
    "architectures": [
      "BertForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bad_words_ids": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "eos_token_id": null,
    "finetuning_task": null,
    "gradient_checkpointing": false,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "id2label": {
      "0": "LABEL_0",
      "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "is_decoder": true,
    "is_encoder_decoder": false,
    "label2id": {
      "LABEL_0": 0,
      "LABEL_1": 1
    },
    "layer_norm_eps": 1e-12,
    "length_penalty": 1.0,
    "max_length": 20,

To create a shared encoder-decoder model, the parameter `tie_encoder_decoder=True` can additionally be passed.

In [25]:
shared_bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "bert-base-cased", tie_encoder_decoder=True)


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.c

**As a comparison, we can see that the tied model has much fewer parameters as expected.**

In [26]:
print(f"\n\nNum Params. Shared: {shared_bert2bert.num_parameters()}, Non-Shared: {bert2bert.num_parameters()}")



Num Params. Shared: 137298244, Non-Shared: 247363386


Training a non-shared  *Bert2Bert* model, so we continue with `bert2bert` and not `shared_bert2bert`.

In [27]:
# free memory
del shared_bert2bert

We have warm-started a `bert2bert` model.

**Setting the special tokens:**

* `bert-base-cased` does not have a `decoder_start_token_id` or `eos_token_id`, so we will use its `cls_token_id` and `sep_token_id` respectively. 
* Also, we should define a `pad_token_id` on the config and make sure the correct `vocab_size` is set.

In [28]:
bert2bert.config.decoder_start_token_id = tokenizer.cls_token_id
bert2bert.config.eos_token_id = tokenizer.sep_token_id
bert2bert.config.pad_token_id = tokenizer.pad_token_id
bert2bert.config.vocab_size = bert2bert.config.encoder.vocab_size

* Defining all parameters related to beam search decoding. 
* Since `bart-large-cnn` yields good results on CNN/Dailymail, copying its beam search decoding parameters.

In [29]:
bert2bert.config.max_length = 142
bert2bert.config.min_length = 56
bert2bert.config.no_repeat_ngram_size = 3
bert2bert.config.early_stopping = True
bert2bert.config.length_penalty = 2.0
bert2bert.config.num_beams = 4

### **Fine-Tuning Warm-Started Encoder-Decoder Models**

Making use of `Seq2SeqTrainer` to fine-tune a warm-started encoder-decoder model.

In [30]:
## Importing the `Seq2SeqTrainer` and its training arguments `Seq2SeqTrainingArguments`
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In [31]:
%%capture
!pip install git-python==1.0.3
!pip install rouge_score
!pip install sacrebleu

In [32]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True, 
    output_dir="./",
    logging_steps=2,
    save_steps=10,
    eval_steps=4,
    # logging_steps=1000,
    # save_steps=500,
    # eval_steps=7500,
    # warmup_steps=2000,
    # save_total_limit=3,
)

Like most summarization tasks, CNN/Dailymail is typically evaluated using the ROUGE score. 

Loading the ROUGE metric using the datasets library.

In [33]:
rouge = datasets.load_metric("rouge")

Downloading:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

In [34]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

Passing all arguments to the `Seq2SeqTrainer` to start finetuning. 

In [44]:
# instantiate trainer
trainer = Seq2SeqTrainer(
    model=bert2bert,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=val_data,
)
trainer.train()

  self.args.max_grad_norm,


Step,Training Loss,Validation Loss,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure,Runtime,Samples Per Second
4,7.7434,7.596498,0.0059,0.0109,0.0076,6.3832,1.253
8,6.8493,7.56575,0.0,0.0,0.0,6.3436,1.261
12,6.9985,7.41835,0.0,0.0,0.0,6.3621,1.257
16,6.5213,7.324226,0.0065,0.0151,0.009,6.2697,1.276
20,6.4324,7.287829,0.0045,0.0105,0.0057,6.2886,1.272
24,6.0335,7.243678,0.0019,0.0066,0.0029,6.3721,1.255


  self.args.max_grad_norm,


TrainOutput(global_step=24, training_loss=6.772713144620259, metrics={'train_runtime': 88.5302, 'train_samples_per_second': 0.271, 'total_flos': 91188038615040, 'epoch': 3.0})

In [None]:
!ls

bert2bert  checkpoint-10  checkpoint-20  runs  sample_data


Loading the checkpoint as usual via the `EncoderDecoderModel.from_pretrained()` method.

In [None]:
dummy_bert2bert = EncoderDecoderModel.from_pretrained("./checkpoint-20")

### **Evaluation**

* Evaluating the *BERT2BERT* model on the test data.

* Loading a *BERT2BERT* model that was finetuned on the full training dataset. Also, we load its tokenizer, which is just a copy of `bert-base-cased`'s tokenizer.

In [None]:
from transformers import BertTokenizer

bert2bert = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert_cnn_daily_mail").to("cuda")
tokenizer = BertTokenizer.from_pretrained("patrickvonplaten/bert2bert_cnn_daily_mail")

Downloading:   0%|          | 0.00/3.66k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/252 [00:00<?, ?B/s]

Next, we load just 2% of *CNN/Dailymail's* test data. For the full evaluation, one should obviously use 100% of the data.

In [None]:
test_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="test[:2%]")

Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/0128610a44e10f25b4af6689441c72af86205282d26399642f7db38fa7535602)


Using the `map()` function to generate a summary for each test sample.

For each data sample:

- first, tokenizing the article
- second, generating the output token ids, and
- third, decoding the output token ids to obtain our predicted summary.

In [None]:
def generate_summary(batch):
    # cut off at BERT max length 512
    inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")
    outputs = bert2bert.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    batch["pred_summary"] = output_str
    return batch

Let's run the map function to obtain the *results* dictionary that has the model's predicted summary stored for each sample.

In [None]:
batch_size = 16  # change to 64 for full evaluation
results = test_data.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["article"])

  0%|          | 0/15 [00:00<?, ?ba/s]

## **Computing the ROUGE score**

In [None]:
rouge.compute(predictions=results["pred_summary"], references=results["highlights"], rouge_types=["rouge2"])["rouge2"].mid

Score(precision=0.10389454113300968, recall=0.1564771201053348, fmeasure=0.12175271663717585)

In [None]:
predictions=results["pred_summary"]

In [None]:
predictions[0]

'best known for his portrayal of sheriff rosco p. coltrane on tv\'s " the dukes of hazzard " best died monday after a brief illness. best\'s co - stars paid tribute to the late actor on social media. " jimmy best was the most creative person i have ever known, " a friend says.'

# **Gradio**

In [None]:
!pip install -q gradio

[K     |████████████████████████████████| 2.0 MB 8.4 MB/s 
[K     |████████████████████████████████| 206 kB 53.3 MB/s 
[K     |████████████████████████████████| 1.9 MB 48.1 MB/s 
[K     |████████████████████████████████| 63 kB 2.2 MB/s 
[K     |████████████████████████████████| 961 kB 51.4 MB/s 
[K     |████████████████████████████████| 3.5 MB 32.0 MB/s 
[?25h  Building wheel for ffmpy (setup.py) ... [?25l[?25hdone
  Building wheel for flask-cachebuster (setup.py) ... [?25l[?25hdone


In [None]:
import tensorflow as tf
import numpy as np
# from urllib.request import urlretrieve
import gradio as gr

def sentiment_analysis(text):
    inputs = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")
    outputs = bert2bert.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return output_str

gr.Interface(fn=sentiment_analysis, 
             inputs="textbox", 
             outputs='textbox').launch(share=True);

Colab notebook detected. To show errors in colab notebook, set `debug=True` in `launch()`
This share link will expire in 72 hours. If you need a permanent link, visit: https://gradio.app/introducing-hosted
Running on External URL: https://20667.gradio.app


In [None]:
text = 'It\'s official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It\'s a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but because he wants to. "While I believe I have the authority to carry out this military action without specific congressional authorization, I know that the country will be stronger if we take this course, and our actions will be even more effective," he said. "We should have this debate, because the issues are too big for business as usual." Obama said top congressional leaders had agreed to schedule a debate when the body returns to Washington on September 9. The Senate Foreign Relations Committee will hold a hearing over the matter on Tuesday, Sen. Robert Menendez said. Transcript: Read Obama\'s full remarks . Syrian crisis: Latest developments . U.N. inspectors leave Syria . Obama\'s remarks came shortly after U.N. inspectors left Syria, carrying evidence that will determine whether chemical weapons were used in an attack early last week in a Damascus suburb. "The aim of the game here, the mandate, is very clear -- and that is to ascertain whether chemical weapons were used -- and not by whom," U.N. spokesman Martin Nesirky told reporters on Saturday. But who used the weapons in the reported toxic gas attack in a Damascus suburb on August 21 has been a key point of global debate over the Syrian crisis. Top U.S. officials have said there\'s no doubt that the Syrian government was behind it, while Syrian officials have denied responsibility and blamed jihadists fighting with the rebels. British and U.S. intelligence reports say the attack involved chemical weapons, but U.N. officials have stressed the importance of waiting for an official report from inspectors. The inspectors will share their findings with U.N. Secretary-General Ban Ki-moon Ban, who has said he wants to wait until the U.N. team\'s final report is completed before presenting it to the U.N. Security Council. The Organization for the Prohibition of Chemical Weapons, which nine of the inspectors belong to, said Saturday that it could take up to three weeks to analyze the evidence they collected. "It needs time to be able to analyze the information and the samples," Nesirky said. He noted that Ban has repeatedly said there is no alternative to a political solution to the crisis in Syria, and that "a military solution is not an option." Bergen: Syria is a problem from hell for the U.S. Obama:'
inputs = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")

In [None]:
input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")
outputs = bert2bert.generate(input_ids, attention_mask=attention_mask)
output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
output_str

["new : obama says he will take his case to congress on tuesday. new : u. s. officials have said there's no doubt that the syrian government is behind the attack. the u. n. inspectors leave syria on saturday, carrying evidence that will determine whether chemical weapons were used. the president says he has the authority to carry out the military action without specific congressional authorization."]