In [2]:
import warnings
warnings.filterwarnings('ignore')

# Summarization

Used to summarize large amounts of text. Can be done in multiple languages. Most models were initially trained on either articles from BBC or CNN.

BART allows for summaries to be built using new phrases and terms while keeping the overall meaning (abstractive summarization).

In [3]:
#import needed libraries
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig

#choose model that performs summarization
checkpoint= 'facebook/bart-large-cnn'

#the model has already been pretrained and finetuned to data from xsum ()
model= AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer= AutoTokenizer.from_pretrained(checkpoint)
config= AutoConfig.from_pretrained(checkpoint)


In [5]:
#use article found on CNN to test out summarization
sequence= """Airlines, airports and the federal government are bracing for aviation infrastructure to take a major blow from Hurricane Ian. Cancellations and closures are already piling up across the Florida peninsula.
The storm is forecast to make landfall Wednesday afternoon on Florida's west coast as a major hurricane.
Tampa International Airport, where officials are preparing for a major impact, suspended operations at 5 p.m. ET Tuesday.

The Tampa airport said there will be no departing flights through Thursday.
'We will share a reopening date and time when it is determined,' the airport said on Twitter Wednesday. The airport typically handles 450 flights daily.
Miami International Airport was still open midday Wednesday, according to a notice on the airport's website, although some flights had been delayed or canceled.
Operations ceased at 10:30 am ET Wednesday at Orlando International Airport. The airport sees nearly 130,000 passengers daily, according to its website.
The terminal at St. Pete-Clearwater International Airport closed at 1 p.m. Tuesday "due to mandatory evacuation orders from Pinellas County and remain closed until the evacuation order is lifted," according to the verified tweet from the airport.
Sarasota Bradenton International Airport closed at 8 p.m. Tuesday night.
Florida airports lead in US cancellations
By midday Wednesday, FlightAware data showed more than 2,100 US flight cancellations nationwide on Wednesday. About 1,700 Thursday flights had already been canceled.
Orlando, Miami and Tampa airports were the top three trouble spots, with cancellations also mounting at Fort Lauderdale International Airport and Southwest Florida International Airport in Fort Myers.
Effects could ripple through the southeastern United States with Atlanta and Charlotte already seeing cancellations.
Airlines canceling flights
American Airlines, which operates about 250 daily departures out of Miami, its fourth-largest hub, had canceled 583 flights by midday Wednesday, including mainline and regional service.
American customers traveling through 20 airports in the hurricane's path can rebook flights without change fees. The airline has also added "reduced, last-minute fares for cities that will be impacted" in hopes of helping people who are trying to "evacuate via air."
United Airlines is starting to shutter operations on the Atlantic Coast of Florida in anticipation of Hurricane Ian's path after it makes landfall.
By Wednesday afternoon, United says it will halt departures from West Palm Beach, Miami and Fort Lauderdale airports. United will not operate from Jacksonville starting on Thursday.
United said on Wednesday that it had proactively canceled 345 flights since Tuesday, swapping some outbound flights with larger airplanes to help customers who were evacuating from the storm's path.
United and Southwest Airlines also suspended operations at the Fort Myers and Sarasota airports.
United also canceled all Tuesday and Wednesday flights to and from Key West and canceled some flights out of Orlando 'as to minimize crew layovers.'
By midday Wednesday, Southwest Airlines had canceled more than 500 US flights, according to FlightAware data.
"""

In [6]:
inputs= tokenizer(sequence, padding= True, return_tensors="pt")
#change max length, min length, length penalty, num_beams
outputs= model.generate(inputs['input_ids'], max_length=100, min_length= 40, length_penalty= 2.0, num_beams= 10, early_stopping= True)
#print summary of article above
print(tokenizer.decode(outputs[0], skip_special_tokens= True))

Orlando, Miami and Tampa airports were the top three trouble spots, with cancellations also mounting at Fort Lauderdale International Airport and Southwest Florida International Airport in Fort Myers. United Airlines is starting to shutter operations on the Atlantic Coast of Florida in anticipation of Hurricane Ian's path after it makes landfall.


# Train model on conversations

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
import torch

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

Use a dataset that has text conversations and their summaries to train the model.

In [4]:
data= load_dataset("samsum")
checkpoint= 'facebook/bart-large-cnn'
    
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer= AutoTokenizer.from_pretrained(checkpoint)

device= torch.device('cuda' if torch.cuda.is_available() else "cpu")
model.to(device)

def preprocess_function(examples):
    inputs = [doc for doc in examples["dialogue"]]
    model_inputs = tokenizer(inputs, max_length= 1000, truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length= 150, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_data= data.map(preprocess_function, batched=True)


#customize the training by specifying attributes
batch_size= 4
model_name= checkpoint
args= Seq2SeqTrainingArguments(
    output_dir= 'txt_summarization',
    evaluation_strategy= 'epoch',
    learning_rate= 2e-5,
    per_device_train_batch_size= batch_size,
    per_device_eval_batch_size= batch_size,
    weight_decay= 0.01,
    save_total_limit= 2,
    num_train_epochs= 3,
    predict_with_generate= True,
    fp16= True)
    
data_collator= DataCollatorForSeq2Seq(tokenizer, model= model)

Found cached dataset samsum (/home/aibrah/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /home/aibrah/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-12af2190291d7181.arrow


  0%|          | 0/1 [00:00<?, ?ba/s]

Loading cached processed dataset at /home/aibrah/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-c5acd4d37aa86677.arrow


In [5]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer)

trainer.train()

Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: dialogue, summary, id. If dialogue, summary, id are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 14732
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 11049
You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,1.3594,1.46021
2,1.0134,1.415231
3,0.7606,1.493712


Saving model checkpoint to txt_summarization/checkpoint-500
Configuration saved in txt_summarization/checkpoint-500/config.json
Model weights saved in txt_summarization/checkpoint-500/pytorch_model.bin
tokenizer config file saved in txt_summarization/checkpoint-500/tokenizer_config.json
Special tokens file saved in txt_summarization/checkpoint-500/special_tokens_map.json
Saving model checkpoint to txt_summarization/checkpoint-1000
Configuration saved in txt_summarization/checkpoint-1000/config.json
Model weights saved in txt_summarization/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in txt_summarization/checkpoint-1000/tokenizer_config.json
Special tokens file saved in txt_summarization/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to txt_summarization/checkpoint-1500
Configuration saved in txt_summarization/checkpoint-1500/config.json
Model weights saved in txt_summarization/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in txt_summari

TrainOutput(global_step=11049, training_loss=1.0619153366206253, metrics={'train_runtime': 2391.3146, 'train_samples_per_second': 18.482, 'train_steps_per_second': 4.62, 'total_flos': 2.651417773149389e+16, 'train_loss': 1.0619153366206253, 'epoch': 3.0})

In [11]:
#save new model
trainer.save_model("./conversation_summarizer_model")

Saving model checkpoint to ./conversation_summarizer_model
Configuration saved in ./conversation_summarizer_model/config.json
Model weights saved in ./conversation_summarizer_model/pytorch_model.bin
tokenizer config file saved in ./conversation_summarizer_model/tokenizer_config.json
Special tokens file saved in ./conversation_summarizer_model/special_tokens_map.json


In [19]:
#test out new model
checkpoint= './conversation_summarizer_model'

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer= AutoTokenizer.from_pretrained(checkpoint)

sequence= input('What would you like to summarize?')

inputs= tokenizer(sequence, padding= True, return_tensors="pt")
outputs= model.generate(inputs['input_ids'], max_length=100, min_length= 15, length_penalty= 2.0, num_beams= 10, early_stopping= True)
print(tokenizer.decode(outputs[0], skip_special_tokens= True))

loading configuration file ./conversation_summarizer_model/config.json
Model config BartConfig {
  "_name_or_path": "./conversation_summarizer_model",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.

Hannah is looking for Betty's number but Amanda can't find it. Larry called Betty the last time they were at the park together. Hannah doesn't know him well but Amanda thinks he's nice.


In [17]:
text= """Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye"""

Your max_length is set to 142, but you input_length is only 121. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=60)


[{'summary_text': "Amanda can't find Betty's number. Larry called Betty the last time Amanda and Hannah were at the park together. Hannah doesn't know Larry very well, so Amanda will text him to ask him about Betty's contact details. Hannah and Amanda will talk to him later."}]