<a href="https://colab.research.google.com/github/brandonscolieri/266_final/blob/main/abstractive_base_model%2Bfinetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Standard (Out of the Box) Pegasus Abstractive Summarization

In [1]:
!pip install sentencepiece
!pip install transformers



In [2]:
!pip install --upgrade pip



In [3]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration

In [4]:
model = TFPegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')

All model checkpoint layers were used when initializing TFPegasusForConditionalGeneration.

All the layers of TFPegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-xsum.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFPegasusForConditionalGeneration for predictions without further training.


In [5]:
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')

In [6]:
ARTICLE_TO_SUMMARIZE = (
"At 1:51 p.m. on Jan. 6, a right-wing radio host named Michael D. Brown wrote on Twitter that rioters had breached the United States Capitol — and immediately speculated about who was really to blame. 'Antifa or BLM or other insurgents could be doing it disguised as Trump supporters,' Mr. Brown wrote, using shorthand for Black Lives Matter. 'Come on, man, have you never heard of psyops?'"
"Only 13,000 people follow Mr. Brown on Twitter, but his tweet caught the attention of another conservative pundit: Todd Herman, who was guest-hosting Rush Limbaugh’s national radio program. Minutes later, he repeated Mr. Brown’s baseless claim to Mr. Limbaugh’s throngs of listeners: 'It’s probably not Trump supporters who would do that. Antifa, BLM, that’s what they do. Right?'"
"What happened over the next 12 hours illustrated the speed and the scale of a right-wing disinformation machine primed to seize on a lie that served its political interests and quickly spread it as truth to a receptive audience. The weekslong fiction about a stolen election that President Donald J. Trump pushed to his millions of supporters had set the stage for a new and equally false iteration: that left-wing agitators were responsible for the attack on the Capitol."
 )

In [7]:
type(ARTICLE_TO_SUMMARIZE)

str

In [8]:
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='tf')

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [9]:
tokenizer

PreTrainedTokenizer(name_or_path='google/pegasus-xsum', vocab_size=96103, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'mask_token': '<mask_2>', 'additional_special_tokens': ['<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<unk_60>', '<unk_61>', '<unk_62>

In [10]:
# Generate Summary
summary_ids = model.generate(inputs['input_ids'])
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])

['In our series of letters from African-American journalists, filmmaker and columnist Richard Pérez-Pea looks at how right-wing disinformation helped Donald Trump win the White House.']


## Re - Fine-Tuned Newsroom Pegasus on NYT Corpus

In [12]:
## loading dependencies
!pip install nlp
from nlp import load_dataset



In [13]:
train = load_dataset("cnn_dailymail", '3.0.0', split="train")
train

Dataset(features: {'article': Value(dtype='string', id=None), 'highlights': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None)}, num_rows: 287113)

In [14]:
print(train.column_names)

['article', 'highlights', 'id']


## NEW CODE FOR FINE TUNING 

In [15]:
encoder_max_length=512
decoder_max_length=128
batch_size = 16

def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
  inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=encoder_max_length)
  outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=decoder_max_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`. 
  # We have to make sure that the PAD token is ignored
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

In [16]:
train = train.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["article", "highlights", "id"]
)

In [17]:
# validation data
val = load_dataset("cnn_dailymail", '3.0.0', split="validation")
val

Dataset(features: {'article': Value(dtype='string', id=None), 'highlights': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None)}, num_rows: 13368)

In [18]:
val = val.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["article", "highlights", "id"]
)

In [19]:
import torch

In [20]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

ft_model = TFPegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train,         # training dataset
    eval_dataset=val             # evaluation dataset
)

trainer.train()

  return torch._C._cuda_getDeviceCount() > 0
All model checkpoint layers were used when initializing TFPegasusForConditionalGeneration.

All the layers of TFPegasusForConditionalGeneration were initialized from the model checkpoint at google/pegasus-xsum.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFPegasusForConditionalGeneration for predictions without further training.


AttributeError: 'TFPegasusForConditionalGeneration' object has no attribute 'to'

In [None]:
train

In [None]:
from transformers import PegasusTokenizerFast

In [None]:
ft_tokenizer = PegasusTokenizerFast.from_pretrained('google/pegasus-xsum')

In [None]:
ft_model = TFPegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum')

In [None]:
train_encodings = tokenizer(train['arti', truncation=True, padding=True)

## OLD CODE FOR FINE TUNING

In [15]:
# train = train.map(lambda batch: tokenizer(batch["article"], truncation=True, padding=True), batched=True)
# train.rename_column_("highlights", "labels")

HBox(children=(FloatProgress(value=0.0, max=288.0), HTML(value='')))




In [16]:
# train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
# {key: val.shape for key, val in train[0].items()})

SyntaxError: unmatched ')' (<ipython-input-16-a4b970679dd7>, line 2)