In [1]:
!nvidia-smi
# check which GPU we are using

Mon Dec  2 19:49:48 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.46                 Driver Version: 552.46         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8              1W /   64W |     113MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

### Purpose of accelerate:
1. Ease of Multi-Device Training: whether you're using multiple GPUs or TPUs, accelerate makes it easier to scale your model across devices with minimal code changes

2. Mixed Precision: It allows models to be trained using mixed precision, which can speed up traning and reduce memory usage
3. Zero Redundancy Optimizer (ZeRO): helps manage large models by efficiently splitting the model across multiple devices
4. Offload to CPU/SSD: Useful for large models that may not fit entirely into GPU memory, by allowing parts of the model or optimizer to be offloaded to CPU or even SSD

In [4]:
!pip install --upgrade accelerate 
#accelerate is used to assign jobs better than the GPU
!pip uninstall -y transformers accelerate
!pip install transformers accelerate

Found existing installation: transformers 4.46.3
Uninstalling transformers-4.46.3:
  Successfully uninstalled transformers-4.46.3
Found existing installation: accelerate 1.1.1
Uninstalling accelerate-1.1.1:
  Successfully uninstalled accelerate-1.1.1
Collecting transformers
  Using cached transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
Collecting accelerate
  Using cached accelerate-1.1.1-py3-none-any.whl.metadata (19 kB)
Using cached transformers-4.46.3-py3-none-any.whl (10.0 MB)
Using cached accelerate-1.1.1-py3-none-any.whl (333 kB)
Installing collected packages: accelerate, transformers
Successfully installed accelerate-1.1.1 transformers-4.46.3


In [6]:
pip install ipywidgets

Collecting ipywidgetsNote: you may need to restart the kernel to use updated packages.

  Using cached ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets)
  Using cached widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets)
  Using cached jupyterlab_widgets-3.0.13-py3-none-any.whl.metadata (4.1 kB)
Using cached ipywidgets-8.1.5-py3-none-any.whl (139 kB)
Using cached jupyterlab_widgets-3.0.13-py3-none-any.whl (214 kB)
Using cached widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
Installing collected packages: widgetsnbextension, jupyterlab-widgets, ipywidgets
Successfully installed ipywidgets-8.1.5 jupyterlab-widgets-3.0.13 widgetsnbextension-4.0.13


In [8]:
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt
from datasets import load_dataset
import pandas as pd

#from datasets import load_dataset, load_metric
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# AutoTokenizer converts texts into tokens for every each model that is compatible to the model and converts texts into tokens
# AutoModelForSeq2SeqLM - used for loading the model

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ericz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Basic Functionality of Huggingface Model

In [9]:
from transformers import AutoTokenizer, PegasusForConditionalGeneration

model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-xsum")
tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")

ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)
inputs = tokenizer(ARTICLE_TO_SUMMARIZE, max_length=1024, return_tensors="pt")

# Generate Summary
summary_ids = model.generate(inputs["input_ids"])
tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


"California's largest electricity provider has turned off power to hundreds of thousands of customers."

In [10]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cpu'

### Fine Tuning

In [12]:
model = "google/pegasus-cnn_dailymail"

tokenizer = AutoTokenizer.from_pretrained(model) #load a tokenizer, convert text into tokens

model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model).to(device) #loading the model

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
!pip install py7zr



In [14]:
# download and unzip data
!wget https://github.com/ez-anthro-tech-design/datasets/blob/main/summarizer_data.zip
!unzip summarizer_data.zip

'wget' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���
'unzip' �����ڲ����ⲿ���Ҳ���ǿ����еĳ���
���������ļ���


In [16]:
from datasets import load_dataset

# Load the data from JSON files
dataset_samsum = load_dataset('json', data_files={'train': 'train.json', 'test': 'test.json', 'validation': 'val.json'})

# Print the dataset
dataset_samsum

Generating train split: 14732 examples [00:00, 42276.09 examples/s]
Generating test split: 819 examples [00:00, 23190.63 examples/s]
Generating validation split: 818 examples [00:00, 27217.16 examples/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'summary', 'dialogue'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'summary', 'dialogue'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'summary', 'dialogue'],
        num_rows: 818
    })
})

In [17]:
split_lengths = [len(dataset_samsum[split])for split in dataset_samsum] #how much is the data length of the train data, the test data, and the validation data

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDialogue:")

print(dataset_samsum["test"][21]["dialogue"])

print("\nSummary:")

print(dataset_samsum["test"][21]["summary"])



Split lengths: [14732, 819, 818]
Features: ['id', 'summary', 'dialogue']

Dialogue:
Gloria: This exam is a bit of a lottery in fact
Gloria: You can't really get prepared, it's all about experience
Emma: But there are some rules and some typical texts right?
Gloria: You can see some texts from previous years
Gloria: <file_other>
Emma: Wow that's very useful
Emma: I have never seen this site
Gloria: Yes it's very good
Gloria: Actually it's good to read all the texts because you will see that some phrases repeat very often
Emma: How much time do you have for all 4 parts?
Gloria: 4 hours
Emma: Is it enough?
Gloria: Well it has to be
Gloria: Would be perfect to have 2 more hours... But on the other hand it would be really exhausting
Emma: 4 hours and no breaks?
Gloria: No breaks :/ So it's really important to be really focused and try to write as fast as you can
Gloria: And read it carefully and correct during the last hour
Emma: I'm going to read everything from that website, it's great

S

### Preparing Data For Training For Sequence To Sequence Model

{
  'dialogue': "Hi! How are you?",
  "summary": "The speaker is asking how the other person is."

}
-> converted to
{
  'input_ids': [123, 456, 789, ...], # Token IDs for the dialogue
  'attention_mask': [1, 1, 1, ...], # Attention mask for the input
  'labels': [321, 654, 987, ...], # Token IDs for the summary (target)

}

In [18]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch['dialogue'] , max_length = 1024, truncation = True )
    with tokenizer.as_target_tokenizer():
      target_encodings = tokenizer(example_batch['summary'], max_length = 128, truncation = True )

    return {
      'input_ids' : input_encodings['input_ids'],
      'attention_mask': input_encodings['attention_mask'],
      'labels': target_encodings['input_ids']

    }

In [19]:
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched = True)

Map: 100%|██████████| 14732/14732 [00:03<00:00, 4417.17 examples/s]
Map: 100%|██████████| 819/819 [00:00<00:00, 1907.54 examples/s]
Map: 100%|██████████| 818/818 [00:00<00:00, 4165.57 examples/s]


In [20]:
dataset_samsum_pt["train"]

Dataset({
    features: ['id', 'summary', 'dialogue', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [21]:
dataset_samsum_pt["test"]

Dataset({
    features: ['id', 'summary', 'dialogue', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 819
})

DataCollatorForSeq2Seq is a special data collator designed for sequence-to-sequence models (e.g. Pegasus, T5, BART) that helps in preparing batches of data for training

In [22]:
# Training

from transformers import DataCollatorForSeq2Seq # makes sure whatever data we have we convert that into batch so that it can be provided to the model for training

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

In [23]:
# fine tuning data
from transformers import TrainingArguments, Trainer

trainer_args= TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=5, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    eval_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)




In [24]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
          tokenizer=tokenizer, data_collator=seq2seq_data_collator,
          train_dataset=dataset_samsum_pt["test"], #test data is smaller, for quicker processing
          eval_dataset=dataset_samsum_pt["validation"])


  trainer = Trainer(model=model_pegasus, args=trainer_args,


In [25]:
trainer.train()

  4%|▍         | 10/255 [15:26<14:33:40, 213.96s/it]

{'loss': 3.1101, 'grad_norm': 511.05169677734375, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.2}


  8%|▊         | 20/255 [18:29<1:39:51, 25.50s/it]  

{'loss': 3.0467, 'grad_norm': 253.41697692871094, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.39}


 12%|█▏        | 30/255 [23:41<1:56:24, 31.04s/it]

{'loss': 3.1613, 'grad_norm': 164.25572204589844, 'learning_rate': 3e-06, 'epoch': 0.59}


 15%|█▍        | 38/255 [27:29<1:43:27, 28.60s/it]

KeyboardInterrupt: 

In [None]:
# Evaluation
### 1st[1,2,3,4,5,6] -> [1,2,3][4,5,6]
def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elemnts,"""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                  attention_mask=inputs["attention_mask"].to(device),
                  length_penalty=0.8, num_beams=8, max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''

        # Finally, we docde the generated texts,
        # replace the token, and add the decoded texts with the references to the metric
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]
        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]

        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    #  Finally compute and return the ROUGE scores.
    score = metric.compute()
    return score


In [2]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [3]:
import evaluate

rouge_metric = evaluate.load('rouge')
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
#rouge_metric = load_meric('rouge')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'][0:10], rouge_metric, trainer.model, tokenizer, batch_size = 2, column_text = 'dialogue', column_summary= 'summary'
)

# Directly use the scores without accessing fmeasure of mid
rouge_dict = {rn: score [rn] for rn in rouge_names}

# Convert the dictionary to a DataFrame for easy visualization
import pandas as pd
pd.DataFrame(rouge_dict, index = [f'pegasus'])


# score closer to 1 the better, rouge 1 = perfectly match

### Interpreting Good vs Bad ROUGE scores
1. Scores close to 1: This indicates a strong overlap between the generated summary and the reference summary, which is desirable in summarization tasks. For example, an F-1 score of 0.7 or higher across metrics is generally considered good.
2. Scores between 0.5 and 0.7: Indicates moderate overlap. The summary might be capturing key points but is likely missing some structure or important information
3. Scores below 0.5: Suggest a poor match between the generated and reference summaries. The model might be generating irrelevant or incomplete summaries that don't capture the key ideas well

In [None]:
## save model

model_pegasus.save_pretrained("pegasus-samsum-model")

In [None]:
## save tokenizer
tokenzier.save_pretrained("tokenizer")


In [None]:
# Load from modelcheckpoints and checkpoints
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")


In [None]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128} # some patterns to load based on,
# max_length = how many words the summarizer will be,
# length_penalty = if the value is greater than one

sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)
# creating the pipeline based on the model

print("Dialogue:")
print(reference)

print("\nRefernce Summary:")
print(reference)
# true data

print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])
# the output generated from the model