# Text Summarization Demo

*   This project is done on the local machine by clonning git URL using homebrew on Macbook.

1.   Install homebrew in your terminal: https://brew.sh/. Make sure you run all the recommended links.

2.   Create a new repository on your Github site and copy the **HTTPS** url.

3.   In the same terminal, type 'git clone' and paste the link:
    e.g.: git clone https://github.com/annguyenhuynh/Text-Summarization-Project.git

4.   On your local machine, you will see a folder with the name of the repository you created on Github that contains folders of your license (if you choose one) and README (if you also create one)

5.   The IDE used in this project is VSCode.


 **Project Structure**

*   In VSCode, under the Folder created by cloning your Github, create a file called "template.py". This file contains the template layout for this project.

*   The code below creating folders and files that will be used in the project

*   After building the code, in the terminal, run **python3 template.py** and your folder structure will be automatically created for you on the local machine.

*   You can create more folders and files later as needed by adding in the list 'list_files' and run the template.py in the terminal to make the updates.


import os
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO, format='[%(asctime)s]: %(message)s:')

project_name = "textSummarizer"

list_files = [
  ".github/workflows/.gitkeep",
  f"src/{project_name}/__init__.py",
  f"src/{project_name}/components/__init__.py",
  f"src/{project_name}/utils/__init__.py"
  f"src/{project_name}/utils/common.py",
  f"src/{project_name}/logging/__init__.py",
  f"src/{project_name}/config/__init__.py",
  f"src/{project_name}/config/configuration.py",
  f"src/{project_name}/pipeline/__init__.py",
  f"src/{project_name}/entity/__init__.py",
  f"src/{project_name}/constants/__init__.py",
  "config/config.yaml",
  "params.yaml",
  "app.py",
  "main.py",
  "Dockerfile",
  "reqirements.txt",
  "setup.py",
  "research/trials.ipynb"
]

for filepath in list_files:
  filepath = Path(filepath)
  filedir,filename = os.path.split(filepath)

  if filedir != "":
    os.makedirs(filedir,exist_ok=True)
    logging.info(f"Creating directory: {filedir} for file: {filename}")

  if(not os.path.exists(filepath)) or (os.path.getsize(filepath) == 0):
    with open(filepath, 'w') as f:
      pass
    logging.info(f"Creating empty file: {filepath}")

  else:
    logging.info(f"{filename} already exists")


**Commit to Github**

*  In your terminal, write and run
    *     git add .
    *     git commit -m "folder structure added"
    *    git push origin main

*   Go to Github site, refresh it and you will see all folders and files there.

**Creating environment for this project**

*   We should create an environment whenever we start a ML or DL project because each project may require different libraries and packages and some have dependency issues so we want to have all the projects separated from one another

*   How to create virtual environment?
    *   conda create -n [venv_name] python=3.8(versioning depends on your choice) -y

*   Why should we create conda env and not python env?
    *   According to my search on Google, Conda environments are more specific and better for reproducibility. Conda combines environment and package management, which can reduce compatibility issues and streamline workflow. Conda environments are often used in data science because they can handle complex dependencies, especially in scientific computing libraries.

*   After creating, you need to activate the virtual environment
    *   conda activate [venv_name]

















**Updating *requirements.txt***

*   In this txt file, we will list all the packages we need to install for this specific project



"""
transformers
transformers[sentencepiece]
datasets
sacrebleu
rouge-score
py7zr
pandas
nltk
tqdm
PyYAML
matplotlib
torch
notebook
boto3
mypy-boto3-s3
python-box==6.0.2
ensure==1.0.2
fastapi==0.78.0
uvicorn==0.18.3
Jinja2==3.1.2
-e .
"""

**Final Setup**

import setuptools

with open("README.md", "r", encoding="utf-8") as f:
  long_description = f.read()

__version__ = "0.0.0"

REPO_NAME = "Text-Summarization-Project"
AUTHOR_USERNAME = "annguyenhuynh"
SRC_REPO = "textSummarizer"
AUTHOR_EMAIL = "hnpa1997@gmail.com"

setuptools.setup(
  name=SRC_REPO,
  version=__version__,
  author=AUTHOR_USERNAME,
  author_email=AUTHOR_EMAIL,
  description="Text summarization project using NLP techniques",
  long_description=long_description,
  long_description_content_type="text/markdown",
  url=f"https://github.com/{AUTHOR_USERNAME}/{REPO_NAME}",
  project_urls={
    "Bug Tracker": f"https://github.com/{AUTHOR_USERNAME}/{REPO_NAME}/issues",
  },
  package_dir={"": "src"},
  packages=setuptools.find_packages(where="src")
)

**Installing all the packages in the .txt file and commit to Github**

*   pip install -r reqirements.txt

*   Run the above command in your terminal in the conda virtual environment. Besides running the .txt file, it will also run the 'setup.py' file. Now, everything is ready for the project.

*   Remember to commit to Github following the same steps used above.


**Set up the logging**

*   src --> textSummarizer --> logging --> open '__init__.py'


import os
import sys
import logging

logging_str = "[%(asctime)s] %(levelname)s: %(module)s: %(message)s]"
log_dir = "logs"
log_filepath = os.path.join(log_dir, "running_logs.log")
os.makedirs(log_dir, exist_ok=True)


logging.basicConfig(
  level=logging.INFO,
  format=logging_str,
  handlers=[
    logging.FileHandler(log_filepath),
    logging.StreamHandler(sys.stdout)]

)


*   In **main.py**, run
    *   from textSummarizer.logging import logger
    *   logger.info("Welcome to our custom logging")

*   In the terminal, if you run **python3 main.py**, you will see something like this:
    *   [2024-08-31 14:12:55,021] INFO: main: Welcome to our custom logging]

*   The results returned follow the setup of the logging_str: timestamp level(info) module(main) and message


**Store all the functions needed in common.py**

*   We will store all the packages needed in a file, and so when we need to use a specific function, we can refer to the file

import os
from box.exceptions import BoxValueError
import yaml
from textSummarizer.logging import logger
from ensure import ensure_annotations
from box import ConfigBox
from pathlib import Path
from typing import Any

@ensure_annotations
def read_yaml(path_to_yaml:Path) -> ConfigBox:
  """read yaml file and return

  Args:
    path_to_yaml (str):path like input

  Raises:
    ValueError: if yaml file is empty
    e: empty file

  Returns:
    ConfigBox: ConfigBox type
  """
  try:
    with open(path_to_yaml) as yaml_file:
      content = yaml.safe.load(yaml_file)
      logger.info(f"yaml file: {path_to_yaml} loaded successfully")
      return ConfigBox(content)
  except BoxValueError:
    raise ValueError("yaml file is empty")
  except Exception as e:
    raise e

  @ensure_annotations
  def create_directories(path_to_directories: list, verbose=True):
    """create list of directories

    Args:
      path_to_directories(list):list of path to directories
      ignore_log (bool, optional): ignore if multiple directories are to be created, default to False.
    """
    for path in path_to_directories:
      os.makedirs(path, exist_ok=True)
      if verbose:
        logger.info(f"Directory created at: {path}")

  @ensure_annotations
  def get_size(path:Path) -> str:
    """get size in KB
    Args:
    path (Path): path to the file

    Returns:
    str: size in KB
    """
    size_in_kb = round(os.path.getsize(path)/1024)
    return f"~ {size_in_kb} KB"

'@': **decorator** with **ensure_annotations**

*   A decorator in Python is a function that modifies or extends the behavior of another function
without changing the original function's code. Decorators are a design pattern that can be used to enhance the functionality of classes, methods, or functions.

*   Decorators can be used for logging and caching. Logging saves information about executed functions, such as arguments and return values. Caching stores arguments and return values so they can be reused

*   For e.g: you define a function
    *   def get_product(x: int, y: int) -> int:
          return x*y
  *  However, you call the function like this get_product(2,'4'). This returns '44' as string. If you add *@ensure_annotation* before your function, even 4 is entered as string, the output is still 8.

'->': **Hints**: tell the developers types of object expects and what it returns

' """ """ ': Everything inside these triple quotes are pseudo-code. It is the outline/logic of the real code

**Configbox**: read Python config values into Python types. We need it here because we have a function that reads YAML files

In [1]:
from transformers import pipeline, set_seed
from datasets import load_dataset, load_from_disk
import matplotlib.pyplot as plt
from datasets import load_dataset, load_metric
import pandas as pd

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import nltk
from nltk.tokenize import sent_tokenize

from tqdm import tqdm
import torch

nltk.download("punkt")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /Users/AnhHuynh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
""" Why CUDA GPU?
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA
Parallel Computing: CUDA GPUs can run thousands of threads simultaneously, making them highly effective for tasks that can be parallelized.
Accelerated Workloads: Tasks like matrix operations, deep learning model training, and real-time image or video processing are much faster on a CUDA GPU compared to a CPU.
Deep Learning: CUDA is widely used in deep learning frameworks like TensorFlow and PyTorch to speed up model training and inference by utilizing the GPU.

Development Tools: CUDA provides libraries, tools, and frameworks (e.g., cuBLAS, cuDNN) to help developers write high-performance applications that can run on NVIDIA GPUs.
"""

' Why CUDA GPU?\nCUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA\nParallel Computing: CUDA GPUs can run thousands of threads simultaneously, making them highly effective for tasks that can be parallelized.\nAccelerated Workloads: Tasks like matrix operations, deep learning model training, and real-time image or video processing are much faster on a CUDA GPU compared to a CPU.\nDeep Learning: CUDA is widely used in deep learning frameworks like TensorFlow and PyTorch to speed up model training and inference by utilizing the GPU.\n\nDevelopment Tools: CUDA provides libraries, tools, and frameworks (e.g., cuBLAS, cuDNN) to help developers write high-performance applications that can run on NVIDIA GPUs.\n'

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [8]:
model_ckpt = "google/pegasus-cnn_dailymail"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Code explain**

*   **Model Checkpoint**: google/pegasus-cnn_dailymail. Pegasus pretraining task is intentionally similar to summarization: important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary

*   **AutoTokenizer**: loads the tokenizer associated with the specified model checkpoint. The tokenizer is responsible for converting text into tokens that the model can understand and later converting model output tokens back into text.

*   **AutoModelForSeq2SeqLM**: AutoModelForSeq2SeqLM can be used to load any seq2seq (or encoder-decoder) model that has a language modeling (LM) head on top

*   [**Auto Classes**](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSeq2SeqLM)

*   [**Hugging Face - Pegasus**](https://huggingface.co/google/pegasus-cnn_dailymail)

*   [GenAI Udemy course](https://www.udemy.com/share/10bnQZ3@8z0IN-HKfHQ5jQHrBrD4Ipgm39VhFuZdEH4l4rwhoahc4MKvBJCTdPZl-RJT2y2tyQ==/)






In [2]:
from datasets import load_dataset

# Load the SAMSum dataset
dataset = load_dataset("samsum") 

# Display the first few examples from the train split
print(dataset['train'][0])


{'id': '13818513', 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)", 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}


In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [11]:
split_lengths = [len(dataset[split]) for split in dataset]

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset['train'].column_names}")
print("\nDialogue")

print(dataset['test'][1]['dialogue'])

print('\nSummary')

print(dataset['test'][1]['summary'])

Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']

Dialogue
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary
Eric and Rob are going to watch a stand-up on youtube.


**Code explain**


*   split: The dataset object is typically a dictionary-like object where each key corresponds to a split (e.g., "train", "validation", "test").

*   len(dataset[split]): For each split in the dataset, len() calculates the number of examples in that split.

*   [len(dataset[split]) for split in dataset] iterates over all the splits in the dataset, calculates their lengths, and stores the results in a list.






In [12]:
def convert_examples_to_features(example_batch):
  input_encodings = tokenizer(example_batch['dialogue'], max_length = 1024, truncation=True)
  
  with tokenizer.as_target_tokenizer():
    target_encodings = tokenizer(example_batch['summary'],max_length = 128,truncation=True)

  return {
      'input_ids': input_encodings['input_ids'],
      'attention_mask': input_encodings['attention_mask'],
      'labels': target_encodings['input_ids']
  }

**Code explain**
*   This function is used to prepocess text data for seq-to-seq task. This function takes a batch of examples  from the dataset loaded earlier and converts them into features that a model can use for training or inference.

*   **tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)**: This line tokenizes the input text, which is the dialogue in this case. The max_length=1024 parameter ensures that the input sequence is truncated if it exceeds 1024 tokens. The result is a dictionary containing input_ids (the tokenized input) and attention_mask (which indicates which tokens are padding and which are actual input).

*   **with tokenizer.as_target_tokenizer()**: This temporarily switches the tokenizer to target mode. This is important for seq2seq tasks where the model needs to generate text.

*   **target_encodings = tokenizer(example_batch['summary'], max_length=128, truncation=True)**: This tokenizes the target text, which is the summary in this case, ensuring the sequence length does not exceed 128 tokens.

*   Returning features:

    *   **input_ids**: The token IDs corresponding to the input dialogue.

    *   **attention_mask**: The attention mask that tells the model which parts of the input are actual tokens and which are padding.

    *   **labels**: The token IDs corresponding to the target summary. These are used as the labels during training.





In [13]:
tokenized_dataset = dataset.map(convert_examples_to_features, batched=True)

Map: 100%|██████████| 818/818 [00:00<00:00, 2016.36 examples/s]


In [14]:
tokenized_dataset['train']

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [15]:
# Training
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer,model=model_pegasus)

**Code explain**

*   DataCollatorForSeq2Seq is a utility in the Hugging Face Transformers library designed to help with batching and preparing data for sequence-to-sequence (seq2seq) tasks. It is particularly useful when working with models like T5, BART, and Pegasus.

*   When training or evaluating seq2seq models, you need to handle batches of input sequences and their corresponding target sequences (labels). These sequences often have different lengths, so padding is necessary to ensure that all sequences in a batch have the same length. DataCollatorForSeq2Seq automates this process, making it easier to prepare your data for the model.


In [16]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir = 'pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    evaluation_strategy ='steps', eval_steps=500, save_steps=16,
    gradient_accumulation_steps=16
)



**Code explain**

*   **output_dir**='pegasus-samsum': This specifies the directory where the model checkpoints, logs, and other outputs will be saved. After training, you'll find the fine-tuned model, configuration files, and any evaluation results in this directory.

*   **num_train_epochs=1**: The number of complete passes through the training dataset. In this case, it's set to 1, meaning the model will be trained for one epoch.

*   **warmup_steps=500**: The number of steps used for a warmup phase, where the learning rate increases linearly from 0 to its target value. This can help stabilize training in the early stage

*   **weight_decay=0.01**: Weight decay is a regularization technique to prevent overfitting by penalizing large weights in the model. This value (0.01) is applied during optimization.

*   **logging_steps=10**: This specifies that the training metrics (like loss, learning rate, etc.) should be logged every 10 steps.

*   **evaluation_strategy='steps'**: This tells the Trainer to run evaluation during training based on steps. Other options include epoch (to evaluate at the end of each epoch) or no (no evaluation during training).

*   **eval_steps=500**: Evaluation will be performed every 500 steps during training. This is useful for monitoring model performance throughout training.

*   **save_steps=16**: This determines how often model checkpoints will be saved, set here to every 16 steps. Frequent checkpointing can help avoid losing progress in case of interruptions.


*   **gradient_accumulation_steps=16**: Gradients will be accumulated over 16 steps before performing a backward pass. This effectively increases the batch size to 16 * per_device_train_batch_size, allowing you to simulate a larger batch size without requiring as much GPU memory.





In [17]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=tokenized_dataset["test"],
                  eval_dataset=tokenized_dataset["validation"])

In [18]:
trainer.train()

 20%|█▉        | 10/51 [21:44<1:35:36, 139.91s/it]

{'loss': 3.1101, 'grad_norm': 31.94026756286621, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.2}


Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}
 39%|███▉      | 20/51 [42:55<1:02:48, 121.55s/it]

{'loss': 3.0467, 'grad_norm': 15.841309547424316, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.39}


 59%|█████▉    | 30/51 [1:03:52<45:15, 129.33s/it]

{'loss': 3.1613, 'grad_norm': 10.095331192016602, 'learning_rate': 3e-06, 'epoch': 0.59}


Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}
 78%|███████▊  | 40/51 [1:23:50<21:39, 118.18s/it]

{'loss': 2.9901, 'grad_norm': 20.720611572265625, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.78}


Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}
 98%|█████████▊| 50/51 [1:41:13<01:43, 103.05s/it]

{'loss': 2.8539, 'grad_norm': 51.40227127075195, 'learning_rate': 5e-06, 'epoch': 0.98}


Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}
100%|██████████| 51/51 [1:43:06<00:00, 121.31s/it]

{'train_runtime': 6186.8641, 'train_samples_per_second': 0.132, 'train_steps_per_second': 0.008, 'train_loss': 3.0231569280811383, 'epoch': 1.0}





TrainOutput(global_step=51, training_loss=3.0231569280811383, metrics={'train_runtime': 6186.8641, 'train_samples_per_second': 0.132, 'train_steps_per_second': 0.008, 'total_flos': 313450454089728.0, 'train_loss': 3.0231569280811383, 'epoch': 0.9963369963369964})

In [26]:
#Evaluation
def generate_batch_sized_chunk(list_of_elements, batch_size):
  """split the dataset into smaller batches that we can process simultaneuosly. Yield successive batch-sized chunks from list of elements"""
  for i in range(0, len(list_of_elements), batch_size):
    yield list_of_elements[i:i + batch_size]

def calculate_metric_on_test_ds(dataset, metric,model, tokenizer,
                                batch_size=16, device=device,
                                column_text = 'article',
                                column_summary = 'highlights'):
  article_batches = list(generate_batch_sized_chunk(dataset[column_text], batch_size))
  target_batches = list(generate_batch_sized_chunk(dataset[column_text],batch_size)) 

  for article_batch, target_batch in tqdm(
    zip(article_batches, target_batches), total=len(article_batches)):

    inputs = tokenizer(article_batch,max_length=1024,truncation=True,
                       padding="max_length", return_tensors="pt") 

    summaries = model.generate(input_ids=inputs['input_ids'].to(device), 
                               attention_mask=inputs["attention_mask"].to(device), 
                               length_penalty=0.8, num_beams=8, max_length=128)
    '''parameter for legnth penalty ensures that the model does not generate sequences that are too long.'''
  #Decode generated text,
  # replace the token, and add the decoded texts with the references to the metrics
    decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                            clean_up_tokenization_spaces=True)
                    for s in summaries] 
    decoded_summaries = [d.replace(""," ") for d in decoded_summaries]

    metric.add_batch(predictions = decoded_summaries, references = target_batch)

  # Compute and return the rogue scores
  score = metric.compute() 
  return score

**Code explain**
* **tqdm**: provides a progress bar to track the loop’s progress.
* **zip**: pairs the batches of articles and target summaries for processing.
* **tokenizer**: encodes the input texts (articles) into token IDs that the model can process. It also pads the sequences to a maximum length and returns them as PyTorch tensors.
* **num_beams** (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
* *Beam Search* is a  search algorithm used in deep learning and natural language processing (NLP) that helps find solutions to problems with large search spaces.
* **Length penalty** normalizes the scores based on the sequence length
* **pt**: pytorch device

In [22]:
rogue_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"] 
rogue_metric = load_metric("rouge")

Downloading builder script: 5.65kB [00:00, 1.85MB/s]                   


**ROUGE SCORE**
* The ROUGE score, or *Recall-Oriented Understudy for Gisting Evaluation*, is a metric used to evaluate the quality of machine translation and summarization models. It compares a machine-generated summary or translation to a human-produced reference. The ROUGE score is a scalar value between 0 and 1, with higher scores indicating greater similarity between the two.

In [27]:
score = calculate_metric_on_test_ds(
  dataset['test'][0:10], rogue_metric, trainer.model, tokenizer, batch_size=2, column_text='dialogue', column_summary='summary')
rogue_dict = dict((rn, score[rn].mid.fmeasure) for rn in rogue_names) 
# score[rn].mid.fmeasure, which is the F1 score for that particular ROUGE metric.

pd.DataFrame(rogue_dict,index=[f'pegasus]'])

100%|██████████| 5/5 [09:41<00:00, 116.31s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus],0.074672,0.005143,0.067024,0.074331


In [29]:
# Save model 
model_pegasus.save_pretrained("pegasus-samsum-model")

Non-default generation parameters: {'max_length': 128, 'min_length': 32, 'num_beams': 8, 'length_penalty': 0.8, 'forced_eos_token_id': 1}


In [30]:
# Save tokenizer
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [31]:
#load 
tokenizer = AutoTokenizer.from_pretrained("tokenizer")

In [32]:
#Predictions 

gen_kwargs = {"length_penalty": 0.8, "num_beams": 8, "max_length": 128}
sample_text = dataset["test"][0]["dialogue"]
reference = dataset["test"][0]["summary"] 
pipe = pipeline("summarization", model="pegasus-samsum-model", tokenizer=tokenizer)

print("Dialogue: ", sample_text)

print("\nReference Summary: ", reference)

print("\nModel Summary: ", pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:  Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:  Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:  Amanda: Ask Larry Amanda: He called her last time we were at the park together .<n>Hannah: I'd rather you texted him .<n>Amanda: Just text him .
