# Summarization Project with T5 and Hyperparameter Tuning

This notebook demonstrates how to use the T5 model for text summarization using the Hugging Face Transformers library. We will also perform hyperparameter tuning using Optuna and track our experiments with MLflow. Data loading will be managed using PyTorch's DataLoader.

## Objectives
1. Load and preprocess the CNN/DailyMail dataset.
2. Tokenize the data using T5 tokenizer.
3. Set up hyperparameter tuning using Optuna.
4. Train the model with the best hyperparameters.
5. Evaluate the model and save it for later use.
6. Create an interactive widget for text summarization.

## 1. Setup and Installation
First, we need to install the necessary libraries.


In [None]:
pip install transformers[torch]

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu

In [None]:
# Install necessary packages
!pip install datasets transformers optuna mlflow

import datasets
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
import optuna
import mlflow
from torch.utils.data import DataLoader, Dataset

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.8/547.8 kB[0m [31m6.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting optuna
  Downloading optuna-3.6.1-py3-none-any.whl (380 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mlflow
  Downloading mlflow-2.14.1-py3-none-any.whl (25.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.8/25.8 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15

##2. Load and Preprocess the Dataset
We will load the CNN/DailyMail dataset and preprocess it.

In [None]:
# Load the CNN/DailyMail dataset
dataset = datasets.load_dataset('cnn_dailymail', '3.0.0')

# Use small subsets for tuning and training
train_data = dataset['train'].select(range(1000))  # Use first 50 examples
validation_data = dataset['validation'].select(range(200))  # Use first 20 examples

# Load the model and tokenizer
model_name = 't5-base'  # Upgraded model from small
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

def preprocess_data(examples: datasets.arrow_dataset.Dataset) -> dict:
    """
    Preprocess the dataset by tokenizing the inputs and outputs.

    Args:
        examples (datasets.arrow_dataset.Dataset): The dataset examples.

    Returns:
        dict: The tokenized inputs and labels.
    """
    inputs = [f"summarize: {doc}" for doc in examples['article']]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding='max_length')

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['highlights'], max_length=150, truncation=True, padding='max_length')

    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Preprocess the data
train_data = train_data.map(preprocess_data, batched=True)
validation_data = validation_data.map(preprocess_data, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



Map:   0%|          | 0/200 [00:00<?, ? examples/s]

##3. Create DataLoaders
We'll create DataLoaders for efficient data handling.


In [None]:
# Create DataLoaders
train_dataloader = DataLoader(train_data, batch_size=4, shuffle=True)
validation_dataloader = DataLoader(validation_data, batch_size=4)


##4. Hyperparameter Tuning with Optuna
We will use Optuna to find the best hyperparameters for our model.

In [None]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function for hyperparameter tuning with Optuna.

    Args:
        trial (optuna.Trial): The trial object for hyperparameter suggestions.

    Returns:
        float: The evaluation loss of the model.
    """
    training_args = TrainingArguments(
        output_dir='./results',
        evaluation_strategy='steps',
        eval_steps=50,  # Evaluate every 50 steps
        logging_steps=50,  # Log every 50 steps
        learning_rate=trial.suggest_loguniform('learning_rate', 1e-5, 5e-5),
        per_device_train_batch_size=trial.suggest_categorical('per_device_train_batch_size', [2, 4]),
        per_device_eval_batch_size=4,
        num_train_epochs=1,  # Reduced number of epochs for quicker training
        weight_decay=trial.suggest_loguniform('weight_decay', 0.01, 0.1),
        warmup_steps=trial.suggest_int('warmup_steps', 0, 100),
        fp16=True,  # Use mixed precision
        gradient_accumulation_steps=trial.suggest_int('gradient_accumulation_steps', 1, 2),  # Gradient accumulation
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        eval_dataset=validation_data,
    )

    trainer.train()
    eval_results = trainer.evaluate()
    mlflow.log_params(trial.params)
    mlflow.log_metrics(eval_results)
    return eval_results['eval_loss']

# Optimize hyperparameters
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=3)

# Best hyperparameters
best_params = study.best_trial.params
print("Best Hyperparameters:", best_params)


[I 2024-06-26 00:27:37,259] A new study created in memory with name: no-name-62253ecd-b1d1-4af7-a6fb-f2fac401f20e
  learning_rate=trial.suggest_loguniform('learning_rate', 1e-5, 5e-5),
  weight_decay=trial.suggest_loguniform('weight_decay', 0.01, 0.1),


Step,Training Loss,Validation Loss
50,4.6255,0.718179
100,0.8771,0.649758
150,0.8332,0.613636
200,0.7209,0.596988
250,0.7284,0.591885
300,0.7321,0.590039
350,0.6868,0.588389
400,0.7262,0.587784
450,0.7166,0.586748
500,0.7059,0.586672


[I 2024-06-26 00:31:35,216] Trial 0 finished with value: 0.5866720676422119 and parameters: {'learning_rate': 3.867178297525716e-05, 'per_device_train_batch_size': 2, 'weight_decay': 0.04594653833284964, 'warmup_steps': 15, 'gradient_accumulation_steps': 1}. Best is trial 0 with value: 0.5866720676422119.
  learning_rate=trial.suggest_loguniform('learning_rate', 1e-5, 5e-5),
  weight_decay=trial.suggest_loguniform('weight_decay', 0.01, 0.1),


Step,Training Loss,Validation Loss
50,0.6622,0.586936
100,0.667,0.58793


[I 2024-06-26 00:33:26,554] Trial 1 finished with value: 0.5870904326438904 and parameters: {'learning_rate': 3.731743796809645e-05, 'per_device_train_batch_size': 4, 'weight_decay': 0.04226427923883455, 'warmup_steps': 47, 'gradient_accumulation_steps': 2}. Best is trial 0 with value: 0.5866720676422119.
  learning_rate=trial.suggest_loguniform('learning_rate', 1e-5, 5e-5),
  weight_decay=trial.suggest_loguniform('weight_decay', 0.01, 0.1),


Step,Training Loss,Validation Loss
50,0.587,0.589927
100,0.5353,0.599582
150,0.5567,0.604789
200,0.5992,0.600141
250,0.6772,0.597392


[I 2024-06-26 00:35:45,658] Trial 2 finished with value: 0.5973921418190002 and parameters: {'learning_rate': 3.1459223287335566e-05, 'per_device_train_batch_size': 4, 'weight_decay': 0.0225238325308319, 'warmup_steps': 99, 'gradient_accumulation_steps': 1}. Best is trial 0 with value: 0.5866720676422119.


Best Hyperparameters: {'learning_rate': 3.867178297525716e-05, 'per_device_train_batch_size': 2, 'weight_decay': 0.04594653833284964, 'warmup_steps': 15, 'gradient_accumulation_steps': 1}


##5. Train the Model with Best Hyperparameters
Now we will train the model using the best hyperparameters found.

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='steps',
    eval_steps=500,  # Evaluate every 500 steps
    logging_steps=500,  # Log every 500 steps
    learning_rate=best_params['learning_rate'],
    per_device_train_batch_size=best_params['per_device_train_batch_size'],
    num_train_epochs=3,
    weight_decay=best_params['weight_decay'],
    warmup_steps=best_params['warmup_steps'],
    fp16=True,  # Use mixed precision
)

# Train with the full dataset using DataLoaders
train_dataloader = DataLoader(train_data, batch_size=best_params['per_device_train_batch_size'], shuffle=True)
validation_dataloader = DataLoader(validation_data, batch_size=8)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=validation_data,
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")




Step,Training Loss,Validation Loss
500,0.5467,0.59148
1000,0.5724,0.599549
1500,0.5356,0.601827


Evaluation Results: {'eval_loss': 0.6018274426460266, 'eval_runtime': 5.7231, 'eval_samples_per_second': 34.946, 'eval_steps_per_second': 4.368, 'epoch': 3.0}


##6. Save and Load the Model
We will save the trained model and tokenizer for future use.

In [None]:
# Save the model
model.save_pretrained('./saved_model')
tokenizer.save_pretrained('./saved_model')

# Load the model
model = T5ForConditionalGeneration.from_pretrained('./saved_model')
tokenizer = T5Tokenizer.from_pretrained('./saved_model')


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


##7. Summarization Pipeline
We will create a summarization pipeline using the trained model.

In [None]:
from transformers import pipeline

# Summarization pipeline
summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)

# Example summarization
text = "It seemed like it should have been so simple. There was nothing inherently difficult with getting the project done. It was simple and straightforward enough that even a child should have been able to complete it on time, but that wasn't the case. The deadline had arrived and the project remained unfinished."
summary = summarizer(text, max_length=150, min_length=50, do_sample=False)
print(summary[0]['summary_text'])


Your max_length is set to 150, but your input_length is only 67. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=33)


a child should have been able to complete a project on time . But that wasn't the case . The deadline had arrived and the project remained unfinished . "It was simple and straightforward enough" is the saying of the author .


##8. Interactive Widgets
We'll create interactive widgets for text summarization.

In [None]:
import ipywidgets as widgets
from IPython.display import display

# Create input and output widgets
text_area = widgets.Textarea(
    value='Enter text to summarize here...',
    placeholder='Type something',
    description='Text:',
    disabled=False,
    layout=widgets.Layout(width='100%', height='200px')
)

button = widgets.Button(description="Summarize")
output = widgets.Output()

def summarize_text(b: widgets.Button) -> None:
    """
    Summarize the text from the input widget.

    Args:
        b (widgets.Button): The button widget.
    """
    with output:
        output.clear_output()
        text = text_area.value
        summary = summarizer(text, max_length=150, min_length=50, do_sample=False)
        print(summary[0]['summary_text'])

button.on_click(summarize_text)

display(text_area, button, output)


Textarea(value='Enter text to summarize here...', description='Text:', layout=Layout(height='200px', width='10…

Button(description='Summarize', style=ButtonStyle())

Output()