# Summarization of Customer Reviews by Fine-tuning mT5 on Multilingual Amazon Reviews Corpus dataset (PyTorch)


## Installments

In [1]:
import torch
torch.cuda.is_available()

True

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
# !pip install accelerate
# !apt install git-lfs

## logins

In [None]:
# # Setup git
# !git config --global user.email ""
# !git config --global user.name ""

In [3]:
# logged in to the Hugging Face Hub
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Preparing a multilingual corpus

We’ll use the [Multilingual Amazon Reviews Corpus](https://www.kaggle.com/datasets/mexwell/amazon-reviews-multi/data) to create our bilingual summarizer for English and Spanish.

### Download the English and Spanish subsets from the Hugging Face Hub

In [4]:
## Not available on hf
# from datasets import load_dataset

# spanish_dataset = load_dataset("amazon_reviews_multi", "es")
# english_dataset = load_dataset("amazon_reviews_multi", "en")
# english_dataset

In [5]:
## Download dataset from Kaggle
!pip install kaggle
!mkdir ~/.kaggle
!mv ~/Downloads/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

mv: cannot stat '/root/Downloads/kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [6]:
!kaggle datasets download -d mexwell/amazon-reviews-multi

Dataset URL: https://www.kaggle.com/datasets/mexwell/amazon-reviews-multi
License(s): other
Downloading amazon-reviews-multi.zip to /content
 89% 117M/131M [00:01<00:00, 127MB/s] 
100% 131M/131M [00:01<00:00, 111MB/s]


In [7]:
!unzip amazon-reviews-multi.zip -d amazon-reviews-multi

Archive:  amazon-reviews-multi.zip
  inflating: amazon-reviews-multi/test.csv  
  inflating: amazon-reviews-multi/train.csv  
  inflating: amazon-reviews-multi/validation.csv  


In [8]:
# Load the dataset from the CSV file
from datasets import load_dataset

dataset = load_dataset(
    'csv',
    data_files={
        'train': 'amazon-reviews-multi/train.csv',
        'validation': 'amazon-reviews-multi/validation.csv',
        'test': 'amazon-reviews-multi/test.csv'  # Closing quotation mark added
    }
)


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [9]:
# dataset = dataset.map(remove_columns=['Unnamed: 0'])

In [10]:
# # Ensure there are no NoneType objects in the 'validation' split
# def clean_dataset(dataset):
#     def filter_none(example):
#         return {k: (v if v is not None else '') for k, v in example.items()}

#     return dataset.map(filter_none)

# dataset = clean_dataset(dataset)

In [11]:
# Filter the dataset by language
spanish_dataset = dataset.filter(lambda example: example['language'] == 'es')
english_dataset = dataset.filter(lambda example: example['language'] == 'en')

Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

In [12]:
english_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

In [13]:
def show_samples(dataset, num_samples=3, seed=121):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"Title: {example['review_title']}")
        print(f"Review: {example['review_body']}")
        print(f"\n")

show_samples(english_dataset)

Title: great size
Review: I was surprised how large this was! it fits out entire patio table and 2 other chairs with room to spare! The quality is great! It was easy to put on and now I can relax a bit knowing my new patio furniture is safe from the elements. The set came packaged very condensed but opens easily. great deal too!!


Title: A lot of rosin on initial use but quickly dissipates after use.
Review: Great bag for keeping your hands dry of moisture and sweat while bowling. Only drawback was on initial use there was a ton of rosin and very shortly after there wasn’t much coming out of the bag when squeezed and patted in hands or over finger holes.


Title: Pictures are not in color
Review: This book is written well for kids. However, it refers to the colors in the pictures but the pictures are all in black and white. Kind of aggravating.




In [14]:
# Compute the number of reviews per product category
english_dataset.set_format("pandas")
english_df = english_dataset["train"][:]

# Show counts for the top 20 products
english_df["product_category"].value_counts()[:20]

Unnamed: 0_level_0,count
product_category,Unnamed: 1_level_1
home,17679
apparel,15951
wireless,15717
other,13418
beauty,12091
drugstore,11730
kitchen,10382
toy,8745
sports,8277
automotive,7506


In [15]:
# Switch the format of english_dataset_dict from "pandas" back to "arrow"
english_dataset.reset_format()

In [16]:
# Function to filter rows involving the book categories
def filter_books(example):
    return (
        example["product_category"] == "book"
        or example["product_category"] == "digital_ebook_purchase"
    )

In [17]:
spanish_books = spanish_dataset.filter(filter_books)
english_books = english_dataset.filter(filter_books)
show_samples(english_books)

Filter:   0%|          | 0/200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Title: Everyone can find their happily ever after - Spoilers Ahead
Review: I received an ARC of this book to provide a review. This is a well written and edited book. Ben’s dad passes away and in in his will there is a marriage clause for him to receive his inheritance. He gets married to Pam who is stewardess he meets to be able to receive his inheritance. When he meets Jessica her daughter her daughter who is in college he falls for her. He falls very hard for Jessica and wants a life with her. I recommend that you read the book to find out how all individuals have find there happily ever after. I woul give this book a 3.5. Great job Amy Brent!!!


Title: Love it!
Review: Much bigger than i expected lol but the book is perfect !


Title: Seven great plots
Review: Dev is a PI with the morals of an alley cat, and the unfortunate talent for being in the wrong place at the wrong time with the wrong female! His tendency to trust the wrong female has him in hot water with the local police 

### Data preparation
To create our bilingual dataset, we’ll loop over each split, concatenate the datasets for that split, and shuffle the result to ensure our model doesn’t overfit to a single language

In [18]:
from datasets import concatenate_datasets, DatasetDict

books_dataset = DatasetDict()

for split in english_books.keys():
    books_dataset[split] = concatenate_datasets(
        [spanish_books[split], english_books[split]]
    )
    books_dataset[split] = books_dataset[split].shuffle(seed=42)

# Peek at a few samples
show_samples(books_dataset)

Title: Estaba buscando el de adultos y encontré un estupendo libro juvenil.
Review: Estaba buscando el de adultos y encontré un estupendo libro juvenil. Me lo leí y se lo he pasado a mis hijos, es una buena historia.


Title: Warm hearted read
Review: Thoroughly enjoyed this book. Love the characters and how the story evolved. Looking forward to the next book on Mistletoe Lane.


Title: Nothing!
Review: Not sure just trying to get out of this so I can get to the home page. No more please!




**The plots below show the word distributions, and we can see that the titles are heavily skewed toward just 1-2 words**

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/review-lengths-dark.svg)

In [19]:
# Apply a rough heuristic to split the titles on whitespace
books_dataset = books_dataset.filter(
    lambda example: len(example["review_title"].split()) < 2
)

Filter:   0%|          | 0/17612 [00:00<?, ? examples/s]

Filter:   0%|          | 0/424 [00:00<?, ? examples/s]

Filter:   0%|          | 0/442 [00:00<?, ? examples/s]

## Multilingual Models for text summarization

- **mT5:**	A multilingual version of T5, pretrained on the multilingual Common Crawl corpus (mC4), covering 101 languages.

- **mBART-50:**	A multilingual version of BART, pretrained on 50 languages.

- We’ll focus on mT5, an interesting architecture based on T5 that was pretrained in a text-to-text framework

![](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter7/t5-dark.svg)

### Preprocessing the data

In [20]:
# Load mt5-small as our checkpoint
from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [21]:
# Let’s test out the mT5 tokenizer on a small example
inputs = tokenizer("I loved reading the Hunger Games!")
inputs

{'input_ids': [336, 259, 28387, 11807, 287, 62893, 295, 12507, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [22]:
# Let’s decode input IDs to see use of SentencePiece tokenizer
tokenizer.convert_ids_to_tokens(inputs.input_ids)

['▁I', '▁', 'loved', '▁reading', '▁the', '▁Hung', 'er', '▁Games', '!', '</s>']

In [23]:
# Process inputs and targets for mT5
max_input_length = 512
max_target_length = 30


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["review_body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["review_title"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [24]:
tokenized_datasets = books_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/3025 [00:00<?, ? examples/s]

Map:   0%|          | 0/78 [00:00<?, ? examples/s]

Map:   0%|          | 0/77 [00:00<?, ? examples/s]

### Metrics for text summarization
- For summarization, one of the most commonly used metrics is the **ROUGE score (short for Recall-Oriented Understudy for Gisting Evaluation)**

In [None]:
!pip install rouge_score

In [26]:
import evaluate

rouge_score = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [27]:
generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

In [28]:
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)

scores

{'rouge1': 0.923076923076923,
 'rouge2': 0.7272727272727272,
 'rougeL': 0.923076923076923,
 'rougeLsum': 0.923076923076923}

In [29]:
from datasets import load_metric

# Load ROUGE metric
rouge_score = load_metric("rouge")

# Example predictions and references
predictions = ["This is the hypothesis."]
references = ["This is the reference."]

# Compute ROUGE scores with stemming
scores = rouge_score.compute(predictions=predictions, references=references, use_stemmer=True)

# Round and print results
for key, value in scores.items():
    print(f"{key}: {value}")

  rouge_score = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

The repository for rouge contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/rouge.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
rouge1: AggregateScore(low=Score(precision=0.75, recall=0.75, fmeasure=0.75), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=0.75, recall=0.75, fmeasure=0.75))
rouge2: AggregateScore(low=Score(precision=0.6666666666666666, recall=0.6666666666666666, fmeasure=0.6666666666666666), mid=Score(precision=0.6666666666666666, recall=0.6666666666666666, fmeasure=0.6666666666666666), high=Score(precision=0.6666666666666666, recall=0.6666666666666666, fmeasure=0.6666666666666666))
rougeL: AggregateScore(low=Score(precision=0.75, recall=0.75, fmeasure=0.75), mid=Score(precision=0.75, recall=0.75, fmeasure=0.75), high=Score(precision=0.75, recall=0.75, fmeasure=0.75))
roug

In [30]:
# Accessing the mid value of the rouge1 score
rouge1_mid = scores["rouge1"].mid
print(rouge1_mid)

Score(precision=0.75, recall=0.75, fmeasure=0.75)


### Creating a strong baseline for text summarization


In [31]:
!pip install nltk



In [32]:
import nltk

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [33]:
from nltk.tokenize import sent_tokenize

def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

# def three_sentence_summary(text):
#     sentences = sent_tokenize(text)
#     print("Sentences:", sentences)  # Debugging statement to check the split
#     # Check if there are fewer than three sentences and adjust accordingly
#     if len(sentences) < 3:
#         return "\n".join(sentences)
#     else:
#         return "\n".join(sentences[:3])

review_text = books_dataset["train"][10]["review_body"]
print(three_sentence_summary(review_text))


It was a typical light read!
Lots of faith is expressed but not overwhelming.
There are no lurid sex scenes.


In [34]:
# computes the ROUGE scores for the baseline
def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["review_body"]]
    return metric.compute(predictions=summaries, references=dataset["review_title"])

In [35]:
import pandas as pd

# Evaluate baseline
score = evaluate_baseline(books_dataset["validation"], rouge_score)

# Assuming score[rn] is a float value
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, round(score[rn].mid.fmeasure * 100, 2)) for rn in rouge_names)
rouge_dict

{'rouge1': 3.0, 'rouge2': 0.11, 'rougeL': 3.02, 'rougeLsum': 3.05}

## Fine-tuning mT5 with the Trainer API


In [36]:
# Load the pretrained model from the mt5-small checkpoint
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [37]:
from tkinter.constants import N
# Define the hyperparameters and other arguments
from transformers import Seq2SeqTrainingArguments

batch_size = 8
num_train_epochs = 3

# Show the training loss with every epoch
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir="mt5-finetuned-amazon-reviews",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=True,
)



In [38]:
# Function to evaluate our model during training
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Decode generated summaries into text
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # Decode reference summaries into text
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # ROUGE expects a newline after each sentence
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    # Compute ROUGE scores
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract the median scores
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

In [39]:
# Define data collator for our sequence-to-sequence task
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [40]:
# Remove the columns with strings for collator
tokenized_datasets = tokenized_datasets.remove_columns(
    books_dataset["train"].column_names
)

In [41]:
# Wrangle the data into the expected format
features = [tokenized_datasets["train"][i] for i in range(3)]
data_collator(features)

{'input_ids': tensor([[ 1659, 12431,  1664,   707,   319,  7656,   835,   260,   260, 69122,
           473,   260,   309,     1,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [ 1517, 40086,   980,   259, 96883,   261,   259, 49260,   319,  1280,
           259,   262,   658,   261,   319,   259,   262,   658,   603,   921,
             1,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [ 8739,  3435,   259,  2220, 22677,   288,   259,  4940,  1537,  1371,
           336,  9070,   345,  6117,   304,   259,  3824, 20743,   261,   259,
         16611,   276,   776,   332, 20564,  1711,   287,   259,  2015,   304,
          3034,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [42]:
# Instantiate the trainer with the standard arguments
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [43]:
# Launch our training
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,18.4463,8.544673,0.3663,0.0,0.3663,0.3663
2,9.359,5.067376,0.0,0.0,0.0,0.0
3,6.6153,4.261654,0.0,0.0,0.0,0.0




TrainOutput(global_step=1137, training_loss=11.459027688756992, metrics={'train_runtime': 416.2225, 'train_samples_per_second': 21.803, 'train_steps_per_second': 2.732, 'total_flos': 932146461880320.0, 'train_loss': 11.459027688756992, 'epoch': 3.0})

In [44]:
# Evaluate model
trainer.evaluate()

{'eval_loss': 4.261654376983643,
 'eval_rouge1': 0.0,
 'eval_rouge2': 0.0,
 'eval_rougeL': 0.0,
 'eval_rougeLsum': 0.0,
 'eval_runtime': 1.2282,
 'eval_samples_per_second': 63.507,
 'eval_steps_per_second': 8.142,
 'epoch': 3.0}

In [47]:
# Push the model weights to the Hub
trainer.push_to_hub(
    tags=["summarization", "translation", "text-generation"],
    commit_message="Training complete",
)


CommitInfo(commit_url='https://huggingface.co/ashaduzzaman/mt5-finetuned-amazon-reviews/commit/e2b472dde446b1676d6149b8f307ef23f2c46df8', commit_message='Training complete', commit_description='', oid='e2b472dde446b1676d6149b8f307ef23f2c46df8', pr_url=None, pr_revision=None, pr_num=None)

## Fine-tuning mT5 with 🤗 Accelerate

### Preparing everything for training

In [None]:
# # Create a DataLoader for each of our splits
# tokenized_datasets.set_format("torch")

In [None]:
# # Instantiate the DataCollatorForSeq2Seq
# model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [None]:
# # Instantiate the data collator and define dataloaders
# from torch.utils.data import DataLoader

# batch_size = 8

# train_dataloader = DataLoader(
#     tokenized_datasets["train"],
#     shuffle=True,
#     collate_fn=data_collator,
#     batch_size=batch_size,
# )

# eval_dataloader = DataLoader(
#     tokenized_datasets["validation"],
#     collate_fn=data_collator,
#     batch_size=batch_size,
# )

In [None]:
# # Define the AdamW Optimizer
# from torch.optim import AdamW

# optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
# # Feed model, optimizer, and dataloaders to the Accelerator
# from accelerate import Accelerator

# accelerator = Accelerator()

# model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
#     model, optimizer, train_dataloader, eval_dataloader
# )

In [None]:
# # Define the learning rate schedule
# from transformers import get_scheduler

# num_train_epochs = 5
# num_update_steps_per_epoch = len(train_dataloader)
# num_training_steps = num_train_epochs * num_update_steps_per_epoch

# lr_scheduler = get_scheduler(
#     "linear",
#     optimizer=optimizer,
#     num_warmup_steps=0,
#     num_training_steps=num_training_steps,
# )

In [None]:
# # Function to post-process the summaries for evaluation
# def postprocess_text(preds, labels):
#     preds = [pred.strip() for pred in preds]
#     labels = [label.strip() for label in labels]

#     # ROUGE expects a newline after each sentence
#     preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
#     labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

#     return preds, labels

In [None]:
# #Create a repository on the Hub that we can push our model to
# from huggingface_hub import create_repo, get_full_repo_name

# repo_name = "mT5-small-finetuned-amazon-reviews-accelerate"
# create_repo(repo_name)

# model_name = "mT5-small-finetuned-amazon-reviews-accelerate"
# repo_name = get_full_repo_name(model_name)
# repo_name

In [None]:
# # Clone a local version to our results directory
# from huggingface_hub import Repository

# output_dir = "results-mt5-finetuned-squad-accelerate"
# repo = Repository(output_dir, clone_from=repo_name)

In [None]:
# # Training loop
# from tqdm.auto import tqdm
# import torch
# import numpy as np

# progress_bar = tqdm(range(num_training_steps))

# for epoch in range(num_train_epochs):
#     # Training
#     model.train()
#     for step, batch in enumerate(train_dataloader):
#         outputs = model(**batch)
#         loss = outputs.loss
#         accelerator.backward(loss)

#         optimizer.step()
#         lr_scheduler.step()
#         optimizer.zero_grad()
#         progress_bar.update(1)

#     # Evaluation
#     model.eval()
#     for step, batch in enumerate(eval_dataloader):
#         with torch.no_grad():
#             generated_tokens = accelerator.unwrap_model(model).generate(
#                 batch["input_ids"],
#                 attention_mask=batch["attention_mask"],
#             )

#             generated_tokens = accelerator.pad_across_processes(
#                 generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
#             )

#             labels = batch["labels"]

#             # If we didn't pad to max length, we need to pad the labels too
#             labels = accelerator.pad_across_processes(
#                 batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
#             )

#             generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
#             labels = accelerator.gather(labels).cpu().numpy()

#             # Replace -100 in the labels as we can't decode them
#             labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
#             if isinstance(generated_tokens, tuple):
#                 generated_tokens = generated_tokens[0]
#             decoded_preds = tokenizer.batch_decode(
#                 generated_tokens, skip_special_tokens=True
#             )
#             decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

#             decoded_preds, decoded_labels = postprocess_text(
#                 decoded_preds, decoded_labels
#             )

#             rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

#     # Compute metrics
#     results = rouge_score.compute()
#     # Extract the median ROUGE scores
#     result = {key: value.mid.fmeasure * 100 for key, value in results.items()}
#     result = {k: round(v, 4) for k, v in result.items()}
#     print(f"Epoch {epoch}: result")

#     # Save and Upload
#     accelerator.wait_for_everyone()
#     unwrapped_model = accelerator.unwrap_model(model)
#     unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
#     if accelerator.is_main_process:
#         tokenizer.save_pretrained(output_dir)
#         repo.push_to_hub(
#             commit_message=f"Training in progress epoch {epoch}", blocking=False
#         )


## Using the fine-tuned model

In [48]:
from transformers import pipeline

hub_model_id = "ashaduzzaman/mt5-finetuned-amazon-reviews"
summarizer = pipeline("summarization", model=hub_model_id)

config.json:   0%|          | 0.00/802 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/19.3k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/416 [00:00<?, ?B/s]

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [49]:
# Let’s implement a simple function to show the review, title, and generated summary
def print_summary(idx):
    review = books_dataset["test"][idx]["review_body"]
    title = books_dataset["test"][idx]["review_title"]
    summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"]
    print(f"'>>> Review: {review}'")
    print(f"\n'>>> Title: {title}'")
    print(f"\n'>>> Summary: {summary}'")

In [54]:
# Let’s take a look at one of the English examples
print_summary(10)

Your max_length is set to 20, but your input_length is only 16. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=8)


Review: 
Great story, characters and everything else. I can not recommend this more!
Title: Read!
Summary: <extra_id_0>able


In [55]:
print_summary(20)

Review: 
Algunos laberintos son un poco complicados para Niño de cinco años. Otros no, los hace sin ningún problema. Satisfecha con la compra.
Title: Entretenido
Summary: <extra_id_0>able
