# 🤖 Fine-Tuning T5 for Product Review Generation
This notebook explores the task of generating product reviews using the T5 (Text-to-Text Transfer Transformer) model. It encompasses data preparation, model training and review generation.

## 📦 Setups and Imports

In [19]:
%%capture
!pip install datasets
!pip uninstall wandb -y

In [20]:
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

## 🧹 Data Preparation
Beginning with the preparation of our dataset. We'll use a subset of Amazon product reviews for our analysis and training.

### Loading and Merging

In [21]:
dataset_category = 'Software'

In [22]:
meta_ds = load_dataset('McAuley-Lab/Amazon-Reviews-2023',
                       f'raw_meta_{dataset_category}',
                       trust_remote_code=True,
                       split='full').to_pandas()[['parent_asin',
                                                  'title']]
review_ds = load_dataset('McAuley-Lab/Amazon-Reviews-2023',
                         f'raw_review_{dataset_category}',
                         trust_remote_code=True,
                         split='full').to_pandas()[['parent_asin',
                                                    'rating',
                                                    'text',
                                                    'verified_purchase']]

In [23]:
ds = meta_ds.merge(review_ds,
                   on='parent_asin',
                   how='inner').drop(columns='parent_asin')
ds = ds.rename(columns={'rating':'star_rating',
                        'title':'product_title',
                        'text':'review_body'})

In [24]:
ds = ds[ds['verified_purchase'] & (ds['review_body'].map(len) > 100)]
ds = ds.sample(10000)
ds

Unnamed: 0,product_title,star_rating,review_body,verified_purchase
3972751,USA Network,3.0,I got this so that I could watch a show that I...,True
3653933,Burger,4.0,This game is AWESOME and I think I have become...,True
3810399,Kik Messenger,1.0,"I just got this app, but when I try to make an...",True
1780614,Star Girl,5.0,So the device needed me to rate so yahoo it's ...,True
3730610,Calculator Plus,5.0,Easy to use I like the % function. Big numbers...,True
...,...,...,...,...
286855,Webroot Antivirus Software 2023 | 5 Device | 1...,4.0,I like Webroot Antivirus and have had for 4 or...,True
1644812,Pinterest,2.0,Again why do I have to have a million password...,True
3351724,Minecraft,5.0,My son and I thoroughly enjoy this app. It hel...,True
3183,TurboTax Deluxe 2014 Fed + State + Fed Efile T...,5.0,So far so good. Haven't pulled the trigger to ...,True


### Enconding and Splitting

In [25]:
dataset = Dataset.from_pandas(ds)
dataset = dataset.class_encode_column('star_rating')
dataset = dataset.train_test_split(test_size=0.1,
                                   seed=42,
                                   stratify_by_column='star_rating')

Stringifying the column:   0%|          | 0/10000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [26]:
train_dataset = dataset['train']
train_dataset[0]

{'product_title': 'Township',
 'star_rating': 1,
 'review_body': "The update was over a week ago with my game still crashes at the same spot. I &quot;get&quot; to watch my team he demoted in LAST WEEK'S regatta when I try to collect prize it closes. I haven't been able to participate in this regard at all.😞",
 'verified_purchase': True,
 '__index_level_0__': 4638942}

In [27]:
test_dataset = dataset['test']
test_dataset[0]

{'product_title': 'Mahjong Epic',
 'star_rating': 2,
 'review_body': 'I like the idea of changing the faces on the tiles. On many of the Chinese looking tiles I have a hard time telling some of the tiles apart. So many symbols are too similar.',
 'verified_purchase': True,
 '__index_level_0__': 2180044}

## Model Preparation 🔧
Preparing our T5 model for training.

### Tokenizer Initialization

In [28]:
MODEL_NAME = 't5-base'
tokenizer = T5Tokenizer.from_pretrained('t5-base')

### Data Preprocessing Function

In [29]:
def preprocess_data(examples):
  examples['prompt'] = [f'review: {product_title}, {star_rating} Stars!' for product_title, star_rating in zip(examples['product_title'], examples['star_rating'])]
  examples['response'] = [f'{review_body}' for review_body in examples['review_body']]

  inputs = tokenizer(examples['prompt'],
                     padding='max_length',
                     truncation=True,
                     max_length=128)
  targets = tokenizer(examples['response'],
                      padding='max_length',
                      truncation=True,
                      max_length=128)

  target_input_ids = []
  for ids in targets['input_ids']:
    target_input_ids.append([id if id != tokenizer.pad_token_id else -100 for id in ids])

  inputs.update({'labels':target_input_ids})
  return inputs

### Preprocessing Datasets

In [30]:
train_dataset = train_dataset.map(preprocess_data,
                                  batched=True)
test_dataset = test_dataset.map(preprocess_data,
                                batched=True)

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [31]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 🎯 Fine-Tuning the Model

In [32]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

In [33]:
TRAINING_OUTPUT = './models/t5_fine_tuned_reviews'

In [34]:
training_args = TrainingArguments(output_dir=TRAINING_OUTPUT,
                                  num_train_epochs=3,
                                  per_device_train_batch_size=12,
                                  per_device_eval_batch_size=12,
                                  save_strategy='epoch')

In [35]:
trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=train_dataset,
                  data_collator=data_collator)

In [36]:
trainer.train()

Step,Training Loss
500,3.578
1000,3.4204
1500,3.3698
2000,3.3194


TrainOutput(global_step=2250, training_loss=3.4118033854166665, metrics={'train_runtime': 1712.6477, 'train_samples_per_second': 15.765, 'train_steps_per_second': 1.314, 'total_flos': 4110465761280000.0, 'train_loss': 3.4118033854166665, 'epoch': 3.0})

## 💾 Saving and Loading the Model

In [37]:
trainer.save_model(TRAINING_OUTPUT)

In [38]:
model = T5ForConditionalGeneration.from_pretrained(TRAINING_OUTPUT)

## ✍️ Generating Reviews

In [53]:
def generate_review(text):
  inputs = tokenizer('review: ' + text,
                     return_tensors='pt',
                     max_length=512,
                     padding='max_length',
                     truncation=True)
  outputs = model.generate(inputs['input_ids'],
                           max_length=128,
                           no_repeat_ngram_size=3,
                           num_beams=6,
                           early_stopping=True)
  summary = tokenizer.decode(outputs[0],
                             skip_special_tokens=True)
  return summary

In [54]:
random_products = test_dataset.shuffle(42).select(range(10))['product_title']
random_products

['Westbound: Pioneer Adventure',
 'Time Gap: Free Hidden Object Mystery Game',
 'Subway Surfers',
 'World Series of Poker - WSOP Texas Holdem Free Poker',
 'Amazon Prime Video',
 'Amazon Music for Android',
 'Supermarket Mania Journey: A Time Management Adventure',
 'Kids Doodle - Movie Kids Drawing',
 'Midnight Mysteries: Devil on the Mississippi (Full)',
 'Guess The Emoji']

In [66]:
generate_review(random_products[0] + ", 1 Stars!")

"This is a great game, but it's a little slow. I'm not sure if this is the game for you, but if you're like me, you'll love this game."

In [56]:
generate_review(random_products[1] + ", 2 Stars!")

"It's a fun game but I don't like that you have to spend money to play it. I'm not sure if it's worth the money or not."

In [57]:
generate_review(random_products[2] + ", 3 Stars!")

"I love this game. It's a lot of fun. I've played it a few times and it's fun to play."

In [58]:
generate_review(random_products[3] + ", 4 Stars!")

"This is a great game to play. It's a lot of fun to play with friends and family. I would recommend this game to anyone."

In [59]:
generate_review(random_products[4] + ", 5 Stars!")

"I love this app. It's easy to use and it's a great way to watch movies and TV shows."

In [60]:
generate_review(random_products[5] + ", 1 Stars!")

"I've been using this app for a few months now and it's still not working. I'm not sure if I'll be able to get it to work or not."

In [61]:
generate_review(random_products[6] + ", 2 Stars!")

"This is a great game, but it's a little slow. I don't know if I'm going to like it or not."

In [62]:
generate_review(random_products[7] + ", 3 Stars!")

"This is a great app for kids. It's easy to use and has a lot of fun. I would recommend it to anyone."

In [63]:
generate_review(random_products[8] + ", 4 Stars!")

"This is a great game. It's a lot of fun to play. I've been playing this game for a long time."

In [64]:
generate_review(random_products[9] + ", 5 Stars!")

"This is a fun game to play. It's a lot of fun. I'm a big fan of emojis and this is one of the best games I've ever played."