<a href="https://colab.research.google.com/github/dileep9968/mastery-in-llm/blob/main/7_3_t5_for_product_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#🤖 Fine-Tuning T5 for Product Review Generation

In this interactive lab, we'll explore the exciting task of generating product reviews using the T5 (Text-to-Text Transfer Transformer) model. We'll dive into data preparation, model training, and ultimately, review generation.

###Setup and Installation

First things first, we need to install the required libraries to ensure our environment is ready for the tasks ahead.

In [None]:
!nvidia-smi

Tue Sep  3 05:17:04 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
!pip install numpy==1.25.1
!pip install transformers[torch]==4.41.0
!pip install datasets===2.14.6

Collecting accelerate>=0.21.0 (from transformers[torch])
  Using cached accelerate-0.29.3-py3-none-any.whl (297 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->transformers[t

###Importing Libraries 📚

Let's import all the necessary modules that will help us load datasets, process data, and utilize the T5 model.


In [None]:
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding

###Data Preparation 📋

Our journey begins with preparing our dataset. We'll use a subset of Amazon product reviews for our analysis and training.

Loading and Merging Datasets
We replace the unavailable "amazon_us_reviews" with a similar dataset and merge metadata with review data.

In [None]:
# Amazon removed the "amazon_us_reviews" dataset, so we'll have to use a replacement here.
dataset_category = "Software" # "Electronics" you can also choose electronics like in the lesson, but the dataset is bigger and loading will take longer

meta_ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_meta_{dataset_category}", split='full').to_pandas()[['parent_asin', 'title']]
review_ds = load_dataset("McAuley-Lab/Amazon-Reviews-2023", f"raw_review_{dataset_category}", split='full').to_pandas()[['parent_asin', 'rating', 'text', 'verified_purchase']]

ds = meta_ds.merge(review_ds, on='parent_asin', how='inner').drop(columns="parent_asin")
ds = ds.rename(columns={"rating":"star_rating", "title":"product_title", "text":"review_body"})

ds = ds[ds['verified_purchase'] & (ds['review_body'].map(len) > 100)].sample(100_000)
ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/39.6k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/19.7k [00:00<?, ?B/s]

Downloading and preparing dataset amazon-reviews-2023/raw_meta_Software to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_meta_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8...


Downloading data:   0%|          | 0.00/256M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

Dataset amazon-reviews-2023 downloaded and prepared to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_meta_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8. Subsequent calls will reuse this data.
Downloading and preparing dataset amazon-reviews-2023/raw_review_Software to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_review_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8...


Downloading data:   0%|          | 0.00/1.87G [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

Dataset amazon-reviews-2023 downloaded and prepared to /root/.cache/huggingface/datasets/McAuley-Lab___amazon-reviews-2023/raw_review_Software/0.0.0/16b76e0823d73bb8cff1e9c5e3e37dbc46ae3daee380417ae141f5e67d3ea8e8. Subsequent calls will reuse this data.


  table = cls._concat_blocks(blocks, axis=0)


Unnamed: 0,product_title,star_rating,review_body,verified_purchase
886895,SLCnow,1.0,"Pop up foot review, then get a negative one. C...",True
716386,Beautiful Wallpapers Collection,5.0,Love it!At first I got stuck at the loading sc...,True
2420065,MLB,4.0,picture quality comes and goes don't know the ...,True
2020313,Destination America GO - Fire TV,1.0,Cant watch this network unless you give login ...,True
2962630,Wedding Dash,5.0,"Fun game, it really gets harder as the levels ...",True
...,...,...,...,...
2479961,Mahjong Pro,5.0,"If you are a Mahjong junkies, you will like th...",True
1428143,Smart Robots VS Stupid Zombies,3.0,This game is OK. Kinda like plants vs zombies....,True
1697677,Word Chums,5.0,I will never play words with friends again. Th...,True
4772437,Paramount+,2.0,I was so excited to get this channel cause I m...,True


Encoding and Splitting
Next, we encode our star_rating column and split our dataset into training and testing sets.

In [None]:
# Loading the dataset
dataset = Dataset.from_pandas(ds)

# encoding the 'star_rating' column
dataset = dataset.class_encode_column("star_rating")

# Splitting the dataset into training and testing sets
dataset = dataset.train_test_split(test_size=0.1, seed=42, stratify_by_column="star_rating")

train_dataset = dataset['train']
test_dataset = dataset['test']
print(train_dataset[0])

Stringifying the column:   0%|          | 0/100000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/100000 [00:00<?, ? examples/s]

{'product_title': 'TiVo', 'star_rating': 4, 'review_body': 'I love this app not only is it useful for when I lose the remote but it gives me a laugh when my husband is watching his football I change it to Disney channel and my daughters are just laughing out of control. Lol!', 'verified_purchase': True, '__index_level_0__': 955399}


###Model Preparation 🛠️

Now, let's prepare our T5 model for training.

###Tokenizer Initialization

In [None]:
MODEL_NAME = 't5-base'
tokenizer = T5Tokenizer.from_pretrained('t5-base')

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


###Data Preprocessing Function
We define a function to preprocess our data, preparing it for the model.

In [None]:
# Defining the function to preprocess the data
def preprocess_data(examples):
    examples['prompt'] = [f"review: {product_title}, {star_rating} Stars!" for product_title, star_rating in zip(examples['product_title'], examples['star_rating'])]
    examples['response'] = [f"{review_body}" for review_body in examples['review_body']]

    inputs = tokenizer(examples['prompt'], padding='max_length', truncation=True, max_length=128)
    targets = tokenizer(examples['response'], padding='max_length', truncation=True, max_length=128)

    # Set -100 at the padding positions of target tokens
    target_input_ids = []
    for ids in targets['input_ids']:
        target_input_ids.append([id if id != tokenizer.pad_token_id else -100 for id in ids])

    inputs.update({'labels': target_input_ids})
    return inputs

###Preprocessing Datasets

In [None]:
train_dataset = train_dataset.map(preprocess_data, batched=True)
test_dataset = test_dataset.map(preprocess_data, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/90000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

###Fine-Tuning the Model 🎯

With our data ready, we proceed to fine-tune the T5 model on our dataset.

In [None]:
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

TRAINING_OUTPUT = "./models/t5_fine_tuned_reviews"
training_args = TrainingArguments(
    output_dir=TRAINING_OUTPUT,
    num_train_epochs=3,
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    save_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
)

trainer.train()

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Step,Training Loss


KeyboardInterrupt: 

###Saving and Loading the Model 💾

After training, we save our model for later use and demonstrate how to load it.

In [None]:
trainer.save_model(TRAINING_OUTPUT)

In [None]:
# Loading the fine-tuned model
# model = T5ForConditionalGeneration.from_pretrained(TRAINING_OUTPUT)

# or get it directly trained from here:
model = T5ForConditionalGeneration.from_pretrained("TheFuzzyScientist/T5-base_Amazon-product-reviews")

config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

###Generating Reviews ✍️

Finally, we use our fine-tuned model to generate reviews for new products.

In [None]:
# Defining the function to generate reviews
def generate_review(text):
    inputs = tokenizer("review: " + text, return_tensors='pt', max_length=512, padding='max_length', truncation=True)
    outputs = model.generate(inputs['input_ids'], max_length=128, no_repeat_ngram_size=3, num_beams=6, early_stopping=True)
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary

In [None]:
# Generating reviews for random products
random_products = test_dataset.shuffle(42).select(range(10))['product_title']

print(generate_review(random_products[0] + ", 3 Stars!"))
print(generate_review(random_products[1] + ", 5 Stars!"))
print(generate_review(random_products[2] + ", 2 Stars!"))

Good for the price I bought these for a couple of reasons. The first is that they are a bit bulky, but the second is a little bulky. The third is that it's a lot bigger than I thought it would be. I'm not sure if they'll hold up or not, but they're a good buy.
Great product for the price I bought this for my husband for Christmas and he loves it. It's easy to use and the sound quality is great.
It's ok, but it's not as good as I thought it would be. I've had it for about a month now, and I'm very happy with it. The only problem I have is that it doesn't work as well as I would like it to.


### Conclusion 🚀

Congratulations! You've just completed a hands-on project on fine-tuning T5 for generating product reviews. Experiment further with different product categories or tweak the model parameters to see how it affects the output. Happy coding!

## 🚀 Next Steps: Elevate Your Skills with Advanced Techniques

Congratulations on reaching this far! If you're keen to expand your expertise and dive deeper into the world of Large Language Models (LLMs), our next course is designed just for you.

### 🌟 Use LLMs Smarter: Scale Gen AI, ML-Ops & Cost Efficiency
In a world where AI and machine learning are revolutionizing industries, the ability to deploy and manage massive models like Llama, Mistral, and Gemma efficiently is invaluable.

This course is tailored to equip you with the knowledge and skills to:

* **Deploy Huge Models**: Learn the ins and outs of working with some of the largest models available, understanding their architecture and how they can be leveraged for your projects.
* **Scale Across Clusters**: Discover strategies for scaling these behemoths across clusters of machines without sacrificing performance, ensuring seamless operation.
* **Optimize Response Times**: Achieve response times in the milliseconds while maintaining the delicate balance between accuracy and speed.
* **Balance Accuracy, Speed, & Cost**: Master the art of cost efficiency without compromising on performance, utilizing the latest and most powerful technologies.

### 🎓 Why This Course?
This course goes beyond the basics, diving into the practical aspects of deploying and optimizing LLMs at scale. Whether you're working on cutting-edge research or developing solutions for real-world problems, the insights and techniques covered here will be invaluable.

### 💡 Take the Leap
If you're intrigued by the possibilities and ready to take your skills to the next level, take 2 minutes to explore this course further. As a token of our gratitude for completing the current course, we're offering an exclusive discount—better than what you might find elsewhere—for this next step in your journey.

# 🔗 Check out the course [here](https://www.udemy.com/course/deploy-ai-smarter-llm-scalability-ml-ops-cost-efficiency/?referralCode=ADC24A974EEC326467E6/?couponCode=99AF47C4162E4C5C0481)

Use this cuppon code if the discount is not applied automatically: **99AF47C4162E4C5C0481**


Embrace the opportunity to become a proficient practitioner in deploying, scaling, and optimizing Large Language Models. Your journey into the advanced realms of AI and machine learning starts now!