<a href="https://colab.research.google.com/github/AbdelwahedSouiid/Transformers/blob/aymen/Transforming%20Amazon%20Product%20Reviews%20into%20Insights%3A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center;">
    <strong>Transforming Amazon Product Reviews into Insights: A Sentiment Analysis Approach Using Transformer Models</strong>
</div>

## Introduction
This notebook focuses on sentiment classification of Amazon product reviews using transformer models. The objective is to analyze customer sentiments and derive insights from product reviews. We will utilize BERT, RoBERTa, and DistilBERT for this task.


## Dataset Overview
In this section, we load the Amazon product reviews dataset. The dataset contains user reviews, product IDs, scores, and additional metadata. We will preprocess this data for sentiment analysis.


In [4]:
!pip install kaggle



In [5]:
from google.colab import files
files.upload()  # Choose your kaggle.json file to upload


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"aymenmsalmi","key":"ffe2251aa4fd313d1654afdec8753d14"}'}

In [6]:
!kaggle datasets download -d arhamrumi/amazon-product-reviews

Dataset URL: https://www.kaggle.com/datasets/arhamrumi/amazon-product-reviews
License(s): CC0-1.0
Downloading amazon-product-reviews.zip to /content
 85% 97.0M/115M [00:00<00:00, 167MB/s]
100% 115M/115M [00:00<00:00, 143MB/s] 


In [7]:
!unzip amazon-product-reviews.zip

Archive:  amazon-product-reviews.zip
  inflating: Reviews.csv             


# Load and Explore Data
First, we will load the dataset and take a quick look at the first few rows.

In [8]:
import pandas as pd

# Load the dataset (replace 'your_file.csv' with the actual file name)
df = pd.read_csv('/content/Reviews.csv')  # Check the extracted files for the correct filename
df.head()  # Display the first few rows


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [None]:
# prompt: chose 500 rows assuring that the text and summary columns are not empty

from google.colab import files
import pandas as pd

# Assuming the code to upload kaggle.json and download/unzip the dataset is already executed

# Load the dataset
df = pd.read_csv('/content/Reviews.csv')

# Filter out rows with empty 'Text' or 'Summary' columns
df_filtered = df.dropna(subset=['Text', 'Summary'])

# Sample 500 rows
df_sampled = df_filtered.sample(n=500, random_state=42) # random_state for reproducibility

# Now df_sampled contains 500 rows with non-empty 'Text' and 'Summary' columns
df_sampled.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
41434,41435,B0088YBUOU,A16O0S1QROXGJM,"Amy in SC ""Amy in SC""",10,11,4,1141516800,I like these!,These are actually very tasty. Pure potatoes ...
209481,209482,B000Q75354,A37V5C3TXIBIHT,Martin Dulberg,1,3,4,1265414400,Good but subjectively not 5 star,I realize that taste is a matter of personal p...
247306,247307,B000EOXQS0,A28NR6KKJGHQXH,"Virginia A. Mashensky ""photo nut""",0,0,5,1350345600,"Lipton Cup A Soup, Spring Vegetable.4 oz",This is one of my Favorite cup of soup choices...
80089,80090,B0000D9589,AJBVY9K7D1AZ4,Mark Gatzke,7,7,4,1177027200,"Suited to its purpose, if not quite its goal...",If you like the classic taste of a good margar...
218580,218581,B000UPNK9S,A2G8OM5IXSG97Q,gabbersib,0,0,2,1314403200,Tastes artificial!,I was willing to give this a chance even after...


In [None]:


# Verify that the filtered dataframe does not contain empty values in the specified columns.
empty_text_count_filtered = df_sampled['Text'].isnull().sum()
empty_summary_count_filtered = df_sampled['Summary'].isnull().sum()

print(f"Number of empty values in 'Text' column after filtering: {empty_text_count_filtered}")
print(f"Number of empty values in 'Summary' column after filtering: {empty_summary_count_filtered}")



Number of empty values in 'Text' column after filtering: 0
Number of empty values in 'Summary' column after filtering: 0


In [None]:
df_sampled.shape


(500, 10)

In [None]:
# prompt: just select the text and the summar colym only

# Assuming df_sampled is already defined as in the previous code

# Select only the 'Text' and 'Summary' columns
selected_df = df_sampled[['Text', 'Summary']]

selected_df

Unnamed: 0,Text,Summary
41434,These are actually very tasty. Pure potatoes ...,I like these!
209481,I realize that taste is a matter of personal p...,Good but subjectively not 5 star
247306,This is one of my Favorite cup of soup choices...,"Lipton Cup A Soup, Spring Vegetable.4 oz"
80089,If you like the classic taste of a good margar...,"Suited to its purpose, if not quite its goal..."
218580,I was willing to give this a chance even after...,Tastes artificial!
...,...,...
120391,I had tried this Cappucino in a package of a v...,Grove Square Cappucino
169731,A cookie review may initially seem superfluous...,Cookie Crunch Lost in Peanut Butter and Fudge ...
206578,These dry chews help scrape teeth clean and he...,Fresh Canine Breath
400201,I received this coffee drink from Amazon and w...,coffee


In [None]:
selected_df.shape

(500, 2)

# 1. Prompt Engineering

This cell sets up the necessary libraries and functions for text summarization using a pre-trained T5 model. Below is an overview of what each step accomplishes:

- **Library Imports**: The `transformers` library is imported for accessing the T5 model and tokenizer, and `torch` is used for tensor operations.
- **Model Initialization**: The `t5-small` pre-trained model and tokenizer are loaded from Hugging Face, enabling efficient input text processing and summary generation.
- **Function Creation**: A function named `generate_summary_for_one` is defined to:
  - Add a "summarize:" prefix to the input text.
  - Encode the input text and pass it to the model.
  - Generate a concise summary using beam search for improved output quality.
- **Error Handling**: The function includes a `try-except` block to gracefully manage any issues that might occur during the summarization process.

The function is then applied to the first entry of a DataFrame (`selected_df`) to demonstrate how it can generate a sample summary.

In [None]:
# Install the necessary library if not already installed
# !pip install transformers

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

# Load pre-trained T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Function to generate a summary with a prompt for a single text
def generate_summary_for_one(text, max_length=50, min_length=25):
    try:
        # Add a prompt to the input text
        prompted_text = f"summarize: {text}"

        # Encode the input with proper handling for truncation and tensor type
        input_ids = tokenizer.encode(prompted_text, return_tensors="pt", padding="longest", truncation=True, max_length=512)

        # Generate summary with the model
        summary_ids = model.generate(input_ids, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Decode and return the summary
        return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    except Exception as e:
        return f"An error occurred: {e}"

# Select the first item from the 'Text' and 'Summary' columns of selected_df
first_text = selected_df['Text'].iloc[0]  # Ensure that selected_df has been created with the relevant data
original_summary = selected_df['Summary'].iloc[0]

# Generate the summary
predicted_summary = generate_summary_for_one(first_text)



In [None]:
# Print the original text, original summary, and generated summary
print("Original Text:\n", first_text)
print("\nOriginal Summary:\n", original_summary)
print("\nGenerated Summary:\n", predicted_summary)


Original Text:
 These are actually very tasty.  Pure potatoes with a great texture and no nasty filler "stuff."  No bacon, no cheese...just tasty potatoes.  They cook well in either the oven or microwave.  I add a touch of either salt & pepper or fajita seasoning to spice it up.  I rated 4 out of 5 stars because they could be a bit bigger portion.  However, this item is a fairly good value for the money.

Original Summary:
 I like these!

Generated Summary:
 pure potatoes with a great texture and no nasty filler "stuff" they cook well in either the oven or microwave. this item is a fairly good value for money.


# Prompt Tuning

### Fine-Tuning and Prompt-Based Text Summarization with T5

In this cell, we set up and apply a custom approach for generating text summaries using a pre-trained T5 model, with optional support for fine-tuning prompt embeddings. Below is an explanation of each step:

- **Library Imports**: Essential libraries such as `transformers` and `torch` are imported. The `T5Tokenizer`, `T5ForConditionalGeneration`, and `T5Config` classes from `transformers` are used for tokenization and model interaction.

- **Model Initialization**: The `t5-small` model and tokenizer are loaded from Hugging Face's model repository, enabling the transformation of input text into a summary.

- **Custom Function for Generating Summaries**:
  - The `generate_summary_with_learned_prompt` function is defined to handle input text and integrate optional prompt embeddings.
  - **Prompt Embedding**: If provided, the function incorporates a learned prompt embedding with the input. Otherwise, a default text-based prompt (`"summarize:"`) is used.
  - The function encodes the input text, generates a summary using beam search for improved results, and decodes the output for readability.

- **Example Usage**:
  - The code extracts the first entry from a DataFrame (`selected_df`) for testing.
  - A placeholder (`None`) for prompt embedding is used to demonstrate how the function works with or without learned prompts.
  - The generated summary is printed along with the original text and dataset-provided summary for comparison.

This approach demonstrates a way to enhance model performance by using prompt tuning, a technique that can fine-tune the input for better results in NLP tasks.


In [None]:
# Install the necessary library if not already installed
# !pip install transformers

from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
import torch

# Load pre-trained T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Prepare a function to fine-tune prompt embeddings (simulated approach)
def generate_summary_with_learned_prompt(text, prompt_embedding=None, max_length=50, min_length=25):
    try:
        # If a prompt embedding is provided, integrate it into the input
        if prompt_embedding:
            input_ids = torch.cat((prompt_embedding, tokenizer.encode(text, return_tensors="pt", truncation=True, max_length=512)), dim=-1)
        else:
            # Default to a fixed textual prompt
            prompted_text = f"summarize: {text}"
            input_ids = tokenizer.encode(prompted_text, return_tensors="pt", truncation=True, max_length=512)

        # Generate summary with the model
        summary_ids = model.generate(input_ids, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)

        # Decode and return the summary
        return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    except Exception as e:
        return f"An error occurred: {e}"

# Select the first item from the 'Text' and 'Summary' columns of selected_df
first_text = selected_df['Text'].iloc[0]  # Ensure that selected_df has been created with the relevant data
original_summary = selected_df['Summary'].iloc[0]

# Simulate prompt embedding (for demonstration purposes)
# In real scenarios, this would involve training a continuous embedding
prompt_embedding = None  # Placeholder for learned embeddings if available

# Generate the summary with prompt tuning (simulated)
predicted_summary = generate_summary_with_learned_prompt(first_text, prompt_embedding=prompt_embedding)

# Print the original text, original summary, and generated summary
print("Original Text:\n", first_text)
print("\nOriginal Summary:\n", original_summary)
print("\nGenerated Summary with Prompt Tuning:\n", predicted_summary)


Original Text:
 These are actually very tasty.  Pure potatoes with a great texture and no nasty filler "stuff."  No bacon, no cheese...just tasty potatoes.  They cook well in either the oven or microwave.  I add a touch of either salt & pepper or fajita seasoning to spice it up.  I rated 4 out of 5 stars because they could be a bit bigger portion.  However, this item is a fairly good value for the money.

Original Summary:
 I like these!

Generated Summary with Prompt Tuning:
 pure potatoes with a great texture and no nasty filler "stuff" they cook well in either the oven or microwave. this item is a fairly good value for money.


### Training Soft Prompts for Enhanced Summarization with T5

This cell demonstrates how to set up and train soft prompts for enhancing the performance of a pre-trained T5 model in a controlled manner. Soft prompts are trainable embeddings that guide the model without altering its original weights. Here's an overview of the code:

- **Library Imports**: `torch`, `nn` (for neural network operations), and components from `transformers` are imported for model loading, optimization, and training.

- **Model and Tokenizer Loading**:
  - The `T5Tokenizer` and `T5ForConditionalGeneration` are loaded from the Hugging Face library to process input text and generate summaries.
  
- **Freezing Model Weights**:
  - The model weights are frozen using `param.requires_grad = False` to ensure that only the soft prompts are trained while the original model parameters remain unchanged. This allows for efficient training with minimal adjustments to the pre-trained model.

- **Creating Trainable Soft Prompt Embeddings**:
  - A soft prompt of length 20 is defined, with each embedding having a size matching the model's hidden state (`embedding_size`).
  - `nn.Parameter` initializes these embeddings as trainable parameters with random values.
  
- **Optimizer Initialization**:
  - An `Adam` optimizer is set up specifically to train the soft prompt embeddings with a learning rate of `1e-3`.

This approach leverages prompt tuning, a technique that fine-tunes input prompts to guide the model more effectively for specific tasks without modifying the core model weights.


In [None]:
import torch
from torch import nn
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Freeze the model weights to train only the soft prompts
for param in model.parameters():
    param.requires_grad = False

# Create trainable soft prompt embeddings
soft_prompt_length = 20  # Length of the soft prompt
embedding_size = model.config.d_model  # Size of the model's hidden state
soft_prompts = nn.Parameter(torch.randn(soft_prompt_length, embedding_size, requires_grad=True))

# Initialize the optimizer for soft prompts
optimizer = torch.optim.Adam([soft_prompts], lr=1e-3)


### Training Soft Prompts with T5 for Text Summarization

This cell provides a comprehensive example of how to fine-tune trainable soft prompts for text summarization using the T5 model. The following steps are included:

- **Library Installation**:
  - The `transformers` library is installed to ensure that all necessary components for model loading and training are available.

- **Library Imports**:
  - `torch` and `nn` from PyTorch are imported for tensor operations and neural network components.
  - `T5ForConditionalGeneration` and `T5Tokenizer` from `transformers` are imported for working with the pre-trained T5 model and tokenizer.

- **Model and Tokenizer Loading**:
  - The `t5-small` model and tokenizer are loaded from Hugging Face's model hub to handle text input and output for summarization.

- **Freezing Model Weights**:
  - The weights of the T5 model are frozen (`param.requires_grad = False`) so that only the newly added soft prompts are trained, preserving the original model's parameters.

- **Creating Trainable Soft Prompts**:
  - A soft prompt of length 20 with embedding vectors matching the model's hidden state size is created and initialized as a trainable `nn.Parameter`.
  
- **Optimizer Initialization**:
  - An `Adam` optimizer is configured to update only the soft prompt embeddings with a learning rate of `1e-3`.

- **Training Data Preparation**:
  - Training data is constructed as a list of input and target summary pairs from a DataFrame (`selected_df`).

- **Training Loop**:
  - The model is trained for 2 epochs, and for each input text, the soft prompts are combined with the input embeddings.
  - The `model.get_input_embeddings()` method retrieves the embeddings for the input IDs, which are concatenated with the soft prompt tensor.
  - The model's `forward` method processes the combined tensor and computes the loss using the provided target summary.
  - The optimizer updates the soft prompts based on the computed gradients, and the average loss is printed at the end of each epoch to monitor training progress.

- **Completion Message**:
  - The code prints "Training completed." once the training loop finishes.

This approach shows how prompt tuning can be applied to enhance the T5 model's performance by training additional input embeddings while keeping the main model weights fixed.


In [None]:
# Install the necessary library if not already installed
!pip install transformers

import torch
from torch import nn
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Freeze the model weights to train only the soft prompts
for param in model.parameters():
    param.requires_grad = False

# Create trainable soft prompt embeddings
soft_prompt_length = 20  # Length of the soft prompt
embedding_size = model.config.d_model  # Size of the model's hidden state
soft_prompts = nn.Parameter(torch.randn(soft_prompt_length, embedding_size, requires_grad=True))

# Initialize the optimizer for soft prompts
optimizer = torch.optim.Adam([soft_prompts], lr=1e-3)

# Create the training data from the DataFrame
training_data = list(zip(selected_df['Text'], selected_df['Summary']))

# Training loop for 2 epochs
num_epochs = 2
for epoch in range(num_epochs):
    total_loss = 0
    for input_text, target_summary in training_data:
        optimizer.zero_grad()

        # Prepare inputs and outputs
        input_ids = tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=512)
        target_ids = tokenizer.encode(target_summary, return_tensors="pt", truncation=True, max_length=50)

        # Combine soft prompts with input
        prompt_tensor = soft_prompts.unsqueeze(0)
        input_tensor = torch.cat((prompt_tensor, model.get_input_embeddings()(input_ids)), dim=1)

        # Forward pass and loss computation
        outputs = model(inputs_embeds=input_tensor, labels=target_ids)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch + 1}/{num_epochs} - Loss: {total_loss/len(training_data):.4f}")

print("Training completed.")


Epoch 1/2 - Loss: 4.8087
Epoch 2/2 - Loss: 4.2239
Training completed.


In [None]:
# Function to generate a summary with the trained soft prompts
def generate_with_soft_prompt(text):
    input_ids = tokenizer.encode(text, return_tensors="pt", truncation=True, max_length=512)

    # Combine the trained soft prompts with the input embeddings
    prompt_tensor = soft_prompts.unsqueeze(0)  # Add batch dimension
    input_tensor = torch.cat((prompt_tensor, model.get_input_embeddings()(input_ids)), dim=1)

    # Generate the output
    outputs = model.generate(inputs_embeds=input_tensor, max_length=50, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Get the first item from the 'Text' column
first_text = selected_df['Text'].iloc[0]

# Generate a summary for the first item
generated_summary = generate_with_soft_prompt(first_text)

# Print the original text, original summary, and generated summary
print("Original Text:\n", first_text)
print("\nGenerated Summary with Trained Soft Prompts:\n", generated_summary)


Original Text:
 These are actually very tasty.  Pure potatoes with a great texture and no nasty filler "stuff."  No bacon, no cheese...just tasty potatoes.  They cook well in either the oven or microwave.  I add a touch of either salt & pepper or fajita seasoning to spice it up.  I rated 4 out of 5 stars because they could be a bit bigger portion.  However, this item is a fairly good value for the money.

Generated Summary with Trained Soft Prompts:
 These are really tasty.


#  PEFT (Parameter-Efficient Fine-Tuning)


In [None]:
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split

In [None]:
from peft import get_peft_model, LoraConfig

### Applying LoRA (Low-Rank Adaptation) for Model Fine-Tuning

This cell sets up and applies a LoRA (Low-Rank Adaptation) configuration to the T5 model to enable efficient fine-tuning. Here is what each part of the code does:

- **LoRA Configuration (`LoraConfig`)**:
  - **`r`**: Specifies the rank for the low-rank adaptation. This parameter controls the dimensionality of the low-rank matrices applied to the model. A higher value may improve model capacity but can also increase computation.
  - **`lora_alpha`**: A scaling factor for the low-rank matrices, influencing the learning rate adjustment for these layers.
  - **`lora_dropout`**: Specifies the dropout rate applied during training to prevent overfitting.
  - **`target_modules`**: Indicates the specific model layers where LoRA will be applied. For T5, "q" and "v" represent the query and value layers of the attention mechanism, which are common targets for LoRA in transformer models.
  - **`bias`**: Set to `"none"` to exclude bias terms from the LoRA layers, simplifying the adaptation.

- **Applying the PEFT Model**:
  - The `get_peft_model()` function takes the original T5 model and the configured `lora_config` and applies LoRA to the specified layers. This enhances the model's training efficiency by adding lightweight, trainable low-rank matrices without modifying the core model weights.

This approach is useful for adapting pre-trained models to new tasks with minimal computational overhead, allowing for faster and more efficient fine-tuning while preserving most of the pre-trained model's knowledge.


In [None]:
# Set LoRA configuration for fine-tuning
lora_config = LoraConfig(
    r=8,  # Rank for low-rank adaptation (adjust based on performance)
    lora_alpha=32,  # Scaling factor for the low-rank matrices
    lora_dropout=0.1,  # Dropout rate during training
    target_modules=["q", "v"],  # Target modules to apply LoRA (for T5, 'q' and 'v' refer to attention layers)

)

# Apply the PEFT model (LoRA)
peft_model = get_peft_model(model, lora_config)

In [None]:
df=selected_df

In [None]:
# Initialize the tokenizer and model
model_name = "t5-small"  # You can use "t5-base" or "t5-large" for better quality
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the trained model and tokenizer
model_name = '/content/drive/MyDrive/model_t5_summarization'  # Ensure this is the correct path
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Now you can proceed to summarize

In [None]:
# Step 9: Load the trained model and tokenizer again if not in the same session
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the model and tokenizer if necessary
# Replace 'path/to/model' with the actual path if the model needs to be reloaded
# model = T5ForConditionalGeneration.from_pretrained('path/to/model')
# tokenizer = T5Tokenizer.from_pretrained('path/to/model')

# Input your specific review and summary
input_text = "These are actually very tasty. Pure potatoes with a great texture and no nasty filler 'stuff.' No bacon, no cheese...just tasty potatoes. They cook well in either the oven or microwave. I add a touch of either salt & pepper or fajita seasoning to spice it up. I rated 4 out of 5 stars because they could be a bit bigger portion. However, this item is a fairly good value for the money."
original_summary = "I like these"

# Prepare the input for the model
input_enc = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Move the input tensor to the same device as the model (e.g., GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
input_enc = {key: val.to(device) for key, val in input_enc.items()}

# Generate the summary
summary_ids = model.generate(
    input_enc['input_ids'],
    max_length=50,  # Adjust as needed
    min_length=25,  # Adjust as needed
    length_penalty=2.0,
    num_beams=4,
    early_stopping=True
)

# Decode the summary
predicted_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Print the results
print("Original Review:", input_text)
print("Original Summary:", original_summary)
print("Predicted Summary:", predicted_summary)


Original Review: These are actually very tasty. Pure potatoes with a great texture and no nasty filler 'stuff.' No bacon, no cheese...just tasty potatoes. They cook well in either the oven or microwave. I add a touch of either salt & pepper or fajita seasoning to spice it up. I rated 4 out of 5 stars because they could be a bit bigger portion. However, this item is a fairly good value for the money.
Original Summary: I like these
Predicted Summary: I rated them 4 out of 5 stars because they could be a bit bigger than a bit larger.


# RAG for Summarization

In [None]:
# !pip install datasets




In [None]:
# !pip install faiss-cpu




### Summarizing Text Using a Pre-trained T5 Model

This cell demonstrates how to use a pre-trained T5 model for generating summaries from input text. Below is a breakdown of the code:

- **Library Installation**:
  - Installs the `transformers` library to ensure that all necessary components for working with the T5 model are available.

- **Model and Tokenizer Loading**:
  - Loads the `t5-small` tokenizer and model using `T5Tokenizer.from_pretrained()` and `T5ForConditionalGeneration.from_pretrained()`. Users can swap `t5-small` with `t5-base` or `t5-large` for potentially better summarization quality.

- **Function for Generating Summaries**:
  - A function `generate_t5_summary` is defined to:
    - Accept an input text.
    - Prepend the input with "summarize:" as a prompt for the T5 model.
    - Encode the input text with truncation and a maximum length of 512 tokens.
    - Generate a summary using the model with specified parameters such as `max_length`, `min_length`, `length_penalty`, `num_beams`, and `early_stopping` for better control over output quality.
    - Decode and return the generated summary, excluding special tokens for readability.

- **Generating and Displaying a Summary**:
  - The first entry from the `Text` column of `selected_df` is selected for summarization.
  - The function `generate_t5_summary` is called to generate a summary for the selected text.
  - The original text and the generated summary are printed to compare and evaluate the model's performance.

This step helps illustrate the practical use of T5 for generating concise and meaningful summaries from longer pieces of text.


In [None]:
# Install necessary library if not already installed
!pip install transformers

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')  # You can use 't5-base' or 't5-large' for better results
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Function to summarize a single text
def generate_t5_summary(text, max_length=50, min_length=25):
    # Prepare the input by adding the prompt
    input_text = "summarize: " + text
    input_ids = tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=512)

    # Generate the summary
    summary_ids = model.generate(input_ids, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Get the first item from the 'Text' column of selected_df
first_text = selected_df['Text'].iloc[0]  # Ensure selected_df is defined

# Generate a summary for the first item
generated_summary = generate_t5_summary(first_text)

# Print the original text and the generated summary
print("Original Text:\n", first_text)
print("\nGenerated Summary using T5:\n", generated_summary)


Original Text:
 These are actually very tasty.  Pure potatoes with a great texture and no nasty filler "stuff."  No bacon, no cheese...just tasty potatoes.  They cook well in either the oven or microwave.  I add a touch of either salt & pepper or fajita seasoning to spice it up.  I rated 4 out of 5 stars because they could be a bit bigger portion.  However, this item is a fairly good value for the money.

Generated Summary using T5:
 pure potatoes with a great texture and no nasty filler "stuff" they cook well in either the oven or microwave. this item is a fairly good value for money.


In [2]:
!pip install faiss-gpu


Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


### Building a Simple Retrieval-Augmented Generation (RAG) System

This cell demonstrates how to build a basic Retrieval-Augmented Generation (RAG) system using a combination of FAISS for document retrieval and a pre-trained T5 model for response generation. Below is a breakdown of each step:

- **Library Imports**:
  - `pandas` for handling data in a DataFrame format.
  - `faiss` for creating a vector index to support efficient document retrieval.
  - `numpy` for array manipulations.
  - `T5Tokenizer` and `T5ForConditionalGeneration` from `transformers` for text generation.
  - `TfidfVectorizer` from `sklearn.feature_extraction.text` for creating text vector embeddings.

- **Sample Text Preparation**:
  - A sample text is used to create a small dataset (`selected_df`) with a column for the text and a corresponding summary. In practice, this would be replaced by a larger dataset.

- **Step 1: Vector Embeddings and FAISS Index Creation**:
  - The `TfidfVectorizer` converts the text documents into vector embeddings.
  - A FAISS index (`IndexFlatL2`) is created and populated with these embeddings to allow for efficient similarity searches.
  - The index is saved to disk for later use.

- **Step 2: Loading the T5 Model and Tokenizer**:
  - The `t5-small` tokenizer and model are loaded for generating text responses.

- **Step 3: Function to Retrieve Relevant Documents and Generate a Response**:
  - The `generate_rag_response` function retrieves relevant documents based on a query:
    - Loads the saved FAISS index and creates an embedding for the query.
    - Searches the index to retrieve the most similar documents.
    - Combines the retrieved documents and the query as input for the T5 model.
    - Generates a response by summarizing the combined input and returns the output.
    
- **Step 4: Testing the Function**:
  - The function is tested using the original text as the query to generate a response, showcasing how the RAG system performs.

This cell provides a basic implementation of a RAG system that can be extended for more complex applications, such as question answering or document summarization.


In [10]:
import pandas as pd
import faiss
import numpy as np
from transformers import T5Tokenizer, T5ForConditionalGeneration
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text for testing
original_text = """These are actually very tasty. Pure potatoes with a great texture and no nasty filler 'stuff.'
No bacon, no cheese...just tasty potatoes. They cook well in either the oven or microwave. I add a touch of either
salt & pepper or fajita seasoning to spice it up. I rated 4 out of 5 stars because they could be a bit bigger portion.
However, this item is a fairly good value for the money."""

# Prepare the dataset (in practice, use a full dataset as shown before)
data = {
    'Text': [original_text],
    'Summary': ["I like these!"]  # Original summary, just for context
}
selected_df = pd.DataFrame(data)

# Step 1: Create vector embeddings and FAISS index
documents = selected_df['Text'].tolist()
vectorizer = TfidfVectorizer().fit(documents)
document_embeddings = vectorizer.transform(documents).toarray()

dimension = document_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(document_embeddings.astype(np.float32))

# Save the FAISS index to disk
faiss.write_index(index, "text_vector_index.faiss")

# Step 2: Load the T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Step 3: Function to retrieve relevant documents and generate a response
def generate_rag_response(query, num_retrieved_docs=1):
    # Load the saved FAISS index
    index = faiss.read_index("text_vector_index.faiss")

    # Create an embedding for the query
    query_embedding = vectorizer.transform([query]).toarray().astype(np.float32)
    distances, indices = index.search(query_embedding, num_retrieved_docs)

    # Retrieve relevant documents
    retrieved_docs = [documents[i] for i in indices[0]]
    combined_input = " ".join(retrieved_docs) + " " + query

    # Prepare input for T5
    input_text = "summarize: " + combined_input
    input_ids = tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=512)

    # Generate the response
    summary_ids = model.generate(input_ids, max_length=50, min_length=25, length_penalty=2.0, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Step 4: Test the function with the original text
query = original_text  # Use the original text as the query
response = generate_rag_response(query)
print("\nGenerated Summary:\n", response)



Generated Summary:
 Pure potatoes with a great texture and no nasty filler'stuff' they cook well in either the oven or microwave. I add a touch of either salt & pepper or fajita seasoning to


In [11]:
%%html
<style>
  table {
    width: 100%;
    border-collapse: collapse;
    margin: 20px 0;
  }
  th, td {
    padding: 15px;
    text-align: left;
    border-bottom: 1px solid #ddd;
  }
  th {
    background-color: #f4f4f4;
    font-weight: bold;
  }
  tr:first-child th {
    border-top: 2px solid #000;
  }
  hr {
    border: 1px solid #ddd;
    margin: 0;
  }
  td {
    word-wrap: break-word;
  }
</style>

<table>
  <tr>
    <th>Original Text</th>
    <td>These are actually very tasty.  Pure potatoes with a great texture and no nasty filler 'stuff.'  No bacon, no cheese...just tasty potatoes.  They cook well in either the oven or microwave.  I add a touch of either salt & pepper or fajita seasoning to spice it up.  I rated 4 out of 5 stars because they could be a bit bigger portion.  However, this item is a fairly good value for the money.</td>
  </tr>
  <hr>
  <tr>
    <th>Original Summary</th>
    <td>I like these!</td>
  </tr>
  <hr>
  <tr>
    <th>Generated Summary with Prompt Engineering</th>
    <td>pure potatoes with a great texture and no nasty filler 'stuff' they cook well in either the oven or microwave. this item is a fairly good value for money.</td>
  </tr>
  <hr>
  <tr>
    <th>Generated Summary with Trained Soft Prompts</th>
    <td>These are really tasty.</td>
  </tr>
  <hr>
  <tr>
    <th>Predicted Summary with LoRA PEFT</th>
    <td>I rated them 4 out of 5 stars because they could be a bit bigger than a bit larger.</td>
  </tr>
  <hr>
  <tr>
    <th>Generated Summary using T5 RAG</th>
    <td> Pure potatoes with a great texture and no nasty filler'stuff' they cook well in either the oven or microwave. I add a touch of either salt & pepper or fajita seasoning to</td>
  </tr>
</table>


0,1
Original Text,"These are actually very tasty. Pure potatoes with a great texture and no nasty filler 'stuff.' No bacon, no cheese...just tasty potatoes. They cook well in either the oven or microwave. I add a touch of either salt & pepper or fajita seasoning to spice it up. I rated 4 out of 5 stars because they could be a bit bigger portion. However, this item is a fairly good value for the money."
Original Summary,I like these!
Generated Summary with Prompt Engineering,pure potatoes with a great texture and no nasty filler 'stuff' they cook well in either the oven or microwave. this item is a fairly good value for money.
Generated Summary with Trained Soft Prompts,These are really tasty.
Predicted Summary with LoRA PEFT,I rated them 4 out of 5 stars because they could be a bit bigger than a bit larger.
Generated Summary using T5 RAG,Pure potatoes with a great texture and no nasty filler'stuff' they cook well in either the oven or microwave. I add a touch of either salt & pepper or fajita seasoning to
