<a href="https://colab.research.google.com/github/AbdelwahedSouiid/Transformers/blob/aymen/Transforming%20Amazon%20Product%20Reviews%20into%20Insights%3A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center;">
    <strong>Transforming Amazon Product Reviews into Insights: A Sentiment Analysis Approach Using Transformer Models</strong>
</div>

## Introduction
This notebook focuses on sentiment classification of Amazon product reviews using transformer models. The objective is to analyze customer sentiments and derive insights from product reviews. We will utilize BERT, RoBERTa, and DistilBERT for this task.


## Dataset Overview
In this section, we load the Amazon product reviews dataset. The dataset contains user reviews, product IDs, scores, and additional metadata. We will preprocess this data for sentiment analysis.


In [None]:
# !pip install kaggle



In [None]:
# from google.colab import files
# files.upload()  # Choose your kaggle.json file to upload


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"aymenmsalmi","key":"ffe2251aa4fd313d1654afdec8753d14"}'}

In [None]:
# !kaggle datasets download -d arhamrumi/amazon-product-reviews

Dataset URL: https://www.kaggle.com/datasets/arhamrumi/amazon-product-reviews
License(s): CC0-1.0
Downloading amazon-product-reviews.zip to /content
 82% 94.0M/115M [00:00<00:00, 315MB/s]
100% 115M/115M [00:00<00:00, 311MB/s] 


In [None]:
# !unzip amazon-product-reviews.zip

Archive:  amazon-product-reviews.zip
  inflating: Reviews.csv             


# Load and Explore Data
First, we will load the dataset and take a quick look at the first few rows.

In [None]:
import pandas as pd

# Load the dataset (replace 'your_file.csv' with the actual file name)
df = pd.read_csv('/content/Reviews.csv')  # Check the extracted files for the correct filename
df.head()  # Display the first few rows


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [None]:

import pandas as pd



# Assuming 'Score' column exists in your dataframe
def choose_top_100_per_score(df):
    top_reviews = pd.DataFrame()
    for score in df['Score'].unique():
      score_df = df[df['Score'] == score].sort_values(by='ProductId', ascending=False).head(100) # sort by product id and choose top 100
      top_reviews = pd.concat([top_reviews, score_df])

    return top_reviews

# Call the function
top_reviews_df = choose_top_100_per_score(df)

print(top_reviews_df.head())

            Id   ProductId          UserId    ProfileName  \
327600  327601  B009WVB40S  A3ME78KVX31T21           K'la   
5702      5703  B009WSNWC4   AMP7K1O84DH1T           ESTY   
328481  328482  B009UUS05I   ARL20DSHGVM1Y          Jamie   
221794  221795  B009SR4OQ2  A32A6X5KCP7ARG        sicamar   
188388  188389  B009SF0TN6  A1L0GWGRK4BYPT  Bety Robinson   

        HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
327600                     0                       0      5  1351123200   
5702                       0                       0      5  1351209600   
328481                     0                       0      5  1331856000   
221794                     1                       1      5  1350604800   
188388                     0                       0      5  1350518400   

                                      Summary  \
327600                                 Tasty!   
5702                                DELICIOUS   
328481                             

In [None]:
df=top_reviews_df

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df['label'] = df['Score'].apply(lambda x: 0 if x in [1, 2] else (1 if x == 3 else 2))  # Create binary labels
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['label'], test_size=0.2)


In [None]:
X_train.head()

Unnamed: 0,Text
417943,This is my most favorite fall flavor K-cup. T...
417829,"This is the ONLY apple cider Kcup I'll use, so..."
417985,To describe this product--words escape me--ter...
204270,These are amazing chips but they just cost too...
204281,The crisps are awesome. Give me English crisps...


# Text Summarization


In [None]:
# Import necessary libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [None]:
# Data preparation
reviews = df['Text'].tolist()

# Model Selection
tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

# Tokenization
inputs = tokenizer(reviews, return_tensors='pt', max_length=512, truncation=True, padding=True)

# Training (This is a placeholder for your actual training code)
# Note: For training, you would need a set of summary texts corresponding to the reviews.

# Example of summarizing one review
input_ids = inputs['input_ids'][1:2]  # Summarize only the first review
summary_ids = model.generate(input_ids, max_length=150, num_beams=4, early_stopping=True)

summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)



my kids and i love it.br />br />very happy with this product. My kids and i love it.br />very happy with this product.br />very happy with this product.


In [None]:
type(summary)

str

# Text Generation

**Generating Responses with GPT-2**

In this section, we utilize the GPT-2 model to generate text-based responses based on user input. We first import the necessary libraries and prepare the data by extracting product reviews. Then, we load the GPT-2 tokenizer and model. Finally, we demonstrate the model's ability to generate a response to a given prompt, showcasing its capability to provide relevant feedback based on the context of the input.


In [None]:
# Import necessary libraries
from transformers import GPT2Tokenizer, GPT2LMHeadModel


# Data preparation
reviews = df['Text'].tolist()

# Model Selection
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Example of generating a response
input_text = "What do you think about the product?"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# Generate a response
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(response)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


What do you think about the product?

I think it's a great product. I think it's a great product. I think it's a great product. I think it's a great product. I think it's a great product.


# Question Ansewering

**Question Answering with BERT**

In this section, we implement a question-answering system using the BERT model. We start by importing the necessary libraries and selecting the BERT tokenizer and model for question answering. 

We define a sample question and use the first product review from our dataset as context for the model. The code tokenizes the input question and context, then uses the BERT model to predict the start and end positions of the answer within the context. Finally, we extract and print the answer based on the predicted indices.

This demonstrates BERT's ability to understand and extract relevant information from text based on a specific query.


In [None]:
# Import necessary libraries
from transformers import BertTokenizer, BertForQuestionAnswering
import torch

# Model Selection
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

# Example Question and Context
question = "What is the flavor of the product?"
context = df['Text'].iloc[0]  # Use the first review as context

# Tokenization
inputs = tokenizer(question, context, return_tensors='pt')

# Get answer start and end scores
with torch.no_grad():
    outputs = model(**inputs)
    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

# Get the most likely start and end of answer
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores) + 1  # Inclusive of the end index

# Get the answer
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][start_index:end_index]))
print(answer)


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


my son who ' s away at college . it was delivered right to his dorm room with very fast shipping . he loved it so much he called me to


In [None]:
# Import necessary libraries
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the dataset
df = pd.read_csv('/content/Reviews.csv')

# Function to summarize reviews for a specific product ID
def summarize_reviews(product_id):
    # Filter reviews related to the specified product ID
    product_reviews = df[df['ProductId'] == product_id]['Text'].tolist()

    # Check if there are reviews available for the product
    if not product_reviews:
        return "No reviews found for this product ID."

    # Initialize T5 tokenizer and model
    tokenizer = T5Tokenizer.from_pretrained('t5-base')
    model = T5ForConditionalGeneration.from_pretrained('t5-base')

    # Prepare the input text for summarization
    input_text = " ".join(product_reviews)

    # Tokenization
    inputs = tokenizer([input_text], return_tensors='pt', max_length=512, truncation=True, padding=True)

    # Generate summary
    summary_ids = model.generate(inputs['input_ids'], max_length=150, num_beams=4, early_stopping=True)

    # Decode summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary



In [None]:
# Assuming the summarization function is defined as above
product_id = "B006K2ZZ7K"  # Use the extracted product ID
summary = summarize_reviews(product_id)
print(f"Summary for product ID {product_id}:\n{summary}")

Summary for product ID B006K2ZZ7K:
lasted only two weeks! a great deal! very soft and chewy. The candies were individually wrapped well. very soft and chewy. The candies were individually wrapped well. taffy is so good.. taffy!!


# Summarizing Product Reviews with T5

In this section, we implement a function to summarize reviews for a specific product using the T5 model. 

**Overview of the Code:**
1. **Import Necessary Libraries**: We import `pandas` for data manipulation and the T5 tokenizer and model from the Hugging Face Transformers library.
2. **Load the Dataset**: The dataset containing product reviews is loaded from a CSV file.
3. **Function Definition**: 
   - The `summarize_reviews` function takes a `product_id` as input.
   - It filters the reviews and ratings related to the specified product ID.
   - If no reviews are found, a message is returned.
   - The overall rating is calculated by averaging the scores of the reviews.
   - The function initializes the T5 tokenizer and model, prepares the input text for summarization, and generates a summary of the reviews.
   - Finally, it returns the summary, the number of reviews, and the overall rating.
4. **Example Usage**: An example product ID is provided to demonstrate how to use the summarization function, and the results are printed.

This implementation showcases how to extract meaningful insights from customer reviews using T5, providing a concise summary along with relevant metrics about the product.


In [None]:
# Import necessary libraries
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the dataset
df = pd.read_csv('/content/Reviews.csv')

# Function to summarize reviews for a specific product ID
def summarize_reviews(product_id):
    # Filter reviews related to the specified product ID
    product_reviews = df[df['ProductId'] == product_id]['Text'].tolist()
    product_ratings = df[df['ProductId'] == product_id]['Score'].tolist()  # Adjust the column name if necessary

    # Check if there are reviews available for the product
    num_reviews = len(product_reviews)

    if num_reviews == 0:
        return "No reviews found for this product ID."

    # Calculate overall rating
    overall_rating = sum(product_ratings) / num_reviews if num_reviews > 0 else 0

    # Initialize T5 tokenizer and model
    tokenizer = T5Tokenizer.from_pretrained('t5-base')
    model = T5ForConditionalGeneration.from_pretrained('t5-base')

    # Prepare the input text for summarization
    input_text = " ".join(product_reviews)

    # Tokenization
    inputs = tokenizer([input_text], return_tensors='pt', max_length=512, truncation=True, padding=True)

    # Generate summary
    summary_ids = model.generate(inputs['input_ids'], max_length=150, num_beams=4, early_stopping=True)

    # Decode summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Return the summary, number of reviews, and overall rating
    return {
        "summary": summary,
        "num_reviews": num_reviews,
        "overall_rating": overall_rating
    }

# Example usage
product_id = "B00004CI84"  # Replace with the actual product ID you want to summarize
result = summarize_reviews(product_id)
print(f"Summary for product ID {product_id}:\n{result['summary']}")
print(f"Number of reviews: {result['num_reviews']}")
print(f"Overall rating: {result['overall_rating']:.2f}")


FileNotFoundError: [Errno 2] No such file or directory: '/content/Reviews.csv'

In [None]:
# prompt: look into the data frame for prodcution was commeded more the 5 time and give  the id

# Assuming 'df' is your DataFrame loaded from 'Reviews.csv'

# Group by 'ProductId' and count the occurrences of each product
product_counts = df.groupby('ProductId')['ProductId'].count()

# Filter for products with more than 5 occurrences (commented more than 5 times)
commented_products = product_counts[product_counts > 5]

# Get the product IDs
product_ids = commented_products.index.tolist()

print(product_ids)

['0006641040', '7310172001', '7310172101', 'B00002N8SM', 'B00004CI84', 'B00004CXX9', 'B00004RAMS', 'B00004RAMV', 'B00004RAMX', 'B00004RAMY', 'B00004RBDU', 'B00004RBDW', 'B00004RBDZ', 'B00004RYGX', 'B00004S1C5', 'B00004S1C6', 'B00005344V', 'B0000537KC', 'B00005C2M2', 'B00005C2M3', 'B00005IX96', 'B00005IX97', 'B00005IX98', 'B00005OMWQ', 'B00005U2FA', 'B00006G930', 'B00006L2ZT', 'B00006LL38', 'B00008433V', 'B000084DWM', 'B000084E66', 'B000084E6V', 'B000084E76', 'B000084E9M', 'B000084EHV', 'B000084EL4', 'B000084ESM', 'B000084ETV', 'B000084EZ4', 'B000084F04', 'B000084F0P', 'B000084F1I', 'B000084F1O', 'B000084F3O', 'B000084F44', 'B000084F5E', 'B000084F6F', 'B00008CQVA', 'B00008DF91', 'B00008DFK5', 'B00008DFOG', 'B00008GQ4F', 'B00008JOL0', 'B00008MOIF', 'B00008MOJ2', 'B00008O36H', 'B00008Q3B7', 'B00008WUA9', 'B000093HKC', 'B00009KF1J', 'B00009OLE2', 'B00009OLEP', 'B00009ZIY2', 'B00009ZJ48', 'B0000AH3QW', 'B0000AH3RM', 'B0000AH3UK', 'B0000BXJIO', 'B0000BXJO8', 'B0000C69FB', 'B0000CA4TK', 'B000

In [None]:
# Import necessary libraries
import pandas as pd
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the dataset
df = pd.read_csv('/content/Reviews.csv')

# Function to summarize reviews for a specific product ID and include user profile information
def summarize_reviews(product_id):
    # Filter reviews related to the specified product ID
    product_reviews = df[df['ProductId'] == product_id]['Text'].tolist()
    product_ratings = df[df['ProductId'] == product_id]['Score'].tolist()  # Adjust the column name if necessary
    user_profiles = df[df['ProductId'] == product_id]['UserId']  # Adjust to your actual column names

    # Check if there are reviews available for the product
    num_reviews = len(product_reviews)

    if num_reviews == 0:
        return "No reviews found for this product ID."

    # Calculate overall rating
    overall_rating = sum(product_ratings) / num_reviews if num_reviews > 0 else 0

    # Initialize T5 tokenizer and model
    tokenizer = T5Tokenizer.from_pretrained('t5-base')
    model = T5ForConditionalGeneration.from_pretrained('t5-base')

    # Prepare the input text for summarization
    input_text = " ".join(product_reviews)

    # Tokenization
    inputs = tokenizer([input_text], return_tensors='pt', max_length=512, truncation=True, padding=True)

    # Generate summary
    summary_ids = model.generate(inputs['input_ids'], max_length=150, num_beams=4, early_stopping=True)

    # Decode summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return {
        "summary": summary,
        "num_reviews": num_reviews,
        "overall_rating": overall_rating
    }

In [None]:


# Example usage
product_id = "B001E4KFG0"  # Replace with the actual product ID you want to summarize
result = summarize_reviews(product_id)

# Print summary and user profile information
print(f"Summary for product ID {product_id}:\n{result['summary']}")
print(f"Number of reviews: {result['num_reviews']}")
print(f"Overall rating: {result['overall_rating']:.2f}")




Summary for product ID B001E4KFG0:
My Labrador loves this canned dog food product. taste better.
Number of reviews: 1
Overall rating: 5.00


In [2]:
# Example usage
product_id = "B00004CI84"  # Replace with the actual product ID you want to summarize
result = summarize_reviews(product_id)

# Print summary and user profile information
print(f"Summary for product ID {product_id}:\n{result['summary']}")
print(f"Number of reviews: {result['num_reviews']}")
print(f"Overall rating: {result['overall_rating']:.2f}")

Summary for product ID B00004CI84:
blah blah! It's Tim Burton's Beetlejuice! The cast are great in this movie as they all seem to give a 110% into their characters and the whole movie is so funny haha blah blah! It's Tim Burton's Beetlejuice! It's a great movie. Alec Baldwin and Geena Davis were fine as beetlejuice, but I just couldn't see the point of the movie. My 8-year
Number of reviews: 189
Overall rating: 4.49

