# 4. Review Summarization Using Generative AI

This notebook covers:
- Loading clustered review data
- Using a generative model (OpenAI GPT) to produce summaries
- Generating blog-style recommendation articles per meta-category
- Output: Top 3 products, top complaints, worst product per category

**Used model:**
- OpenAI GPT API — High-quality generation via API

## 4.1 Imports & Setup

In [1]:
import pandas as pd
import numpy as np

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


## 4.2 Load Clustered Data

In [2]:
df = pd.read_csv('data/clustered_reviews.csv')
print(f"Loaded {df.shape[0]} reviews across {df['meta_category'].nunique()} categories")
print(df['meta_category'].value_counts())

Loaded 28232 reviews across 5 categories
meta_category
Tablets                   14405
Batteries and Chargers    12033
E-Readers                  1049
Smart Assistants            643
Accessories                 102
Name: count, dtype: int64


## 4.3 Aggregate Reviews per Product & Category

We will create aggregated review texts for each product and category to feed into the summarization model.

In [3]:
# Aggregate stats per product
product_stats = df.groupby(['meta_category', 'name']).agg(
     avg_rating=('reviews.rating', 'mean'),
     num_reviews=('reviews.rating', 'count'),
     positive_pct=('sentiment', lambda x: (x == 'Positive').mean()),
     negative_pct=('sentiment', lambda x: (x == 'Negative').mean()),
).reset_index()
product_stats = product_stats.sort_values(['meta_category', 'avg_rating'], ascending=[True, False])
product_stats.head(10)

Unnamed: 0,meta_category,name,avg_rating,num_reviews,positive_pct,negative_pct
2,Accessories,AmazonBasics 16-Gauge Speaker Wire - 100 Feet,5.0,5,1.0,0.0
6,Accessories,AmazonBasics Nespresso Pod Storage Drawer - 50...,5.0,1,1.0,0.0
9,Accessories,AmazonBasics Single-Door Folding Metal Dog Cra...,5.0,1,1.0,0.0
10,Accessories,AmazonBasics USB 3.0 Cable - A-Male to B-Male ...,5.0,6,1.0,0.0
12,Accessories,Cat Litter Box Covered Tray Kitten Extra Large...,5.0,2,1.0,0.0
13,Accessories,Expanding Accordion File Folder Plastic Portab...,5.0,9,1.0,0.0
14,Accessories,Two Door Top Load Pet Kennel Travel Crate Dog ...,5.0,1,1.0,0.0
1,Accessories,AmazonBasics 15.6-Inch Laptop and Tablet Bag,4.52381,21,0.904762,0.047619
5,Accessories,AmazonBasics External Hard Drive Case,4.5,6,0.833333,0.0
7,Accessories,AmazonBasics Nylon CD/DVD Binder (400 Capacity),4.25,4,0.75,0.25


## 4.4 Identify Top/Worst Products per Category

Here we will identify the top 3 products and the worst product in each category based on review ratings and sentiment.

In [4]:
# For each category, find top 3 and worst product
def get_category_insights(category_df, category_name):
    """Extract top 3 products, worst product, and key complaints for a category."""
    sorted_df = category_df.sort_values('avg_rating', ascending=False)
    top_3 = sorted_df.head(3)
    worst = sorted_df.tail(1)
    
    return {
        'category': category_name,
        'top_3': top_3[['name', 'avg_rating', 'num_reviews']].to_dict('records'),
        'worst': worst[['name', 'avg_rating', 'num_reviews']].to_dict('records')[0],
    }

category_insights = {}
for cat in product_stats['meta_category'].unique():
    cat_df = product_stats[product_stats['meta_category'] == cat]
    category_insights[cat] = get_category_insights(cat_df, cat)
    print(f"\n=== {cat} ===")
    print(f"  Top 3: {[p['name'] for p in category_insights[cat]['top_3']]}")
    print(f"  Worst: {category_insights[cat]['worst']['name']}")


=== Accessories ===
  Top 3: ['AmazonBasics 16-Gauge Speaker Wire - 100 Feet', 'AmazonBasics Nespresso Pod Storage Drawer - 50 Capsule Capacity', 'AmazonBasics Single-Door Folding Metal Dog Crate - Large (42x28x30 Inches)']
  Worst: AmazonBasics Double-Door Folding Metal Dog Crate - Medium (36x23x25 Inches)

=== Batteries and Chargers ===
  Top 3: ['Kindle PowerFast International Charging Kit (for accelerated charging in over 200 countries)', 'Amazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders', 'Amazon Kindle Charger Power Adapter Wall Charger And Usb Cable Micro Usb Cord']
  Worst: Oem Amazon Kindle Power Usb Adapter Wall Travel Charger Fire/dx/+micro Usb Cable

=== E-Readers ===
  Top 3: ['Kindle Voyage E-reader, 6 High-Resolution Display (300 ppi) with Adaptive Built-in Light, PagePress Sensors, Free 3G + Wi-Fi - Includes Special Offers', 'All-New Kindle Oasis E-reader - 7 High-Resolution Display (300 ppi), Waterproof, Built-In Audi

## 4.5 Extract Key Complaints

We will extract the most common complaints for the worst products to include in our summaries.

In [5]:
# Gather negative reviews for top products to identify complaints
def get_complaints(product_name, df, n=5):
    """Get the top n negative reviews for a product."""
    neg_reviews = df[(df['name'] == product_name) & (df['sentiment'] == 'Negative')]
    return neg_reviews['clean_text'].head(n).tolist()

# Example
for cat, insights in category_insights.items():
    for product in insights['top_3']:
        complaints = get_complaints(product['name'], df)
        product['complaints'] = complaints

## 4.6 Load Generative Model

Here we will load our chosen generative model (GPT-3.5-turbo) for summarization.

In [10]:
import openai
import os
import dotenv
dotenv.load_dotenv()  # Load environment variables from .env file
openai.api_key = os.getenv('OPENAI_API_KEY')

def generate_with_gpt(prompt, max_tokens=500):
    response = openai.chat.completions.create(
        model='gpt-3.5-turbo',
        messages=[{'role': 'user', 'content': prompt}],
        max_tokens=max_tokens,
        temperature=0.7,
    )
    return response.choices[0].message.content

## 4.7 Generate Recommendation Articles

In this section, we will generate blog-style recommendation articles for each meta-category, highlighting the top products and key complaints.

In [11]:
def build_prompt(insights, reviews_df):
    """Build a prompt for the generative model to create a blog-style article."""
    category = insights['category']
    top_3 = insights['top_3']
    worst = insights['worst']
    
    # Gather sample reviews for context
    review_snippets = []
    for product in top_3:
        product_reviews = reviews_df[reviews_df['name'] == product['name']]['clean_text'].head(3).tolist()
        review_snippets.append(f"Product: {product['name']} (Rating: {product['avg_rating']:.1f})")
        for r in product_reviews:
            review_snippets.append(f"  - {r[:200]}")
    
    worst_reviews = reviews_df[(reviews_df['name'] == worst['name']) & 
                               (reviews_df['sentiment'] == 'Negative')]['clean_text'].head(3).tolist()
    
    prompt = f"""Write a product recommendation blog post for the '{category}' category.

Top 3 recommended products:
{chr(10).join(review_snippets)}

Worst product to avoid: {worst['name']} (Rating: {worst['avg_rating']:.1f})
Negative reviews:
{chr(10).join(['  - ' + r[:200] for r in worst_reviews])}

The article should include:
1. A brief intro about the category
2. Top 3 products with key differences
3. Top complaints for each product
4. The worst product and why to avoid it
"""
    return prompt

print("Prompt builder ready.")

Prompt builder ready.


In [12]:
# Generate articles for each category
articles = {}
for cat, insights in category_insights.items():
    print(f"\nGenerating article for: {cat}")
    prompt = build_prompt(insights, df)
    
    # Using OpenAI GPT API:
    article = generate_with_gpt(prompt, max_tokens=500)
    
    articles[cat] = article
    print(article)
    print('---')


Generating article for: Accessories
When it comes to accessories, there are so many options to choose from that can enhance your everyday life. Whether you're looking for speaker wire, storage solutions, or pet crates, AmazonBasics has you covered with high-quality products at affordable prices.

1. AmazonBasics 16-Gauge Speaker Wire - 100 Feet:
This speaker wire is perfect for setting up your home entertainment system. It comes in a 100-foot spool, allowing you to easily connect your speakers to your receiver. With a 5.0 rating, customers rave about how quickly it arrived and how well it performed. The wire is durable and provides a clear sound quality.

Top complaints: Some customers mentioned that the wire can be a bit stiff and difficult to work with, especially if you need to make tight turns or bends.

2. AmazonBasics Nespresso Pod Storage Drawer - 50 Capsule Capacity:
If you're a fan of Nespresso coffee capsules, this storage drawer is a must-have. With a 50-capsule capacity, i

## 4.8 Save Generated Articles

In [13]:
# Save articles to a text file
with open('data/recommendation_articles.txt', 'w', encoding='utf-8') as f:
    for cat, article in articles.items():
        f.write(f"{'='*60}\n")
        f.write(f"Category: {cat}\n")
        f.write(f"{'='*60}\n\n")
        f.write(article)
        f.write('\n\n')
print("Articles saved to data/recommendation_articles.txt")

Articles saved to data/recommendation_articles.txt
