# Part 4 - Generator / Summarizer
In this notebook, we implement the 3rd transformer, in charge or generating a review of the best product per category, from a batch of reviews and sentiments

## Imports

In [1]:
import os
import re
import json
import pandas as pd
from transformers import AutoModelForCausalLM
import torch
import random
from datetime import datetime
from tqdm import tqdm
tqdm.pandas()  # Initialize tqdm with pandas
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# import helpers module
import helpers
import importlib
importlib.reload(helpers)

SEP = 100 * '-'

In [2]:
# load HugginFace token to environment
if not os.environ.get('HF_TOKEN'):
    os.environ['HF_TOKEN'] = input('Enter API token for Hugging Face: ')
else:
    print('Hugging Face token already loaded to environment')

## Load datasets

In [4]:
from functools import reduce

# Load datasets
data_processed = helpers.load_pickled_dataset('pickle/data_processed.pkl')
data_clustered = helpers.load_pickled_dataset('pickle/data_clustered.pkl')
data_sa = helpers.load_pickled_dataset('pickle/data_sentiment_analysis.pkl')

# Select columns
proc_columns = ['name', 'brand', 'reviews.rating', 'reviews.numHelpful', 'review']
cluster_columns = ['review', 'clusterCategories']
sa_columns = ['review', 'reviews.sentiment', 'reviews.ft', 'reviews.ft.sentiment']

# Filter DataFrames by the selected columns
data1 = data_processed[proc_columns]
data2 = data_clustered[cluster_columns]
data3 = data_sa[sa_columns]

# Merge the DataFrames sequentially
data = reduce(lambda left, right: pd.merge(left, right, on='review', how='outer'), [data1, data2, data3])


Dataset loaded from pickle/data_processed.pkl.
Dataset loaded from pickle/data_clustered.pkl.
Dataset loaded from pickle/data_sentiment_analysis.pkl.


In [5]:
helpers.print_random_product_sheet(data)

----------------------------------------------------------------------------------------------------
[name] All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 32 GB - Includes Special Offers, Blue
----------------------------------------------------------------------------------------------------
[brand] Amazon
----------------------------------------------------------------------------------------------------
[reviews.rating] 5.0
----------------------------------------------------------------------------------------------------
[reviews.numHelpful] 0.0
----------------------------------------------------------------------------------------------------
[review] My daughter in law loved it!
Not only did my daughter in law love it I was able to experience convenience of having it shipped!
I recommend this product.
----------------------------------------------------------------------------------------------------
[clusterCategories] Home & Smart Devices
-------------------------------------

## METHOD 1: Prompt Fine-Tuning
Let's see what we can get using only prompt fine-tuning. For this I will try [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.3) in 4bit quantization for quick inference.

Using 4-bit quantization will help reduce memory usage significantly while retaining reasonable model performance.

In [6]:
!transformers-cli cache clear

2024-10-16 18:42:22.685511: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-16 18:42:22.685682: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-16 18:42:22.823240: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-16 18:42:23.100075: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
usage: transformers-cli <command> [<args>]
Transforme

### Load Mistral-7b 4bits

In [7]:
from transformers import LlamaTokenizer

model_name = "mistralai/Mistral-7B-v0.3"
tokenizer = LlamaTokenizer.from_pretrained(model_name)

# load 4bit quantization model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)

# test model
inputs = tokenizer("Hello my name is", return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


tokenizer_config.json:   0%|          | 0.00/137k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
2024-10-16 18:44:05.870681: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-16 18:44:05.870893: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-16 18:44:05.967797: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-16 18:44:06.164770: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate c

Hello my name is Katie and I am a 20 year old student at the University of North Carolina at Chap


### Create a prompter

In [8]:
# Set pad_token to eos_token
tokenizer.pad_token = tokenizer.eos_token

def generate_summary(prompt, review, max_tokens=100):
    inputs = tokenizer(prompt.format(review_text=review), return_tensors="pt", padding=True, truncation=True).to(device)
    
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=max_tokens,
        temperature=0.3,
        top_p=0.8,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    ).to(device)
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)



### Zero-shot prompting
Let's work on a prompt template so the model extract information from reviews and summarize them

In [28]:
prompt = """Summarize the following product review in one sentence, highlighting the main points.
Then, provide a list of pros and cons and a short overall conclusion.

### Review: 
{review_text}

### One-sentence summary, pros and cons and conclusion:
"""

review_text = "This laptop has a fantastic display and long battery life, but the keyboard is a bit cramped, and it tends to get hot during heavy use. Overall, it performs well for everyday tasks and is good value for the price."

result = generate_summary(prompt, review_text)

print(result)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Summarize the following product review in one sentence, highlighting the main points.
Then, provide a list of pros and cons and a short overall conclusion.

### Review: 
This laptop has a fantastic display and long battery life, but the keyboard is a bit cramped, and it tends to get hot during heavy use. Overall, it performs well for everyday tasks and is good value for the price.

### One-sentence summary, pros and cons and conclusion:

- Pros:
  - Fantastic display
  - Long battery life
  - Good value for the price
- Cons:
  - Keyboard is a bit cramped
  - Tends to get hot during heavy use
- Conclusion:
  - Performs well for everyday tasks






































The result is not exactly what I need. Let's try to include 1 or 2 example in the prompt.

## Few-shot prompting

In [10]:
prompt = """
Summarize the following product review in one sentence, highlighting the main points.
Then, provide a list of pros and cons and a short overall conclusion.

### Example 1:
# Review: 
This phone has a great camera and sleek design, but the battery life is disappointing.
# One-sentence summary: 
"Great camera and design, but poor battery life."
# Pros:
  - Great camera
  - Sleek design
# Cons:
  - Disappointing battery life
# Conclusion:
"Good for photography lovers but might need frequent charging."

### Example 2:
# Review: 
"This vacuum cleaner is powerful and easy to use, but it's noisy and a bit heavy."
# One-sentence summary:
"Effective but noisy and heavy."
# Pros:
  - Powerful
  - Easy to use
# Cons:
  - Noisy
  - Heavy
# Conclusion:
"Ideal for quick, powerful cleaning if noise isn’t an issue."

# Review: 
{review_text}
"""

review_text = "This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen."

result = generate_summary(prompt, review_text, max_tokens=200)

print(result)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Summarize the following product review in one sentence, highlighting the main points.
Then, provide a list of pros and cons and a short overall conclusion.

### Example 1:
# Review: 
This phone has a great camera and sleek design, but the battery life is disappointing.
# One-sentence summary: 
"Great camera and design, but poor battery life."
# Pros:
  - Great camera
  - Sleek design
# Cons:
  - Disappointing battery life
# Conclusion:
"Good for photography lovers but might need frequent charging."

### Example 2:
# Review: 
"This vacuum cleaner is powerful and easy to use, but it's noisy and a bit heavy."
# One-sentence summary:
"Effective but noisy and heavy."
# Pros:
  - Powerful
  - Easy to use
# Cons:
  - Noisy
  - Heavy
# Conclusion:
"Ideal for quick, powerful cleaning if noise isn’t an issue."

# Review: 
This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibil

The result is pretty close to what we need. However, the model keeps generating text until it reaches the max_tokens length. I can't manage to get the end of token to work.

## JSON output
Let's try to pack the information into a json object

In [36]:
prompt = """
Summarize the following product review and provide the information in a structured JSON format with the following keys:
- "summary": A single-sentence summary of the review.
- "pros": A list of positive aspects.
- "cons": A list of negative aspects.
- "conclusion": An overall conclusion.

### Review:
{review_text}

### JSON output:
{{
  "summary": "",
  "pros": [],
  "cons": [],
  "conclusion": ""
}}
"""

review_text = "This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen."

result = generate_summary(prompt, review_text, max_tokens=200)
print(result)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Summarize the following product review and provide the information in a structured JSON format with the following keys:
- "summary": A single-sentence summary of the review.
- "pros": A list of positive aspects.
- "cons": A list of negative aspects.
- "conclusion": An overall conclusion.

### Review:
This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen.

### JSON output:
{
  "summary": "",
  "pros": [],
  "cons": [],
  "conclusion": ""
}

### JSON output:
{
  "summary": "This

In [None]:
def extract_json_dict(result: str) -> dict:
    """Extract JSON content from the model output and convert it to a Python dictionary.

    Args:
        result (str): The model output containing the JSON content.

    Returns:
        dict: A Python dictionary containing the extracted JSON content.

    Exception:
        json.JSONDecodeError: If there is an error decoding the JSON content.
    """
    # let's extract the json part and store it in a dict
    json_content = re.search(r'json\s*(\{.*?\})\s*', result, re.DOTALL)

    review_dict = {}

    if json_content:
        extracted_json = json_content.group(1) 
        try: 
            # Convert JSON string to Python dictionary 
            review_dict = json.loads(extracted_json)
        except json.JSONDecodeError as e: 
            print(f"Error decoding JSON: {e}")
    else: 
        print("No JSON content found.")

    return review_dict

review_dict = extract_json_dict(result)

print(SEP)
for k, v in review_dict.items():
    print(f"{k.upper()}: {v}")

This is working great. Outputing json contents make it very clear, concise and avoid mixed up information or repetition.


## Multiple review summarization

Let's try to input a list of reviews into the model and see if it can output a json summary.

In [23]:
prompt = """
Summarize the following reviews of the same product in a structured JSON format, combining insights from all three reviews into a single summary. Use the following keys:
- "summary": A single-sentence summary reflecting the general or mix of sentiment and key points from all reviews.
- "pros": A list of positive aspects listed in the reviews.
- "cons": A list of negative aspects listed in the reviews.
- "conclusion": An overall conclusion based on the reviews, pros and cons.
Generate a single JSON output that combines the information from all three reviews.
"""

# let's try with a positive, a neutral and a negative review
reviews_list = [
    "I absolutely love this blender! It blends everything so smoothly, from ice to veggies and fruits. The different speed settings are super useful, and it's easy to clean. The motor is powerful, and it doesn't struggle with anything I throw at it. Definitely worth the price if you're serious about smoothies or cooking!",
    "The blender is decent overall. It does its job well, though it can be a bit noisy at higher speeds. The jar is also a bit heavy, which makes it slightly awkward to handle when full. But it's sturdy, and I appreciate the variety of settings. Not a bad choice, but nothing extraordinary.",
    "Unfortunately, this blender has been a disappointment. It struggles to blend frozen items properly, and it makes a lot of noise. After a few uses, the blade already seems dull. It's hard to clean, especially around the blade area, and I feel it's overpriced for the quality you get. I wouldn’t recommend it."
]

# add reviews to the prompt
for i, review in enumerate(reviews_list):
    prompt += f"\n### Review {i+1}:\n{review}\n"

# add JSON output format
prompt += """
### Unique JSON output:
{{
  "summary": "",
  "pros": [],
  "cons": [],
  "conclusion": ""
}}
"""

print(prompt)


Summarize the following reviews of the same product in a structured JSON format, combining insights from all three reviews into a single summary. Use the following keys:
- "summary": A single-sentence summary reflecting the general or mix of sentiment and key points from all reviews.
- "pros": A list of positive aspects listed in the reviews.
- "cons": A list of negative aspects listed in the reviews.
- "conclusion": An overall conclusion based on the reviews, pros and cons.
Generate a single JSON output that combines the information from all three reviews.

### Review 1:
I absolutely love this blender! It blends everything so smoothly, from ice to veggies and fruits. The different speed settings are super useful, and it's easy to clean. The motor is powerful, and it doesn't struggle with anything I throw at it. Definitely worth the price if you're serious about smoothies or cooking!

### Review 2:
The blender is decent overall. It does its job well, though it can be a bit noisy at hi

In [24]:
result = generate_summary(prompt, review_text, max_tokens=200)
print(result)

review_dict = extract_json_dict(result)
print(SEP)
for k, v in review_dict.items():
    print(f"{k.upper()}: {v}")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Summarize the following reviews of the same product in a structured JSON format, combining insights from all three reviews into a single summary. Use the following keys:
- "summary": A single-sentence summary reflecting the general or mix of sentiment and key points from all reviews.
- "pros": A list of positive aspects listed in the reviews.
- "cons": A list of negative aspects listed in the reviews.
- "conclusion": An overall conclusion based on the reviews, pros and cons.
Generate a single JSON output that combines the information from all three reviews.

### Review 1:
I absolutely love this blender! It blends everything so smoothly, from ice to veggies and fruits. The different speed settings are super useful, and it's easy to clean. The motor is powerful, and it doesn't struggle with anything I throw at it. Definitely worth the price if you're serious about smoothies or cooking!

### Review 2:
The blender is decent overall. It does its job well, though it can be a bit noisy at hi

We notice the main issue here: Due to contradictory review sentiments, not agreeing on the same topic, the output is a bit erratic and the model strives at summarizing them.

To face this problem, and also the issue of context window and large amount of review text used as input, we could split the task into 2 differemt prompts, using the same model:
- Prompt/Model 1 will ask the model to summarize a review into a few words carrying most of the review ideas.
- Prompt/Model 2 will take all summarized reviews of a product and output the json dict we asked earlier

## Model 1: Summarizing reviews
Let's start with a simple model summarizing reviews.

In [None]:
prompt_review_summary = """
Summarize the following product review.

### Review 1:
"This laptop exceeded my expectations. The battery lasts all day, and it’s super lightweight, which makes it perfect for traveling. The display is clear and bright, and the performance is fast even with multiple programs running. My only complaint is that the keyboard feels a bit shallow. Overall, I’m really happy with my purchase!"
### Summary 1:
"Lightweight and fast laptop with a great display and long-lasting battery, but the keyboard is a bit shallow."

### Review 2:
"The headphones are okay but not as good as I expected. The sound quality is decent, and they’re comfortable to wear for a while. However, they don’t block out background noise as well as I hoped. For the price, I think there are better options out there."
### Summary 2:
"Decent sound quality and comfortable, but poor noise isolation and better options exist for the price."

### Review 3:
{review_text}
### Summary 3:
"""

review_text = "This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen."

result = generate_summary(prompt_review_summary, review_text, max_tokens=20)

print(result)


In [9]:
prompt_review_summary = """
Summarize the following product review information. Return an empty string if you are unable to generate a summary.

Review 1:
"This laptop exceeded my expectations. The battery lasts all day, and it’s super lightweight, which makes it perfect for traveling. The display is clear and bright, and the performance is fast even with multiple programs running. My only complaint is that the keyboard feels a bit shallow. Overall, I’m really happy with my purchase!"
Summary 1:
->"Lightweight and fast laptop with a great display and long-lasting battery, but the keyboard is a bit shallow."

Review 2:
"The headphones are okay but not as good as I expected. The sound quality is decent, and they’re comfortable to wear for a while. However, they don’t block out background noise as well as I hoped. For the price, I think there are better options out there."
Summary 2:
->"Decent sound quality and comfortable, but poor noise isolation and better options exist for the price."

Review 3:
{review_text}
Summary 3:
"""

review_text = "This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen."

result = generate_summary(prompt_review_summary, review_text, max_tokens=50)

print(result)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



Summarize the following product review information. Return an empty string if you are unable to generate a summary.

Review 1:
"This laptop exceeded my expectations. The battery lasts all day, and it’s super lightweight, which makes it perfect for traveling. The display is clear and bright, and the performance is fast even with multiple programs running. My only complaint is that the keyboard feels a bit shallow. Overall, I’m really happy with my purchase!"
Summary 1:
->"Lightweight and fast laptop with a great display and long-lasting battery, but the keyboard is a bit shallow."

Review 2:
"The headphones are okay but not as good as I expected. The sound quality is decent, and they’re comfortable to wear for a while. However, they don’t block out background noise as well as I hoped. For the price, I think there are better options out there."
Summary 2:
->"Decent sound quality and comfortable, but poor noise isolation and better options exist for the price."

Review 3:
This blender is

In [10]:
# Regex to match text after "[Review 3]" and stop at the next line break
def extract_review_summary(result: str) -> str:
    # Match all summaries marked by '->"' and capture their content
    match = re.findall(r'->"(.*?)"', result, re.DOTALL)
    if match and len(match) >= 3:
        # Return the third summary with leading/trailing spaces removed
        return match[2].strip()
    else:
        print("No review summary found.")
        return ""  # Return empty string if no match is found

extract_review_summary(result)

'Powerful blender with different speed settings, but loud and heavy jar, sharp blade, and difficult to handle.'

Perfect. Now let's try it on our dataset.

In [31]:
for _ in range(10):
    # get random row and extract review
    row_idx = random.randint(0, len(data))
    review_text = data['review'][row_idx]
    print("REVIEW:")
    print(review_text)

    # generate summary
    result = generate_summary(prompt_review_summary, review_text, max_tokens=50)
    review_summary = extract_review_summary(result)
    print("SUMMARY:")
    print(review_summary)

    print(SEP)

REVIEW:
good product
Bought this as a gift for my mother. She absolutely loves it. Easy to set up and use. and the price was right.
I recommend this product.
SUMMARY:
Easy to set up and use, good price, and my mother loves it.
----------------------------------------------------------------------------------------------------
REVIEW:
Nice tablet for a kid
Had to get this tablet it was a great price and came with free amazon movies for a year, I just had to change the case.
I recommend this product.
SUMMARY:
Great price, free movies, and a good case.
----------------------------------------------------------------------------------------------------
REVIEW:
love it
great product, easy to use, love the lighted pad, ease of down loading books,
I recommend this product.
SUMMARY:
Easy to use, lighted pad, and ease of down loading books, recommend this product.
----------------------------------------------------------------------------------------------------
REVIEW:
Excellent
I have ordere

The results are prety good. We can see that the model tends to hallucinate on some summaries and deduct information that are not necessarely true and specified in the original review. Example:

REVIEW:  
- Great tablet. I like how cheap this is. It's great for browsing the Internet.

SUMMARY:  
- Great for browsing the Internet, but cheap and not suitable for other tasks.

## Model 2: Extracting information from reviews
This second model will take as input a list of reviews and extract information into a json file

### Load our product scores
In notebook 5 we compute scores for all products in each category, so we can extract the top 3 or the worst product.

In [11]:
# load product score dataset
data_scored = helpers.load_pickled_dataset('pickle/data_scored.pkl')

data_scored.head()

Dataset loaded from pickle/data_scored.pkl.


Unnamed: 0,clusterCategories,name,reviews.rating
7,Accessories & Adapters,AmazonBasics AAA Performance Alkaline Batterie...,62201.0
6,Accessories & Adapters,AmazonBasics AA Performance Alkaline Batteries...,43195.0
19,Accessories & Adapters,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",40862.0
18,Accessories & Adapters,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",25108.0
20,Accessories & Adapters,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",23945.0


In [12]:
# print unique categories
category_names = data_scored['clusterCategories'].unique()
for cat in category_names:
    print(cat)

Accessories & Adapters
Electronics & Media
Home & Smart Devices
Pet Supplies
Tablets & eReaders
Video & Streaming


In [15]:
# Select the top 3 products per category
def best_products_per_category(data, category_name, top_n=3):
    # subset the category
    category_subset = data[data['clusterCategories'] == category_name]
    # select the top n products
    return category_subset.head(top_n)

best_products_per_category(data_scored, 'Accessories & Adapters', 3)

Unnamed: 0,clusterCategories,name,reviews.rating
7,Accessories & Adapters,AmazonBasics AAA Performance Alkaline Batterie...,62201.0
6,Accessories & Adapters,AmazonBasics AA Performance Alkaline Batteries...,43195.0
19,Accessories & Adapters,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",40862.0


In [18]:
# store the top 3 products for each category into a dict
top_products = {}
for cat in category_names:
    top_products[cat] = best_products_per_category(data_scored, cat, 3)['name'].tolist()

for k, v in top_products.items():
    print(f"{k.upper()}:")
    for product in v:
        print(f"  - {product}")
    print(SEP)


ACCESSORIES & ADAPTERS:
  - AmazonBasics AAA Performance Alkaline Batteries (36 Count)
  - AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary
  - Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Green Kid-Proof Case
----------------------------------------------------------------------------------------------------
ELECTRONICS & MEDIA:
  - Amazon Kindle Paperwhite - eBook reader - 4 GB - 6 monochrome Paperwhite - touchscreen - Wi-Fi - black,,,
  - Kindle Voyage E-reader, 6 High-Resolution Display (300 ppi) with Adaptive Built-in Light, PagePress Sensors, Wi-Fi - Includes Special Offers
  - Kindle Voyage E-reader, 6 High-Resolution Display (300 ppi) with Adaptive Built-in Light, PagePress Sensors, Wi-Fi - Includes Special Offers,
----------------------------------------------------------------------------------------------------
HOME & SMART DEVICES:
  - Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Green Kid-Proof Case
  - All-New Fire HD 8 Tablet, 8 

### Extract n reviews for each product (positive, neutral and negative)

In [56]:
# create a function that return a review for a given product in the desired category and sentiment
def get_product_reviews(data, name, category, sentiment, n=1):
    # subset the data
    subset = data[(data['name'] == product_name) & (data['clusterCategories'] == category_name) & (data['reviews.sentiment'] == sentiment)]
    # get random reviews
    reviews = []
    while len(reviews) < n:
        new_review = subset['review'].sample().values[0]
        # remove line breaks (forgotten in preprocessing)
        new_review = new_review.replace('\n', ' ')
        if new_review not in reviews:
            reviews.append(new_review)
    return reviews

category_name = 'Accessories & Adapters'
n_reviews = 3
category_products = []

for product_name in top_products[category_name]:
    product_dict = {}
    product_dict['name'] = product_name
    product_dict['positive_reviews'] = get_product_reviews(data, product_name, category_name, 'positive', n_reviews)
    product_dict['neutral_reviews'] = get_product_reviews(data, product_name, category_name, 'neutral', n_reviews)
    product_dict['negative_reviews'] = get_product_reviews(data, product_name, category_name, 'negative', n_reviews)
    category_products.append(product_dict)

print(SEP)
print('CATEGORY:', category_name)
print(SEP)
for product in category_products:
    print(f"PRODUCT: {product['name']}")
    print()
    print("POSITIVE REVIEWS:")
    for review in product['positive_reviews']:
        print(f"  - {review}")
    print()
    print("NEUTRAL REVIEWS:")
    for review in product['neutral_reviews']:
        print(f"  - {review}")
    print()
    print("NEGATIVE REVIEWS:")
    for review in product['negative_reviews']:
        print(f"  - {review}")
    print(SEP)

----------------------------------------------------------------------------------------------------
CATEGORY: Accessories & Adapters
----------------------------------------------------------------------------------------------------
PRODUCT: AmazonBasics AAA Performance Alkaline Batteries (36 Count)

POSITIVE REVIEWS:
  - Five Stars do a good iob
  - Four Stars They work as expected.
  - same as above batteries, great price and long lasting and appreciate quality and great price

NEUTRAL REVIEWS:
  - Of the four used so far, two already lost ... Of the four used so far, two already lost charge, one so much as giving a negative read on a voltmeter.
  - Three Stars I don't feel Amazon brand batteries are as powerful as other brands.
  - AmazonBasics AA Performance Alkaline Batteries The juice in these batteries aren't long term.

NEGATIVE REVIEWS:
  - The battery burned me!! I opened up this package fresh and put two AA batteries in my Xbox one controller. It only lasted for a few hour

Perfect, now we have everything we need to input into our model.

### Run inference on the model to extract information from the reviews and store them into a json file