# Part 4 - Generator / Summarizer
In this notebook, we implement the 3rd transformer, in charge or generating a review of the best product per category, from a batch of reviews and sentiments

## Imports

In [37]:
import os
import re
import json
import pandas as pd
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer
import random
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# import helpers module
import helpers
import importlib
importlib.reload(helpers)

# constants and paths
DATA_PROCESSED_PKL = 'pickle/data_processed.pkl'
DATA_CLUSTERED_PKL = 'pickle/data_clustered.pkl'
DATA_SA_PKL = 'pickle/data_sentiment_analysis.pkl'
DATA_SCORED_PKL = 'pickle/data_scored.pkl'
SEP = 100 * '-'

In [2]:
# load HugginFace token to environment
if not os.environ.get('HF_TOKEN'):
    os.environ['HF_TOKEN'] = input('Enter API token for Hugging Face: ')
else:
    print('Hugging Face token already loaded to environment')

## Load datasets

In [38]:
data = helpers.load_merge_pickled_datasets([DATA_PROCESSED_PKL, DATA_CLUSTERED_PKL, DATA_SA_PKL])

Datasets loaded from pickle files:
['pickle/data_processed.pkl', 'pickle/data_clustered.pkl', 'pickle/data_sentiment_analysis.pkl'].


In [39]:
helpers.print_random_product_sheet(data)

----------------------------------------------------------------------------------------------------
[name] Amazon Kindle Paperwhite - eBook reader - 4 GB - 6 monochrome Paperwhite - touchscreen - Wi-Fi - black,,,
----------------------------------------------------------------------------------------------------
[brand] Amazon
----------------------------------------------------------------------------------------------------
[reviews.rating] 5.0
----------------------------------------------------------------------------------------------------
[reviews.numHelpful] 0.0
----------------------------------------------------------------------------------------------------
[review] Love it!
I love my new Kindle Paperwhite! Love the lighted pages and the ability to change print sizes.
I recommend this product.
----------------------------------------------------------------------------------------------------
[clusterCategories] Electronics & Media
-----------------------------------------

## METHOD 1: Prompt Fine-Tuning

Note: Due to a lack of time and since I got acceptable results using this method, I did not explore other methods like LORA fine-tuning in this notebook.

Let's see what we can get using only prompt fine-tuning. For this I will try [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.3) in 4bit quantization for quick inference.

Using 4-bit quantization will help reduce memory usage significantly while retaining reasonable model performance.

In [5]:
!transformers-cli cache clear

2024-10-17 11:01:06.921936: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-17 11:01:06.922028: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-17 11:01:06.923698: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-17 11:01:06.934357: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
usage: transformers-cli <command> [<args>]
Transforme

## Load Mistral-7b 4bits

In [None]:
model_name = "mistralai/Mistral-7B-v0.3"
tokenizer = LlamaTokenizer.from_pretrained(model_name)

# load 4bit quantization model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)

# test model
inputs = tokenizer("Hello my name is", return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


In [33]:
print(type(model))
print(type(tokenizer))

<class 'transformers.models.mistral.modeling_mistral.MistralForCausalLM'>
<class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>


## Create a method to infer on the model

In [7]:
# Set pad_token to eos_token
tokenizer.pad_token = tokenizer.eos_token

def run_inference_on_model(prompt: str, max_tokens: int = 100) -> str:
    """Run inference on the model

    Args:
        prompt (str): prompt to generate text from
        max_tokens (int, optional): maximum tokens to generate. Defaults to 100.

    Returns:
        str: generated text
    """

    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(device)
    
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=max_tokens,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    ).to(device)
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)



## Zero-shot prompting
Let's work on a prompt template so the model extract information from reviews and summarize them

In [8]:
prompt_zero_shot = """Summarize the following product review in one sentence, highlighting the main points.
Then, provide a list of pros and cons and a short overall conclusion.

### Review: 
{review_text}

### One-sentence summary, pros and cons and conclusion:
"""

review_text = "This laptop has a fantastic display and long battery life, but the keyboard is a bit cramped, and it tends to get hot during heavy use. Overall, it performs well for everyday tasks and is good value for the price."
prompt = prompt_zero_shot.format(review_text=review_text)

review_text = run_inference_on_model(prompt, max_tokens=100)

print(review_text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Summarize the following product review in one sentence, highlighting the main points.
Then, provide a list of pros and cons and a short overall conclusion.

### Review: 
This laptop has a fantastic display and long battery life, but the keyboard is a bit cramped, and it tends to get hot during heavy use. Overall, it performs well for everyday tasks and is good value for the price.

### One-sentence summary, pros and cons and conclusion:

- Pros:
  - Fantastic display
  - Long battery life
  - Good value for the price
- Cons:
  - Keyboard is a bit cramped
  - Tends to get hot during heavy use
- Conclusion:
  - Performs well for everyday tasks






































The result is not exactly what I need. Let's try to include 1 or 2 example in the prompt.

## Few-shots prompting

In [9]:
prompt_few_shots = """
Summarize the following product review in one sentence, highlighting the main points.
Then, provide a list of pros and cons and a short overall conclusion.

### Example 1:
# Review: 
This phone has a great camera and sleek design, but the battery life is disappointing.
# One-sentence summary: 
"Great camera and design, but poor battery life."
# Pros:
  - Great camera
  - Sleek design
# Cons:
  - Disappointing battery life
# Conclusion:
"Good for photography lovers but might need frequent charging."

### Example 2:
# Review: 
"This vacuum cleaner is powerful and easy to use, but it's noisy and a bit heavy."
# One-sentence summary:
"Effective but noisy and heavy."
# Pros:
  - Powerful
  - Easy to use
# Cons:
  - Noisy
  - Heavy
# Conclusion:
"Ideal for quick, powerful cleaning if noise isn’t an issue."

# Review: 
{review_text}
"""

review_text = "This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen."
prompt = prompt_few_shots.format(review_text=review_text)

review_text = run_inference_on_model(prompt, max_tokens=100)

print(review_text)



Summarize the following product review in one sentence, highlighting the main points.
Then, provide a list of pros and cons and a short overall conclusion.

### Example 1:
# Review: 
This phone has a great camera and sleek design, but the battery life is disappointing.
# One-sentence summary: 
"Great camera and design, but poor battery life."
# Pros:
  - Great camera
  - Sleek design
# Cons:
  - Disappointing battery life
# Conclusion:
"Good for photography lovers but might need frequent charging."

### Example 2:
# Review: 
"This vacuum cleaner is powerful and easy to use, but it's noisy and a bit heavy."
# One-sentence summary:
"Effective but noisy and heavy."
# Pros:
  - Powerful
  - Easy to use
# Cons:
  - Noisy
  - Heavy
# Conclusion:
"Ideal for quick, powerful cleaning if noise isn’t an issue."

# Review: 
This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibil

The result is pretty close to what we need. However, the model keeps generating text until it reaches the max_tokens length. I can't manage to get the end of token to work.

## JSON output
Let's try to pack the information into a json object

In [10]:
prompt_json = """
Summarize the following product review and provide the information in a structured JSON format with the following keys:
- "summary": A single-sentence summary of the review.
- "pros": A list of positive aspects.
- "cons": A list of negative aspects.
- "conclusion": An overall conclusion.

### Review:
{review_text}

### JSON output:
"""

review_text = "This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen."
prompt = prompt_json.format(review_text=review_text)

review_text = run_inference_on_model(prompt, max_tokens=150)
print(review_text)



Summarize the following product review and provide the information in a structured JSON format with the following keys:
- "summary": A single-sentence summary of the review.
- "pros": A list of positive aspects.
- "cons": A list of negative aspects.
- "conclusion": An overall conclusion.

### Review:
This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen.

### JSON output:
```json
{
  "summary": "This blender is absolutely fantastic! It blends everything smoothly, whether it's 

In [11]:
def extract_json_dict(result: str) -> dict:
    """Extract JSON content from the model output and convert it to a Python dictionary.

    Args:
        result (str): The model output containing the JSON content.

    Returns:
        dict: A Python dictionary containing the extracted JSON content.

    Exception:
        json.JSONDecodeError: If there is an error decoding the JSON content.
    """
    # let's extract the json part and store it in a dict
    json_content = re.search(r'json\s*(\{.*?\})\s*', result, re.DOTALL)

    review_dict = {}

    if json_content:
        extracted_json = json_content.group(1) 
        try: 
            # Convert JSON string to Python dictionary 
            review_dict = json.loads(extracted_json)
        except json.JSONDecodeError as e: 
            print(f"Error decoding JSON: {e}")
    else: 
        print("No JSON content found.")

    return review_dict

review_dict = extract_json_dict(review_text)

print(SEP)
for k, v in review_dict.items():
    print(f"{k.upper()}: {v}")

----------------------------------------------------------------------------------------------------
SUMMARY: This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits.
PROS: ['different speed settings', 'flexibility for various recipes']
CONS: ['loud', 'jar is heavy', 'blade is really sharp']
CONCLUSION: It's a solid blender for anyone who wants power and versatility in their kitchen.


This is working great. Outputing json contents make it very clear, concise and avoid mixed up information or repetition.


## Multiple review summarization

Let's try to input a list of reviews into the model and see if it can output a json summary.

In [12]:
prompt_multiple_reviews = """
Summarize the following reviews of the same product in a structured JSON format, combining insights from all three reviews into a single summary. Use the following keys:
- "summary": A single-sentence summary reflecting the general or mix of sentiment and key points from all reviews.
- "pros": A list of positive aspects listed in the reviews.
- "cons": A list of negative aspects listed in the reviews.
- "conclusion": An overall conclusion based on the reviews, pros and cons.
Generate a single JSON output that combines the information from all three reviews.
"""

# let's try with a positive, a neutral and a negative review
reviews_list = [
    "I absolutely love this blender! It blends everything so smoothly, from ice to veggies and fruits. The different speed settings are super useful, and it's easy to clean. The motor is powerful, and it doesn't struggle with anything I throw at it. Definitely worth the price if you're serious about smoothies or cooking!",
    "The blender is decent overall. It does its job well, though it can be a bit noisy at higher speeds. The jar is also a bit heavy, which makes it slightly awkward to handle when full. But it's sturdy, and I appreciate the variety of settings. Not a bad choice, but nothing extraordinary.",
    "Unfortunately, this blender has been a disappointment. It struggles to blend frozen items properly, and it makes a lot of noise. After a few uses, the blade already seems dull. It's hard to clean, especially around the blade area, and I feel it's overpriced for the quality you get. I wouldn’t recommend it."
]

# add reviews to the prompt
for i, review in enumerate(reviews_list):
    prompt_multiple_reviews += f"\n### Review {i+1}:\n{review}\n"

# add JSON output format
prompt_multiple_reviews += "\n### JSON output:\n"

review_text = run_inference_on_model(prompt_multiple_reviews, max_tokens=100)
print(review_text)

review_dict = extract_json_dict(review_text)
print(SEP)
for k, v in review_dict.items():
    print(f"{k.upper()}: {v}")



Summarize the following reviews of the same product in a structured JSON format, combining insights from all three reviews into a single summary. Use the following keys:
- "summary": A single-sentence summary reflecting the general or mix of sentiment and key points from all reviews.
- "pros": A list of positive aspects listed in the reviews.
- "cons": A list of negative aspects listed in the reviews.
- "conclusion": An overall conclusion based on the reviews, pros and cons.
Generate a single JSON output that combines the information from all three reviews.

### Review 1:
I absolutely love this blender! It blends everything so smoothly, from ice to veggies and fruits. The different speed settings are super useful, and it's easy to clean. The motor is powerful, and it doesn't struggle with anything I throw at it. Definitely worth the price if you're serious about smoothies or cooking!

### Review 2:
The blender is decent overall. It does its job well, though it can be a bit noisy at hi

We notice the main issue here: Due to contradictory review sentiments, not agreeing on the same topic, the output is a bit erratic and the model strives at summarizing them.

To face this problem, and also the issue of context window and large amount of review text used as input, we could split the task into 2 differemt prompts, using the same model:
- Prompt 1 will ask the model to summarize a review into a few words carrying most of the review ideas.
- Prompt 2 will take all summarized reviews of a product and output the json dict we asked earlier

## Prompt 1: Summarizing a reviews
Let's start with a simple model summarizing reviews.

In [13]:
prompt_review_summary = """
Summarize the following product review information. Return an empty string if you are unable to generate a summary.

Review 1:
"This laptop exceeded my expectations. The battery lasts all day, and it’s super lightweight, which makes it perfect for traveling. The display is clear and bright, and the performance is fast even with multiple programs running. My only complaint is that the keyboard feels a bit shallow. Overall, I’m really happy with my purchase!"
Summary 1:
->"Lightweight and fast laptop with a great display and long-lasting battery, but the keyboard is a bit shallow."

Review 2:
"The headphones are okay but not as good as I expected. The sound quality is decent, and they’re comfortable to wear for a while. However, they don’t block out background noise as well as I hoped. For the price, I think there are better options out there."
Summary 2:
->"Decent sound quality and comfortable, but poor noise isolation and better options exist for the price."

Review 3:
{review_text}
Summary 3:
"""


In [14]:
def generate_review_summary(prompt: str) -> str:
    """Run inference on the model to generate a review summary based on the provided review.

    Args:
        prompt (str): prompt containing instruction and review

    Returns:
        str: review summary extracted from the model's answer using a regex pattern
    """
    
    # run inference
    result = run_inference_on_model(prompt, max_tokens=50)
    
    # Match all summaries marked by '->"' and capture their content
    match = re.findall(r'->"(.*?)"', result, re.DOTALL)
    if match and len(match) >= 3:
        # Return the third summary with leading/trailing spaces removed
        return match[2].strip()
    else:
        print("No review summary found.")
        return ""  # Return empty string if no match is found

review_text = "This blender is absolutely fantastic! It blends everything smoothly, whether it's ice, vegetables, or fruits. I love the different speed settings, which give a lot of flexibility for various recipes. However, it’s quite loud, especially on the higher settings, so it can be a bit disruptive in the kitchen. The jar is also a bit heavy, which makes it a little difficult to handle, especially when it's full. Cleaning is easy, but the blade is really sharp, so I need to be careful. Overall, it's a solid blender for anyone who wants power and versatility in their kitchen."
prompt = prompt_review_summary.format(review_text=review_text)

review_summary = generate_review_summary(prompt)
print(SEP)
print(review_summary)
print(SEP)

----------------------------------------------------------------------------------------------------
Powerful blender with different speed settings, but loud and heavy jar, sharp blade, and difficult to handle.
----------------------------------------------------------------------------------------------------


Perfect. Now let's try it on our dataset.

In [15]:
for _ in range(5):
    # get random row and extract review
    row_idx = random.randint(0, len(data))
    review_text = data['review'][row_idx]
    print("REVIEW:")
    print(review_text)
    print()
    
    # generate summary
    prompt = prompt_review_summary.format(review_text=review_text)
    review_summary = generate_review_summary(prompt)
    print("SUMMARY:")
    print(review_summary)

    print(SEP)

REVIEW:
Great tablet for first time user
This was a gift for a senior citizen that is not a fan of computers and technology. After just a few minutes of instruction she picked right up on how much fun she could have with her down loaded games and she also learned how to reach out on the internet for information on subjects that interest her. She is very happy with her gift!
I recommend this product.

SUMMARY:
Great tablet for first time user
----------------------------------------------------------------------------------------------------
REVIEW:
Amazon kid version
Tablet is great, battery satisfactory. Case works great, within 10 minutes of opening the tablet, my grandson dropped it. No damage!!Easy to use. Highly recommend.
I recommend this product.

SUMMARY:
Great tablet, satisfactory battery, case works great, no damage after 10 minutes of opening the tablet, easy to use, highly recommend.
-------------------------------------------------------------------------------------------

The results are prety good. We can see that the model tends to hallucinate on some summaries and deduct information that are not necessarely true and specified in the original review. Example:

REVIEW:  
- Great tablet. I like how cheap this is. It's great for browsing the Internet.

SUMMARY:  
- Great for browsing the Internet, but cheap and not suitable for other tasks.

## Prompt 2: Extracting information from multiple summarized reviews

### Summarize recurring ideas
Let's ask the model to extract recurring ideas from all reviews

In [16]:
# lets try to extract recurring general ideas from reviews
# Analyze the following list of product reviews. Identify the 3 most frequently mentioned ideas or themes across all reviews and output them in a list.

prompt_reviews_recurring_ideas = """
Analyze the following list of product reviews. Provide a concise summary of the three most frequently mentioned ideas or themes. Ensure each point reflects common feedback without repeating phrasing.

### Reviews:
[Review 1]: "The camera quality is amazing and takes very clear photos, even in low light."
[Review 2]: "Battery life could be better; it drains faster than my previous phone."
[Review 3]: "I’m impressed with the screen clarity and brightness, but the battery doesn’t last very long."
[Review 4]: "The photos are sharp and vibrant, especially in daylight."
[Review 5]: "Battery life is a bit disappointing, but the display is excellent and great for watching videos."

### Recurring Ideas:
1. The camera quality is highly praised, especially for clear and vibrant photos in various lighting.
2. The display quality is appreciated for its clarity and brightness, making it ideal for media.
3. Battery life is a common complaint, with users noting that it drains quickly.

### Reviews:
[Review 1]: "The sound quality on these headphones is outstanding, with deep bass and clear highs."
[Review 2]: "They’re comfortable to wear for long hours, but I wish they blocked out noise better."
[Review 3]: "Amazing audio clarity, but they don’t fully cancel out background sounds."
[Review 4]: "Great sound quality for the price, but I can still hear some outside noise."
[Review 5]: "Comfortable fit and good audio, but not ideal for noisy environments."

### Recurring Ideas:
1. Sound quality is praised for its clarity and depth, especially given the price.
2. Comfort is a major positive, with users finding them suitable for extended wear.
3. Noise cancellation is lacking, with multiple reviews mentioning that they allow outside noise.

### Reviews:
{review_text}

### Recurring Ideas:
"""

reviews = [
    "The vacuum cleaner is lightweight and easy to carry around the house.",
    "I love how portable it is, but the battery life doesn’t last as long as expected.",
    "It’s very light, making it easy to maneuver, but I wish it had a longer battery life.",
    "Great suction power and easy to use, but it needs frequent recharging.",
    "Lightweight design makes it easy to handle, but it doesn’t hold charge for long."
]

reviews_text = ""
for i, review in enumerate(reviews):
    reviews_text += f"[Review {i+1}]: \"{review}\"\n"

prompt = prompt_reviews_recurring_ideas.format(review_text=reviews_text)

review_text = run_inference_on_model(prompt, max_tokens=100)

print(review_text)


Analyze the following list of product reviews. Provide a concise summary of the three most frequently mentioned ideas or themes. Ensure each point reflects common feedback without repeating phrasing.

### Reviews:
[Review 1]: "The camera quality is amazing and takes very clear photos, even in low light."
[Review 2]: "Battery life could be better; it drains faster than my previous phone."
[Review 3]: "I’m impressed with the screen clarity and brightness, but the battery doesn’t last very long."
[Review 4]: "The photos are sharp and vibrant, especially in daylight."
[Review 5]: "Battery life is a bit disappointing, but the display is excellent and great for watching videos."

### Recurring Ideas:
1. The camera quality is highly praised, especially for clear and vibrant photos in various lighting.
2. The display quality is appreciated for its clarity and brightness, making it ideal for media.
3. Battery life is a common complaint, with users noting that it drains quickly.

### Reviews:
[

### Summarize recurring ideas using a JSON output

In [17]:
# lets try to extract recurring general ideas from reviews
# Analyze the following list of product reviews. Identify the 3 most frequently mentioned ideas or themes across all reviews and output them in a list.

prompt_reviews_recurring_ideas_json = """
Analyze the following list of product reviews. Provide a concise summary of the three most frequently mentioned ideas or themes. Ensure each point reflects common feedback without repeating phrasing.
Store the recurring ideas in a structured JSON format.

### Reviews:
[Review 1]: "The camera quality is amazing and takes very clear photos, even in low light."
[Review 2]: "Battery life could be better; it drains faster than my previous phone."
[Review 3]: "I’m impressed with the screen clarity and brightness, but the battery doesn’t last very long."
[Review 4]: "The photos are sharp and vibrant, especially in daylight."
[Review 5]: "Battery life is a bit disappointing, but the display is excellent and great for watching videos."

### Recurring Ideas:
{{
"1": "The camera quality is highly praised, especially for clear and vibrant photos in various lighting."
"2": "The display quality is appreciated for its clarity and brightness, making it ideal for media."
"3": "Battery life is a common complaint, with users noting that it drains quickly."
}}


### Reviews:
[Review 1]: "The sound quality on these headphones is outstanding, with deep bass and clear highs."
[Review 2]: "They’re comfortable to wear for long hours, but I wish they blocked out noise better."
[Review 3]: "Amazing audio clarity, but they don’t fully cancel out background sounds."
[Review 4]: "Great sound quality for the price, but I can still hear some outside noise."
[Review 5]: "Comfortable fit and good audio, but not ideal for noisy environments."

### Recurring Ideas:
```json
{{
"1": "Sound quality is praised for its clarity and depth, especially given the price."
"2": "Comfort is a major positive, with users finding them suitable for extended wear."
"3": "Noise cancellation is lacking, with multiple reviews mentioning that they allow outside noise."
}}
```

### Reviews:
{review_text}

### Recurring Ideas:
"""

reviews = [
    "The vacuum cleaner is lightweight and easy to carry around the house.",
    "I love how portable it is, but the battery life doesn’t last as long as expected.",
    "It’s very light, making it easy to maneuver, but I wish it had a longer battery life.",
    "Great suction power and easy to use, but it needs frequent recharging.",
    "Lightweight design makes it easy to handle, but it doesn’t hold charge for long."
]

reviews_text = ""
for i, review in enumerate(reviews):
    reviews_text += f"[Review {i+1}]: \"{review}\"\n"

prompt = prompt_reviews_recurring_ideas_json.format(review_text=reviews_text)

review_text = run_inference_on_model(prompt, max_tokens=100)

print(review_text)


Analyze the following list of product reviews. Provide a concise summary of the three most frequently mentioned ideas or themes. Ensure each point reflects common feedback without repeating phrasing.
Store the recurring ideas in a structured JSON format.

### Reviews:
[Review 1]: "The camera quality is amazing and takes very clear photos, even in low light."
[Review 2]: "Battery life could be better; it drains faster than my previous phone."
[Review 3]: "I’m impressed with the screen clarity and brightness, but the battery doesn’t last very long."
[Review 4]: "The photos are sharp and vibrant, especially in daylight."
[Review 5]: "Battery life is a bit disappointing, but the display is excellent and great for watching videos."

### Recurring Ideas:
{
"1": "The camera quality is highly praised, especially for clear and vibrant photos in various lighting."
"2": "The display quality is appreciated for its clarity and brightness, making it ideal for media."
"3": "Battery life is a common 

This prompt outputing the recurring ideas in a json format only works for a few reviews. If inputing more than 4-5, the model keeps adding more than 3 ideas and the json format is not respected, missing the closing brackets. This makes it an unreliable solution.

### Summarize reviews into pros and cons bullet points
This is an alternative method, asking the model to summarize all reviews into pros and cons

In [18]:
# lets try to extract clear pros and cons bullet points from reviews
prompt_review_pros_cons = """
Analyze the following list of product reviews. Identify the 3 most frequently mentioned positive aspects (pros) and 3 most frequently mentioned negative aspects (cons) across all reviews.
Store them in a json output.

### Reviews:
{review_text}

### JSON Output:
"""

reviews = [
    "The vacuum cleaner is lightweight and easy to carry around the house.",
    "I love how portable it is, but the battery life doesn’t last as long as expected.",
    "It’s very light, making it easy to maneuver, but I wish it had a longer battery life.",
    "Great suction power and easy to use, but it needs frequent recharging.",
    "Lightweight design makes it easy to handle, but it doesn’t hold charge for long.",
]

review_text = ""
for i, review in enumerate(reviews):
    review_text += f"[Review {i+1}]: {review}\n"

prompt = prompt_review_pros_cons.format(review_text=review_text)

review_text = run_inference_on_model(prompt, max_tokens=100)

print(review_text)


Analyze the following list of product reviews. Identify the 3 most frequently mentioned positive aspects (pros) and 3 most frequently mentioned negative aspects (cons) across all reviews.
Store them in a json output.

### Reviews:
[Review 1]: The vacuum cleaner is lightweight and easy to carry around the house.
[Review 2]: I love how portable it is, but the battery life doesn’t last as long as expected.
[Review 3]: It’s very light, making it easy to maneuver, but I wish it had a longer battery life.
[Review 4]: Great suction power and easy to use, but it needs frequent recharging.
[Review 5]: Lightweight design makes it easy to handle, but it doesn’t hold charge for long.


### JSON Output:

```json
{
  "pros": [
    "lightweight",
    "portable",
    "easy to maneuver"
  ],
  "cons": [
    "battery life",
    "frequent recharging",
    "short battery life"
  ]
}
```


























This seems to work pretty good. Now let's try with more reviews in the next section.

## Extracting best models and reviews from the dataset
In notebook 5 we compute scores for all products in each category, so we can extract the top 3 or the worst product.

### Load our product scores

In [19]:
# load product score dataset
data_scored = helpers.load_pickled_dataset('pickle/data_scored.pkl')

data_scored.head()

Dataset loaded from pickle/data_scored.pkl.


Unnamed: 0,clusterCategories,name,reviews.rating
7,Accessories & Adapters,AmazonBasics AAA Performance Alkaline Batterie...,62201.0
6,Accessories & Adapters,AmazonBasics AA Performance Alkaline Batteries...,43195.0
19,Accessories & Adapters,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",40862.0
18,Accessories & Adapters,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",25108.0
20,Accessories & Adapters,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",23945.0


In [42]:
# print unique categories
category_names = list(data_scored['clusterCategories'].unique())

for cat in category_names:
    print(cat)

<class 'list'>
Accessories & Adapters
Electronics & Media
Home & Smart Devices
Pet Supplies
Tablets & eReaders
Video & Streaming


In [21]:
# let's use one category for the following analysis
category_name = 'Tablets & eReaders'

In [50]:
# Select the top 3 products per category
def best_products_per_category(data, category_name, top_n=3):
    # subset the category
    category_subset = data[data['clusterCategories'] == category_name]
    # select the top n products
    return category_subset.head(top_n)

best_products_per_category(data_scored, category_name)


Unnamed: 0,clusterCategories,name,reviews.rating
132,Tablets & eReaders,"Echo (White),,,\r\nEcho (White),,,",13557.0
122,Tablets & eReaders,"Amazon Fire Tv,,,\r\nAmazon Fire Tv,,,",11791.0
117,Tablets & eReaders,Amazon Echo Show Alexa-enabled Bluetooth Speak...,6490.0


In [23]:
# store the top 3 products for each category into a dict
top_products = {}
for cat in category_names:
    top_products[cat] = best_products_per_category(data_scored, cat, 3)['name'].tolist()

for k, v in top_products.items():
    print(f"{k.upper()}:")
    for product in v:
        print(f"  - {product}")
    print(SEP)


ACCESSORIES & ADAPTERS:
  - AmazonBasics AAA Performance Alkaline Batteries (36 Count)
  - AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary
  - Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Green Kid-Proof Case
----------------------------------------------------------------------------------------------------
ELECTRONICS & MEDIA:
  - Amazon Kindle Paperwhite - eBook reader - 4 GB - 6 monochrome Paperwhite - touchscreen - Wi-Fi - black,,,
  - Kindle Voyage E-reader, 6 High-Resolution Display (300 ppi) with Adaptive Built-in Light, PagePress Sensors, Wi-Fi - Includes Special Offers
  - Kindle Voyage E-reader, 6 High-Resolution Display (300 ppi) with Adaptive Built-in Light, PagePress Sensors, Wi-Fi - Includes Special Offers,
----------------------------------------------------------------------------------------------------
HOME & SMART DEVICES:
  - Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Green Kid-Proof Case
  - All-New Fire HD 8 Tablet, 8 

### Extract n reviews for each product (positive, neutral and negative)

In [24]:
# create a function that returns a random review from a subset
def get_random_review(subset, min_word_count=10):
    while True:
        new_review = subset['review'].sample(random_state=random.randint(1, len(subset['review']))).values[0]
        if len(new_review.split()) >= min_word_count:
            return new_review

# create a function that return a review for a given product in the desired category and sentiment
def get_product_reviews(data, name, category, sentiment, n=1):
    # subset the data
    subset = data[(data['name'] == name) & (data['clusterCategories'] == category) & (data['reviews.sentiment'] == sentiment)]
    # get random reviews
    reviews = []
    while len(reviews) < n:
        new_review = get_random_review(subset)
        # remove line breaks (forgotten in preprocessing)
        new_review = new_review.replace('\n', ' ')
        # remove multiple dots
        new_review = re.sub(r'\.{2,}', '.', new_review)
        # remove multiple spaces
        new_review = re.sub(r'\s{2,}', ' ', new_review)
        # lowercase letters
        new_review = new_review.lower()
        # append to list if not already in
        if new_review not in reviews:
            reviews.append(new_review)
    return reviews

product_name = top_products[category_name][0]
# n_reviews = 10
category_products = []

for product_name in top_products[category_name]:
    product_dict = {}
    product_dict['name'] = product_name
    product_dict['positive_reviews'] = get_product_reviews(data, product_name, category_name, 'positive', 8)
    product_dict['neutral_reviews'] = get_product_reviews(data, product_name, category_name, 'neutral', 5)
    product_dict['negative_reviews'] = get_product_reviews(data, product_name, category_name, 'negative', 2)
    category_products.append(product_dict)

print(SEP)
print('CATEGORY:', category_name)
print(SEP)
for product in category_products:
    print(f"PRODUCT: {product['name']}")
    print()
    print("POSITIVE REVIEWS:")
    for review in product['positive_reviews']:
        print(f"  - {review}")
    print()
    print("NEUTRAL REVIEWS:")
    for review in product['neutral_reviews']:
        print(f"  - {review}")
    print()
    print("NEGATIVE REVIEWS:")
    for review in product['negative_reviews']:
        print(f"  - {review}")
    print(SEP)

----------------------------------------------------------------------------------------------------
CATEGORY: Tablets & eReaders
----------------------------------------------------------------------------------------------------
PRODUCT: Echo (White),,,
Echo (White),,,

POSITIVE REVIEWS:
  - great organizer i love alexa. she keeps a shopping list for me, schedules appointments plus plays my favorite tunes! and so easy to work with. a must in every home. i recommend this product.
  - very smart i love my echo. i use it everyday. i recommend it. i recommend this product.
  - cool purchase doesn't really answer my questions but can set timers and tell me the weather i recommend this product.
  - love the echo we have it set up in the kitchen. mainly use it to play music and give us the weather forecast. she really is amaxing at answering general questions i recommend this product.
  - great product i purchased the ecco as an addition to my home automation. works great. the wife and kids

Perfect, now we have everything we need to input into our model.

### Try the pros and cons prompt with more reviews

In [25]:
# load pickled reviews if available
pickled_reviews_path = f"pickle/reviews.pkl"

if os.path.exists(pickled_reviews_path):
    review_text = helpers.load_pickled_reviews(pickled_reviews_path)
    print(review_text)
else:
    # load reviews
    category_name = 'Tablets & eReaders'  # let's try with a category
    product_name = category_products[0]['name']  # let's try with the first product
    positive_reviews = category_products[0]['positive_reviews']  # get the positive reviews
    neutral_reviews = category_products[0]['neutral_reviews']  # get the neutral reviews
    negative_reviews = category_products[0]['negative_reviews']  # get the negative reviews
    
    # merge the reviews and shuffle them to avoid any bias
    all_reviews = positive_reviews + neutral_reviews + negative_reviews
    random.shuffle(all_reviews)

    # build our review text
    review_text = ""
    for i, review in enumerate(all_reviews):
        # summarize review
        prompt = prompt_review_summary.format(review_text=review)
        review_summary = f"[Review {i+1}]: {generate_review_summary(prompt)}"
        print(review_summary)
        review_text += review_summary + '\n'

    # pickle the review text
    helpers.pickle_list_reviews(review_text, pickled_reviews_path)

[Review 1]: Echo functionality is easy to set up, but speaker quality is not as good as Google Home. Nest thermostat and Inseon hub work well.
No review summary found.
[Review 2]: 
[Review 3]: Great addition to home automation, wife and kids love it, recommend.
[Review 4]: Great organizer with Alexa, keeps shopping list, schedules appointments, plays music, and is easy to work with. Recommended for every home.
[Review 5]: Cool purchase, but doesn't answer questions and can set timers and tell weather.
[Review 6]: Very smart and recommended product.
[Review 7]: Great device from Amazon, even better pricing than from Amazon itself. Recommended.
[Review 8]: Good device but poor operating instructions, and I don't recommend it.
[Review 9]: Not recognizing sentences, damaged product, needs improvement on the api's, don't recommend.
[Review 10]: Good functional product, great service, and price match at Best Buy.
[Review 11]: Limited options and repetitive playlist, but overall a good produc

In [26]:
# infer model using pros and cons prompt
prompt = prompt_review_pros_cons.format(review_text=review_text)
result = run_inference_on_model(prompt, max_tokens=200)
print(result)


Analyze the following list of product reviews. Identify the 3 most frequently mentioned positive aspects (pros) and 3 most frequently mentioned negative aspects (cons) across all reviews.
Store them in a json output.

### Reviews:
[Review 1]: Echo functionality is easy to set up, but speaker quality is not as good as Google Home. Nest thermostat and Inseon hub work well.
[Review 2]: 
[Review 3]: Great addition to home automation, wife and kids love it, recommend.
[Review 4]: Great organizer with Alexa, keeps shopping list, schedules appointments, plays music, and is easy to work with. Recommended for every home.
[Review 5]: Cool purchase, but doesn't answer questions and can set timers and tell weather.
[Review 6]: Very smart and recommended product.
[Review 7]: Great device from Amazon, even better pricing than from Amazon itself. Recommended.
[Review 8]: Good device but poor operating instructions, and I don't recommend it.
[Review 9]: Not recognizing sentences, damaged product, nee

In [34]:
# Infer model using the recurring ideas prompt
review_text = helpers.load_pickled_reviews(pickled_reviews_path)

prompt = prompt_reviews_recurring_ideas.format(review_text=review_text)

result = run_inference_on_model(prompt, max_tokens=150)

print(result)

Reviews loaded from pickle/reviews.pkl.
<class 'str'>

Analyze the following list of product reviews. Provide a concise summary of the three most frequently mentioned ideas or themes. Ensure each point reflects common feedback without repeating phrasing.

### Reviews:
[Review 1]: "The camera quality is amazing and takes very clear photos, even in low light."
[Review 2]: "Battery life could be better; it drains faster than my previous phone."
[Review 3]: "I’m impressed with the screen clarity and brightness, but the battery doesn’t last very long."
[Review 4]: "The photos are sharp and vibrant, especially in daylight."
[Review 5]: "Battery life is a bit disappointing, but the display is excellent and great for watching videos."

### Recurring Ideas:
1. The camera quality is highly praised, especially for clear and vibrant photos in various lighting.
2. The display quality is appreciated for its clarity and brightness, making it ideal for media.
3. Battery life is a common complaint, wit

### Conclusion
Using key ideas seems to give us better results:
- Less repetitions
- Less contradictions
- Better generalisation of ideas

The max_tokens length is very important, if too long the model starts repeating itself or hallucinating

In [28]:
# extract recurring ideas
recurrent_ideas = result.split('### Recurring Ideas:\n')[-1].split('\n')
# remove the last one in case it's incomplete
recurrent_ideas = recurrent_ideas[:-1]
# remove the numbers
recurrent_ideas = [ideas[3:] for ideas in recurrent_ideas]

for idea in recurrent_ideas:
    print(idea)

Echo functionality is praised for its ease of setup and integration with other smart home devices.
Sound quality is a mixed bag, with some users finding it lacking compared to Google Home.
Noise cancellation is a recurring theme, with multiple reviews mentioning that it doesn't fully block out background sounds.
Comfort is a positive, with users finding them suitable for extended wear.
Battery life is a common complaint, with users noting that it drains quickly.
Sound quality is praised for its clarity and depth, especially given the price.
Comfort is a major positive, with users finding them suitable for extended wear.


## Prompt 3: Write the product review
In this prompt we input the recurring ideas and generate a product review

In [31]:
prompt_final_product_review = """
Write a positive product title and review from these recurring ideas, that presents the product, highlights its pros, and subtly mentions a few minor downsides reported by some users. 

### Product Name:
Experience V2 Smartphone

### Review recurring ideas:
Screen quality is highly praised, with users noting its vibrant colors and sharp resolution.
Battery life is frequently mentioned, with users appreciating its ability to last through a full day of use.
Build quality is a recurring theme, with users complimenting its sturdy and premium design.
Camera performance is noted, with users enjoying clear, high-quality photos, especially in good lighting.
Price is occasionally mentioned, with some users feeling it’s slightly high for the features offered.

### Product title and review:
Reliable smartphone with vibrant display, long-lasting Battery, and premium build quality
The product has received impressive feedback from users, especially for its vibrant and sharp screen quality. The colors are rich and bring media content to life, making it ideal for streaming and everyday use. Another standout feature is the long-lasting battery life, which comfortably supports a full day of usage without the need for frequent recharging—a major plus for users on the go.
Additionally, the premium build quality gives the device a sturdy, high-end feel, with many appreciating its sleek design and solid construction. The camera has also been a popular highlight, capturing clear and vivid photos, particularly in well-lit conditions.
Though generally well-regarded, some users noted that the price might feel slightly high relative to other models. However, with its reliable performance and standout features, this product offers substantial value and has been positively received overall.

### Product Name:
{product_name}

### Review recurring ideas:
{review_ideas}

### Product title and review:
"""

review_ideas = "\n".join(recurrent_ideas)

product_name = "Echo Blutooth Speaker"
prompt = prompt_final_product_review.format(product_name=product_name, review_ideas=review_ideas)
review_text = run_inference_on_model(prompt, max_tokens=100)
print(review_text)



Write a positive product title and review from these recurring ideas, that presents the product, highlights its pros, and subtly mentions a few minor downsides reported by some users. 

### Product Name:
Experience V2 Smartphone

### Review recurring ideas:
Screen quality is highly praised, with users noting its vibrant colors and sharp resolution.
Battery life is frequently mentioned, with users appreciating its ability to last through a full day of use.
Build quality is a recurring theme, with users complimenting its sturdy and premium design.
Camera performance is noted, with users enjoying clear, high-quality photos, especially in good lighting.
Price is occasionally mentioned, with some users feeling it’s slightly high for the features offered.

### Product title and review:
Reliable smartphone with vibrant display, long-lasting Battery, and premium build quality
The product has received impressive feedback from users, especially for its vibrant and sharp screen quality. The colo

In [32]:
# extract the review
final_review = review_text.split('### Product title and review:\n')[-1]
# split title and review
title, review = final_review.split('\n', 1)
# remove line breaks
review = review.replace('\n', ' ')
# remove last sentence from review if incomplete
review = '.'.join(review.split('.')[:-1]) + '.'

print(title)
print()
print(review)


Echo Bluetooth Speaker: A solid choice for smart home integration and sound quality

The Echo Bluetooth Speaker has received positive feedback for its ease of setup and integration with other smart home devices. Users have found it easy to connect and control their smart home devices, making it a convenient addition to their home.


WE DID IT! YEAH MISTRAL AND YEAH ME! :)

Now let's deploy this solution to a python script so we can run inference using our 3 prompts and generate a product review from a list of Amazon reviews.