<a href="https://colab.research.google.com/github/anshupandey/Generative-AI-opensource/blob/main/Mini_Project3_Solution_Retail_Product_Description_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Industry Background

## Company Name: TechRetail Solutions

**TechRetail Solutions** is a leading e-commerce platform that connects vendors with customers worldwide. The company offers a wide range of products across various categories, including electronics, fashion, home appliances, and more. TechRetail Solutions is known for its user-friendly interface, extensive product catalog, and commitment to providing a seamless shopping experience.

# Problem Statement

One of the critical challenges faced by TechRetail Solutions is ensuring that vendors provide high-quality, clear, and accurate product descriptions. Many vendors struggle to write descriptions that meet the platform's standards and format requirements. This inconsistency leads to a suboptimal user experience, as poorly written descriptions can confuse customers, reduce trust, and ultimately impact sales.

The lack of standardized, engaging, and informative product descriptions also hampers the platform's ability to showcase products effectively. To enhance user experience and improve sales, TechRetail Solutions needs a solution that can help vendors generate high-quality product descriptions effortlessly.

# Solution Approach

To address this challenge, TechRetail Solutions aims to leverage Large Language Models (LLMs) to automate the generation of product descriptions. By using advanced LLMs, the company can ensure that product descriptions are not only accurate and informative but also engaging and consistent with the platform's standards.

The solution involves evaluating different LLMs to identify the most suitable model for generating product descriptions. The evaluation will be based on key metrics such as **BLEU**, **ROUGE**, human evaluation, latency, and resource usage. By selecting the best-performing LLM, TechRetail Solutions can provide vendors with a tool that generates high-quality product descriptions, enhancing the overall user experience and boosting sales.

# Dataset Overview

The dataset for evaluating LLMs consists of a diverse set of products across various categories. Each product scenario includes the product name, a brief description, and two high-quality reference descriptions. This dataset will be used to assess the performance of different LLMs based on their ability to generate product descriptions that closely match the reference descriptions.

## Example Dataset:

**Product 1: Wireless Earbuds**

- **Brief Description:** High-fidelity wireless earbuds with noise-canceling technology and long battery life.

- **Reference Description 1:** "Experience the ultimate in wireless freedom with our high-fidelity earbuds. Featuring noise-canceling technology and up to 20 hours of battery life, these earbuds are perfect for music lovers on the go."

- **Reference Description 2:** "Our wireless earbuds offer superior sound quality and comfort. With easy touch controls and a sleek design, enjoy your favorite tunes anytime, anywhere."


Dataset Link: https://github.com/anshupandey/Working_with_Large_Language_models/blob/main/retail_product_description_dataset.json

## Solution: Evaluating LLMs for Retail Product Description Generation

### Objective
Select the best LLM for generating high-quality, engaging, and accurate product descriptions.

### Metrics
- **BLEU**: Aim for a BLEU score above 0.3.
- **ROUGE**: Aim for a ROUGE-L score above 0.5.
- **Perplexity**: Average score less than 20.
- **Latency**: Less than 2 seconds per description.
- **Model Size and Resource Usage**: Fit within available computational resources.

### Steps

### 1. Define Evaluation Criteria
- BLEU score
- ROUGE score
- Perplexity Score
- Latency
- Model Size and Resource Usage

### 2. Benchmarking
- Select candidate LLMs. (Gemini 1.5 Flash, PaLM 2, GPT 35 Turbo)
- Load the benchmark dataset.


### 4. Evaluate LLMs
1. **Generate Descriptions**
   - Use each LLM to generate descriptions.
2. **Calculate Metrics**
   - BLEU Score
   - ROUGE Score
   - Perplexity Score
3. **Analyze Results**
   - Compare metrics.
   - Identify the best-performing model.



## Environment Setup

In [1]:
!pip install together rouge-score --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m51.2/61.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.31.6 requires pillow<10.1.0,>=8.3.2, but you have pillow 10.4.0 which is incompatible.[0m[31m
[0m

In [3]:
import os
os.environ["TOGETHER_API_KEY"] = "4faeef5f54789f54e13a6e05a6e413be38ba3b30c2059c6a8c9b955a413ce99b"

## Load Dta

In [4]:
url = "https://raw.githubusercontent.com/anshupandey/Working_with_Large_Language_models/main/retail_product_description_dataset.json"

In [5]:
import pandas as pd
# load data
df = pd.read_json(url)
df.shape

(10, 3)

In [6]:
df.head()

Unnamed: 0,name,brief_description,reference_descriptions
0,Wireless Earbuds,High-fidelity wireless earbuds with noise-canc...,Experience the ultimate in wireless freedom wi...
1,Smartwatch,A stylish smartwatch with fitness tracking and...,Stay connected and track your fitness goals wi...
2,Electric Kettle,A 1.7-liter electric kettle with rapid boil te...,Boil water quickly and safely with our 1.7-lit...
3,Gaming Laptop,A high-performance gaming laptop with a powerf...,Unleash your gaming potential with our high-pe...
4,Yoga Mat,"A non-slip, eco-friendly yoga mat with cushion...",Enhance your yoga practice with our eco-friend...


## Setup Prediction Functions

In [7]:
def get_prompt(name,brief_desc):
  prompt = f"""
  for the give product name and brief description, Generate a 2 line product description.
  DO not add any additional information on your own which is not present in the information provided.
  Write the brief description so that its easy to read and interpret,  Add call to action at the end.
  Product Name: {name}
  Brief Description: {brief_desc}
  """
  return prompt

In [8]:
from together import Together
client = Together()

In [9]:
def get_prediction_llama(prompt,client=client):
  response = client.chat.completions.create(
    model="meta-llama/Llama-3-8b-chat-hf",
    messages=[{"role": "user", "content": prompt}],)
  return response.choices[0].message.content


# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="google/gemma-2b-it",device=0) # remove device = 0 if not using GPU

def get_prediction_gemma(prompt):
  messages = [{"role": "user", "content": prompt},]
  response = pipe(messages,max_length=2000)
  return response[0]['generated_text'][1]['content']


def get_prediction_phi2(prompt,client=client):
  response = client.chat.completions.create(
    model="microsoft/phi-2",
    messages=[{"role": "user", "content": prompt}],)
  return response.choices[0].message.content

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
import math
import nltk
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu

In [11]:
def calculate_perplexity(predicted_sentence, reference_sentence):
    ref_len = len(reference_sentence.split()) # calculating total num of words in reference
    log_prob_sum = 0
    for word in reference_sentence.split():
        if word in predicted_sentence.split():
            log_prob_sum += math.log(1 / (predicted_sentence.split().count(word) / len(predicted_sentence.split())))
        else:
            log_prob_sum += math.log(1 / len(predicted_sentence.split()))
    return math.exp(log_prob_sum / ref_len)


def calculate_bleu(predicted_sentence, reference_sentence):
    return sentence_bleu([reference_sentence.split()], predicted_sentence.split())

# ROUGE Score Calculation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

def calculate_rouge(predicted_sentence, reference_sentence):
    scores = scorer.score(reference_sentence, predicted_sentence)
    return scores

In [12]:
def get_comparison_report(df):
  print("Evaluating LLMs")
  predictions = {"phi2":[],"llama":[],"gemma":[]}
  reference = []
  import time

  for i in range(len(df)):
    name = df.iloc[i]["name"]
    time.sleep(20)
    brief_desc = df.iloc[i]["brief_description"]
    prompt = get_prompt(name,brief_desc)
    predictions["phi2"].append(get_prediction_phi2(prompt))
    predictions["llama"].append(get_prediction_llama(prompt))
    predictions["gemma"].append(get_prediction_gemma(prompt))
    reference.append(df.iloc[i]["reference_descriptions"].split("|")[0])


  result = {"perplexity":[],"bleu":[],"rouge1":[],"rouge2":[],"rougeL":[]}
  for model in predictions.keys():
    perplexities = [calculate_perplexity(pred, ref) for pred, ref in zip(predictions[model], reference)]
    average_perplexity = sum(perplexities) / len(perplexities)
    result["perplexity"].append(average_perplexity)

    bleus = [calculate_bleu(pred, ref) for pred, ref in zip(predictions[model], reference)]
    average_bleu = sum(bleus) / len(bleus)
    result["bleu"].append(average_bleu)

    rouge_scores = [calculate_rouge(pred, ref) for pred, ref in zip(predictions[model], reference)]

    average_rouge = {
    'rouge1': sum([score['rouge1'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rouge2': sum([score['rouge2'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    'rougeL': sum([score['rougeL'].fmeasure for score in rouge_scores]) / len(rouge_scores),
    }
    result["rouge1"].append(average_rouge['rouge1'])
    result["rouge2"].append(average_rouge['rouge2'])
    result["rougeL"].append(average_rouge['rougeL'])
  return pd.DataFrame(result,index=predictions.keys())


In [13]:
get_comparison_report(df)

Evaluating LLMs


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider usin

Unnamed: 0,perplexity,bleu,rouge1,rouge2,rougeL
phi2,0.334851,0.005997,0.043306,0.019309,0.039677
llama,0.9953,0.080443,0.469597,0.293813,0.403614
gemma,1.532803,0.055275,0.425129,0.216824,0.331243
