## Scoring System
The "ground truth" product descriptions from the h and m dataset are fairly short. We just want to make sure that the generated answers describe the product accurately and it isn't hallucinating wrong qualities about the product. Including more details about the product is actually good for the recommendation system later on, but more traditional metrics (cosine sim, bert score) would penalize these details. So instead, we will use a LLM as a judge to reward accuracy, penalize omissions, and heavily penalize contradictions, without penalizing the inclusion of additonal details.

**Base Score:** Start with 3 points if the generated description is entirely accurate to the "ground truth" (additional details are acceptable as long as they don’t contradict the original).

**Missing Details:** Subtract 1 point if key detail(s) from the "ground truth" are missing in the generated description.

**Contradictory Details:** Subtract 2 points if key detail(s) in the generated description contradict the "ground truth."




In [None]:
%%capture
# Install required packages
!pip install git+https://github.com/huggingface/transformers
!pip install accelerate
!pip install torch
!pip install pandas
!pip install matplotlib
!pip install numpy
!pip install scikit_learn
!pip install keras
!pip install tensorflow
!pip install huggingface_hub[hf_transfer]
!pip install qwen-vl-utils[decord]==0.0.8
!pip install bert-score
!pip install nltk

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import re

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, confusion_matrix,roc_auc_score,recall_score,f1_score, ConfusionMatrixDisplay
from bert_score import BERTScorer

import torch
from huggingface_hub import hf_hub_download, snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor

In [None]:
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'

# Define the model repository ID
repo_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

# Download the model file using hf_transfer
file_path = snapshot_download(repo_id)

print(f"Model downloaded to: {file_path}")

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

model-00002-of-000002.safetensors:   0%|          | 0.00/7.39G [00:00<?, ?B/s]

model-00001-of-000002.safetensors:   0%|          | 0.00/8.67G [00:00<?, ?B/s]

figures%2Fbenchmark.jpg:   0%|          | 0.00/777k [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/19.0k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Model downloaded to: /root/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B/snapshots/74fbf131a939963dd1e244389bb61ad0d0440a4d


In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", torch_dtype="auto", device_map="auto"
)

processor = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Helper Functions

In [None]:
def create_prompt(ground_truth, generated):
  return f"""You are an expert judge tasked with evaluating the accuracy of a generated product description compared to a "ground truth" description. Your goal is to score the generated description based on the following rules:

  1. **Base Score**: Start with 3 points if the generated description is entirely accurate to the "ground truth." Additional details are acceptable as long as they do not contradict the original.
  2. **Missing Details**: Subtract 1 point if key detail(s) from the "ground truth" are missing in the generated description.
  3. **Contradictory Details**: Subtract 2 points if key detail(s) in the generated description contradict the "ground truth."

  ### Instructions:
  - Carefully compare the generated description to the "ground truth."
  - Identify any missing or contradictory details.
  - Calculate the final score based on the rules above.
  - Provide a brief explanation for your scoring, highlighting missing or contradictory details.
  - Variation in wording is okay.

  ### Format:
  - **Score**: [Insert score here]
  - **Explanation**: [Insert explanation here]

  ### Example:
  - **Ground Truth Description**: A red cotton t-shirt with a round neckline and short sleeves.
  - **Generated Description**: A red cotton t-shirt with a V-neck and short sleeves.
  - **Score**: 1
  - **Explanation**: The generated description contradicts the "ground truth" by describing a V-neck instead of a round neckline (-2 points). No details are missing, so the final score is 1.

  Now, evaluate the following:
  - **Ground Truth Description**: {ground_truth}
  - **Generated Description**: {generated}"""

In [None]:
def parse_response(response):
    # Define regex patterns to extract each component
    score_pattern = r"\*\*Score\*\*:\s*(\d+)"
    explanation_pattern = r"\*\*Explanation\*\*:\s*(.*)"

    # Extract the components using regex
    score_match = re.search(score_pattern, response)
    explanation_match = re.search(explanation_pattern, response)

    # Print each match for debugging
    # print("Score Match:", score_match)
    # print("Explanation Match:", explanation_match)

    # Check if any match is None and handle it gracefully
    if not score_match or not explanation_match:
        raise ValueError("The response format is invalid or does not match the expected pattern.")

    score = int(score_match.group(1))
    explanation = explanation_match.group(1).strip()

    # Create and return the dictionary
    return {
        "score": score,
        "explanation": explanation,
    }

Sample Run 1: All correct

In [None]:
sample_ground_truth = "contrast black jacket in fast-drying functional fabric with a ribbed stand-up collar and zip down the front long raglan sleeves with contrasting colour panels down the sides and ribbed cuffs welt side pockets and a ribbed hem the polyester content of the jacket is partly recycled"
sample_generated = "This is a black track jacket with white side panels. The material appears to be a lightweight, textured fabric, possibly nylon or polyester, suitable for layering. It has a relaxed fit with a high collar and a full-length zipper. The sleeves have ribbed cuffs, and there are two patch pockets at the front. The jacket features a sporty design with clean lines and a modern aesthetic.\n\n**Style:** Sporty, Minimalist\n\n**Occasion Suitability:** Casual, Sporty\n\n**Weather Appropriateness:** Cold, All-Season (can be worn under layers for warmth)"

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": create_prompt(sample_ground_truth, sample_generated)}
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Okay, so I need to evaluate the generated product description against the ground truth. Let\'s start by reading both descriptions carefully.\n\nThe ground truth is a detailed description of a jacket: black, fast-drying functional fabric, ribbed stand-up collar, zip down front, long raglan sleeves with contrasting color panels, ribbed cuffs, welt side pockets, ribbed hem, and the fabric is partly recycled polyester. It also mentions it\'s suitable for casual and sporty occasions and appropriate for cold weather and layering.\n\nThe generated description is a bit different. It talks about a black track jacket with white side panels. The material is described as lightweight and textured, possibly nylon or polyester, good for layering. It mentions a relaxed fit with a high collar and full-length zipper. Sleeves have ribbed cuffs, and there are two patch pockets at the front. The design is sporty with clean lines and a modern look. The style is labeled as Sporty and Minimalist, occasion s

In [None]:
parse_response(output_text[0])

ValueError: The response format is invalid or does not match the expected pattern.

Sample Run 2: All correct plus extra details

In [None]:
sample_ground_truth = "solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves"
sample_generated = "solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves. The overall style is casual, and is suitable for everyday wear."

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": create_prompt(sample_ground_truth, sample_generated)}
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


['Alright, so I need to evaluate the generated product description against the ground truth. Let me start by comparing both descriptions carefully.\n\nFirst, the ground truth is a solid dark blue fitted top made of soft stretch jersey with a wide neckline and long sleeves. The generated description is almost identical: it starts with the same details—solid dark blue fitted top, soft stretch jersey, wide neckline, long sleeves. \n\nNow, looking further, the generated description adds two more points: "The overall style is casual, and is suitable for everyday wear." These additions don\'t contradict the original description; instead, they provide additional information about the intended use or style. \n\nAccording to the rules, the base score is 3 points for complete accuracy. Since the generated description includes all the same details without any omissions, it doesn\'t lose any points for missing details or contradictions. The extra details about style and suitability are acceptable 

In [None]:
print("Sample Ground Truth Description: " + sample_ground_truth)
print("Sample Generated Description: " + sample_generated)
parse_response(output_text[0])

Sample Ground Truth Description: solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves
Sample Generated Description: solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves. The overall style is casual, and is suitable for everyday wear.


{'score': 3,
 'explanation': 'The generated description matches the ground truth exactly, with no missing or contradictory details. The additional information about the style and suitability does not affect the base score as it is non-contradictory.'}

Sample Run 3: Missing Detail (color)

In [None]:
sample_ground_truth = "solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves"
sample_generated = "fitted top in soft stretch jersey with a wide neckline and long sleeves"

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": create_prompt(sample_ground_truth, sample_generated)}
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['Okay, so I need to evaluate the generated product description against the ground truth. Let\'s start by reading both carefully.\n\nThe ground truth is: "solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves." \n\nThe generated description is: "fitted top in soft stretch jersey with a wide neckline and long sleeves."\n\nFirst, I\'ll check for any key details. The ground truth mentions it\'s solid, dark blue, and describes the neckline as wide and the sleeves as long. The generated description includes "fitted top," "soft stretch jersey," "wide neckline," and "long sleeves." So, the important aspects like color, fabric, neckline, and sleeve length are present.\n\nNext, I should look for any missing details. The ground truth specifies "solid dark blue," but the generated version only mentions "fitted top in soft stretch jersey." Wait, actually, the generated does say "dark blue" in the example, but in the provided example, it\'s written as "fitted top i

In [None]:
parse_response(output_text[0])

{'score': 2,
 'explanation': 'The generated description omits the "solid dark blue" part of the ground truth, which is a key detail. No contradictory information is present.'}

Sample Run 4: Contradicting Detail


In [None]:
sample_ground_truth = "solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves"
sample_generated = "red fitted top in soft stretch jersey with a wide neckline and long sleeves"

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": create_prompt(sample_ground_truth, sample_generated)}
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


['Alright, let\'s tackle this evaluation step by step. I need to compare the generated description with the ground truth and apply the given rules.\n\nFirst, the ground truth is a "solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves." The generated description is "red fitted top in soft stretch jersey with a wide neckline and long sleeves."\n\nLooking at the details:\n\n1. **Color**: Ground truth is dark blue, while the generated is red. That\'s a key detail missing in the generated description.\n2. **Fabric Type**: Both mention "soft stretch jersey," so that\'s accurate.\n3. **Neckline**: Both have "wide neckline," which matches.\n4. **Sleeves**: Both have "long sleeves," so that\'s correct.\n\nSince the only difference is the color, which is a key detail missing, I need to apply the rules. The base score is 3. Missing details deduct 1 point, so 3 - 1 = 2.\n\nI should also note that the color discrepancy is the only issue. No contradictory details b

In [None]:
parse_response(output_text[0])

{'score': 2,
 'explanation': 'The generated description misses the key detail of the color, which is dark blue in the ground truth. The fabric type, neckline, and sleeves are accurate, but the missing color detail results in a deduction of 1 point from the base score of 3.'}

Sample Run 5: Missing Details (color) and Contradicting Details (short vs long sleeve)

In [None]:
sample_ground_truth = "solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves"
sample_generated = "fitted top in soft stretch jersey with a wide neckline and short sleeves"

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": create_prompt(sample_ground_truth, sample_generated)}
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


['Okay, so I need to evaluate the accuracy of the generated product description against the ground truth. Let\'s see, the ground truth is a solid dark blue fitted top in soft stretch jersey with a wide neckline and long sleeves. The generated description is a fitted top in soft stretch jersey with a wide neckline and short sleeves.\n\nFirst, I\'ll start by comparing each detail one by one. The ground truth mentions "solid dark blue," while the generated description doesn\'t mention the color at all. That\'s a key detail missing. \n\nNext, the fabric type is "soft stretch jersey" in the ground truth, and the generated description also says "soft stretch jersey." So that\'s accurate. \n\nThen, the neckline is "wide" in both descriptions, so that\'s correct. \n\nThe sleeves in the ground truth are "long," but the generated description says "short sleeves." That\'s a contradictory detail. \n\nNow, according to the rules, the base score is 3 points if everything is accurate. Since there\'s 

In [None]:
parse_response(output_text[0])

{'score': 0,
 'explanation': 'The generated description misses the "solid dark blue" color and contradicts by having short sleeves instead of long sleeves. Missing detail (-1) and contradictory detail (-2) result in a total score of 0.'}

# 2/16 Update: Choosing final prompt to use for generating product descriptions
In the eval_on_descriptions notebook, we used 3 different prompts to generate product descriptions. Now we will evaluate the performance of the generated descriptions using the llm scoring strategy discussed above in order to choose the final prompt we will use in our product.

## Update LLM judge prompt
First lets update the prompt used in the llm judge a little bit in order to emphasize that having more detail is expected and describe the task a bit more. Also updated the scoring wording and example for clarity.

In [None]:
def create_prompt_v2(ground_truth, generated):
  return f"""You are an expert judge tasked with evaluating the accuracy of a generated product description compared to a "ground truth" description. The "ground truth" product descriptions are fairly short. The generated product description is expected to be longer and include more detail. We just want to make sure that the generated product description includes all details from the "groud truth" product description and doesn't include inaccurate qualities about the product. Your goal is to score the generated product description based on the following rules:

  1. **Base Score**: Start with 3 points.
  2. **Missing Details**: Subtract 1 point if key detail(s) from the "ground truth" are missing in the generated description. Missing multiple details would only subtract a total of 1 point.
  3. **Contradictory Details**: Subtract 2 points if key detail(s) in the generated description contradict the "ground truth." Mulitple contradictory details would only subtract a total of 2 points.

  ### Instructions:
  - Carefully compare the generated description to the "ground truth."
  - Identify any missing or contradictory details.
  - Calculate the final score based on the rules above.
  - Provide a brief explanation for your scoring, highlighting missing or contradictory details.
  - Variation in wording is okay.

  ### Format:
  - **Score**: [Insert score here]
  - **Explanation**: [Insert explanation here]

  ### Example:
  - **Ground Truth Description**: A red cotton t-shirt with a round neckline and short sleeves.
  - **Generated Description**: A cotton t-shirt with a V-neck and short sleeves.
  - **Score**: 0
  - **Explanation**: The generated description contradicts the "ground truth" by describing a V-neck instead of a round neckline (-2 points), and is missing the detail about the red color (-1 point). There are contradictory and missing details, so the final score is 0.

  Now, evaluate the following:
  - **Ground Truth Description**: {ground_truth}
  - **Generated Description**: {generated}"""

## Evaluate descriptions generated by **prompt 1** (from eval_on_description notebook)

In [None]:
# adds score and explanation columns to inputted dataframe (contains generated_description and ground_truth)
def score_descriptions(df):
  scores = []
  explanations = []

  for index, row in df.iterrows():
    generated_description = row['generated_description']
    ground_truth_description = row['ground_truth']

    messages = [
      {"role": "user", "content": create_prompt_v2(ground_truth_description, generated_description)}
    ]

    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    inputs = processor(
        text=[text],
        padding=True,
        return_tensors="pt",
    )

    inputs = inputs.to("cuda")

    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=512)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    # save output
    parsed_output = parse_response(output_text[0])
    scores.append(parsed_output['score'])
    explanations.append(parsed_output['explanation'])

  df['score'] = scores
  df['explanation'] = explanations

  return df

In [None]:
# load in descriptions generated from prompt1
df1 = pd.read_csv('/content/drive/MyDrive/MIDS Capstone/prompt1_descriptions.csv')
df1.head(5)

In [None]:
# score
prompt1_scored = score_descriptions(df1)
prompt1_scored.head(5)

In [None]:
# calculate average score and visualize distribution of scores
print("Average Score: " + str(prompt1_scored['score'].mean()))
prompt1_scored['score'].hist()

## Evaluate descriptions generated by **prompt 2** (from eval_on_description notebook)

In [None]:
# load in descriptions generated from prompt2
df2 = pd.read_csv('/content/drive/MyDrive/MIDS Capstone/prompt2_descriptions.csv')
df2.head(5)

In [None]:
# score
prompt2_scored = score_descriptions(df2)
prompt2_scored.head(5)

In [None]:
# calculate average score and visualize distribution of scores
print("Average Score: " + str(prompt2_scored['score'].mean()))
prompt2_scored['score'].hist()

## Evaluate descriptions generated by **prompt 3** (from eval_on_description notebook)

In [None]:
# load in descriptions generated from prompt3
df3 = pd.read_csv('/content/drive/MyDrive/MIDS Capstone/prompt3_descriptions.csv')
df3.head(5)

In [None]:
# score
prompt3_scored = score_descriptions(df3)
prompt3_scored.head(5)

In [None]:
# calculate average score and visualize distribution of scores
print("Average Score: " + str(prompt3_scored['score'].mean()))
prompt3_scored['score'].hist()