# Synthetic Product Data Generation Using LLMs

This Jupyter Notebook leverages the capabilities of Large Language Models (LLMs) to generate synthetic product data. It is built upon the [LangChain](https://python.langchain.com/docs/get_started/introduction) framework, enabling a seamless and flexible approach to creating realistic data for e-commerce or any other domain requiring product listings.

## Key Features
- **Flexible LLM Integration**: While examples for [Ollama](https://ollama.com/) and [OpenAI](https://platform.openai.com/docs/introduction) are provided, users are encouraged to explore and integrate other LLMs as per their requirements, following the guidance available on the [LangChain Quickstart Guide](https://python.langchain.com/docs/get_started/quickstart#llm-chain).
- **Customizable Data Generation**: Users can tailor the synthetic data generation process through a set of predefined constants, allowing for the adjustment of shop type, the number of categories, vendors, and products, as well as the retry logic for product detail generation.

## Workflow
- **Vendor List Generation**: Initiates the process by creating a list of unique vendor names suitable for the specified shop type.
- **Category List Creation**: Generates a diverse range of product categories to ensure a comprehensive inventory representation.
- **Product Combination Formation**: Constructs random pairings of categories and vendors based on the desired number of products, setting the stage for detailed product information generation.
- **Synthetic Product Detail Generation**: For each product pairing, detailed information including product names, descriptions, and prices is generated.

## Interactivity and Customization
Designed within the interactive environment of a Jupyter Notebook, this tool offers users the flexibility to:

- **Monitor and Adjust**: Follow the generation process step-by-step, making real-time adjustments as needed.
- **Rerun and Refine**: Easily rerun sections to refine the outcomes or explore different configurations.
- **Manual Overrides**: Directly edit the generated data, such as vendor names or category specifics, to align with specific preferences or requirements.
This approach not only automates the creation of synthetic data but also places control in the hands of users, blending automation with customization to meet diverse needs.

In [None]:
# If running in Collab
%pip install langchain

In [None]:
# Standard library imports
from typing import Dict, List, Optional
import random

# Related third-party imports
import pandas as pd
from tqdm import tqdm

# Local application/library specific imports
from langchain.prompts import PromptTemplate
from langchain.output_parsers import CommaSeparatedListOutputParser
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

## Ollama
If you want to use Ollama, use this and change the name of the model you want to use

In [None]:
from langchain_community.llms import Ollama

llm = Ollama(model="...")

## OpenAI
If you want to use OpenAI, use this:

In [None]:
# If running in Collab
%pip install langchain_openai

In [None]:
import os
from langchain_openai import ChatOpenAI

os.environ['OPENAI_API_KEY'] = '...'

llm = ChatOpenAI(model="...")

## Constants
Define here your desired values

In [None]:
# Type of shop. Including "a/an". Example: "an online book store"
TYPE_OF_SHOP = "an online book store"

# Number of categories to generate
NUMBER_OF_CATEGORIES = 3

# Number of vendors to generate
NUMBER_OF_VENDORS = 3

# Number of products to generate
NUMBER_OF_PRODUCTS = 200

# Maximum number of attempts for the LLM to generate each product
MAX_ATTEMPTS = 5

## Helper Prompts
These prompts are responsible for creating the list of vendors and categories

In [None]:
# Prepare the helper prompt
helper_system_prompt = f"""
As a synthetic product generator specialized for {TYPE_OF_SHOP}, your role encompasses the creation of categories and vendors that fit this specific shop's theme. 
Your responses should:

- Strictly conform to the instructions provided in both system and user prompts.
- Focus solely on generating content that is directly requested, without introducing extraneous information.
- Reflect the unique context and offerings of {TYPE_OF_SHOP}, ensuring relevance and alignment with its theme.

Remember, your objective is to generate data that is coherent, contextually appropriate, and within the bounds of the prompts. 
Precision and relevance are key.
"""

csv_parser = CommaSeparatedListOutputParser()

helper_prompt = PromptTemplate(
    template="{system_prompt}\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={
        "format_instructions": csv_parser.get_format_instructions(),
        "system_prompt": helper_system_prompt,
    },
)

helper_chain = helper_prompt | llm | csv_parser

# Prepare the categories prompt
categories_prompt = f"""
Imagine you're organizing the inventory for {TYPE_OF_SHOP}, a shop that needs a well-defined set of categories to classify its wide range of products. 
Your task is to create {NUMBER_OF_CATEGORIES} category names that reflect the diversity and specificity of products typically found in such a store. 
These categories should:

- Be reflective of common divisions seen in retail stores, ensuring they are recognizable and straightforward for customers.
- Encompass a comprehensive scope of the shop's inventory, covering all potential product types it might offer.
- Maintain simplicity and clarity, with each name immediately conveying the kind of products it includes.

Very important: List only each category name, focusing on the names alone without additional explanations. 
Do not include new lines, break lines or characters like "()[]/\:"
"""

categories = helper_chain.invoke({"query": categories_prompt})

# Prepare the vendors prompt
vendors_prompt = f"""
Craft {NUMBER_OF_VENDORS} unique and inventive vendor names that would be a perfect match for {TYPE_OF_SHOP}. 
Each name should:

- Evoke a sense of professionalism and alignment with the shop's range of products.
- Span a variety of styles, suggesting backgrounds from handcrafted origins to modern, technology-driven enterprises.
- Remain entirely fictional, carefully avoiding any similarities to actual brands or vendors in the market.

Focus exclusively on generating the vendor names, without additional details or descriptions.
Do not include new lines, break lines or characters like "()[]/\:"
"""

vendors = helper_chain.invoke({"query": vendors_prompt})

In [None]:
# Execute this line if you want to see the generated vendors
vendors

In [None]:
# Execute this line if you want to see the generated categories
categories

## Generate a list of product combinations
Constructs random pairings of categories and vendors based on the desired number of products, setting the stage for detailed product information generation.

In [None]:
# Generate a list of product combinations
product_combinations = [{
    'vendor': random.choice(vendors),
    'category': random.choice(categories)
} for _ in range(NUMBER_OF_PRODUCTS)]

## Product Prompt
The Product Prompt is designed to generate unique titles, descriptions, and prices for each product, utilizing the vendor name and category for context. To ensure diversity, it cross-references a list of previously generated titles for each vendor, avoiding duplicate titles.

To counteract potential inconsistencies, especially with smaller LLMs that might not adhere strictly to instructions, a retry mechanism is in place. This allows the prompt to be executed up to ```MAX_ATTEMPTS``` times, ensuring the generation of accurate and complete product information on each attempt.

In [None]:
class Product(BaseModel):
    title: str = Field(description="Title of the product")
    description: str = Field(description="Description of the product")
    price: str = Field(description="Price of the product in USD")
        
product_system_prompt = f"""
As a synthetic product generator specialized for {TYPE_OF_SHOP}, your role encompasses the creation of products that fit this specific shop's theme. 
You will be given a category and an vendor name for generating the product.
Your responses should:

- Strictly conform to the instructions provided in both system and user prompts.
- Focus solely on generating content that is directly requested, without introducing extraneous information.
- Reflect the unique context and offerings of {TYPE_OF_SHOP}, ensuring relevance and alignment with its theme.

Remember, your objective is to generate data that is coherent, contextually appropriate, and within the bounds of the prompts. 
Precision and relevance are key.
"""

json_parser = JsonOutputParser(pydantic_object=Product)

product_prompt = PromptTemplate(
    template="{system_prompt}\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={
        "format_instructions": json_parser.get_format_instructions(),
        "system_prompt": product_system_prompt,
    },
)

product_chain = product_prompt | llm | json_parser

def generate_product_details(vendor, category, chain, existing_titles: List[str]):
    # Construct the prompt for generating product details
    existing_titles_formatted = ", ".join([f'"{title}"' for title in existing_titles])
    prompt = f"""
Given a vendor "{vendor}", generate a product for the {category} category. 
Avoid using these existing titles: {existing_titles_formatted}.

Include:

- A product title that captures the essence of the item. Do not include the vendor name on the title. Keep it short and concise, and ensure it is unique among these titles: {existing_titles_formatted}.
- A comprehensive description that showcases the product's key features and benefits. If it makes sense for the type of product being generated, include relevant details such as weight, size, color, or other pertinent characteristics to give a clear picture of the product. The length of the description is at your discretion, but it should be thorough enough to inform and entice potential customers.
- A price in USD, considering the product's value and market positioning.
Make sure to always generate a title, a description, and a price.

Guidelines:
- The product must always include a title, description, and price.
- Ensure the title and description are specifically tailored to the category.
- You may choose whether to incorporate the vendor name and category within the description, based on what best suits the product narrative and customer engagement strategy.
- In your answer, output only the title, description, and price.
"""
    for attempt in range(MAX_ATTEMPTS):
        # Invoke the chain (LLM) with the prompt
        response = chain.invoke({"query": prompt})

        # Assuming 'response' is a dict with 'title', 'description', and 'price'
        # Check if all required parts are present
        if all(key in response for key in ['title', 'description', 'price']) and response['title'] not in existing_titles:
            # If all parts are present, return the response
            return {
                'vendor': vendor,
                'category': category,
                'title': response['title'],
                'description': response['description'],
                'price': response['price']
            }
        else:
            # If not all parts are present, print a message and try again
            print(f"Attempt {attempt + 1}: Missing one or more parts or title is duplicated. Retrying...")
            print(response)
    
    # If the loop exits without returning, it means all attempts failed
    print("Failed to generate complete product details after maximum attempts.")
    return None

## Generating the synthetic product data
Finally, the titles, descriptions and prices are generated

In [None]:
product_details_list = [] 

# Dictionary to keep track of titles generated for each vendor
vendor_titles = {}  

for product in tqdm(product_combinations, desc='Generating Product Details'):
    vendor = product['vendor']
    category = product['category']
    
    if vendor not in vendor_titles:
        vendor_titles[vendor] = []
    
    product_details = generate_product_details(vendor, category, product_chain, vendor_titles[vendor])
    if product_details:
        product_details_list.append(product_details)
        vendor_titles[vendor].append(product_details['title'])

## Pandas
The data is saved to a pandas dataset.

The following lines are for helping the user visualize the results and check that there are not duplicates

In [None]:
# Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(product_details_list)

# Display the first few rows of the DataFrame to verify
print(df.head())

In [None]:
df['category'].unique()

In [None]:
df['vendor'].unique()

In [None]:
df[df.duplicated('title', keep=False)]

In [None]:
df[df.duplicated(['vendor', 'title'], keep=False)]

## Save to Excel and CSV
Save the data

In [None]:
df.to_excel('products.xlsx')

In [None]:
df.to_csv('products.csv')