# Data Preparation for Retail GenAI Multi-Modal System

This notebook demonstrates how to gather, clean, and prepare retail data for a multi-modal GenAI system. We'll work with:

1. Product images (for computer vision)
2. Product descriptions (for text understanding)
3. Sales information (for recommendation engine)
4. Store layout data (for spatial context)

## Overview

Retail data presents unique challenges:
* High variance in product appearances
* Domain-specific terminology 
* Seasonal and trend-dependent patterns
* Multi-modal nature (text, images, structured data)

This notebook provides a systematic approach to preparing this data for AI model training.

## Environment Setup

First, let's ensure we have all the required dependencies.

In [None]:
# Install dependencies if needed
!pip install -q pandas numpy torch torchvision pillow matplotlib tqdm albumentations opencv-python

In [None]:
# Import necessary libraries
import os
import sys
import numpy as np
import pandas as pd
import torch
import torchvision
import matplotlib.pyplot as plt
from PIL import Image
from pathlib import Path
from tqdm.notebook import tqdm

# For GPU detection
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

## 1. Data Collection

For this demo, we'll use a combination of:
- Sample retail dataset (included in the repository)
- Programmatic generation of additional data
- Optional: Connection to public retail datasets

Let's first examine the sample data structure.

In [None]:
# Define paths to example data
REPO_ROOT = Path("..")
DATA_DIR = REPO_ROOT / "examples" / "product_data"
IMAGES_DIR = REPO_ROOT / "examples" / "images"

# Check if example data exists, otherwise download it
if not DATA_DIR.exists() or len(list(DATA_DIR.glob("*.csv"))) == 0:
    print("Example data not found. Downloading...")
    # This would be implemented in the actual notebook
    # !python ../src/utils/download_demo_data.py
    print("For this demo, we'll create some sample data")
    
    # Create directories if they don't exist
    os.makedirs(DATA_DIR, exist_ok=True)
    os.makedirs(IMAGES_DIR, exist_ok=True)
    
    # Generate sample product catalog data
    product_data = {
        "product_id": list(range(1, 101)),
        "name": [f"Product {i}" for i in range(1, 101)],
        "category": np.random.choice(["Electronics", "Clothing", "Groceries", "Home", "Beauty"], 100),
        "price": np.random.uniform(5, 500, 100).round(2),
        "description": [f"This is a detailed description of product {i}" for i in range(1, 101)],
        "in_stock": np.random.choice([True, False], 100, p=[0.8, 0.2])
    }
    
    # Create DataFrame and save to CSV
    product_df = pd.DataFrame(product_data)
    product_df.to_csv(DATA_DIR / "product_catalog.csv", index=False)
    
    # Print sample of generated data
    print("Sample data generated successfully!")
    print(product_df.head())

### 1.1 Loading Product Catalog Data

In [None]:
# Load product catalog
product_catalog = pd.read_csv(DATA_DIR / "product_catalog.csv")

# Display basic information
print(f"Total products: {len(product_catalog)}")
print("\nCategory distribution:")
print(product_catalog['category'].value_counts())

# Display a sample of products
product_catalog.head()

### 1.2 Creating Image Processing Pipeline

For a retail AI system, we need to process product images for both:
- Training our computer vision models
- Creating multi-modal embeddings

Let's build a pipeline to handle these tasks.

In [None]:
import cv2
import albumentations as A
from albumentations.pytorch import ToTensorV2

# Define image transformation pipeline
def get_transforms(mode="train"):
    if mode == "train":
        return A.Compose([
            A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
            A.HorizontalFlip(p=0.5),
            A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1, p=0.5),
            A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
            ToTensorV2(),
        ])
    else:  # validation/test transforms
        return A.Compose([
            A.Resize(height=256, width=256),
            A.CenterCrop(height=224, width=224),
            A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
            ToTensorV2(),
        ])

# Function to process images with GPU acceleration when available
def process_image_batch(image_paths, transforms, device=None):
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    processed_images = []
    for img_path in tqdm(image_paths, desc="Processing images"):
        # Read image
        img = cv2.imread(str(img_path))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        # Apply transformations
        transformed = transforms(image=img)
        tensor_img = transformed["image"]
        
        # Move to device
        tensor_img = tensor_img.to(device)
        processed_images.append(tensor_img)
    
    # Stack into a batch
    if processed_images:
        return torch.stack(processed_images)
    return None

### 1.3 Text Processing Pipeline

Next, we'll create a pipeline for processing product descriptions and other text data.

In [None]:
# We'll use the transformers library for text processing
!pip install -q transformers

from transformers import AutoTokenizer, AutoModel

# Load tokenizer and model for text embeddings
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Function to create text embeddings
def process_text_batch(texts, tokenizer, model, device=None, max_length=128):
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Tokenize texts
    encoded = tokenizer(texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
    
    # Move to device
    input_ids = encoded["input_ids"].to(device)
    attention_mask = encoded["attention_mask"].to(device)
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        embeddings = outputs.last_hidden_state[:, 0, :]
    
    return embeddings

## 2. Data Preparation and Preprocessing

Now that we have our basic pipelines, let's preprocess our dataset for the multi-modal fusion.

In [None]:
# This section would contain:
# 1. Data cleaning (handling missing values, duplicates, etc.)
# 2. Feature engineering (creating new features from existing data)
# 3. Data splitting (training, validation, test sets)
# 4. Creating combined representations

# For brevity, we'll outline the key steps with sample code

# 1. Data cleaning
print("Checking for missing values...")
print(product_catalog.isnull().sum())

# Fill missing descriptions with a default message
product_catalog['description'] = product_catalog['description'].fillna('No description available')

# 2. Feature engineering - Create rich text representation
product_catalog['full_text'] = (
    'Product: ' + product_catalog['name'] + '. ' +
    'Category: ' + product_catalog['category'] + '. ' +
    'Price: $' + product_catalog['price'].astype(str) + '. ' +
    'Description: ' + product_catalog['description'] + '. ' +
    'In stock: ' + product_catalog['in_stock'].map({True: 'Yes', False: 'No'})
)

# Display a sample of the enriched text
print("\nSample of enriched product text:")
print(product_catalog['full_text'].iloc[0])

### 2.1 Preparing Image-Text Pairs

For multi-modal training, we need to pair product images with their textual descriptions.

In [None]:
# This would be implemented in the final notebook
# For now, we'll demonstrate with pseudocode:

'''
# Create a mapping of product_id to image path
image_mapping = {}
for img_path in IMAGES_DIR.glob("*.jpg"):
    # Assuming image filename format: product_{id}.jpg
    product_id = int(img_path.stem.split('_')[1])
    image_mapping[product_id] = img_path

# Add image paths to the product catalog
product_catalog['image_path'] = product_catalog['product_id'].map(image_mapping)

# Filter to only products with images
products_with_images = product_catalog.dropna(subset=['image_path'])
print(f"Products with images: {len(products_with_images)} out of {len(product_catalog)}")
'''

## 3. Data Export and Integration

Finally, we'll export our processed data for use in the model building notebook.

In [None]:
# Create processed data directory
PROCESSED_DIR = REPO_ROOT / "data" / "processed"
os.makedirs(PROCESSED_DIR, exist_ok=True)

# Export the processed catalog
product_catalog.to_csv(PROCESSED_DIR / "processed_product_catalog.csv", index=False)

print(f"Processed data saved to: {PROCESSED_DIR}")

## 4. NVIDIA GPU Acceleration Benchmarks

Let's demonstrate the performance advantages of using NVIDIA GPUs for our data preprocessing pipeline.

In [None]:
import time

def benchmark_processing(num_samples=100):
    # Generate dummy text data
    dummy_texts = [
        f"This is a sample product description for product {i}." for i in range(num_samples)
    ]
    
    # Load model for benchmarking
    model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
    
    # Benchmark on CPU
    cpu_device = torch.device("cpu")
    model.to(cpu_device)
    
    start_time = time.time()
    _ = process_text_batch(dummy_texts, tokenizer, model, device=cpu_device)
    cpu_time = time.time() - start_time
    
    # Benchmark on GPU if available
    if torch.cuda.is_available():
        gpu_device = torch.device("cuda")
        model.to(gpu_device)
        
        # Warm-up run
        _ = process_text_batch(dummy_texts[:10], tokenizer, model, device=gpu_device)
        
        start_time = time.time()
        _ = process_text_batch(dummy_texts, tokenizer, model, device=gpu_device)
        gpu_time = time.time() - start_time
        
        print(f"CPU time: {cpu_time:.2f} seconds")
        print(f"GPU time: {gpu_time:.2f} seconds")
        print(f"Speedup: {cpu_time/gpu_time:.1f}x")
    else:
        print(f"CPU time: {cpu_time:.2f} seconds")
        print("GPU not available for comparison")

# Run benchmark
print("Benchmarking text processing performance...")
benchmark_processing()

## 5. Summary and Next Steps

In this notebook, we've:
1. Set up the environment for retail data processing
2. Created data processing pipelines for images and text
3. Prepared and exported the processed data
4. Demonstrated GPU acceleration benefits

In the next notebook, we'll build multi-modal models using this processed data to create a unified retail AI system.