# Batch CLIP Embedding for Crop Images with Delta Tables

This notebook demonstrates how to:
- Load a registered CLIP model from Unity Catalog for batch processing
- Create Spark UDFs to generate embeddings at scale
- Process crop image data from Delta tables with CLIP embeddings
- Save the enriched data back to Delta tables for downstream ML workflows

The resulting table will contain both the original image metadata and high-dimensional CLIP embeddings, enabling semantic search and similarity analysis across the crop image dataset.

## Import Required Libraries

Import necessary libraries for MLflow model loading, Spark DataFrame processing, and UDF creation.


In [0]:
import mlflow
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import ArrayType, FloatType
import torch
from transformers import CLIPProcessor, CLIPModel

## Dataset Configuration

Configure the dataset-specific parameters. Update these values to match your registered model and table names.


In [0]:
# Dataset Configuration - Update these values for your dataset
CATALOG_NAME = "autobricks"  # Your Unity Catalog name
SCHEMA_NAME = "agriculture"   # Your schema name
MODEL_NAME = "clip_embedding-135"  # Your registered CLIP model name
SOURCE_TABLE_NAME = "crop_images_directory"  # Source table from notebook 00
OUTPUT_TABLE_NAME = "crop_images_directory_embeddings"  # Output table name

# Construct full table names
FULL_MODEL_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.{MODEL_NAME}"
SOURCE_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.{SOURCE_TABLE_NAME}"
OUTPUT_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.{OUTPUT_TABLE_NAME}"

# Processing parameters
INPUT_COLUMN = "image_base64"
INPUT_TYPE = "image"

## Load Model and Create Embedding UDF

Load the registered CLIP model from Unity Catalog and create a Spark UDF for distributed embedding generation.


In [0]:
# Load model from Unity Catalog
model_uri = f"models:/{FULL_MODEL_NAME}/1"
model = mlflow.pyfunc.load_model(model_uri)
print(f"Loaded model: {FULL_MODEL_NAME}")

# Create UDF for distributed embedding generation
def get_embedding(input_data):
    if input_data is None:
        return None
    
    input_df = pd.DataFrame({"input_data": [input_data]})
    params = {"input_type": INPUT_TYPE}
    
    try:
        result = model.predict(input_df, params=params)
        return result[0]
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

embedding_udf = udf(get_embedding, ArrayType(FloatType()))

## Process Images and Generate Embeddings

Load the crop images table and apply the CLIP embedding UDF to generate vector representations for each image.

In [0]:
# Load source table and apply embedding generation to all rows

df = spark.table(SOURCE_TABLE)
print(f"Processing {df.count()} rows from {SOURCE_TABLE}")

# Apply the embedding UDF to the entire DataFrame
results_df = df.withColumn("embeddings", embedding_udf(col(INPUT_COLUMN)))

# Optionally display a sample of the enriched DataFrame
# display(results_df.select("file_name", "folder", "size_bytes", "embeddings"))

## Save Enriched Data to Delta Table

Save the processed DataFrame with embeddings to a new Delta table for downstream ML workflows and semantic search applications.


In [0]:
# Save enriched data with embeddings to Delta table
results_df.write \
         .format("delta") \
         .mode("overwrite") \
         .saveAsTable(OUTPUT_TABLE)

print(f"✅ Enriched data with embeddings saved to {OUTPUT_TABLE}")
print(f"Table contains {results_df.count()} rows with CLIP embeddings")