%md
# Batch CLIP Embedding for Crop Images with Delta Tables

This notebook demonstrates how to:
- Load a registered CLIP model from Unity Catalog for batch processing
- Create Spark UDFs to generate embeddings at scale
- Process crop image data from Delta tables with CLIP embeddings
- Save the enriched data back to Delta tables for downstream ML workflows

The resulting table will contain both the original image metadata and high-dimensional CLIP embeddings, enabling semantic search and similarity analysis across the crop image dataset.

In [0]:
# Import only necessary libraries for MLflow model loading, Spark DataFrame processing, and UDF creation
import mlflow
import pandas as pd
from pyspark.sql.functions import col, udf
from pyspark.sql.types import ArrayType, FloatType

In [0]:
# Configuration: Edit these values for your environment
CATALOG_NAME = "autobricks"  # Unity Catalog name
SCHEMA_NAME = "agriculture"   # Schema name
MODEL_UC_NAME = "autobricks.agriculture.clip_image_embedding"  # Registered CLIP model (Unity Catalog)
MODEL_VERSION = 1  # Model version to use
SOURCE_TABLE_NAME = "crop_images_directory"  # Source table name
OUTPUT_TABLE_NAME = "crop_images_directory_embeddings"  # Output table name
INPUT_COLUMN = "image_base64"  # Column containing base64-encoded images
INPUT_TYPE = "image"  # Input type for the model

# Construct full table names
SOURCE_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.{SOURCE_TABLE_NAME}"
OUTPUT_TABLE = f"{CATALOG_NAME}.{SCHEMA_NAME}.{OUTPUT_TABLE_NAME}"

%md
## Step 2: Load Model, Define Embedding UDF, and Sample Embedding Generation
Continue the streamlined workflow by loading the specified CLIP model, defining the Spark UDF for embedding generation, and providing a sample cell to apply the UDF to 10 rows for verification. All code should reference only the configuration variables and use the correct model. No redundant or legacy code should be included.

In [0]:
# Load the registered CLIP model from Unity Catalog using version number
model = mlflow.pyfunc.load_model(f"models:/{MODEL_UC_NAME}/{MODEL_VERSION}")

def clip_image_embedding_udf(image_base64):
    try:
        if image_base64 is None:
            return None
        # Model expects a list of base64 strings
        result = model.predict([image_base64])
        if result and isinstance(result, list):
            return result[0]
        return None
    except Exception as e:
        # Optionally log the error for debugging
        print(f"Embedding error: {e}")
        return None

embedding_udf = udf(clip_image_embedding_udf, ArrayType(FloatType()))


In [0]:
# Sample 10 records from the source table for quick validation
sample_df = spark.table(SOURCE_TABLE).limit(10)
sample_embedded_df = sample_df.withColumn("embeddings", embedding_udf(col(INPUT_COLUMN)))

# Display a sample of the output for verification
columns_to_show = [INPUT_COLUMN, "embeddings"] + [col for col in sample_df.columns if col not in [INPUT_COLUMN, "embeddings"]][:3]
display(sample_embedded_df.select(*columns_to_show))


In [0]:
# To process and save the full dataset, uncomment and run the following:
df = spark.table(SOURCE_TABLE)
results_df = df.withColumn("embeddings", embedding_udf(col(INPUT_COLUMN)))
results_df.write.format("delta").mode("overwrite").saveAsTable(OUTPUT_TABLE)
print(f"âœ… Enriched data with embeddings saved to {OUTPUT_TABLE}")