# Crop Image Directory Processing and Metadata Storage

This notebook demonstrates how to:
- Recursively scan a volume directory for crop image files (JPEG format)
- Extract metadata from each image file including file stats and folder structure
- Encode images as base64 for storage in Delta tables
- Save the processed data to Unity Catalog as a Delta table

The resulting table will serve as the foundation for downstream image processing and ML workflows.

## Import Required Libraries

Import necessary libraries for file system operations, image processing, and Spark DataFrame creation.

In [0]:
import os
import base64
from datetime import datetime
from PIL import Image
from io import BytesIO
from pyspark.sql import Row

## Dataset Configuration

Configure the dataset-specific parameters. Update these values to point to your own image dataset and desired output location.


In [None]:
# Dataset Configuration - Update these values for your dataset
VOLUME_ROOT = "/Volumes/autobricks/agriculture/crop_images"  # Path to your image folder
CATALOG_NAME = "autobricks"  # Your Unity Catalog name
SCHEMA_NAME = "agriculture"   # Your schema name
TABLE_NAME = "crop_images_directory"  # Output table name

## Define File Discovery Function

Create a function to recursively scan the volume directory and identify all JPEG image files.


In [None]:
# Recursively list all JPEG files in all folders
def list_jpeg_files(root_path):
    files = []
    for dirpath, dirnames, filenames in os.walk(root_path):
        for filename in filenames:
            if filename.lower().endswith(".jpg") or filename.lower().endswith(".jpeg"):
                files.append(os.path.join(dirpath, filename))
    return files

## Scan Directory for Images

Execute the file discovery function to find all JPEG images in the volume.

In [None]:
jpeg_files = list_jpeg_files(VOLUME_ROOT)
print(f"Found {len(jpeg_files)} JPEG images.")

## Define Metadata Extraction Function

Create a function to extract file metadata and encode images as base64 for each discovered image file.

In [None]:
# Collect metadata and base64-encoded image for each file
def get_image_metadata(file_path):
    try:
        # Get file stats
        stat = os.stat(file_path)
        file_name = os.path.basename(file_path)
        folder = os.path.dirname(file_path).replace(VOLUME_ROOT, "").lstrip("/")
        size_bytes = stat.st_size
        created_at = datetime.fromtimestamp(stat.st_ctime)
        # Read and encode image
        with open(file_path, "rb") as f:
            img_bytes = f.read()
            img_b64 = base64.b64encode(img_bytes).decode()
        return Row(
            file_path=file_path,
            file_name=file_name,
            folder=folder,
            size_bytes=size_bytes,
            created_at=created_at,
            image_base64=img_b64
        )
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return None


## Process Images and Create DataFrame

Apply the metadata extraction function to all discovered images and create a Spark DataFrame with the results.

In [None]:
rows = [get_image_metadata(fp) for fp in jpeg_files]
rows = [r for r in rows if r is not None]

# Create Spark DataFrame
images_df = spark.createDataFrame(rows)
display(images_df)

## Save to Unity Catalog

Save the processed DataFrame as a Delta table in Unity Catalog for downstream processing and analysis.

In [0]:
# Save to Unity Catalog table using configured parameters
output_table = f"{CATALOG_NAME}.{SCHEMA_NAME}.{TABLE_NAME}"
images_df.write.format("delta").mode("overwrite").saveAsTable(output_table)
print(f"✅ Table saved to {output_table} in Unity Catalog.")