# Croissant Dataset Explorer, DenseNet Embedder, and RO-Crate Packager

This notebook demonstrates how to:
1. Load and explore a Croissant dataset metadata.
2. Access the actual image data.
3. Use a simple DenseNet from torchvision to generate embeddings.
4. Package the results and the notebook into a research object (RO-Crate).

In [1]:
import mlcroissant as mlc
import json
import os
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import pandas as pd
import numpy as np
from tqdm import tqdm
import logging
import warnings
import pathlib
import shutil
from fairscape_cli.models.rocrate import GenerateROCrate, AppendCrate
from fairscape_cli.models.software import GenerateSoftware
from fairscape_cli.models.dataset import GenerateDataset
from fairscape_cli.models.computation import GenerateComputation

logging.basicConfig(level=logging.ERROR)
warnings.filterwarnings("ignore")

## 1. Load and Explore the Croissant Dataset

First, let's load the Croissant metadata file and explore what's inside.

In [2]:
croissant_file = "cm4ai_if_images_croissant.json"

dataset = mlc.Dataset(jsonld=croissant_file)

print(f"Dataset name: {dataset.metadata.name}")
print(f"Dataset URL: {dataset.metadata.url}")

Dataset name: CM4AI_Multi_Treatment_IF_Dataset
Dataset URL: https://cm4ai.org


### Explore the Dataset Structure

Let's examine the resources (data files) and record sets (logical data groupings) in this dataset.

In [3]:
print("📁 RESOURCES (Data Files):")
print("-" * 50)
for i, resource in enumerate(dataset.metadata.distribution):
    print(resource)
    print(f"  UUID: {resource.uuid}")
    print(f"  Description: {getattr(resource, 'description', 'N/A')}")

print("\n\n📊 RECORD SETS (Data Structures):")
print("-" * 50)
for record_set in dataset.metadata.record_sets:
    print(f"\nRecord Set: {record_set.name}")
    if hasattr(record_set, 'description') and record_set.description:
        print(f"  Description: {record_set.description}")
    
    print(f"  Sample Fields:")
    for field in record_set.fields[:5]:
        print(f"    - {field.name}: {field.data_types[0] if field.data_types else 'unknown type'}")
        if hasattr(field, 'description') and field.description:
            print(f"      Description: {field.description}")

📁 RESOURCES (Data Files):
--------------------------------------------------
FileObject(uuid="manifest-csv")
  UUID: manifest-csv
  Description: Manifest file mapping archive paths to experimental metadata
FileObject(uuid="untreated-archive")
  UUID: untreated-archive
  Description: Zip archive containing untreated control images organized by channel
FileObject(uuid="paclitaxel-archive")
  UUID: paclitaxel-archive
  Description: Zip archive containing paclitaxel treatment images organized by channel
FileObject(uuid="vorinostat-archive")
  UUID: vorinostat-archive
  Description: Zip archive containing vorinostat treatment images organized by channel
FileSet(uuid="all-image-files")
  UUID: all-image-files
  Description: All immunofluorescence images across all treatments and channels


📊 RECORD SETS (Data Structures):
--------------------------------------------------

Record Set: all_images
  Description: Complete dataset with all treatments, channels, and experimental metadata. The use

## 2. Load the Actual Data

This will take a long time. It downloads all the images and unzips them. 

In [4]:
all_records = list(dataset.records(record_set="all_images"))
print(f"✅ Total records loaded: {len(all_records)}")

✅ Total records loaded: 51440


## 3. Create Dataset for Embedding Generation

We'll create a dataset class that loads images for embedding generation and extracts their ARK identifiers for provenance.

In [5]:
class ImageEmbeddingDataset(Dataset):
    def __init__(self, records, record_set_prefix, transform=None, max_images=None):
        self.transform = transform
        self.image_paths = []
        self.image_ids = []
        self.metadata = []
        
        for i, record in enumerate(records):
            if max_images and len(self.image_paths) >= max_images:
                break
                
            image_path = record.get(f"{record_set_prefix}/full_path")
            if image_path and os.path.exists(image_path):
                self.image_paths.append(image_path)
                
                image_id = os.path.basename(image_path)
                if isinstance(image_id, bytes):
                    image_id = image_id.decode('utf-8')
                self.image_ids.append(image_id)
                
                def safe_decode(value, default=""):
                    if isinstance(value, bytes):
                        return value.decode('utf-8')
                    elif value is None:
                        return default
                    else:
                        return str(value)
                
                metadata = {
                    'treatment': safe_decode(record.get(f"{record_set_prefix}/treatment")),
                    'plate': safe_decode(record.get(f"{record_set_prefix}/plate")),
                    'well': safe_decode(record.get(f"{record_set_prefix}/well")),
                    'channel': safe_decode(record.get(f"{record_set_prefix}/channel")),
                    'hpa_antibody_id': safe_decode(record.get(f"{record_set_prefix}/hpa_antibody_id")),
                    'ensembl_id': safe_decode(record.get(f"{record_set_prefix}/ensembl_id")),
                    'ark': safe_decode(record.get(f"{record_set_prefix}/ark"))
                }
                self.metadata.append(metadata)
        
        if len(self.metadata) > 0:
            print("Sample metadata:")
            print(f"  Treatment: {self.metadata[0]['treatment']}")
            print(f"  Plate: {self.metadata[0]['plate']}")
            print(f"  Well: {self.metadata[0]['well']}")
            print(f"  Channel: {self.metadata[0]['channel']}")
            print(f"  ARK ID: {self.metadata[0]['ark']}")
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        image_path = self.image_paths[idx]
        image_id = self.image_ids[idx]
        metadata = self.metadata[idx]
        
        try:
            image = Image.open(image_path).convert('RGB')
            if self.transform:
                image = self.transform(image)
        except Exception as e:
            print(f"Error loading image {image_path}: {e}")
            if self.transform:
                image = self.transform(Image.new('RGB', (224, 224), (0, 0, 0)))
            else:
                image = torch.zeros(3, 224, 224)
        
        return image, image_id, metadata

## 4. Set Up DenseNet Embedding Model

We'll use a pre-trained DenseNet model and extract features from the layer before classification.

In [6]:
class DenseNetEmbedder(nn.Module):
    def __init__(self):
        super(DenseNetEmbedder, self).__init__()
        self.densenet = models.densenet121(weights='IMAGENET1K_V1')
        self.densenet.classifier = nn.Identity()
        
    def forward(self, x):
        return self.densenet(x)

device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
print(f"Using device: {device}")

model = DenseNetEmbedder().to(device)
model.eval()

Using device: mps


DenseNetEmbedder(
  (densenet): DenseNet(
    (features): Sequential(
      (conv0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu0): ReLU(inplace=True)
      (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (denseblock1): _DenseBlock(
        (denselayer1): _DenseLayer(
          (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu1): ReLU(inplace=True)
          (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (relu2): ReLU(inplace=True)
          (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        )
        (denselayer16): _DenseLayer(
          (norm1): BatchNorm2d(992, eps=1e-05, 

## 5. Generate Image Embeddings

Now let's generate embeddings for a small subset of the images.

In [7]:
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

embedding_dataset = ImageEmbeddingDataset(
    all_records, 
    "all_images", 
    transform=transform, 
    max_images=10 # Using a small number for this demo
)

dataloader = DataLoader(
    embedding_dataset, 
    batch_size=16, 
    shuffle=False, 
    num_workers=0
)

print(f"Dataset size: {len(embedding_dataset)}")
print(f"Batch size: {dataloader.batch_size}")
print(f"Number of batches: {len(dataloader)}")

Sample metadata:
  Treatment: Vorinostat
  Plate: 1
  Well: A10
  Channel: 
  ARK ID: ark:59852/B2AI_1_Vorinostat_A10_R11_z01_blue
Dataset size: 10
Batch size: 16
Number of batches: 1


In [8]:
embeddings_list = []
image_ids_list = []
metadata_list = []

with torch.no_grad():
    for batch_idx, (images, image_ids, metadata_batch) in enumerate(tqdm(dataloader, desc="Processing batches")):
        images = images.to(device)
        
        embeddings = model(images)
        
        embeddings_list.extend(embeddings.cpu().numpy())
        image_ids_list.extend(image_ids)
        
        batch_size = len(list(metadata_batch.values())[0])  
        
        for i in range(batch_size):
            individual_metadata = {}
            for key, value_list in metadata_batch.items():
                individual_metadata[key] = value_list[i]
            metadata_list.append(individual_metadata)

print(f"✅ Generated {len(embeddings_list)} embeddings")

Processing batches: 100%|██████████| 1/1 [00:04<00:00,  4.33s/it]

✅ Generated 10 embeddings





## 6. Save Embeddings to Files

Let's save the embeddings and metadata to TSV files for further analysis.

In [None]:
output_dir = "./densenet_embeddings"
os.makedirs(output_dir, exist_ok=True)

embeddings_array = np.array(embeddings_list)
embedding_dim = embeddings_array.shape[1]

embeddings_df = pd.DataFrame(embeddings_array, columns=[f"dim_{i}" for i in range(embedding_dim)])
embeddings_df.insert(0, 'image_id', image_ids_list)

embeddings_file = os.path.join(output_dir, "image_embeddings.tsv")
embeddings_df.to_csv(embeddings_file, sep='\t', index=False)

metadata_rows = []
for i, metadata in enumerate(metadata_list):
    row = {'image_id': image_ids_list[i]}
    row.update(metadata)
    metadata_rows.append(row)

metadata_df = pd.DataFrame(metadata_rows)

print(f"✅ Embeddings saved to: {embeddings_file}")
print(f"\nEmbeddings shape: {embeddings_array.shape}")

✅ Embeddings saved to: ./densenet_embeddings/image_embeddings.tsv
✅ Metadata saved to: ./densenet_embeddings/image_metadata.tsv

Embeddings shape: (10, 1024)
Metadata shape: (10, 8)


## 7. Create and Register an RO-Crate

Now we will package the notebook and its outputs into an RO-Crate for portability and provenance.

In [16]:
rocrate_path = pathlib.Path('densenet_embeddings')
notebook_name = "demo.ipynb"
author_name = "Notebook User"

crate_params = {
    "guid": "ark:59852/demo-if-embedding-rocrate",
    "name": "DenseNet Image Embeddings from IF Data",
    "organizationName": "CM4AI",
    "projectName": "IF Image Embedding",
    "description": "1024-dimensional image embeddings generated by a DenseNet model from a subset of the CM4AI IF image dataset.",
    "keywords": ["embedding", "immunofluorescence", "densenet", "deep learning"],
    "author": author_name,
    "version":"1.0.0", 
    "path": rocrate_path
}

crate = GenerateROCrate(**crate_params)
print(f"✅ RO-Crate initialized at: {rocrate_path}")

shutil.copy(notebook_name, rocrate_path / notebook_name)

software_params = {
    "guid": "ark:59852/embedding-notebook",
    "name": "Jupyter Notebook for Embedding Generation",
    "author": author_name,
    "version": "1.0.0",
    "description": "A Jupyter Notebook that loads IF images from a Croissant dataset, generates embeddings using DenseNet, and packages the results.",
    "keywords": ["python", "jupyter", "pytorch", "mlcroissant"],
    "fileFormat": "application/x-ipynb+json",
    'dateModified': pd.Timestamp.now().isoformat(),
    "filepath": './densenet_embeddings/' + notebook_name,
    "cratePath": rocrate_path
}
notebook_software = GenerateSoftware(**software_params)


source_image_arks = metadata_df['ark'].unique().tolist()

computation_params = {
    "guid": "ark:59852/embedding-computation",
    "name": "Image Embedding Generation Computation",
    "runBy": author_name,
    "dateCreated": pd.Timestamp.now().isoformat(),
    "description": "Execution of the embedding notebook to generate DenseNet vectors from IF images.",
    "keywords": ["embedding", "densenet", "execution"],
    "usedSoftware": [notebook_software.guid],
    "usedDataset": source_image_arks
}
embedding_computation = GenerateComputation(**computation_params)

dataset_params = {
    "guid": "ark:59852/embedding-vectors",
    "name": "Image Embedding Vectors (TSV)",
    "description": f"A TSV file containing {embedding_dim}-dimensional embeddings for {len(embeddings_df)} images.",
    "author": author_name,
    "version": "1.0.0",
    "keywords": ["embedding", "vector", "tsv"],
    "format": "text/tab-separated-values",
    "datePublished": pd.Timestamp.now().isoformat(),
    "filepath": './densenet_embeddings/' + os.path.basename(embeddings_file),
    "derivedFrom": source_image_arks,
    "generatedBy": [embedding_computation.guid],
    "cratePath": rocrate_path
}
embedding_dataset_component = GenerateDataset(**dataset_params)


AppendCrate(
    cratePath=rocrate_path,
    elements=[
        notebook_software,
        embedding_computation,
        embedding_dataset_component
    ]
)
print(f"✅ Registered {notebook_software.name} with GUID: {notebook_software.guid}")
print(f"✅ Registered {embedding_computation.name} with GUID: {embedding_computation.guid}")
print(f"✅ Registered {embedding_dataset_component.name} with GUID: {embedding_dataset_component.guid}")

✅ RO-Crate initialized at: densenet_embeddings
✅ Registered Jupyter Notebook for Embedding Generation with GUID: ark:59852/embedding-notebook
✅ Registered Image Embedding Generation Computation with GUID: ark:59852/embedding-computation
✅ Registered Image Embedding Vectors (TSV) with GUID: ark:59852/embedding-vectors


## Summary

We've successfully:
1. **Loaded the Croissant dataset** and explored its structure.
2. **Created a PyTorch dataset** for loading images.
3. **Set up a DenseNet embedding model** using torchvision's pre-trained weights.
4. **Generated embeddings** for a subset of images in the dataset.
5. **Saved the results** to TSV files for further analysis.
6. **Packaged the entire workflow** into an RO-Crate for portability and provenance.

The output now includes:

- `./densenet_embeddings/`: An RO-Crate directory containing:
  - `ro-crate-metadata.json`: The metadata file describing the contents of the crate.
  - `demo.ipynb`: A copy of this notebook, registered as the `Software` that performed the work.
  - `image_embeddings.tsv`: The generated embeddings, registered as a `Dataset` derived from the source images and generated by the notebook.

These embeddings and their corresponding RO-Crate can now be used for downstream tasks like clustering, classification, or similarity analysis, with a clear record of their origin.