# CLIP (Contrastive Language-Image Pre-Training) Model Adapter Tutorial

This notebook provides a comprehensive guide on fine-tuning a CLIP model using the Dataloop platform and its Python SDK. CLIP models are powerful for tasks involving understanding the relationship between images and text, such as zero-shot image classification, image-text retrieval, and generating image embeddings for semantic search.

Fine-tuning allows you to adapt a pre-trained CLIP model to your specific dataset and domain, potentially improving its performance on tasks relevant to your data. This tutorial will walk you through:

1. Preparing a dataset with images and their corresponding textual descriptions.
2. Using the Dataloop CLIP model adapter to fine-tune the model on your prepared dataset.
3. Deploying the fine-tuned model and using it to generate embeddings for your images.

### Table of Contents
1. [Install Dependencies](#install-dependencies)
2. [Import Required Libraries](#import-libraries)
3. [Set Up Dataloop Environment](#setup-environment)
4. [Prepare Dataset for Fine-Tuning](#prepare-dataset)
    *   [4.1 Option A: Use Public Dataset (Mars Surface Images)](#use-public-dataset)
        *   [4.1.1 Install Mars Surface Images DPK](#install-mars-dpk)
        *   [4.1.2 Get Captioned Dataset and Split for ML](#get-mars-dataset-split)
    *   [4.2 Option B: (Alternative) Upload and Prepare Your Custom Dataset](#upload-custom-dataset)
    *   [4.3 Convert Image Dataset to Prompt Item Format](#convert-to-prompt-items)
5. [Fine-Tune and Deploy CLIP Model](#finetune-deploy-clip)
    *   [5.1 Install CLIP Model Package (DPK)](#install-clip-model-dpk)
    *   [5.2 Configure and Clone Pretrained CLIP Model](#configure-clone-clip-model)
    *   [5.3 Train the Model](#train-clip-model)
    *   [5.4 Deploy the Fine-Tuned Model](#deploy-clip-model)
    *   [5.5 Embed Datasets using the Fine-Tuned Model](#embed-datasets-clip)
6. [Conclusion](#conclusion)

Let's get started!

## <a id='install-dependencies'></a>1. Install Dependencies

First, ensure that the necessary Python libraries are installed. This notebook requires `dtlpy` for interacting with the Dataloop platform and `pandas` for data manipulation. The following cell will install or upgrade them.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

In [None]:
!pip install dtlpy pandas --upgrade --quiet

## <a id='import-libraries'></a>2. Import Required Libraries

Now, we import all the Python libraries that will be used throughout this tutorial.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

In [11]:
import time
import random
import string
import dtlpy as dl
import pandas as pd

from pathlib import Path
from concurrent.futures import ThreadPoolExecutor


## <a id='setup-environment'></a>3. Set Up Dataloop Environment

We need to set up our Dataloop environment and get our project. You'll need to replace project and dataset names with your own values.

> **_NOTE:_**  This tutorial assumes you are working in a new project which does NOT have the CLIP model previously installed. If it's an existing project and you already have CLIP installed, you will need to get the appropriate app and base CLIP model entity for the rest of the code to work correctly.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

In [2]:
if dl.token_expired():
    dl.login()

PROJECT_NAME = "<your project name here>"
project = dl.projects.create(project_name=PROJECT_NAME)

**Action Required:** In the cell above, replace `"<your project name here>"` with the desired name for your Dataloop project. If a project with this name already exists, the SDK will retrieve it; otherwise, a new project will be created.

## <a id='prepare-dataset'></a>4. Prepare Dataset for Fine-Tuning

Fine-tuning a CLIP model requires a dataset of images paired with relevant textual descriptions. This section covers two ways to prepare such a dataset:
1.  **Option A:** Use a publicly available dataset from the Dataloop Marketplace (Mars Surface Images with Captions).
2.  **Option B:** Upload your own custom dataset of images and descriptions.

Once the image dataset is ready, we will convert it into a special "prompt item" format suitable for training the CLIP model adapter.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

### <a id='use-public-dataset'></a>4.1 Option A: Use Public Dataset (Mars Surface Images)

For this tutorial we will install the Mars Surface Images Datasets from the Dataloop Marketplace. This dataset includes images with descriptions pre-loaded in the item metadata.

#### <a id='install-mars-dpk'></a>4.1.1 Install Mars Surface Images DPK

This Dataloop Package (DPK) contains datasets related to Mars surface imagery, including one with captions.

In [None]:
dpk = dl.dpks.get(dpk_name="mars-surface-images")
app = project.apps.install(dpk=dpk)
print(f"Mars Surface Datasets installed: {app.name}")

#### <a id='get-mars-dataset-split'></a>4.1.2 Get Captioned Dataset and Split for ML

After installing the DPK, the "Mars Surface Images with Captions" dataset should be available in your project. We will retrieve this dataset and split its items into training, validation, and test subsets. This splitting is crucial for proper model training and evaluation.

You may need to wait a few minutes after installing the app until the dataset has completed loading into your project.

In [4]:
import dtlpy as dl
project = dl.projects.get("test clip FT8")

In [None]:
dataset = project.datasets.get(dataset_name="Mars Surface Images with Captions")

SUBSET_PERCENTAGES = {'train': 80, 'validation': 10, 'test': 10}
dataset.split_ml_subsets(
        percentages=SUBSET_PERCENTAGES
    )

### <a id='upload-custom-dataset'></a>4.2 Option B: (Alternative) Upload and Prepare Your Custom Dataset

If you have your own dataset of images and corresponding text descriptions, you can use the function below to create a new Dataloop dataset, upload your images, and associate the descriptions. 

The function expects a Pandas DataFrame (`pairs_df`) with two columns:
-   `'filepath'`: The local path to each image file.
-   `'description'`: The text description for that image.

It also assumes that for each image file (e.g., `items/image1.jpg`), there is a corresponding JSON annotation file (e.g., `json/image1.json`) if you have existing annotations to upload. If not, the `local_annotations_path` part can be adapted.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

In [6]:
def create_new_dataset(dataset_name, pairs_df, subset_percentages={'train': 80, 'validation': 10, 'test': 10}):
    """
    Creates a new dataset from a CSV file containing image paths and descriptions

    Args:
        dataset_name (str): Name of the dataset to create
        pairs_df (pd.DataFrame): DataFrame containing 'filepath' and 'img_description' columns
        subset_percentages (dict): Dictionary containing the percentages for each subset
        default is 80% train, 10% validation, 10% test
        can be changed to any other percentages as long as the sum is 100
    """

    try:
        dataset = project.datasets.create(dataset_name=dataset_name)
    except dl.exceptions.BadRequest:
        # Generate 5 random alphanumeric characters
        suffix = ''.join(random.choices(string.ascii_letters + string.digits, k=5))
        dataset = project.datasets.create(dataset_name=f"{dataset_name}_{suffix}")

    def upload_item(row):
        file_path = row["filepath"]
        annots_path = file_path.replace("items", "json") # This assumes a specific structure for annotation files
        
        # Upload item with annotations
        item = dataset.items.upload(
            local_path=file_path,
            local_annotations_path=annots_path,
            item_metadata=dl.ExportMetadata.FROM_JSON,
            overwrite=True,
        )

        # Set description and update
        item.set_description(text=row["description"])
        item.update()

    # Use ThreadPoolExecutor to upload items in parallel with progress bar
    with ThreadPoolExecutor() as executor:
        from tqdm import tqdm
        list(tqdm(
            executor.map(upload_item, [row for _, row in pairs_df.iterrows()]),
            total=len(pairs_df),
            desc="Uploading items",
            unit="item",
            bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}]'
        ))

    # Since model training requires labels, we create a dummy label for the recipe
    dataset.add_labels(label_list=['free-text'])
    # After uploading, you would also split this custom dataset similarly to Option A:
    # dataset.split_ml_subsets(percentages=subset_percentages)

    return dataset

### <a id='convert-to-prompt-items'></a>4.3 Convert Image Dataset to Prompt Item Format

The Dataloop CLIP model adapter expects data in a specific "prompt item" format. A prompt item typically links an image (the prompt) to its textual description (the response or annotation).

The `ClipPrepare` class below, adapted from the [CLIP model adapter repository](https://github.com/dataloop-ai-apps/clip-model-adapter/blob/main/utils/prepare_dataset.py), provides the necessary functions to convert your image dataset (whether from Option A or B) into this prompt item format. It creates a new dataset containing these prompt items.

Images are uploaded as regular items first, and then prompt items are created that reference these images and include their descriptions as text annotations.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

In [15]:
class ClipPrepare:
    @staticmethod
    def convert_dataset(dataset, keep_subsets=None):
        dataset_to = ClipPrepare.convert_to_prompt_dataset(dataset_from=dataset, keep_subsets=keep_subsets)
        return dataset_to

    @staticmethod
    def convert_to_prompt_dataset(dataset_from: dl.Dataset, keep_subsets):
        items = dataset_from.items.list()
        try:
            dataset_to = dataset_from.project.datasets.create(dataset_name=f"{dataset_from.name} prompt items")
        except Exception as e:
            print(f"Prompt item dataset already exists or error: {e}. Creating new prompt item dataset with suffix.")
            suffix = ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(5))
            dataset_to = dataset_from.project.datasets.create(dataset_name=f"{dataset_from.name} prompt items-{suffix}")

        # use thread multiprocessing to get items and convert them to prompt items
        all_items = items.all()
        print(f"Converting {len(all_items)} items from dataset '{dataset_from.name}' to prompt items in dataset '{dataset_to.name}'...")
        with ThreadPoolExecutor() as executor:
            # Using a simple loop with print statements for progress instead of tqdm if it causes issues in some environments
            # For progress with tqdm, ensure it's installed and handles exceptions properly.
            results = list(executor.map(lambda item: ClipPrepare._convert_item(item_id=item.id, dataset=dataset_to, existing_subsets=keep_subsets), all_items))
        print(f"Conversion completed. {len(results)} prompt items created.")

        # Copy recipe from original dataset
        if dataset_from.get_recipe_ids():
            new_recipe_id = dataset_from.get_recipe_ids()[0]
            dataset_to.switch_recipe(recipe_id=new_recipe_id)
            print(f"Switched recipe for '{dataset_to.name}' to recipe ID: {new_recipe_id}")
        else:
            print(f"Warning: Original dataset '{dataset_from.name}' has no recipes. Prompt dataset '{dataset_to.name}' may need a recipe manually.")

        return dataset_to

    @staticmethod
    def _convert_item(item_id, dataset: dl.Dataset, existing_subsets=True):
        item = dl.items.get(item_id=item_id)
        if item.description is not None and item.description.strip() != '':
            caption = item.description
        else:
            # Fallback if description is empty or missing
            # print(f"Item {item.id} ('{item.name}') has no valid description. Trying directory name or using placeholder.")
            item_dir_name = Path(item.dir).name
            if item_dir_name and item_dir_name != '/' and item_dir_name.strip() != '':
                # print(f"Using directory name for item {item.id}: {item_dir_name}")
                caption = f"a photo of a {item_dir_name.replace('_', ' ')}" # Basic caption from directory
            else:
                # print(f"Item {item.id} ('{item.name}') has no description or usable directory name. Using placeholder caption.")
                caption = "an image" # Placeholder caption
        
        new_name = Path(item.name).stem + '.json'

        prompt_item_payload = dl.PromptItem(name=new_name)
        # User (prompt) part: the image
        prompt_item_payload.add(message={"content": [{"mimetype": dl.PromptType.IMAGE, "value": item.stream}]})
        
        # Assistant (response) part: the caption
        prompt_item_payload.add(message={"role": "assistant", 
                                          "content": [{"mimetype": dl.PromptType.TEXT, "value": caption}]})
        
        new_metadata = item.metadata.copy() if item.metadata else {}
        if existing_subsets and "system" in item.metadata and "subsets" in item.metadata["system"]:
            new_metadata.setdefault("system", {}).setdefault("subsets", item.metadata["system"]["subsets"])
        
        try:
            new_item = dataset.items.upload(
                local_path=prompt_item_payload, # Upload the PromptItem object directly
                remote_name=new_name,
                remote_path=item.dir, # Store in the same directory structure
                overwrite=True,
                item_metadata=new_metadata,
            )
            return new_item
        except Exception as e:
            print(f"Error uploading prompt item for original item {item.id} ('{item.name}'): {e}")
            return None

Now that the dataset (either public or custom) is prepared and the `ClipPrepare` class is defined, you can execute the conversion. This step will create a new dataset suffixed with "prompt items", containing the data in the format required for CLIP fine-tuning.

In [None]:
# Ensure 'dataset' variable refers to your prepared image dataset (from Option A or B)
# For example, if you used Option A (Mars dataset):
# dataset = project.datasets.get(dataset_name="Mars Surface Images with Captions")
# Or if you used Option B (custom dataset) and named it 'MyCustomImageData':
# dataset = project.datasets.get(dataset_name='MyCustomImageData')

if 'dataset' in locals() and dataset is not None:
    print(f"Starting conversion for dataset: '{dataset.name}' (ID: {dataset.id})")
    prompt_dataset = ClipPrepare.convert_dataset(dataset=dataset, keep_subsets=True)
    print(f"Successfully created prompt item dataset: '{prompt_dataset.name}' (ID: {prompt_dataset.id})")
    prompt_dataset.open_in_web() # Open the new dataset in the Dataloop platform
else:
    print("Error: 'dataset' variable is not defined. Please ensure you have run the dataset preparation steps (Option A or B)." )

After this step, you should have two datasets in your Dataloop project:
1.  The original dataset with images and their descriptions (e.g., "Mars Surface Images with Captions").
2.  A new dataset with prompt items (e.g., "Mars Surface Images with Captions prompt items"), which will be used for fine-tuning.

## <a id='finetune-deploy-clip'></a>5. Fine-Tune and Deploy CLIP Model

With the prompt item dataset prepared, we can now proceed to fine-tune the CLIP model.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

### <a id='install-clip-model-dpk'></a>5.1 Install CLIP Model Package (DPK)

First, we install the CLIP model DPK from the Dataloop Marketplace. This package provides the necessary components for CLIP model training and inference.

In [None]:
dpk = dl.dpks.get(dpk_name='clip-model-pretrained')
app = project.apps.install(dpk=dpk)
print(f"CLIP App installed: {app.name} (ID: {app.id})")

### <a id='configure-clone-clip-model'></a>5.2 Configure and Clone Pretrained CLIP Model

Next, we retrieve the base pre-trained CLIP model entity ("openai-clip") provided by the DPK. We'll then configure its metadata, specifying which subsets of our `prompt_dataset` to use for training and validation. 

You can also adjust model hyperparameters like learning rate, batch size, and number of epochs in the `base_model.configuration` dictionary. The example settings below are a starting point.

In [21]:
base_model = project.models.get(model_name="openai-clip")

# Configure model metadata and subsets
# This tells the model which items from the PROMPT_DATASET to use for training and validation
base_model.metadata["system"] = {}
base_model.metadata["system"]["subsets"] = {}

train_filters = dl.Filters(field="metadata.system.subsets.train", values=True) # Assuming subsets were marked during prompt_dataset creation
val_filters = dl.Filters(field="metadata.system.subsets.validation", values=True)

base_model.metadata["system"]["subsets"]["train"] = train_filters.prepare()
base_model.metadata["system"]["subsets"]["validation"] = val_filters.prepare()

# Set model configuration (hyperparameters)
base_model.configuration = {
    "model_name": "ViT-B/32",      # CLIP model architecture (e.g., ViT-B/32, RN50)
    "embeddings_size": 512,       # Output embedding dimension for ViT-B/32
    "num_epochs": 10,             # Number of training epochs (adjust based on dataset size and convergence)
    "batch_size": 64,             # Batch size for training (adjust based on GPU memory)
    "learning_rate": 5e-6,        # Learning rate (often smaller for fine-tuning)
    "early_stopping": True,       # Enable early stopping
    "early_stopping_epochs": 3,   # Number of epochs with no improvement before stopping
    "weight_decay": 0.01          # Weight decay for regularization (optional)
}
base_model.output_type = "text" # For CLIP, this usually indicates it's working with text-image pairs

### <a id='train-clip-model'></a>5.3 Train the Model

Now we clone the configured base model. This creates a new model entity in your project that will be fine-tuned. We associate our `prompt_dataset` (created in [Section 4.3](#convert-to-prompt-items)) with this new model and then start the training process.

> **NOTE**: The training process can take a significant amount of time, depending on your dataset size, model configuration, and the available compute resources (GPU type).

In [22]:
# Ensure 'prompt_dataset' is defined from the conversion step
if 'prompt_dataset' not in locals() or prompt_dataset is None:
    raise ValueError("Error: 'prompt_dataset' is not defined. Please run section 4.3 to create it.")

finetuned_model_name = base_model.name + "-finetuned-" + ''.join(random.choices(string.ascii_lowercase + string.digits, k=4))
print(f"Cloning model to create: '{finetuned_model_name}' using dataset '{prompt_dataset.name}'")
finetuned_model = base_model.clone(model_name=finetuned_model_name, dataset_id=prompt_dataset.id)

print(f"Starting training for model: '{finetuned_model.name}' (ID: {finetuned_model.id}). This may take a while...")
execution = finetuned_model.train()
print(f"Training initiated. Execution ID: {execution.id}. You can monitor progress in the Dataloop platform.")

The cell below will periodically check the status of the training execution. You can also monitor the training progress, view logs, and see performance metrics directly in the Dataloop platform by navigating to your project, then Models, finding your `finetuned_model`, and checking its 'Executions' or 'Training' tabs.

In [None]:
# Wait for training to complete
print(f"Waiting for training execution {execution.id} to complete...")

while execution.status not in [dl.ExecutionStatus.SUCCESS, dl.ExecutionStatus.FAILED, dl.ExecutionStatus.CANCELED]:
    print(f"Training in progress (Status: {execution.status})... checking again in 5 minutes")
    time.sleep(300)  # Sleep for 5 minutes
    execution = dl.executions.get(execution_id=execution.id) # Refresh execution status

if execution.status == dl.ExecutionStatus.SUCCESS:
    print(f"Training completed successfully! Model ID: {finetuned_model.id}")
    # Update the local model object with the latest status and artifacts from the platform
    finetuned_model = project.models.get(model_id=finetuned_model.id)
elif execution.status == dl.ExecutionStatus.FAILED:
    print(f"Training failed. Execution ID: {execution.id}. Check logs in Dataloop platform for details.")
else:
    print(f"Training ended with status: {execution.status}. Execution ID: {execution.id}.")

### <a id='deploy-clip-model'></a>5.4 Deploy the Fine-Tuned Model

Once the model has successfully trained, you can deploy it as a service. This makes the model available for inference tasks, such as generating embeddings for new images.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

In [None]:
if 'finetuned_model' in locals() and finetuned_model is not None and execution.status == dl.ExecutionStatus.SUCCESS:
    print(f"Deploying fine-tuned model: '{finetuned_model.name}' (ID: {finetuned_model.id})")
    # The service will be created with default settings (e.g., 1 replica, default GPU if needed by model)
    # You can customize deployment configuration if necessary via finetuned_model.deploy(service_config={...})
    service = finetuned_model.deploy()
    print(f"Model deployment initiated. Service ID: {service.id}. Waiting for service to be ready...")
    service.wait_for_ready_state() # This will block until the service is ready or deployment fails
    print(f"Service '{service.name}' is now ready.")
else:
    print("Skipping deployment: Model training was not successful or 'finetuned_model' is not defined.")

### <a id='embed-datasets-clip'></a>5.5 Embed Datasets using the Fine-Tuned Model

After the fine-tuned model is deployed and its service is ready, you can use it to generate embeddings for images in any dataset. These embeddings capture the semantic content of the images as understood by your fine-tuned model and can be used for tasks like semantic search or similarity comparison.

We will typically want to embed the original image dataset (e.g., "Mars Surface Images with Captions" or your custom image dataset), not the `prompt_dataset`.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)

In [None]:
if 'finetuned_model' in locals() and finetuned_model is not None and 'service' in locals() and service.is_ready:
    # Get the latest version of the model entity, which now includes deployment details
    finetuned_model = project.models.get(model_id=finetuned_model.id)
    
    # IMPORTANT: Select the dataset you want to embed. This should be your ORIGINAL image dataset,
    # not the prompt_dataset used for training.
    # For example, if you used the Mars dataset:
    dataset_to_embed = project.datasets.get(dataset_name="Mars Surface Images with Captions") 
    # Or, if you used a custom dataset named 'MyCustomImageData':
    # dataset_to_embed = project.datasets.get(dataset_name='MyCustomImageData')

    if dataset_to_embed:
        print(f"Starting embedding for dataset: '{dataset_to_embed.name}' (ID: {dataset_to_embed.id}) using model '{finetuned_model.name}'.")
        embedding_execution = finetuned_model.embed_datasets(dataset_ids=[dataset_to_embed.id])
        print(f"Embedding process initiated. Execution ID: {embedding_execution.id}. This may take some time.")
        # You can wait for this execution to complete similarly to the training execution if needed.
        # embedding_execution.wait() or a loop with time.sleep()
    else:
        print("Error: 'dataset_to_embed' is not defined. Please specify the correct original image dataset.")
else:
    print("Skipping embedding: Model is not successfully deployed or 'finetuned_model'/'service' are not defined.")

## <a id='conclusion'></a>6. Conclusion

Congratulations! You have successfully walked through the process of fine-tuning a CLIP model using the Dataloop platform. This included:

1.  **Setting up your Dataloop environment.**
2.  **Preparing a dataset** of images and descriptions, either by using a public dataset or your own custom data, and converting it to the required prompt item format.
3.  **Installing the CLIP model package.**
4.  **Configuring and cloning** a pretrained CLIP model.
5.  **Training (fine-tuning)** the model on your dataset.
6.  **Deploying** the fine-tuned model as a service.
7.  **Generating embeddings** for an image dataset using your fine-tuned model.

### Next Steps
*   **Experiment with Hyperparameters:** Adjust learning rate, batch size, number of epochs, and CLIP model architecture (e.g., `ViT-L/14` if available and supported) to potentially improve performance.
*   **Evaluate Your Model:** While this tutorial focuses on the fine-tuning process, a crucial next step is to evaluate your fine-tuned model's performance on a held-out test set for tasks like image-text retrieval or zero-shot classification.
*   **Use Embeddings:** Explore applications of the generated embeddings, such as building a semantic image search engine or performing image clustering.
*   **Explore Advanced Features:** Dataloop offers many more features for MLOps, data management, and AI development. Check out the [Dataloop Developer Documentation](https://developers.dataloop.ai/) for more.

[Back to Top](#clip-contrastive-language-image-pre-training-model-adapter-tutorial)