# Image preference

- **Goal**: Show a standard workflow for working with complex multi-modal preference datasets like image question answering.
- **Dataset**: [RLAIF-V](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), large-scale multimodal feedback dataset. The dataset provides high-quality feedback with a total number of 83,132 preference pairs for image question answering.
- **Libraries**: [datasets](https://github.com/huggingface/datasets), [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
- **Components**: [ImageField](https://docs.argilla.io/latest/reference/argilla/settings/fields/#src.argilla.settings._field.ImageField), [TextQuestion](https://docs.argilla.io/latest/reference/argilla/settings/questions/#src.argilla.settings._question.TextQuestion), [VectorField](https://docs.argilla.io/dev/reference/argilla/settings/vectors/#rgvectorfield), [TermsMetadataProperty](https://docs.argilla.io/dev/reference/argilla/settings/metadata_property/?h=#rgtermsmetadataproperty)

## Getting started

### Deploy the Argilla server

If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](../getting_started/quickstart.md).

### Set up the environment

To complete this tutorial, you need to install the Argilla SDK and a few third-party libraries via `pip`.

In [None]:
!pip install argilla

In [None]:
!pip install "sentence-transformers>3,<4"

Let's make the required imports:

In [24]:
import argilla as rg

import re
import io
import base64
from IPython.display import display

import torch
import numpy as np
from datasets import load_metric
from PIL import Image
import json
from sentence_transformers import SentenceTransformer
from datasets import load_dataset, Dataset
from transformers import pipeline, AutoImageProcessor, AutoModelForImageClassification, Trainer, TrainingArguments, pipeline

You also need to connect to the Argilla server using the `api_url` and `api_key`.

In [3]:
# Replace api_url with your url if using Docker
# Replace api_key if you configured a custom API key
# Uncomment the last line and set your HF_TOKEN if your space is private
client = rg.Argilla(
    # api_url="https://[your-owner-name]-[your_space_name].hf.space",
    # api_url=
    api_key="argilla.apikey",
    # headers={"Authorization": f"Bearer {HF_TOKEN}"}
)

## Vibe check the dataset

We will have a look at [the dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset) to understand its structure and the kind of data it contains. We do this by using [the embedded Hugging Face Dataset Viewer](https://huggingface.co/docs/hub/main/en/datasets-viewer-embed).

<iframe
  src="https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset/embed/viewer/default/train"
  frameborder="0"
  width="100%"
  height="560px"
></iframe>

## Configure and create the Argilla dataset

Now, we will need to configure the dataset. In the settings, we can specify the guidelines, fields, and questions. Because of the complexity of the task we will also add metadata and vectors. We will add one vector to represent the semantic meaning of the `question`. We will be adding two metadata fields that store the `original_dataset` and the type of task the question belongs to which can be obtained from `origin_split`. 

!!! note
    Check this [how-to guide](../how_to_guides/dataset.md) to know more about configuring and creating a dataset.

In [5]:
settings = rg.Settings(
    guidelines="The goal is to assess if the answers are correct and update them where needed.",
    fields=[
        rg.ImageField(
            name="image",
            title="An image of a certain object, state or action.",
        ),
        rg.TextField(
            name="question",
            title="A question about the image, intended to be answered.",
        )
    ],
    questions=[
        rg.TextQuestion(
            name="chosen",
            title="The chosen answer to the question.",
        ),
        rg.TextQuestion(
            name="rejected",
            title="The rejected answer to the question.",
        )
    ],
    metadata=[
        rg.TermsMetadataProperty(name="origin_dataset", title="Origin dataset"),
        rg.TermsMetadataProperty(name="task_type", title="Task type"),
    ],
    vectors=[
        rg.VectorField(name="question_vector", dimensions=384),
    ]
)

Let's create the dataset with the name and the defined settings:

In [None]:
dataset = rg.Dataset(
    name="image_preference_dataset",
    settings=settings,
)
dataset.create()

## Add records

Even if we have created the dataset, it still lacks the information to be annotated (you can check it in the UI). We will use the `openbmb/RLAIF-V-Dataset` dataset from [the Hugging Face Hub](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset). Specifically, we will use the `train` split and get `100` examples. Because we are dealing with a large dataset, we will set `streaming=True` to avoid loading the entire dataset into memorym and iterate over the data to lazily load it.

!!! tip
    When working with Hugging Face dataset you can set `Image(decode=False)` so that we can get [public image URLs](https://huggingface.co/docs/datasets/en/image_load#local-files), however, this depends on the dataset.

In [None]:
n_rows = 100
hf_dataset = load_dataset("openbmb/RLAIF-V-Dataset", streaming=True)
dataset_rows = []
count = 0
for row in hf_dataset["train"]:
    dataset_rows.append(row)
    count += 1
    if count >= n_rows:
        break
dataset_rows
hf_dataset = Dataset.from_list(dataset_rows)
hf_dataset

Let's have a look at the first image in the dataset.

In [19]:
hf_dataset[0]

{'ds_name': 'RLAIF-V',
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480>,
 'question': 'Who is more likely to use these tools a leather crafter or a paper crafter?',
 'chosen': 'A leather crafter is more likely to use these tools. The image shows various crafting tools, including scissors and a hole punch, which are commonly used in leatherworking projects. Leather is a material that requires cutting, shaping, and precise hole-punching techniques to create desired designs or patterns. In contrast, paper crafters typically use different types of tools, such as adhesives, decorative papers, or specialized cutting machines like the Silhouette Cameo, for their projects.',
 'rejected': 'A leather crafter is more likely to use these tools as they consist of a hole punch, scissors, and a knife. These items are typically used in crafting projects involving fabric or leather materials for various designs and patterns. Paper crafters may also benefit from some of these to

### Convert PIL image to base64

As we can see, the image is a PIL Image. In order to use it in in Argilla, we need to convert them to a base64 string.

In [None]:
def pil_to_data_uri(batch):
    data_uri = []
    for image in batch["image"]:
        buffered = io.BytesIO()
        image.save(buffered, format="PNG")
        img_str = base64.b64encode(buffered.getvalue()).decode()
        data_uri.append(f"data:image/png;base64,{img_str}")
    batch["image_data_uri"] = data_uri
    return batch

hf_dataset_with_base64= hf_dataset.map(pil_to_data_uri, batched=True)
hf_dataset_with_base64[0]

### Retrieve values nested in JSON

The question type values are nested in a JSON object. We can obtain them by looping through the data and getting the `origin_split` column.

In [None]:
def retrieve_type_from_json(batch):
    loaded_json = [json.loads(x) for x in batch["origin_split"]]
    batch["task_type"] = [x["type"] for x in loaded_json]
    return batch

hf_dataset_with_base64_task= hf_dataset_with_base64.map(retrieve_type_from_json, batched=True)
hf_dataset_with_base64_task[0]

### Create vectors

We will use the `sentence-transformers` library to create vectors for the questions. We will use the `TaylorAI/bge-micro-v2` model which strikes a good balance between speed and performance. Note that we also need to convert the vectort to a `list` to store it in the Argilla dataset.

In [None]:
model = SentenceTransformer("TaylorAI/bge-micro-v2")

def encode_questions(batch):
    vectors_as_numpy = model.encode(batch["question"])
    batch["question_vector"] = [x.tolist() for x in vectors_as_numpy]
    return batch

hf_dataset_with_base64_task_vectors = hf_dataset_with_base64_task.map(encode_questions, batched=True)
hf_dataset_with_base64_task_vectors[0]

### Log into Argilla


We will easily add them to the dataset using `log` and the mapping, where we indicate that the column `text` is the data that should be added to the field `review`. We are also adding an "id" column to the record, so we can easily backtrack the record to the external data source.

In [None]:
hf_dataset = hf_dataset.add_column("id", range(len(hf_dataset)))
dataset.records.log(records=hf_dataset[:100], mapping={
    "image_data_uri": "image",
    "idx": "id",
    "question": "question",
    "chosen": "chosen",
    "rejected": "rejected",
})

Voilà! We have added the suggestions to the dataset, and they will appear in the UI marked with a ✨. 

## Evaluate with Argilla

Now, we can start the annotation process. Just open the dataset in the Argilla UI and start annotating the records. If the suggestions are correct, you can just click on `Submit`. Otherwise, you can select the correct label.

!!! note
    Check this [how-to guide](../how_to_guides/annotate.md) to know more about annotating in the UI.

## Train your model

After the annotation, we will have a robust dataset to train the main model. In our case, we will fine-tune using transformers and the . However, you can select the one that best fits your requirements. So, let's start by retrieving the annotated records.

!!! note
    Check this [how-to guide](../how_to_guides/query.md) to know more about filtering and querying in Argilla. Also, you can check the Hugging Face docs on [fine-tuning an image classification model](https://huggingface.co/docs/transformers/en/tasks/image_classification).

### Formatting the data

In [16]:
dataset = client.datasets("image_classification_dataset")

In [None]:
status_filter = rg.Query(filter=rg.Filter(("response.status", "==", "submitted")))

submitted = dataset.records(status_filter).to_list(flatten=True)

We then need to convert our base64 images to a format that the model can understand so we will convert them to PIL images again.

In [55]:
def base64_to_pil(base64_string):
    image_data = re.sub('^data:image/.+;base64,', '', base64_string)
    image = Image.open(io.BytesIO(base64.b64decode(image_data)))
    return image

Now, let's apply that to the whole dataset.

In [None]:
submitted_pil_image = [
    {
        "id": sample["id"],
        "image": base64_to_pil(sample["image"]),
        "label": sample["image_label.responses"][0],
    }
    for sample in submitted
]
submitted_pil_image[0]

We now need to ensure our images are forwarded with the correct dimensions. Because the original MNIST dataset is greyscale and the VIT model expects RGB, we need to add a channel dimension to the images. We will do this by stacking the images along the channel axis.

In [None]:
def greyscale_to_rgb(img) -> Image:
    return Image.merge('RGB', (img, img, img))

submitted_pil_image_rgb = [
    {
        "image": greyscale_to_rgb(sample["image"]),
        "label": sample["label"],
    }
    for sample in submitted_pil_image
]
submitted_pil_image_rgb[0]

Next, we will load the `ImageProcessor` for fine-tuning the model. This processor will handle the image resizing and normalization in order to be compatible with the model we intend to use.

In [None]:
checkpoint = "google/vit-base-patch16-224-in21k"
processor = AutoImageProcessor.from_pretrained(checkpoint)

submitted_pil_image_rgb_processed = [
    {
        "pixel_values": processor(sample["image"], return_tensors='pt')["pixel_values"],
        "label": sample["label"],
    }
    for sample in submitted_pil_image_rgb
]
submitted_pil_image_rgb_processed[0]

We can now convert the images to a Hugging Face datasets Dataset that is ready for fine-tuning.

In [None]:
prepared_ds = Dataset.from_list(submitted_pil_image_rgb_processed)
prepared_ds = prepared_ds.train_test_split(test_size=0.2)
prepared_ds

### The actual training

We then need to define our data collator, which will ensure the data is unpacked and stacked correctly for the model. We wi

In [None]:
def collate_fn(batch):
    return {
        'pixel_values': torch.stack([torch.tensor(x['pixel_values'][0]) for x in batch]),
        'labels': torch.tensor([int(x['label']) for x in batch])
    }

Next, we can define our training metrics. We will use the accuracy metric to evaluate the model's performance.

In [None]:
metric = load_metric("accuracy")
def compute_metrics(p):
    return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)

We then load our model and configure the labels that we will use for training.

In [None]:
model = AutoModelForImageClassification.from_pretrained(
    checkpoint,
    num_labels=len(labels),
    id2label={int(i): int(c) for i, c in enumerate(labels)},
    label2id={int(c): int(i) for i, c in enumerate(labels)}
)
model.config

Finally, we define the training arguments and start the training process.

In [None]:
training_args = TrainingArguments(
  output_dir="./image-classifier",
  per_device_train_batch_size=16,
  evaluation_strategy="steps",
  num_train_epochs=1,
  fp16=False, # True if you have a GPU with mixed precision support
  save_steps=100,
  eval_steps=100,
  logging_steps=10,
  learning_rate=2e-4,
  save_total_limit=2,
  remove_unused_columns=True,
  push_to_hub=False,
  load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
    train_dataset=prepared_ds["train"],
    eval_dataset=prepared_ds["test"],
    tokenizer=processor,
)

train_results = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

As the training data had a better-quality, we can expect a better model. So, we can update the remainder of our original dataset with the new model's suggestions.

In [None]:
pipe = pipeline("image-classification", model=model, image_processor=processor)

def run_inference(batch):
    predictions = pipe(batch["image"])
    batch["image_label"] = [prediction[0]["label"] for prediction in predictions]
    batch["image_label.score"] = [prediction[0]["score"] for prediction in predictions]
    return batch

hf_dataset = hf_dataset.map(run_inference, batched=True)
dataset.records.log(records=hf_dataset[:100], mapping={"image_data_uri": "image"})

## Conclusions

In this tutorial, we present an end-to-end example of a image classification task. This serves as the base, but it can be performed iteratively and seamlessly integrated into your workflow to ensure high-quality curation of your data and improved results.

We started by configuring the dataset, adding records, as an example, to add suggestions. After the annotation process, we trained a new model with the annotated data and updated the remaining records with the new suggestions.