# **1. Data Generation Pipeline for Metric Learning**

**Project:** FashionCLIP (The Seeker)
**Author:** [Tu Nombre]
**Goal:** This notebook details the first and most critical step of the project: creating a high-quality, large-scale dataset for training a metric learning model.

---

### **Overview**
Standard image datasets are often unsuitable for teaching a model nuanced concepts of similarity. To fine-tune our CLIP-based model, we need to explicitly provide it with examples of what makes two images similar or different.

This pipeline programmatically generates a dataset of **triplets** and **semi-positives** from a small set of source images. The structure is as follows:

-   **Anchor**: A base reference image.
-   **Positive**: A slightly modified version of the anchor (e.g., same image, different text color). It should be "closer" to the anchor than any other image.
-   **Negative**: An image from a completely different base image. It should be "far" from the anchor.
-   **Semi-Positive**: A significantly modified version of the anchor (e.g., same image, different text caption). It should be "further" than the positive but "closer" than the negative.

This structured dataset is the key to training our model with the `TripletSemiPosMarginWithDistanceLoss`.

### **2. Configuration and Setup**

First, we'll import the necessary libraries and define the core configuration parameters for our data generation process. This includes setting up paths, defining constants for image manipulation (like text padding and font sizes), and specifying the number of examples to generate. This centralized configuration makes the script easy to modify and reproduce.

In [1]:
import sys
import os
import numpy as np
import pandas as pd

from datasets import Dataset

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from src.dataprep.data_prep import (TEXT_PATH,
                                    MIN_LENGTH_TEXTS,
                                    MAX_LENGHT_TEXTS,
                                    NUM_TEXTS,
                                    DATA_PATH,
                                    ALL_FONTS,
                                    ALL_COLOURS,
                                    NUM_IMAGES,
                                    NUM_TRIPLETS_PER_EXAMPLE,
                                    DATASET_SAVE_PATH)

from src.dataprep.data_prep import generate_examples, generate_multi_triplets


In [2]:
with open(TEXT_PATH) as f:
    lines = f.readlines()
cleaned_lines = [l.replace('/n', '').replace('\n', '') for l in lines if (
    l.replace('/n', '').replace('\n', '') != '') and MIN_LENGTH_TEXTS < len(l) < MAX_LENGHT_TEXTS]

all_texts = cleaned_lines
texts_list = np.random.choice(all_texts, NUM_TEXTS)

image_names_list = [f for _, _, files in os.walk(DATA_PATH) for f in files if f.endswith('.png')][:NUM_IMAGES]
df = generate_examples(image_names_list, fonts_list=ALL_FONTS, colour_list=ALL_COLOURS, texts_list=texts_list, image_will_be_cropped=True)

df

Unnamed: 0,file_path,caption,actual_caption,font,colour,semipos,position,font_given
0,/home/fvelasco/data/research/the-seeker/data/p...,"lieutenant-general.""","lieuteynant-gene ral.""",/usr/share/fonts/smc/Meera.ttf,black,0.100000,"(1267, 444)",40
1,/home/fvelasco/data/research/the-seeker/data/p...,with the idea of his speedy departure.,with the idea of his speedy departure.,/usr/share/fonts/smc/Meera.ttf,black,0.000000,"(1009, 359)",40
2,/home/fvelasco/data/research/the-seeker/data/p...,"""And I also,"" said Franz.","""And I also,"" said Franz.",/usr/share/fonts/smc/Meera.ttf,black,0.000000,"(1118, 438)",40
3,/home/fvelasco/data/research/the-seeker/data/p...,"Greeks, and hence arises the calumny.""","Greeks, and hence arises the calumny.""",/usr/share/fonts/smc/Meera.ttf,black,0.000000,"(1264, 451)",40
4,/home/fvelasco/data/research/the-seeker/data/p...,"grandpapa is again thinking of it.""","grandpVpo is fagain thinking of Sit.""",/usr/share/fonts/smc/Meera.ttf,black,0.114286,"(817, 693)",42
...,...,...,...,...,...,...,...,...
295,/home/fvelasco/data/research/the-seeker/data/p...,"lieutenant-general.""","lieutenant-general.""",/usr/share/fonts/smc/Meera.ttf,red,0.000000,"(1304, 373)",40
296,/home/fvelasco/data/research/the-seeker/data/p...,with the idea of his speedy departure.,with the idea of his speedy departure.,/usr/share/fonts/smc/Meera.ttf,red,0.000000,"(1263, 379)",40
297,/home/fvelasco/data/research/the-seeker/data/p...,"""And I also,"" said Franz.","""And also,"" saidFranJz.",/usr/share/fonts/smc/Meera.ttf,red,0.120000,"(1116, 349)",40
298,/home/fvelasco/data/research/the-seeker/data/p...,"Greeks, and hence arises the calumny.""","Greeks, and hence arises the calumny.""",/usr/share/fonts/smc/Meera.ttf,red,0.000000,"(659, 473)",48


### **3. Generating Examples and Building Triplets**

With the configuration in place, we proceed to the core logic:

1.  **`generate_examples`**: This function iterates through our source images and programmatically creates thousands of variations by adding text with different fonts, colors, and positions. Each unique variation is saved as a new image.
2.  **`generate_multi_triplets`**: For each generated image (which now serves as an "anchor"), this function intelligently samples the entire dataset to find corresponding positive, negative, and semi-positive partners, forming the final triplets.

The final output is a Pandas DataFrame which is then saved as a Hugging Face `Dataset` object for efficient loading during training.

In [None]:
generated_triplets = []
part_res = df.apply(generate_multi_triplets(df, NUM_TRIPLETS_PER_EXAMPLE), axis=1).to_list()

res = [l for p in part_res for l in p]
df_result = pd.DataFrame(res, columns=['anchor', 'pos', 'neg', 'anchor_is_semipos', 'semipos', 'caption', 'actual_caption'])
df_result = df_result[df_result.anchor_is_semipos == 0]
df_result.reset_index(drop=True, inplace=True)
df_result.drop(['anchor_is_semipos', 'actual_caption'], axis=1, inplace=True)
df_result

Unnamed: 0,anchor,pos,neg,semipos,caption
0,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.105263,with the idea of his speedy departure.
1,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.131579,with the idea of his speedy departure.
2,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.000000,with the idea of his speedy departure.
3,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.000000,with the idea of his speedy departure.
4,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.000000,with the idea of his speedy departure.
...,...,...,...,...,...
1575,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.078947,"Greeks, and hence arises the calumny."""
1576,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.105263,"Greeks, and hence arises the calumny."""
1577,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.105263,"Greeks, and hence arises the calumny."""
1578,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,/home/fvelasco/data/research/the-seeker/data/p...,0.131579,"Greeks, and hence arises the calumny."""


In [None]:

def add_image(self, image_name, raw_folder, num_augment, place_images_randomly,
                predefined_positions, keep_asp_ratio, padding, min_percent, max_percent, train_or_test,
                logo_class, logo_list, processed_folder, logo_folders):
    """
    Will generate an image with the logo on it and store it in disk.
    It will also save a df row with the information

    Parameters
    ----------
    -image_name : str, name of the image to be considered
    -raw_folder: str, name of the folder where the raw images are
    -num_augment: int, amount of times the images will be generated
    -place_images_randomly: bool, whether to place images randomly or use a predefined position
    -predefined_positions: list of tuples (float, float, float, float): points where the logos must be placed.
    If more than one, then this image has more than a logo
    -keep_asp_ratio: bool, whether to keep the aspect ratio or not
    -padding: float: will be added to the position of the logo when inserted
    -min_percent: max_percent: maximum percentages for increasing size
    -train_or_test: whether if it is train or test image
    -logo_class: class (agency) of the logo for storing
    -logo_list: list of logos to be added (could be all PEGI for instance)
    -logo_folders : str or list, path from the root to each folder with logos to be added
    -processed_folder: str, path to where the processed images will be added
    -logo_folders: the folder where the raw logo images are stored
    """
    im_path = str(self.root / raw_folder / image_name)
    im_name = image_name
    im = Image.open(im_path)

    for logo_name in logo_list:
        logo_path = str(self.root / logo_folders / logo_name)
        logo = Image.open(logo_path).convert("RGBA")  # logo image
        # will not resize image if it is a corner case
        if not (logo_folders in IMAGE_GENERATION_CORNER_CASE_LOGOS):
            if logo.size[0] > IMAGE_GENERATION_MAX_LOGO_PIXELS or \
                    logo.size[1] > IMAGE_GENERATION_MAX_LOGO_PIXELS:
                logo = self.resize_logo_x_y(logo,
                                            IMAGE_GENERATION_MAX_LOGO_PIXELS / logo.size[0],
                                            IMAGE_GENERATION_MAX_LOGO_PIXELS / logo.size[0])

        if len(logo_name.split('.')[0].split('-')) > 1:
            # swap order country-logoname to logoname-country
            country = logo_name.split('.')[0].split('-')[0]
            l_name = logo_name.split('.')[0].split('-')[1] + '-' + country
        else:
            l_name = logo_name.split('.')[0]  # remove extension only
        for j in range(num_augment):
            # Logo distortion
            if place_images_randomly:
                if keep_asp_ratio:
                    percentage_x = np.random.uniform(low=min_percent, high=max_percent)
                    percentage_y = percentage_x
                else:
                    percentage_x = np.random.uniform(low=min_percent, high=max_percent)
                    percentage_y = np.random.uniform(low=min_percent, high=max_percent)

                if (padding < int(im.width - (padding + percentage_x * logo.width))) and (
                        padding < int(im.height - (padding + percentage_y * logo.height))):
                    x = np.random.randint(0 + padding,
                                            int(im.width - (padding + percentage_x * logo.width)))
                    y = np.random.randint(0 + padding,
                                            int(im.height - (padding + percentage_y * logo.height)))
                else:
                    break
                new_path = str(self.root / processed_folder / train_or_test / str(
                    l_name + '_' + logo_class + '_' + str(j) + '_' + im_name))

                self.include_and_save_image(new_path, x, y, percentage_x, percentage_y, logo_class, im, logo)

            else:
                # there can be more than a logo iff there are text boxes and/or in_game_purchases
                # please note then, the order will be logo, purchases, text boxes.
                # in case there is no purchases, an empty tuple must be included
                for i, position in enumerate(predefined_positions):
                    logo_classes = [logo_class, 'in_game_purchases', 'text_boxes']
                    logo_names = [logo_name, 'in_game_purchases.png', 'text_box_1.jpg']
                    folders = [logo_folders, 'in_game_purchases', 'text_boxes']
                    # Will only save the image with the actual logo
                    save_image = i < 1
                    if not position:
                        continue
                    else:
                        logo_path = str(self.root / folders[i] / logo_names[i])
                        # In case there is more than one logo in the image, all will be included
                        # in the df, but a single image will be saved
                        logo_df_class = logo_classes[i]
                        logo = Image.open(logo_path).convert("RGBA")  # logo image
                        x, y, x_max, y_max = position
                        percentage_x = (x_max - x) / logo.width
                        percentage_y = (y_max - y) / logo.height

                        new_path = str(self.root / processed_folder / train_or_test / str(
                            l_name + '_' + logo_class + '_' + str(j) + '_' + im_name))

                        self.include_and_save_image(new_path, x, y, percentage_x, percentage_y,
                                                    logo_df_class, im, logo,
                                                    save_image=save_image)


### **4. Conclusion**

The script has successfully generated and saved our structured dataset. We now have a file containing paths to thousands of anchor, positive, negative, and semi-positive images, ready to be fed into our training pipeline.

**Next Step:** Use this dataset in `03_triplet_training_new_metric.ipynb` to fine-tune the `ImageEncoderNetwork`.