## IMAGE EMBEDDER
This notebook implements a complete pipeline to **download a batch of images from a URL list**, use the **OpenAI CLIP (ViT-B/32) model** to **convert them into vector embeddings**, and finally **merge all vectors** into a single HDF5 file.

### 1. Initialization and Setup
Installs all necessary Python libraries, including **torch**, **h5py**, **gdown** *(for Google Drive downloads)*, **Pillow** *(for image processing)*, **ipywidgets** *(for tqdm)*, and most importantly, **clip** *(installed directly from OpenAI's GitHub)*.

In [2]:
%pip install requests boto3 h5py gdown torch tqdm typing numpy gdown Pillow ipywidgets git+https://github.com/openai/CLIP.git

Collecting git+https://github.com/openai/CLIP.git
  Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-bchimaqz
  Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-req-build-bchimaqz
  Resolved https://github.com/openai/CLIP.git to commit dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


In [1]:
import json
import h5py
from concurrent.futures import ThreadPoolExecutor
import subprocess
import clip
import gdown
from PIL import Image
import torch
from tqdm.auto import tqdm
import numpy as np
import os

### 2. Data and Tool Preparation

Downloads the `downloader.py` utility script, part of the **Open Images dataset** toolkit. This script is used for efficient, parallel downloading of the dataset images and is stored in the `.cache` directory.

In [3]:
os.makedirs(".cache", exist_ok=True)
if not os.path.isfile(".cache/downloader.py"):
    command = ["wget", "-O", ".cache/downloader.py", "https://raw.githubusercontent.com/openimages/dataset/master/downloader.py"]
    subprocess.run(command, check=True)

Downloads metadata files (`image_ids.json`, `image_urls.json`) derived from the **Open Images V7 dataset**, hosted on a Google Drive mirror. This data is used in accordance with the original **CC BY 4.0 License**.

In [4]:
os.makedirs("RAW_DATASET", exist_ok=True)
if not os.path.isfile("RAW_DATASET/image_ids.json"):
    gdown.download(id='1-HcMviWpMn84cDaDkDpPXqEr-6BFc0bv', output='RAW_DATASET/image_ids.json', quiet=False)
if not os.path.isfile("RAW_DATASET/image_urls.json"):
    gdown.download(id='1aaeCYKWQFva8M-ene1whUyZReSnzjoLA', output='RAW_DATASET/image_urls.json', quiet=False)

### 3. Model Loading and Function Definitions

In [5]:
def read_json(ten_file):
    with open(ten_file, 'r', encoding='utf-8') as f:
        du_lieu = json.load(f)
        return du_lieu

In [6]:
image_IDs = read_json("RAW_DATASET/image_ids.json")
image_URLs = read_json("RAW_DATASET/image_urls.json")
n_images = len(image_URLs)
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

In [7]:
def load_and_preprocess_image(i):
    """
    Hàm này đọc file, preprocess và trả về tensor trên CPU.
    """
    image_path = f".cache/Images/{image_IDs[i]}.jpg"
    if os.path.exists(image_path):
        try:
            image = Image.open(image_path)
            # 'preprocess' là hàm của CLIP, trả về tensor trên CPU
            image_tensor = preprocess(image) 
            return image_URLs[i], image_tensor
        except Exception as e:
            # Bỏ qua các ảnh bị hỏng
            print(f"Error processing image {image_path}: {e}")
            return None
    return None

In [8]:
def merge_HDF5_files(input_list, output_file):
    """
    Gộp dữ liệu từ nhiều file HDF5 (có cùng cấu trúc: urls, embeddings)
    """
    if not input_list:
        print("❌ Error: Input file list is empty.")
        return

    total_records = 0

    # 1. Xử lý file đầu tiên để xác định cấu trúc và khởi tạo Dataset
    first_file = None
    for f_path in input_list:
        if os.path.exists(f_path):
            first_file = f_path
            break

    if not first_file:
        print("❌ Error: No valid input files found.")
        return

    with h5py.File(first_file, 'r') as f_first:
        # Lấy thông tin dtype và chiều (dim)
        embed_shape = np.squeeze(f_first['embeddings'][:]).shape[1]
        embed_dtype = f_first['embeddings'].dtype
        url_dtype = f_first['urls'].dtype

    with h5py.File(output_file, 'w') as f_output:
        f_output.create_dataset(
            'urls',
            shape=(0,),
            maxshape=(None,),
            dtype=url_dtype,
            chunks=True
        )
        f_output.create_dataset(
            'embeddings',
            shape=(0, embed_shape),
            maxshape=(None, embed_shape),
            dtype=embed_dtype,
            chunks=True
        )

    pbar = tqdm(total = len(input_list), desc="Merging")

    # 2. Lặp và Gộp dữ liệu vào file Output (sử dụng chế độ 'a')
    with h5py.File(output_file, 'a') as f_output:
        for file_path in input_list:
            if not os.path.exists(file_path):
                print(f"⚠️ File not found: {file_path}")
                continue

            try:
                with h5py.File(file_path, 'r') as f_input:
                    current_urls = f_input['urls'][:]
                    current_embeddings = f_input['embeddings'][:]
                    current_embeddings = np.squeeze(current_embeddings)
                    num_records = current_urls.shape[0]

                    if num_records == 0:
                        continue

                    dset_urls = f_output['urls']
                    dset_embeddings = f_output['embeddings']

                    new_size = total_records + num_records

                    # Thay đổi kích thước (Resize) Dataset trong file output
                    dset_urls.resize(new_size, axis=0)
                    dset_embeddings.resize(new_size, axis=0)

                    # Ghi dữ liệu vào khoảng trống vừa resize
                    dset_urls[total_records:new_size] = current_urls
                    dset_embeddings[total_records:new_size] = current_embeddings

                    # Cập nhật tổng số bản ghi
                    total_records = new_size

            except Exception as e:
                print(f"❌ Error processing file {file_path}: {e}. Skipping this file.")

            pbar.update(1)
    pbar.close()

### 4. Main Processing Pipeline (Loop)

This is where the main logic is executed. The pipeline runs in **chunks** to save memory and be fault-tolerant.
* For **each chunk**, the script will:
    1.  **Download Images:** Call the `downloader.py` script to download the images for the current chunk.
    2.  **Preprocess (CPU):** Use a `ThreadPoolExecutor` to run the `load_and_preprocess_image` function in parallel.
    3.  **Encode (GPU):** Gather preprocessed images into large batches, push them to the GPU, and run `model.encode_image` to get the vectors.
    4.  **Save Chunk:** Write the URLs and vectors for this chunk into a temporary HDF5 file (e.g., `OUTPUT/Images_Embedded_0.h5`).
    5.  **Cleanup Chunk:** Delete the temporary image directory (`.cache/Images`).

In [12]:
START = 0
END = n_images
CHUNK = 10000
GPU_BATCH_SIZE = 512 
NUM_WORKERS = 50

In [None]:
os.makedirs("OUTPUT", exist_ok=True) 

pbar = tqdm(total = END - START, desc = "Embedding: ")

for x in range(START, END, CHUNK):

    with open(".cache/list_images.txt", 'w', encoding='utf-8') as file:
        for i in range(x, min(n_images, x + CHUNK)):
            file.write("train/" + image_IDs[i] + "\n")

    # Tải ảnh
    os.makedirs(".cache/Images", exist_ok=True)
    subprocess.run([
        "python", ".cache/downloader.py", ".cache/list_images.txt",
        "--download_folder=.cache/Images", "--num_processes=100"
    ])

    # 1. Dùng ThreadPool để ĐỌC và PREPROCESS (CPU)
    preprocessed_data_cpu = [] 
    
    with ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
        futures = [executor.submit(load_and_preprocess_image, i) for i in range(x, min(n_images, x + CHUNK))]
        
        for future in futures:
            result = future.result()
            if result:
                preprocessed_data_cpu.append(result)
            pbar.update(1)

    # Nếu không có ảnh nào được xử lý trong chunk này, bỏ qua
    if not preprocessed_data_cpu:
        print(f"No images were processed in chunk {int(x / CHUNK)}")
        subprocess.run(["rm", "-rf", ".cache/Images"])
        continue

    # 2. Xử lý theo LÔ (BATCH) trên GPU
    final_urls = []
    final_embeddings_list = []

    urls_cpu = [item[0] for item in preprocessed_data_cpu]
    tensors_cpu = [item[1] for item in preprocessed_data_cpu]

    # Tắt tính toán gradient (RẤT QUAN TRỌNG khi inference)
    with torch.no_grad():
        for i in range(0, len(tensors_cpu), GPU_BATCH_SIZE):
            batch_tensors_cpu = tensors_cpu[i : i + GPU_BATCH_SIZE]
            batch_urls = urls_cpu[i : i + GPU_BATCH_SIZE]
            
            batch_on_cpu = torch.stack(batch_tensors_cpu)
            batch_on_gpu = batch_on_cpu.to(device)

            vectors_gpu = model.encode_image(batch_on_gpu)
            vectors_gpu = vectors_gpu / vectors_gpu.norm(dim=-1, keepdim=True)
            
            vectors_numpy = vectors_gpu.cpu().detach().numpy().astype(np.float32)

            final_urls.extend(batch_urls)
            final_embeddings_list.append(vectors_numpy)

    # 3. GHI FILE HDF5
    output_path = f"OUTPUT/Images_Embedded_{int(x / CHUNK)}.h5"
    if not os.path.exists(output_path) and final_embeddings_list:
        all_embeddings_numpy = np.vstack(final_embeddings_list)
        
        with h5py.File(output_path, "w") as outfile:
            outfile.create_dataset("urls", data=np.array(final_urls, dtype='S'))
            outfile.create_dataset("embeddings", data=all_embeddings_numpy)

    # 4. XÓA TEMP IMAGES
    subprocess.run(["rm", "-rf", ".cache/Images"])

pbar.close()


**Merge:** Call `merge_HDF5_files` to combine all temporary HDF5 chunk files into a single final file: `OUTPUT/Images_Embedded.h5`.
**Final Cleanup:** Delete all the temporary HDF5 chunk files.

In [None]:
# Gộp các file HDF5
file_chunks = [
    f"OUTPUT/Images_Embedded_{int(x / CHUNK)}.h5"
    for x in range(START, END, CHUNK)
]
file_gop_cuoi = f"OUTPUT/Images_Embedded.h5"
merge_HDF5_files(file_chunks, file_gop_cuoi)

# Xóa các file chunk
for x in range(START, END, CHUNK):
    os.remove(f"OUTPUT/Images_Embedded_{int(x / CHUNK)}.h5")