# Process the Unsplash dataset with CLIP

This notebook processes all the downloaded photos using OpenAI's [CLIP neural network](https://github.com/openai/CLIP). For each image we get a feature vector containing 512 float numbers, which we will store in a file. These feature vectors will be used later to compare them to the text feature vectors.

This step will be significantly faster if you have a GPU, but it will also work on the CPU.

## Load the photos

Load all photos from the folder they were stored.

In [2]:
from pathlib import Path

# Set the path to the photos
dataset_version = "bso" 
photos_path = Path("/home/cluster/fkraeu/data/bso-image-segmentation/data/images")

# List all JPGs in the folder
photos_files = list(photos_path.glob("*.jpg"))

# Print some statistics
print(f"Photos found: {len(photos_files)}")

Photos found: 27473


## Load the CLIP net

Load the CLIP net and define the function that computes the feature vectors

In [3]:
import clip
import torch
from PIL import Image

# Load the open CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Function that computes the feature vectors for a batch of images
def compute_clip_features(photos_batch):
    # Load all the photos from the files
    photos = [Image.open(photo_file) for photo_file in photos_batch]
    
    # Preprocess all photos
    photos_preprocessed = torch.stack([preprocess(photo) for photo in photos]).to(device)

    with torch.no_grad():
        # Encode the photos batch to compute the feature vectors and normalize them
        photos_features = model.encode_image(photos_preprocessed)
        photos_features /= photos_features.norm(dim=-1, keepdim=True)

    # Transfer the feature vectors back to the CPU and convert to numpy
    return photos_features.cpu().numpy()

## Process all photos

Now we need to compute the features for all photos. We will do that in batches, because it is much more efficient. You should tune the batch size so that it fits on your GPU. The processing on the GPU is fairly fast, so the bottleneck will probably be loading the photos from the disk.

In this step the feature vectors and the photo IDs of each batch will be saved to a file separately. This makes the whole process more robust. We will merge the data later.

In [5]:
import math
import numpy as np
import pandas as pd

# Define the batch size so that it fits on your GPU. You can also do the processing on the CPU, but it will be slower.
batch_size = 16

# Path where the feature vectors will be stored
features_path = Path("unsplash-dataset") / dataset_version / "features"

# Compute how many batches are needed
batches = math.ceil(len(photos_files) / batch_size)

# Process each batch
for i in range(batches):
    print(f"Processing batch {i+1}/{batches}")

    batch_ids_path = features_path / f"{i:010d}.csv"
    batch_features_path = features_path / f"{i:010d}.npy"
    
    # Only do the processing if the batch wasn't processed yet
    if not batch_features_path.exists():
        try:
            # Select the photos for the current batch
            batch_files = photos_files[i*batch_size : (i+1)*batch_size]

            # Compute the features and save to a numpy file
            batch_features = compute_clip_features(batch_files)
            np.save(batch_features_path, batch_features)

            # Save the photo IDs to a CSV file
            photo_ids = [photo_file.name.split(".")[0] for photo_file in batch_files]
            photo_ids_data = pd.DataFrame(photo_ids, columns=['photo_id'])
            photo_ids_data.to_csv(batch_ids_path, index=False)
        except:
            # Catch problems with the processing to make the process more robust
            print(f'Problem with batch {i}')

Processing batch 1/1718
Processing batch 2/1718
Processing batch 3/1718
Processing batch 4/1718
Processing batch 5/1718
Processing batch 6/1718
Processing batch 7/1718
Processing batch 8/1718
Processing batch 9/1718
Processing batch 10/1718
Processing batch 11/1718
Processing batch 12/1718
Processing batch 13/1718
Processing batch 14/1718
Processing batch 15/1718
Processing batch 16/1718
Processing batch 17/1718
Processing batch 18/1718
Processing batch 19/1718
Processing batch 20/1718
Processing batch 21/1718
Processing batch 22/1718
Processing batch 23/1718
Processing batch 24/1718
Processing batch 25/1718
Processing batch 26/1718
Processing batch 27/1718
Processing batch 28/1718
Processing batch 29/1718
Processing batch 30/1718
Processing batch 31/1718
Processing batch 32/1718
Processing batch 33/1718
Processing batch 34/1718
Processing batch 35/1718
Processing batch 36/1718
Processing batch 37/1718
Processing batch 38/1718
Processing batch 39/1718
Processing batch 40/1718
Processin

Processing batch 321/1718
Processing batch 322/1718
Processing batch 323/1718
Processing batch 324/1718
Processing batch 325/1718
Processing batch 326/1718
Processing batch 327/1718
Processing batch 328/1718
Processing batch 329/1718
Processing batch 330/1718
Processing batch 331/1718
Processing batch 332/1718
Processing batch 333/1718
Processing batch 334/1718
Processing batch 335/1718
Processing batch 336/1718
Processing batch 337/1718
Processing batch 338/1718
Processing batch 339/1718
Processing batch 340/1718
Processing batch 341/1718
Processing batch 342/1718
Processing batch 343/1718
Processing batch 344/1718
Processing batch 345/1718
Processing batch 346/1718
Processing batch 347/1718
Processing batch 348/1718
Processing batch 349/1718
Processing batch 350/1718
Processing batch 351/1718
Processing batch 352/1718
Processing batch 353/1718
Processing batch 354/1718
Processing batch 355/1718
Processing batch 356/1718
Processing batch 357/1718
Processing batch 358/1718
Processing b

Processing batch 637/1718
Processing batch 638/1718
Processing batch 639/1718
Processing batch 640/1718
Processing batch 641/1718
Processing batch 642/1718
Processing batch 643/1718
Processing batch 644/1718
Processing batch 645/1718
Processing batch 646/1718
Processing batch 647/1718
Processing batch 648/1718
Processing batch 649/1718
Processing batch 650/1718
Processing batch 651/1718
Processing batch 652/1718
Processing batch 653/1718
Processing batch 654/1718
Processing batch 655/1718
Processing batch 656/1718
Processing batch 657/1718
Processing batch 658/1718
Processing batch 659/1718
Processing batch 660/1718
Processing batch 661/1718
Processing batch 662/1718
Processing batch 663/1718
Processing batch 664/1718
Processing batch 665/1718
Processing batch 666/1718
Processing batch 667/1718
Processing batch 668/1718
Processing batch 669/1718
Processing batch 670/1718
Processing batch 671/1718
Processing batch 672/1718
Processing batch 673/1718
Processing batch 674/1718
Processing b

Processing batch 953/1718
Processing batch 954/1718
Processing batch 955/1718
Processing batch 956/1718
Processing batch 957/1718
Processing batch 958/1718
Processing batch 959/1718
Processing batch 960/1718
Processing batch 961/1718
Processing batch 962/1718
Processing batch 963/1718
Processing batch 964/1718
Processing batch 965/1718
Processing batch 966/1718
Processing batch 967/1718
Processing batch 968/1718
Processing batch 969/1718
Processing batch 970/1718
Processing batch 971/1718
Processing batch 972/1718
Processing batch 973/1718
Processing batch 974/1718
Processing batch 975/1718
Processing batch 976/1718
Processing batch 977/1718
Processing batch 978/1718
Processing batch 979/1718
Processing batch 980/1718
Processing batch 981/1718
Processing batch 982/1718
Processing batch 983/1718
Processing batch 984/1718
Processing batch 985/1718
Processing batch 986/1718
Processing batch 987/1718
Processing batch 988/1718
Processing batch 989/1718
Processing batch 990/1718
Processing b

Processing batch 1259/1718
Processing batch 1260/1718
Processing batch 1261/1718
Processing batch 1262/1718
Processing batch 1263/1718
Processing batch 1264/1718
Processing batch 1265/1718
Processing batch 1266/1718
Processing batch 1267/1718
Processing batch 1268/1718
Processing batch 1269/1718
Processing batch 1270/1718
Processing batch 1271/1718
Processing batch 1272/1718
Processing batch 1273/1718
Processing batch 1274/1718
Processing batch 1275/1718
Processing batch 1276/1718
Processing batch 1277/1718
Processing batch 1278/1718
Processing batch 1279/1718
Processing batch 1280/1718
Processing batch 1281/1718
Processing batch 1282/1718
Processing batch 1283/1718
Processing batch 1284/1718
Processing batch 1285/1718
Processing batch 1286/1718
Processing batch 1287/1718
Processing batch 1288/1718
Processing batch 1289/1718
Processing batch 1290/1718
Processing batch 1291/1718
Processing batch 1292/1718
Processing batch 1293/1718
Processing batch 1294/1718
Processing batch 1295/1718
P

Processing batch 1563/1718
Processing batch 1564/1718
Processing batch 1565/1718
Processing batch 1566/1718
Processing batch 1567/1718
Processing batch 1568/1718
Processing batch 1569/1718
Processing batch 1570/1718
Processing batch 1571/1718
Processing batch 1572/1718
Processing batch 1573/1718
Processing batch 1574/1718
Processing batch 1575/1718
Processing batch 1576/1718
Processing batch 1577/1718
Processing batch 1578/1718
Processing batch 1579/1718
Processing batch 1580/1718
Processing batch 1581/1718
Processing batch 1582/1718
Processing batch 1583/1718
Processing batch 1584/1718
Processing batch 1585/1718
Processing batch 1586/1718
Processing batch 1587/1718
Processing batch 1588/1718
Processing batch 1589/1718
Processing batch 1590/1718
Processing batch 1591/1718
Processing batch 1592/1718
Processing batch 1593/1718
Processing batch 1594/1718
Processing batch 1595/1718
Processing batch 1596/1718
Processing batch 1597/1718
Processing batch 1598/1718
Processing batch 1599/1718
P

Merge the features and the photo IDs. The resulting files are `features.npy` and `photo_ids.csv`. Feel free to delete the intermediate results.

In [6]:
import numpy as np
import pandas as pd

# Load all numpy files
features_list = [np.load(features_file) for features_file in sorted(features_path.glob("*.npy"))]

# Concatenate the features and store in a merged file
features = np.concatenate(features_list)
np.save(features_path / "features.npy", features)

# Load all the photo IDs
photo_ids = pd.concat([pd.read_csv(ids_file) for ids_file in sorted(features_path.glob("*.csv"))])
photo_ids.to_csv(features_path / "photo_ids.csv", index=False)