# THINGS images - Exploring with CLIP

THINGS-images is a freely available database of 26,107 high quality, manually-curated images of 1,854 diverse object concepts, curated systematically from the everyday American English language and using a large-scale web search. Includes 27 high-level categories, semantic embeddings for all concepts, and more metadata.

There's a lot of interesting datasets that have been released based on this one, you can find them here:  
https://things-initiative.org/

You can explore all the categories and selectively download images you want here:
https://things-initiative.org/projects/things-images/

You're going to be using CLIP to explore this dataset, and hopefully discover some cool things about CLIP along the way.

In [1]:
import os
!git clone https://github.com/Srinivas-R/COGS118B_FA25_Project
%cd ./COGS118B_FA25_Project

Cloning into 'COGS118B_FA25_Project'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 16 (delta 0), reused 16 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (16/16), 1.15 MiB | 3.45 MiB/s, done.
/content/COGS118B_FA25_Project


In [4]:
all_images = [x for x in os.listdir('THINGS_images/') if x.endswith('.jpg')]
categories = set([x[:-8] for x in all_images])
category2images = {category : [] for category in categories}
for img in all_images:
    category2images[img[:-8]].append(img)

In [5]:
all_images

['cat_01b.jpg',
 'dog_06s.jpg',
 'mango_03s.jpg',
 'dog_01b.jpg',
 'mango_01b.jpg',
 'cat_04s.jpg']

In [6]:
categories

{'cat', 'dog', 'mango'}

In [7]:
category2images['mango']

['mango_03s.jpg', 'mango_01b.jpg']

## Note
Turns out the text encoder bundled with the stable-diffusion-2-1-unclip doesn't project to the same image space.
So we're pulling the openCLIP text encoder from the official source, since the unCLIP vision encoder is the exact analogue from openCLIP.

In [8]:
import requests
import torch
from PIL import Image
import matplotlib.pyplot as plt
from io import BytesIO
import numpy as np
from tqdm import tqdm
import pandas as pd
import os

In [9]:
!pip install --upgrade diffusers[torch]
!pip install transformers



In [10]:
from diffusers import StableUnCLIPImg2ImgPipeline
from transformers import CLIPTextModelWithProjection, CLIPTokenizer

In [11]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [12]:
# ──────────────────────────────────────────────────────────────
# 1.  Load unCLIP – vision side only (projection_dim = 1024)   ─
# ──────────────────────────────────────────────────────────────
pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
    "sd2-community/stable-diffusion-2-1-unclip",
    torch_dtype=torch.float16,
).to(device)

vision_encoder = pipe.image_encoder                       # keep as-is (1024-d)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_index.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

scheduler_config.json:   0%|          | 0.00/424 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/597 [00:00<?, ?B/s]

scheduler_config.json:   0%|          | 0.00/470 [00:00<?, ?B/s]

image_encoder/model.safetensors:   0%|          | 0.00/2.53G [00:00<?, ?B/s]

text_encoder/model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

image_normalizer/diffusion_pytorch_model(…):   0%|          | 0.00/8.36k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

unet/diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.48G [00:00<?, ?B/s]

vae/diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/9 [00:00<?, ?it/s]

`torch_dtype` is deprecated! Use `dtype` instead!


In [None]:
# ──────────────────────────────────────────────────────────────
# 2.  Swap in an OpenCLIP ViT-H/14 text branch (also 1024-d)  ─
# ──────────────────────────────────────────────────────────────
openclip_repo = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"     # projection_dim = 1024 :contentReference[oaicite:0]{index=0}
tokenizer = CLIPTokenizer.from_pretrained(openclip_repo)
text_encoder = CLIPTextModelWithProjection.from_pretrained(
    openclip_repo,
    torch_dtype=torch.float16
).to(device)

# optional: stuff them into the pipe so `pipe.tokenizer` etc. work
pipe.tokenizer, pipe.text_encoder = tokenizer, text_encoder

tokenizer_config.json:   0%|          | 0.00/904 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.94G [00:00<?, ?B/s]

In [None]:
# ──────────────────────────────────────────────────────────────
# 3. Helpers
# ──────────────────────────────────────────────────────────────
def embed_images(paths, batch_size=8):
    """Return (N,1024) image embeddings"""
    out, fe, enc = [], pipe.feature_extractor, pipe.image_encoder
    for i in range(0, len(paths), batch_size):
        imgs = [Image.open(p).convert("RGB") for p in paths[i:i + batch_size]]
        px   = fe(imgs, return_tensors="pt").pixel_values.to(enc.device, enc.dtype)
        with torch.no_grad():
            v = enc(px)[0]                              # (B,1024)
        out.append(v)
    return torch.cat(out)  # (N,1024)

def embed_texts(prompts, batch_size=64):
    """Return (N,1024) text embeddings"""
    vecs = []
    for i in range(0, len(prompts), batch_size):
        toks = tokenizer(prompts[i:i + batch_size],
                         padding=True, truncation=True, max_length=77,
                         return_tensors="pt").to(text_encoder.device)
        with torch.no_grad():
            t = text_encoder(**toks).text_embeds        # (B,1024)
        vecs.append(t)
    return torch.cat(vecs)  # (N,1024)


In [None]:
# ──────────────────────────────────────────────────────────────
# 4.  Sanity check on image-text similarity
# ──────────────────────────────────────────────────────────────
img_vec = embed_images(["./THINGS_images/mango_03s.jpg"])
txt_vec = embed_texts(["mango"])
print("cosine(mango image, \"mango\") →",
      (torch.nn.functional.normalize(img_vec, dim=-1) @ torch.nn.functional.normalize(txt_vec, dim=-1).T).item())
# expect ≳ 0.3

In [None]:
# ──────────────────────────────────────────────────────────────
# 5.  Sanity check on image-image similarity
# ──────────────────────────────────────────────────────────────
img_vec = embed_images(["./THINGS_images/mango_03s.jpg",
                       "./THINGS_images/cat_01b.jpg",
                       "./THINGS_images/dog_01b.jpg"])
img_vec2 = embed_images(["./THINGS_images/mango_01b.jpg",
                       "./THINGS_images/cat_04s.jpg",
                       "./THINGS_images/dog_06s.jpg"])

sims = torch.nn.functional.normalize(img_vec, dim=-1) @ torch.nn.functional.normalize(img_vec2, dim=-1).T
sims = sims.detach().cpu().numpy()

In [None]:
fig, ax = plt.subplots()

# Display the data as an image (heatmap)
im = ax.imshow(sims, cmap='viridis')

# Loop over the data and place text annotations
for i in range(sims.shape[0]):
    for j in range(sims.shape[1]):
        ax.text(j, i, sims[i, j], ha='center', va='center', color='black')

# Add a colorbar for reference
plt.colorbar(im)
plt.yticks([0, 1, 2], ['mango1', 'cat1', 'dog1'])
plt.xticks([0, 1, 2], ['mango2', 'cat2', 'dog2'])

# Set title and display the plot
ax.set_title('Pairwise similarities between 6 different images')
plt.show()

# A quick exercise, can you build a classifier using the concepts demonstrated above?

In [None]:
# Here are a few test images, and I can tell you they're either a cat or a mango. Write a classifier function that can predict which one each image is.
os.listdir('THINGS_images/test_images/')

In [None]:
def classify(pipe, image):
    # write your classification code here, ideally using CLIP
    pass

# We can reconstruct the embeddings back into images using pretrained diffusion models

In [None]:
recon_pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
    "sd2-community/stable-diffusion-2-1-unclip", torch_dtype=torch.float32
)

In [None]:
recon_pipe = recon_pipe.to("cuda")

In [None]:
def reconstruct_image_from_embedding(recon_pipe, embedding):
    return recon_pipe(image_embeds=embedding).images[0]

In [None]:
# reconstructing the first mango embedding
reconstruct_image_from_embedding(recon_pipe, img_vec[0].unsqueeze(0).float())

In [None]:
text_prompt = embed_texts(["yellow, bird"]) # Text Prompt
reconstruct_image_from_embedding(recon_pipe, text_prompt[0].unsqueeze(0).float())

In [None]:
def text_classify(text, img_path, thresh):
  # thresh - threshold of cosine similarity for positive id
  # text - text prompt to generate vector in CLIP space for classification
  # img_path - path of image to classify
  txt_embed = embed_texts([text]) # Ex. "yellow, bird"
  img_embed = embed_images([img_path])

  similarity = (torch.nn.functional.normalize(img_embed, dim=-1) @ torch.nn.functional.normalize(txt_embed, dim=-1).T).item()

  if (similarity <= thresh):
    return True
  else:
    return False

