Skip to content

dev48v/clip-from-zero

Repository files navigation

CLIP From Zero

Type a phrase, find the image. Cross-modal search running entirely in your browser.

Live demo: clip-from-zero.vercel.app Series: TechFromZero — Day 38 of 50.

This is a tiny, zero-server, no-API-key app that teaches one of the most important ideas in modern AI: embedding text and images into the same vector space. Once they share a space, "find me an image of X" is just argmax(cosine_similarity(encode(X), encode(every_image))) — three lines of math, no model fine-tune, no labels.

It runs OpenAI's CLIP (ViT-B/32) entirely in the browser via Transformers.js + ONNX. The first visit downloads ~150 MB of weights (cached after that). Image embeddings are cached in IndexedDB so warm-loads are instant.

Quick start

git clone https://github.com/dev48v/clip-from-zero.git
cd clip-from-zero
npm install
npm run dev

Open the URL Vite prints, wait for the model to download (one-time), and start typing. Try the suggestion chips first — a cat in the sun, something to eat, a place to visit.

How it works

CLIP is two encoders glued together by a shared output space:

"a corgi puppy"  ─▶  text encoder (transformer)  ─▶  [0.04, -0.12, …]  ─┐
                                                                         ├─▶  cosine sim
[image bytes]    ─▶  vision encoder (ViT-B/32)   ─▶  [0.05, -0.10, …]  ─┘

Both encoders output a 512-dimensional vector. They were trained on 400M (text, image) pairs scraped from the web so that paired captions and images land near each other. Once that's true, any new sentence and any new image can be compared by distance.

Cosine similarity between L2-normalised vectors reduces to a dot product:

let dot = 0
for (let i = 0; i < 512; i++) dot += a[i] * b[i]

Ranking N images against one query is N × 512 multiplies. A modern phone does that for thousands of images in a single frame.

Step-by-step commits

Each commit on main adds one concept. Read them in order to follow the build:

Step What lands
1 Vite + React + TS scaffold
2 Load CLIP via Transformers.js, show download progress
3 Text encoder + cosine similarity utility
4 Image encoder + 24-image curated gallery
5 Ranked search: encode query → dot-product → sort
6 Score badges + gold/silver/bronze for the top 3
7 IndexedDB cache for embeddings (warm reloads = instant)
8 README + footer + polish

Why this matters

CLIP-style embeddings are the foundation of:

  • Image search at scale (Pinterest, Shopify, Unsplash all use a CLIP-family model)
  • Stable Diffusion / Midjourney prompt conditioning
  • Dataset deduplication ("which of these 50M images are near-duplicates?")
  • Zero-shot classification ("which of these 10 labels best fits this image?" — no training needed)
  • Content moderation ("does this image semantically resemble the policy violations we've labelled?")

If you understand cosine similarity + a 512-d vector + a single dot product, you understand the heart of most production CV pipelines from the last four years.

File map

src/
  main.tsx      ← React entrypoint
  App.tsx       ← search UI, gallery state machine
  clip.ts       ← loadClip(), encodeText(), encodeImage(), cosineSim()
  cache.ts      ← IndexedDB get/put for embeddings
  images.ts     ← 24 curated Pollinations.ai prompts (deterministic seeds)
  styles.css    ← dark theme + ranked-grid layout

License

MIT. Use it, fork it, teach with it.

About

Day 38 of TechFromZero - text to image search using CLIP entirely in the browser

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors