CLIP From Zero

Type a phrase, find the image. Cross-modal search running entirely in your browser.

Live demo: clip-from-zero.vercel.app Series: TechFromZero — Day 38 of 50.

This is a tiny, zero-server, no-API-key app that teaches one of the most important ideas in modern AI: embedding text and images into the same vector space. Once they share a space, "find me an image of X" is just argmax(cosine_similarity(encode(X), encode(every_image))) — three lines of math, no model fine-tune, no labels.

It runs OpenAI's CLIP (ViT-B/32) entirely in the browser via Transformers.js + ONNX. The first visit downloads ~150 MB of weights (cached after that). Image embeddings are cached in IndexedDB so warm-loads are instant.

Quick start

git clone https://github.com/dev48v/clip-from-zero.git
cd clip-from-zero
npm install
npm run dev

Open the URL Vite prints, wait for the model to download (one-time), and start typing. Try the suggestion chips first — a cat in the sun, something to eat, a place to visit.

How it works

CLIP is two encoders glued together by a shared output space:

"a corgi puppy"  ─▶  text encoder (transformer)  ─▶  [0.04, -0.12, …]  ─┐
                                                                         ├─▶  cosine sim
[image bytes]    ─▶  vision encoder (ViT-B/32)   ─▶  [0.05, -0.10, …]  ─┘

Both encoders output a 512-dimensional vector. They were trained on 400M (text, image) pairs scraped from the web so that paired captions and images land near each other. Once that's true, any new sentence and any new image can be compared by distance.

Cosine similarity between L2-normalised vectors reduces to a dot product:

let dot = 0
for (let i = 0; i < 512; i++) dot += a[i] * b[i]

Ranking N images against one query is N × 512 multiplies. A modern phone does that for thousands of images in a single frame.

Step-by-step commits

Each commit on main adds one concept. Read them in order to follow the build:

Step	What lands
1	Vite + React + TS scaffold
2	Load CLIP via Transformers.js, show download progress
3	Text encoder + cosine similarity utility
4	Image encoder + 24-image curated gallery
5	Ranked search: encode query → dot-product → sort
6	Score badges + gold/silver/bronze for the top 3
7	IndexedDB cache for embeddings (warm reloads = instant)
8	README + footer + polish

Why this matters

CLIP-style embeddings are the foundation of:

Image search at scale (Pinterest, Shopify, Unsplash all use a CLIP-family model)
Stable Diffusion / Midjourney prompt conditioning
Dataset deduplication ("which of these 50M images are near-duplicates?")
Zero-shot classification ("which of these 10 labels best fits this image?" — no training needed)
Content moderation ("does this image semantically resemble the policy violations we've labelled?")

If you understand cosine similarity + a 512-d vector + a single dot product, you understand the heart of most production CV pipelines from the last four years.

File map

src/
  main.tsx      ← React entrypoint
  App.tsx       ← search UI, gallery state machine
  clip.ts       ← loadClip(), encodeText(), encodeImage(), cosineSim()
  cache.ts      ← IndexedDB get/put for embeddings
  images.ts     ← 24 curated Pollinations.ai prompts (deterministic seeds)
  styles.css    ← dark theme + ranked-grid layout

License

MIT. Use it, fork it, teach with it.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
public		public
src		src
.gitignore		.gitignore
ARTICLE.md		ARTICLE.md
LINKEDIN.md		LINKEDIN.md
README.md		README.md
index.html		index.html
linkedin-image.html		linkedin-image.html
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vite-env.d.ts		vite-env.d.ts
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP From Zero

Quick start

How it works

Step-by-step commits

Why this matters

File map

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CLIP From Zero

Quick start

How it works

Step-by-step commits

Why this matters

File map

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages