Skip to content

gersonazgo/vector-visualizer

Repository files navigation

Vector Visualizer

Store text chunks in a vector database, visualize how they cluster in embedding space, and run similarity searches — all in a single Jupyter notebook.

The experiment

We built 10 synthetic documents: 5 about river banks (erosion, ecosystems, floods) and 5 about financial banks (loans, regulation, fintech). The word "bank" appears in both sets, written exactly the same way. Yet when we embed them into vectors, they form two clearly separated clusters — proof that embedding models understand meaning, not just keywords.

concept

Blue = river banks (nature) · Orange = financial banks (institutions)

2D t-SNE visualization

What you'll learn

Notebook part Concept What happens
Part A Chunking for RAG Load 10 markdown docs, split into overlapping chunks
Part B Embeddings + vector DB Embed chunks with a local model, store in Chroma
Part C Dimensionality reduction t-SNE projects 384D vectors into 2D/3D for visualization
Part D Semantic retrieval Query the database and see which cluster responds

Quick start

uv sync

Open vector_visualizer.ipynb and run all cells.

Set your HuggingFace token (free, create one here):

# in your .env file
HF_TOKEN=your_token_here

No paid API required — the default embedding model (all-MiniLM-L6-v2) runs locally via sentence-transformers.

Project structure

vector_visualizer/
├── vector_visualizer.ipynb       # the notebook — run this
├── knowledge-base/
│   ├── river_banks/              # 5 docs about river banks (nature/geography)
│   └── financial_banks/          # 5 docs about financial banks (institutions)
├── pyproject.toml                # dependencies managed by uv
└── .env                          # optional API keys (HF_TOKEN, OPENAI_API_KEY)

Tech stack

  • LangChain — document loading, text splitting, embedding abstraction
  • Chroma — lightweight vector database with persistence
  • sentence-transformers — local embedding model (all-MiniLM-L6-v2, 384 dimensions)
  • Plotly — interactive 2D/3D scatter plots
  • scikit-learn — t-SNE for dimensionality reduction

Optional: OpenAI embeddings

To use OpenAI's text-embedding-3-large (3072 dimensions) instead of the local model, add your key to .env:

OPENAI_API_KEY=your_key_here

Then in the notebook, uncomment the OpenAI lines and comment out the HuggingFace line.

Credits

The visualization approach in this notebook is based on work by Ed Donner. The original notebook used a similar LangChain + Chroma + t-SNE + Plotly stack for embedding visualization. This project adapts the concept with synthetic data to illustrate semantic disambiguation of the word "bank".

About

Educational notebook: embed text in a vector DB, visualize clusters with t-SNE, and run similarity searches. River banks vs financial banks — same word, different meanings, distinct clusters.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors