Vector Visualizer

Store text chunks in a vector database, visualize how they cluster in embedding space, and run similarity searches — all in a single Jupyter notebook.

The experiment

We built 10 synthetic documents: 5 about river banks (erosion, ecosystems, floods) and 5 about financial banks (loans, regulation, fintech). The word "bank" appears in both sets, written exactly the same way. Yet when we embed them into vectors, they form two clearly separated clusters — proof that embedding models understand meaning, not just keywords.

Blue = river banks (nature) · Orange = financial banks (institutions)

What you'll learn

Notebook part	Concept	What happens
Part A	Chunking for RAG	Load 10 markdown docs, split into overlapping chunks
Part B	Embeddings + vector DB	Embed chunks with a local model, store in Chroma
Part C	Dimensionality reduction	t-SNE projects 384D vectors into 2D/3D for visualization
Part D	Semantic retrieval	Query the database and see which cluster responds

Quick start

uv sync

Open vector_visualizer.ipynb and run all cells.

Set your HuggingFace token (free, create one here):

# in your .env file
HF_TOKEN=your_token_here

No paid API required — the default embedding model (all-MiniLM-L6-v2) runs locally via sentence-transformers.

Project structure

vector_visualizer/
├── vector_visualizer.ipynb       # the notebook — run this
├── knowledge-base/
│   ├── river_banks/              # 5 docs about river banks (nature/geography)
│   └── financial_banks/          # 5 docs about financial banks (institutions)
├── pyproject.toml                # dependencies managed by uv
└── .env                          # optional API keys (HF_TOKEN, OPENAI_API_KEY)

Tech stack

LangChain — document loading, text splitting, embedding abstraction
Chroma — lightweight vector database with persistence
sentence-transformers — local embedding model (all-MiniLM-L6-v2, 384 dimensions)
Plotly — interactive 2D/3D scatter plots
scikit-learn — t-SNE for dimensionality reduction

Optional: OpenAI embeddings

To use OpenAI's text-embedding-3-large (3072 dimensions) instead of the local model, add your key to .env:

OPENAI_API_KEY=your_key_here

Then in the notebook, uncomment the OpenAI lines and comment out the HuggingFace line.

Credits

The visualization approach in this notebook is based on work by Ed Donner. The original notebook used a similar LangChain + Chroma + t-SNE + Plotly stack for embedding visualization. This project adapts the concept with synthetic data to illustrate semantic disambiguation of the word "bank".

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
knowledge-base		knowledge-base
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
reference.ipynb		reference.ipynb
uv.lock		uv.lock
vector_visualizer.ipynb		vector_visualizer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Visualizer

The experiment

What you'll learn

Quick start

Project structure

Tech stack

Optional: OpenAI embeddings

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vector Visualizer

The experiment

What you'll learn

Quick start

Project structure

Tech stack

Optional: OpenAI embeddings

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages