Store text chunks in a vector database, visualize how they cluster in embedding space, and run similarity searches — all in a single Jupyter notebook.
We built 10 synthetic documents: 5 about river banks (erosion, ecosystems, floods) and 5 about financial banks (loans, regulation, fintech). The word "bank" appears in both sets, written exactly the same way. Yet when we embed them into vectors, they form two clearly separated clusters — proof that embedding models understand meaning, not just keywords.
Blue = river banks (nature) · Orange = financial banks (institutions)
| Notebook part | Concept | What happens |
|---|---|---|
| Part A | Chunking for RAG | Load 10 markdown docs, split into overlapping chunks |
| Part B | Embeddings + vector DB | Embed chunks with a local model, store in Chroma |
| Part C | Dimensionality reduction | t-SNE projects 384D vectors into 2D/3D for visualization |
| Part D | Semantic retrieval | Query the database and see which cluster responds |
uv syncOpen vector_visualizer.ipynb and run all cells.
Set your HuggingFace token (free, create one here):
# in your .env file
HF_TOKEN=your_token_hereNo paid API required — the default embedding model (all-MiniLM-L6-v2) runs locally via sentence-transformers.
vector_visualizer/
├── vector_visualizer.ipynb # the notebook — run this
├── knowledge-base/
│ ├── river_banks/ # 5 docs about river banks (nature/geography)
│ └── financial_banks/ # 5 docs about financial banks (institutions)
├── pyproject.toml # dependencies managed by uv
└── .env # optional API keys (HF_TOKEN, OPENAI_API_KEY)
- LangChain — document loading, text splitting, embedding abstraction
- Chroma — lightweight vector database with persistence
- sentence-transformers — local embedding model (
all-MiniLM-L6-v2, 384 dimensions) - Plotly — interactive 2D/3D scatter plots
- scikit-learn — t-SNE for dimensionality reduction
To use OpenAI's text-embedding-3-large (3072 dimensions) instead of the local model, add your key to .env:
OPENAI_API_KEY=your_key_hereThen in the notebook, uncomment the OpenAI lines and comment out the HuggingFace line.
The visualization approach in this notebook is based on work by Ed Donner. The original notebook used a similar LangChain + Chroma + t-SNE + Plotly stack for embedding visualization. This project adapts the concept with synthetic data to illustrate semantic disambiguation of the word "bank".
