A comprehensive, hands-on tutorial that takes you from zero to advanced understanding of vectors, embeddings, vector databases, and Retrieval Augmented Generation (RAG) patterns for NLP and LLM applications.
# Clone and navigate to the repository
cd cloned_repo
# Run with Docker Compose - specific module (recommended)
MODULE=1 docker compose up # Module 1: Vector Basics
MODULE=2 docker compose up # Module 2: Text Embeddings
MODULE=3 docker compose up # Module 3: Similarity Search
MODULE=4 docker compose up # Module 4: Vector Databases
MODULE=5 docker compose up # Module 5: Advanced Techniques
MODULE=6 docker compose up # Module 6: RAG Patterns
MODULE=ALL docker compose up # Run all modules sequentially
# Or run locally (requires Python 3.11+)
pip install -r requirements.txt
python main.pyThis tutorial is structured into 6 progressive modules:
- What are vectors and why they matter
- Basic vector operations (addition, multiplication, dot product)
- Vector magnitude and normalization
- Introduction to cosine similarity
- Working with high-dimensional vectors
- Understanding text embeddings
- From bag-of-words to neural embeddings
- Using Sentence Transformers
- Creating embeddings for real documents
- Embedding properties and characteristics
- Choosing the right embedding model
- Distance metrics (cosine, euclidean, dot product)
- Building a semantic search engine from scratch
- Semantic search vs keyword search
- Ranking strategies and similarity thresholds
- Advanced features: multi-query, re-ranking, query expansion
- Performance optimization
- Why vector databases are essential
- ChromaDB fundamentals
- Storing and querying embeddings
- CRUD operations
- Collections and data management
- Performance and scaling considerations
- Complex metadata filtering
- Hybrid search (semantic + keyword + filters)
- Re-ranking strategies
- Multi-collection architecture
- Query optimization techniques
- Production-ready patterns
- RAG architecture and motivation
- Document chunking strategies
- Building a complete RAG system
- Advanced patterns: query expansion, HyDE, parent document retrieval
- Common challenges and solutions
- Evaluation metrics and monitoring
Zero Knowledge β Beginner β Intermediate β Advanced
β β β β
Module 1 Modules 1-2 Modules 1-4 All Modules
Estimated Time:
- Complete tutorial: 3-4 hours
- Individual module: 20-40 minutes
- Python 3.11+: Programming language
- NumPy: Vector operations
- Sentence Transformers: Text embeddings
- ChromaDB: Vector database
- scikit-learn: ML utilities
- Docker: Containerization
cloned_repo/
βββ main.py # Main tutorial runner
βββ module1_vectors_basics.py # Module 1: Vector fundamentals
βββ module2_text_embeddings.py # Module 2: Text embeddings
βββ module3_similarity_search.py # Module 3: Semantic search
βββ module4_vector_databases.py # Module 4: Vector databases
βββ module5_advanced_techniques.py # Module 5: Advanced techniques
βββ module6_rag_patterns.py # Module 6: RAG patterns
βββ example_corpus/ # Sample job ad documents
β βββ job_ad_1.txt
β βββ job_ad_2.txt
β βββ ...
βββ outputs/ # Generated visualizations
βββ chroma_db/ # Persistent vector database
βββ requirements.txt # Python dependencies
βββ Dockerfile # Docker configuration
βββ docker-compose.yml # Docker Compose setup
βββ README.md # This file
MODULE=1 docker compose up # Module 1: Vector Basics
MODULE=2 docker compose up # Module 2: Text Embeddings
MODULE=3 docker compose up # Module 3: Similarity Search
MODULE=4 docker compose up # Module 4: Vector Databases
MODULE=5 docker compose up # Module 5: Advanced Techniques
MODULE=6 docker compose up # Module 6: RAG Patterns
MODULE=ALL docker compose up # Run all modules sequentiallydocker compose run --rm vector-tutorial python module1_vectors_basics.pydocker compose run --rm vector-tutorial python main.py- Python 3.11 or higher
- pip
# Install dependencies
pip install -r requirements.txt
# Run the tutorial
python main.py# Interactive mode (default)
python main.py
# Run all modules
python main.py --all
# Run specific module
python main.py 1 # Module 1
python main.py 2 # Module 2
# ... etc
# Run individual module directly
python module1_vectors_basics.pyThe tutorial generates visualizations and outputs in the outputs/ directory:
- Vector visualizations
- Similarity matrices
- Search result rankings
- Performance comparisons
- Interactive: Press Enter to proceed through lessons
- Hands-on: Work with real job advertisement data
- Progressive: Builds knowledge step-by-step
- Practical: Production-ready patterns and best practices
- Visual: Includes diagrams and visualizations
- Code Examples: Complete, runnable code for every concept
- Vector spaces and dimensionality
- Semantic meaning representation
- Transformer-based embeddings
- Embedding model selection
- Cosine similarity
- Euclidean distance
- Semantic vs keyword search
- Ranking algorithms
- CRUD operations
- Metadata filtering
- Hybrid search
- Indexing strategies
- Scalability patterns
- Document chunking
- Context retrieval
- Prompt engineering
- Query expansion
- Evaluation metrics
- Build Your Own RAG App: Use the patterns learned to build a Q&A system
- Explore Other Vector DBs: Try Pinecone, Weaviate, or Milvus
- Fine-tune Embeddings: Learn to fine-tune models on your domain
- Production Deployment: Scale your vector database for production
- Integrate LLMs: Connect to GPT-4, Claude, or open-source LLMs
This is a learning project. Feel free to:
- Add more example documents
- Create additional modules
- Improve explanations
- Add more visualizations
- Suggest improvements
This tutorial is provided for educational purposes.
This tutorial uses:
- Pre-trained models that will be downloaded on first run (~80MB)
- Local storage for ChromaDB (persistent across runs)
- The example corpus provided (job advertisements)
Built with:
- Sentence Transformers by UKPLab
- ChromaDB by Chroma
- Python open-source ecosystem
- Go at your own pace: Each module is self-contained
- Experiment: Modify the code and see what happens
- Use your own data: Replace the job ads with your own documents
- Ask questions: The code is heavily commented
- Build projects: Apply what you learn to real problems
Happy Learning! π
Start your journey into the world of vector databases and modern AI applications!