This repository is a hands-on RAG system built from scratch — every concept explained, every line of code written and understood.
RAG = Retrieval Augmented Generation
It means: instead of asking an LLM to remember everything, we give it the ability to search a knowledge base and answer from real retrieved documents.
User Query
↓
Embed query into vector
↓
Search vector database for similar documents
↓
Pass retrieved documents + query to LLM
↓
LLM generates grounded, accurate answer
Cortex-RAG/
├── 01_Embeddings_Basics_to_Advanced.ipynb ← Start here
├── 02_Netflix_Semantic_Search_Pipeline.ipynb ← Full pipeline
├── 03_Vector_Databases_Chroma.ipynb ← Coming soon
├── 04_LLM_Response_Generation.ipynb ← Coming soon
├── 05_Complete_RAG_System.ipynb ← Final project
├── Netflix_Dataset.csv
├── My_New_Netflix_Dataset.csv
└── README.md
What you learn:
| Concept | Description |
|---|---|
| One-Hot Encoding | Why it fails to capture meaning |
| Embedding Matrix | How dense vectors solve the problem |
| Cosine Similarity | Measure meaning — not just spelling |
| SentenceTransformer | State-of-the-art pretrained embeddings |
| Semantic Search | Ask anything — find meaning not keywords |
| Save to CSV | Persist embeddings for reuse |
Key insight: King and Queen are close in embedding space. King and Pizza are far apart. That is how machines understand language.
Full end-to-end pipeline on real data:
EDA Dashboard — 6 Charts:
| Chart | What it shows |
|---|---|
| Pie chart | Movies vs TV Shows split |
| Horizontal bar | Top 10 content-producing countries |
| Genre bars | Most popular genres on Netflix |
| Line chart | Content release growth over years |
| Bar chart | Audience rating distribution |
| Dual line | Movies vs TV Shows trend over time |
Semantic Search Results:
| Query | Top Result |
|---|---|
| "Romantic movies" | Ankahi Kahaniya — love stories |
| "Action movies" | Prey — survival thriller |
| "Steven Spielberg" | Jaws — exact director match |
| "Comedy movies" | Relevant comedy titles |
| "Indian content" | Kota Factory, Indian productions |
Custom vs Sklearn cosine similarity: Both give identical results. Custom helps you understand the math. Sklearn runs faster at scale.
Week 1 — Notebook 01 + 02 (Embeddings + Semantic Search)
Week 2 — Notebook 03 (Chroma vector database)
Week 3 — Notebook 04 (HuggingFace LLM response)
Week 4 — Notebook 05 (Complete RAG pipeline)
This repo builds on top of foundational ML work done in:
ML with Scikit-Learn — github.com/ather-ops/ML-with-Scikit-Learn
That repo covers the complete classical ML pipeline:
- End-to-end pipelines for classification and regression
- Feature engineering, EDA, model evaluation
- ROC curves, AUC, confusion matrix, threshold tuning
- Production-ready code patterns
Cortex-RAG is the next level — moving from classical ML into modern AI with embeddings and language models.
pip install pandas numpy matplotlib seaborn
pip install sentence-transformers scikit-learn
pip install chromadb transformers # for notebooks 03-05