Skip to content

ather-ops/Cortex_RAG

Repository files navigation

Python SentenceTransformers Status Notebooks


What is Cortex-RAG?

This repository is a hands-on RAG system built from scratch — every concept explained, every line of code written and understood.

RAG = Retrieval Augmented Generation

It means: instead of asking an LLM to remember everything, we give it the ability to search a knowledge base and answer from real retrieved documents.

User Query
    ↓
Embed query into vector
    ↓
Search vector database for similar documents
    ↓
Pass retrieved documents + query to LLM
    ↓
LLM generates grounded, accurate answer

Repository Structure

Cortex-RAG/
├── 01_Embeddings_Basics_to_Advanced.ipynb   ← Start here
├── 02_Netflix_Semantic_Search_Pipeline.ipynb ← Full pipeline
├── 03_Vector_Databases_Chroma.ipynb          ← Coming soon
├── 04_LLM_Response_Generation.ipynb          ← Coming soon
├── 05_Complete_RAG_System.ipynb              ← Final project
├── Netflix_Dataset.csv
├── My_New_Netflix_Dataset.csv
└── README.md

Notebooks

01 — Embeddings: From Zero to Semantic Search

What you learn:

Concept Description
One-Hot Encoding Why it fails to capture meaning
Embedding Matrix How dense vectors solve the problem
Cosine Similarity Measure meaning — not just spelling
SentenceTransformer State-of-the-art pretrained embeddings
Semantic Search Ask anything — find meaning not keywords
Save to CSV Persist embeddings for reuse

Key insight: King and Queen are close in embedding space. King and Pizza are far apart. That is how machines understand language.


02 — Netflix Semantic Search Pipeline

Full end-to-end pipeline on real data:

EDA Dashboard — 6 Charts:

Chart What it shows
Pie chart Movies vs TV Shows split
Horizontal bar Top 10 content-producing countries
Genre bars Most popular genres on Netflix
Line chart Content release growth over years
Bar chart Audience rating distribution
Dual line Movies vs TV Shows trend over time

Semantic Search Results:

Query Top Result
"Romantic movies" Ankahi Kahaniya — love stories
"Action movies" Prey — survival thriller
"Steven Spielberg" Jaws — exact director match
"Comedy movies" Relevant comedy titles
"Indian content" Kota Factory, Indian productions

Custom vs Sklearn cosine similarity: Both give identical results. Custom helps you understand the math. Sklearn runs faster at scale.


The RAG Learning Path

Week 1  — Notebook 01 + 02 (Embeddings + Semantic Search)
Week 2  — Notebook 03 (Chroma vector database)
Week 3  — Notebook 04 (HuggingFace LLM response)
Week 4  — Notebook 05 (Complete RAG pipeline)

Related Repository

This repo builds on top of foundational ML work done in:

ML with Scikit-Learn — github.com/ather-ops/ML-with-Scikit-Learn

That repo covers the complete classical ML pipeline:

  • End-to-end pipelines for classification and regression
  • Feature engineering, EDA, model evaluation
  • ROC curves, AUC, confusion matrix, threshold tuning
  • Production-ready code patterns

Cortex-RAG is the next level — moving from classical ML into modern AI with embeddings and language models.


Prerequisites

pip install pandas numpy matplotlib seaborn
pip install sentence-transformers scikit-learn
pip install chromadb transformers   # for notebooks 03-05

Skills Demonstrated

Embeddings Cosine Similarity Semantic Search SentenceTransformers EDA Error Handling


GitHub Live App ML Scratch

About

End-to-End RAG System (Embeddings + Vector DB +LLM)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors