<a href="https://colab.research.google.com/github/aymenhmid/NLP_Guide/blob/main/RAG_sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## 1. Document Ingestion & Preprocessing

1. **Chunking**

   * **Why overlap?** Preserves context across splits.
   * **Typical parameters**:  chunk size = 500 tokens, overlap = 100 tokens.
   * **Python (using HuggingFace’s `tokenizers`)**:

     ```python
     from transformers import AutoTokenizer
     tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

     def chunk_text(text, chunk_size=500, overlap=100):
         tokens = tokenizer.encode(text)
         chunks = []
         for i in range(0, len(tokens), chunk_size - overlap):
             chunk = tokens[i : i + chunk_size]
             chunks.append(tokenizer.decode(chunk))
         return chunks
     ```

2. **Embedding**

   * **Models**: OpenAI’s `text-embedding-ada-002`, SF’s `all-mpnet-base-v2`, Cohere’s semantic embeddings.
   * **Batching**: Send chunks in batches of 100–500 to avoid rate limits.

3. **Indexing into a Vector Store**

   * **Popular choices**:

     * **FAISS** (on-prem, GPU-accelerated)
     * **ChromaDB** (open-source, Python-native)
     * **Pinecone / Weaviate** (managed)
   * **Example (Chroma)**:

     ```python
     import chromadb
     client = chromadb.Client()
     collection = client.create_collection("my_docs")

     # after embedding each chunk to a vector `vec`
     collection.add(
         ids=[chunk_id],
         metadatas=[{"source": filename, "page": page_number}],
         embeddings=[vec],
         documents=[chunk_text],
     )
     ```

---

## 2. Query Encoding & Retrieval

1. **Query Encoder**

   * Often reuses the same embedding model as the docs for vector space alignment.

2. **Retrieval**

   * **k-NN search**: retrieve top-k (e.g. k = 5–10).
   * **Re-ranking**

     * **Cross-encoder** (e.g. SBERT cross-encoder) to reorder your top-k by deeper semantic match.
     * **Example**:

       ```python
       from sentence_transformers import CrossEncoder

       cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
       scores = cross_encoder.predict([
           (query, doc_text) for doc_text in retrieved_texts
       ])
       ranked = [doc for _, doc in sorted(zip(scores, retrieved_texts), reverse=True)]
       ```

3. **Hybrid Retrieval**

   * **Combine sparse + dense**: fuse BM25 (Elasticsearch) scores with vector similarity to boost exact-match signals.

---

## 3. LLM Prompting & Generation

1. **Prompt Templates**

   ```jinja
   You are an expert assistant. Use the following context to answer:

   Context:
   {{retrieved_chunks}}

   Question:
   {{user_query}}

   Answer:
   ```

2. **Temperature & Max Tokens**

   * **Temperature** ≈ 0.0–0.3 for factual tasks.
   * **Max tokens** tuned to ensure room for citations.

3. **Streaming vs. Batched**

   * **Streaming** for low latency in chat.
   * **Batched** when you want the full answer before displaying.

4. **Citation Injection**

   * Append `[source: <metadata.source> | page <metadata.page>]` after each fact.
   * Can be done via post-processing on the LLM’s output.

---

## 4. System Architecture & Scaling

1. **Microservices**

   * **“Retriever” service**: handles embedding & vector DB reads.
   * **“Generator” service**: calls the LLM API.

2. **Caching**

   * Cache popular queries + retrieved contexts to avoid re-embedding.
   * Use Redis with TTL ≈ 24 h for semi-static knowledge bases.

3. **Monitoring**

   * Track **latency** (split between retrieval vs. generation).
   * Track **recall\@k** on held-out Q\&A pairs.

4. **Security & Privacy**

   * Encrypt proprietary docs at rest.
   * Sanitize user inputs to avoid prompt injections.

---

## 5. Evaluation & Feedback Loop

1. **Automatic Metrics**

   * **Recall\@k**: Did the golden answer’s chunk appear?
   * **EM / F1** on generated answers vs. ground truth.

2. **Human-in-the-Loop**

   * Deploy “thumbs up/down” on answers.
   * Periodically re-fine-tune or re-rank based on user feedback.

3. **Continuous Index Refresh**

   * For dynamic corpora (e.g. news), schedule daily or hourly re-indexing.

---

## 6. End-to-End LangChain Example

```python
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# 1. Embed + index (once)
emb = OpenAIEmbeddings()
faiss_index = FAISS.from_texts(chunks, embedding=emb)

# 2. Build retriever + QA chain
retriever = faiss_index.as_retriever(search_kwargs={"k": 7})
llm = OpenAI(temperature=0.1, max_tokens=512)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, chain_type="map_reduce", retriever=retriever
)

# 3. Ask a question
answer = qa_chain.run("What are the main challenges of RAG systems?")
print(answer)
```