Set up and load

In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from transformers import pipeline
import pandas as pd
import os

#### 0- Load VecDB and transforming into a retriver interface

In [14]:
embedding = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embedding)
retriever = vectordb.as_retriever()

### 1- Define Prompt Logic

- temperature: Controls randomness in generation. Lower values make the output more deterministic, while higher values increase diversity.
- top_k: Limits the number of highest-probability tokens considered during generation.
- top_p: Implements nucleus sampling, where only tokens with cumulative probability p are considered.

In [17]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""Use the following context to answer the question. Be concise and accurate.

Context:
{context}

Question:
{question}

Answer:"""
)



llm_pipeline = pipeline("text2text-generation", 
                        model=AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base"), 
                        tokenizer=AutoTokenizer.from_pretrained("google/flan-t5-base"),
                        max_new_tokens=100,
                        )
llm = HuggingFacePipeline(pipeline=llm_pipeline)


Device set to use cpu


| Parameter            | Range         | Purpose/Effect                                                                 | Recommended Values by Use Case         |
|----------------------|---------------|--------------------------------------------------------------------------------|----------------------------------------|
| **Temperature**      | 0.0 – 1.0     | Controls randomness. Lower = more deterministic; higher = more creative.      | - Deterministic: `0.2`<br>- Balanced: `0.5`<br>- Creative: `0.9` |
| **Top-k**            | 1 – 50        | Limits sampling to top-k tokens. Lower = focused; higher = diverse.           | - Deterministic: `5`<br>- Balanced: `20`<br>- Creative: `40`     |
| **Top-p**            | 0.0 – 1.0     | Nucleus sampling. Lower = deterministic; higher = creative.                   | - Deterministic: `0.3`<br>- Balanced: `0.6`<br>- Creative: `0.9` |
| **Repetition Penalty** | 1.0 – 2.0   | Penalizes repeated tokens. Higher = less repetition.                          | - Deterministic: `1.0`<br>- Balanced: `1.2`<br>- Creative: `1.5` |


### 2- Interface and prompting


In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template}
)

questions = [
    "What is best approach to deal with missing values?",
    "What features are important for time series?",
    "What is the purpose of time series analysis?"
]

results = []
for q in questions:
    res = qa_chain.invoke({"query": q})  
    results.append({
        "question": q,
        "answer": res['result'],
        "context": res['source_documents'][0].page_content if res['source_documents'] else "N/A"
    })

df = pd.DataFrame(results)


df


Token indices sequence length is longer than the specified maximum sequence length for this model (1634 > 512). Running this sequence through the model will result in indexing errors


Unnamed: 0,question,answer,context
0,What is best approach to deal with missing val...,Forward fill.,One\tof\tthe\tsimplest\tways\tto\tfill\tin\tmi...
1,What features are important for time series?,(4).,Chapter\t8.\t Generating\tand\tSelecting\nFeat...
2,What is the purpose of time series analysis?,Extracting meaningful summary and statistical ...,\ta\tsurfeit\tof\nfeatures\tthan\tit\tis\tto\t...


### 3- Evaluation


| **Metric**              | **Measures**                             | **Strengths**                                                                | **Limitations**                                                       | **Scoring Range**  | **Ideal Use Cases**                      | **Implementation Notes**                                     |
| ----------------------- | ---------------------------------------- | ---------------------------------------------------------------------------- | --------------------------------------------------------------------- | ------------------ | ---------------------------------------- | ------------------------------------------------------------ |
| **String Similarity**   | Character-level sequence overlap         | Fast, simple, interpretable                                                  | Sensitive to word order; ignores semantics                            | 0 to 1             | Quick sanity checks, small tweaks        | Uses `difflib.SequenceMatcher` in Python                     |
| **BLEU**                | N-gram precision (1–4 grams)             | Widely used in machine translation, captures short patterns                  | Favors shorter outputs; penalizes valid rephrasings                   | 0 to 1             | Factual Q\&A, summaries with gold labels | Use Hugging Face `evaluate` library                          |
| **ROUGE-L**             | Longest common subsequence               | Good for content overlap and partial matches                                 | Sensitive to formatting and verbosity                                 | 0 to 1             | Summarization, Q\&A                      | `rougeL` is best suited for factual or summary comparison    |
| **Semantic Similarity** | Cosine similarity of sentence embeddings | Captures paraphrasing, semantically close responses                          | Can be tricked by verbose, vague, or generic answers                  | –1 to 1            | Open-ended answers, QA with synonyms     | Use `sentence-transformers` with `cos_sim`                   |
| **LLM-as-a-Judge**      | Scored by an LLM based on custom prompt  | Human-like judgment; considers coherence, completeness, and factual accuracy | Subjective; expensive (API calls); inconsistent without prompt tuning | 0 to 5 (or custom) | High-stakes eval, human-like ranking     | Requires well-crafted prompts; best for few-shot comparisons |




**3.1- String similarity: Simple similarity match with ground truth**

In [19]:
# You can optionally add ground truth comparison if available
df["reference_answer"] = [
    "A framework for building LLM applications.",
    "A method to retrieve documents based on vector similarity.",
    "To split large text into manageable and semantically meaningful pieces."
]

# Simple string similarity score
from difflib import SequenceMatcher
df["similarity"] = df.apply(lambda row: SequenceMatcher(None, row["answer"], row["reference_answer"]).ratio(), axis=1)
df[["question", "answer", "reference_answer", "similarity"]]


Unnamed: 0,question,answer,reference_answer,similarity
0,What is best approach to deal with missing val...,Forward fill.,A framework for building LLM applications.,0.290909
1,What features are important for time series?,(4).,A method to retrieve documents based on vector...,0.032258
2,What is the purpose of time series analysis?,Extracting meaningful summary and statistical ...,To split large text into manageable and semant...,0.080268


**3.2- BLEU**

In [21]:
import evaluate

bleu = evaluate.load("bleu")

df["bleu"] = df.apply(lambda row: bleu.compute(predictions=[row["answer"]], references=[[row["reference_answer"]]])["bleu"], axis=1)

df[["question", "answer", "reference_answer", "bleu"]]


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Unnamed: 0,question,answer,reference_answer,bleu
0,What is best approach to deal with missing val...,Forward fill.,A framework for building LLM applications.,0.0
1,What features are important for time series?,(4).,A method to retrieve documents based on vector...,0.0
2,What is the purpose of time series analysis?,Extracting meaningful summary and statistical ...,To split large text into manageable and semant...,0.0


**3.3- Rouge**

In [24]:
rouge = evaluate.load("rouge")
df["rougeL"] = df.apply(lambda row: rouge.compute(predictions=[row["answer"]], references=[row["reference_answer"]])["rougeL"], axis=1)
df[["question", "answer", "reference_answer", "bleu", "rougeL"]]

Unnamed: 0,question,answer,reference_answer,bleu,rougeL
0,What is best approach to deal with missing val...,Forward fill.,A framework for building LLM applications.,0.0,0.0
1,What features are important for time series?,(4).,A method to retrieve documents based on vector...,0.0,0.0
2,What is the purpose of time series analysis?,Extracting meaningful summary and statistical ...,To split large text into manageable and semant...,0.0,0.043956


**3.4- Semantic Evaluation**

In [23]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

df["semantic_similarity"] = df.apply(lambda row: util.cos_sim(
    model.encode(row["answer"], convert_to_tensor=True),
    model.encode(row["reference_answer"], convert_to_tensor=True)
).item(), axis=1)

df[["question", "answer", "reference_answer", "semantic_similarity"]]


Unnamed: 0,question,answer,reference_answer,semantic_similarity
0,What is best approach to deal with missing val...,Forward fill.,A framework for building LLM applications.,0.056915
1,What features are important for time series?,(4).,A method to retrieve documents based on vector...,-0.025989
2,What is the purpose of time series analysis?,Extracting meaningful summary and statistical ...,To split large text into manageable and semant...,0.148686


**3.5- LLM as a judge**

In [25]:
# grading prompt

grading_prompt = """
You are a strict evaluator. Given a question, a reference answer, and a model-generated answer, score the model's answer from 0 to 5 based on how accurate and complete it is.

Question: {question}

Reference Answer: {reference}
Model Answer: {model_answer}

Score (0 to 5):
"""

def llm_grade(row):
    input_text = grading_prompt.format(
        question=row["question"],
        reference=row["reference_answer"],
        model_answer=row["answer"]
    )
    return llm_pipeline(input_text)[0]["generated_text"]

df["llm_score"] = df.apply(llm_grade, axis=1)
df[["question", "answer", "reference_answer", "llm_score"]]


Unnamed: 0,question,answer,reference_answer,llm_score
0,What is best approach to deal with missing val...,Forward fill.,A framework for building LLM applications.,0
1,What features are important for time series?,(4).,A method to retrieve documents based on vector...,4
2,What is the purpose of time series analysis?,Extracting meaningful summary and statistical ...,To split large text into manageable and semant...,3



# 🧠 Productionizing a RAG System (One-Pager)

## 🧩 Architecture Overview
```
1. Document Ingestion
        ↓
2. Chunking & Embedding (sentence-transformers)
        ↓
3. Store in Vector DB (Qdrant / Weaviate)
        ↓
4. Query Handling + Retrieval (LangChain Retriever)
        ↓
5. Prompt Injection + LLM Inference (flan-t5 / mistral + vLLM)
        ↓
6. Evaluation (BLEU / ROUGE / Semantic Similarity)
        ↓
7. Feedback Loop + Monitoring
```

---

## 🛠️ Key Components & Tools

| Component           | Description                                                  | Recommended Tech                                      |
|--------------------|--------------------------------------------------------------|-------------------------------------------------------|
| **Chunking**        | Split documents into semantic chunks                         | LangChain, Unstructured                              |
| **Embedding**       | Encode chunks into dense vectors                             | `BAAI/bge-base`, `MiniLM`, `instructor-xl`           |
| **Vector Store**    | Store & retrieve embeddings                                  | Qdrant, Weaviate, Chroma (dev), FAISS                |
| **Retriever**       | Top-k / MMR retrieval of context                             | LangChain, Haystack                                  |
| **Prompting**       | Custom templates to guide LLM output                         | LangChain PromptTemplate                             |
| **LLM Inference**   | Generate answer from context + query                         | flan-t5, mistral, LLaMA2 via vLLM or TGI             |
| **Evaluation**      | Assess answer quality                                        | BLEU, ROUGE, Semantic Similarity, LLM-as-a-Judge     |
| **Serving API**     | Handle external queries                                      | FastAPI, Flask, LangServe                            |
| **Deployment**      | Orchestrate and scale system                                 | Docker, K8s, Terraform, GitHub Actions               |
| **Monitoring**      | Track performance & detect drift                             | Prometheus, Grafana, LangSmith, Evidently            |

---

## 🔁 Feedback Loop

- Capture user feedback (thumbs up/down)
- Log question + context + answer + feedback
- Fine-tune embedding model or LLM based on evaluation drift

---

## ✅ Deployment Flow (CI/CD)

1. Develop in Jupyter or VSCode (unit tested)
2. Containerize with Docker
3. Deploy via GitHub Actions or Terraform to K8s / SageMaker
4. Monitor endpoints and vector drift
