<a href="https://colab.research.google.com/github/appu-ui/kerala-ayurveda-assignment/blob/main/part_a%26b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1\. RAG Approach**
====================================================


### **Chunking Strategy**

-   Use **RecursiveCharacterTextSplitter** with **chunk size 400--600 tokens** and **80-token overlap** (your code uses 500/80, which is appropriate).

-   Keep **FAQ documents** chunked into *smaller segments* (~200--300 tokens) because each Q&A should ideally be a single chunk.

-   For product internal docs, keep the chunk metadata simple by including just the document source so each retrieved chunk can be clearly traced back to its origin.

-   For **long guides** (e.g., *dosha guide*), keep them closer to 500 tokens to preserve context.

### **Retrieval Method**

-   Start with **embedding-based retrieval** using a SentenceTransformer (`all-mpnet-base-v2`).

-   Add **BM25 hybrid** later if users ask keyword-heavy queries ("price", "dosha type", "stress").

-   Use **FAISS FlatL2** for fast dense vector search.

### **How Many Chunks to Retrieve**

-   Retrieve **3--4 chunks**.

-   More than 4 tends to reduce relevance and adds noise to prompts.

-   Prompt should explicitly instruct: **"Answer only from this context; cite sources."**
### **Returning Citations**

-   Identify the document name from the **"Source:"** marker within each retrieved chunk and include it as the `doc_id` in the citations list.

-   This keeps the citation output consistent and easy to trace back to the original document used during retrieval.

In [6]:

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader, CSVLoader

files = [
    "ayurveda_foundations.md",
    "content_style_and_tone_guide.md",
    "dosha_guide_vata_pitta_kapha.md",
    "faq_general_ayurveda_patients.md",
    "product_ashwagandha_tablets_internal.md",
    "product_brahmi_tailam_internal.md",
    "product_triphala_capsules_internal.md",
    "treatment_stress_support_program.md",
]

docs = []
for f in files:
    docs.extend(TextLoader(f).load())

# CSV
docs.extend(CSVLoader("products_catalog.csv").load())

# Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)
chunks = splitter.split_documents(docs)


In [9]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

embed_model = SentenceTransformer("all-mpnet-base-v2")

# Embed chunks
texts = [c.page_content for c in chunks]
embs = embed_model.encode(texts, convert_to_numpy=True).astype("float32")

# Create FAISS index
dim = embs.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embs)

# Store metadata
metadata = [c.metadata for c in chunks]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [10]:
def retrieve(query, k=3):
    q_emb = embed_model.encode([query], convert_to_numpy=True).astype("float32")
    D, I = index.search(q_emb, k)

    results = []
    for idx in I[0]:
        chunk = chunks[idx]

        results.append(chunk)
    return results

# Assign retriever so it exists in global scope
retriever = {
    "search": retrieve
}


In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Qwen/Qwen2.5-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
hf_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

class HFWrapper:
    def invoke(self, prompt):
        inputs = tokenizer(prompt, return_tensors="pt").to(hf_model.device)
        outputs = hf_model.generate(**inputs, max_new_tokens=300)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)

llm = HFWrapper()


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

## **Function Design**

In [57]:
import re

def answer_user_query(question: str) -> dict:

    retrieved_chunks = retriever["search"](question, k=3)

    context_blocks = []
    for i, chunk in enumerate(retrieved_chunks):
        doc_id = chunk.metadata.get("source", f"doc_{i}")
        context_blocks.append(f"[{doc_id}]\n{chunk.page_content}")

    context_str = "\n\n---\n\n".join(context_blocks)

    prompt = f"""
Use ONLY the context below.

CONTEXT:
{context_str}

QUESTION:
{question}

INSTRUCTIONS:
- Answer strictly using the context.
- Rewrite the answer in a more descriptive, natural-sounding way, but only using details present in the context.
- Do NOT add information not present in the context.
- Include citations ONLY in this exact format: (Source: product_ashwagandha_tablets_internal.md)
- If the answer isn’t in the context, respond exactly with:
  "I couldn't find evidence for this in the supplied materials."

ANSWER:
"""

    # Call LLM
    response = llm.invoke(prompt)
    answer_text = response.content if hasattr(response, "content") else str(response)

    # CLEAN ANSWER EXTRACTION
    if "ANSWER:" in answer_text:
        answer_text = answer_text.split("ANSWER:")[1].strip()

    answer_text = answer_text.strip()

    # Extract doc_ids only
    pattern = r"\(Source:\s*([^)]+?)\)"
    matches = re.findall(pattern, answer_text)

    citations = []
    for doc_id in matches:
        doc_id = doc_id.strip()
        if doc_id not in [c["doc_id"] for c in citations]:
            citations.append({"doc_id": doc_id})

    # REMOVE fallback text if citations exist
    fallback_text = "I couldn't find evidence for this in the supplied materials."
    if citations and fallback_text in answer_text:
        answer_text = answer_text.replace(fallback_text, "").strip()

    return {
        "answer": answer_text,
        "citations": citations
    }


In [59]:
answer_user_query("What are the benefits of Ashwagandha?")


{'answer': "Ashwagandha is traditionally used to support the body's ability to adapt to stress, promote calmness and emotional balance, support strength and stamina, and help maintain restful sleep. It has been shown to have anti-stress properties that can aid in managing stress levels and promoting relaxation. Additionally, it may contribute to improved cognitive function and overall well-being. The herb is believed to be an adaptogen, which means it helps the body cope better with both physical and psychological stressors. While it is important to note that while these claims are supported by traditional practices, scientific research specifically on its efficacy for stress resilience and sleep quality is limited. (Source: product_ashwagandha_tablets_internal.md)",
 'citations': [{'doc_id': 'product_ashwagandha_tablets_internal.md'}]}

**Failure mode:**\
The model might add extra benefits that were not actually in the document, because Ashwagandha is commonly associated with many uses. So it may "fill in the gaps" based on general knowledge instead of sticking only to the retrieved text.

In [62]:
answer_user_query("Can Ayurveda help with stress and sleep?")

{'answer': "Yes, Ayurveda offers various methods to address stress and sleep issues. It involves daily routines, food choices that are grounding, and specific herbal remedies like Ashwagandha, which has been traditionally used to support the body's stress response and promote calmness and emotional balance. These practices aim to create a balanced state within the individual, helping them manage their stress levels and improve their sleep quality over time rather than offering immediate results. While Ayurveda can be beneficial when combined with other forms of therapy or professional mental health care, it should not be considered a replacement for medical treatment. The effectiveness and duration of these treatments can vary greatly depending on the individual's constitution and the extent of their stress and sleep problems, so patience and consistency are key factors. For instance, some individuals might notice improvements in their sleep and overall well-being after several weeks, 

Failure mode:
The model might mix information from different files together and make the answer sound more certain or more clinical than what the documents really say.

# Part B – Agentic workflow & evaluation

1) Outline Agent (role: structure & scope)
------------------------------------------

**Responsibility:** Expand the brief into a clear outline (headings, 3--6 sections, and key points to cover).

**Input → output (JSON)**

```
{
  "input_brief": "string",
  "preferred_length": "short|medium|long",
  "audience": "string",
  "tone": "string"
}
```
```
{
  "outline": [
    {"heading": "string", "bullets": ["string", ...], "expected_sources": ["doc_id", ...]}
  ],
  "word_count_target": 800
}
```

**One likely failure mode:** Outline is either too shallow or includes sections that require claims not present in the corpus.\
**Guardrail / check:** Require at least one expected_sources entry per heading. If none, mark the heading as **"requires source"** and do not proceed until sources are assigned.

* * * * *

2) Writer Agent (role: draft production)
----------------------------------------

**Responsibility:** Produce a first draft from the outline; keep placeholders for any unverifiable claims.

**Input → output (JSON)**

```
{
  "outline": [ ... ],
  "style": "string",
  "word_count_target": 800
}
```

```
{
  "draft_text": "string",
  "claims": [
    {"claim_id": 1, "sentence": "string", "claim_text": "string"}
  ],
  "placeholders": [
    {"location": "heading 2 -> para 1", "note": "needs citation"}
  ]
}
```

**One likely failure mode:** Writer adds explanatory facts or statistics not sourced.\
**Guardrail / check:** Writer must tag every factual sentence with a claim_id. Sentences without claim_id flagged as *opinion/boilerplate only*. Reject draft automatically if > X% of factual sentences have no placeholders or cited sources.

* * * * *

3) Fact-Checker Agent (role: grounding with RAG)
------------------------------------------------

**Responsibility:** For each claim from the Writer Agent, run retrieval, verify support, and attach citation(s). Mark unsupported claims.

**Input → output (JSON)**

```
{
  "claims": [ {"claim_id": 1, "claim_text": "string"} ],
  "k_retrieve": 4
}
```

```
{
  "claim_verifications": [
    {
      "claim_id": 1,
      "supported": true,
      "evidence": [
         {"doc_id": "string", "snippet": "short text", "score": 0.92}
      ],
      "confidence": "high|medium|low"
    }
  ],
  "unsupported_claims": [ {"claim_id": 3, "claim_text": "string"} ]
}
```

**One likely failure mode:** Retrieval returns tangential evidence and the checker marks the claim supported when the snippet doesn't actually back the claim.\
**Guardrail / check:** Use two-stage verification:

-   Stage A: dense-embed retrieve top-k, run exact-match or entailment test (LLM or NLI) between claim and snippet.

-   Stage B: require either (a) explicit phrase overlap + entailment score above threshold, or (b) at least two independent docs supporting the same claim. Otherwise mark supported=false.

* * * * *

4) Tone Editor Agent (role: brand voice & final draft)
------------------------------------------------------

**Responsibility:** Apply brand voice, remove placeholders, format citations inline, and produce final editor draft.

**Input → output (JSON)**

```
{
  "draft_text": "string",
  "claim_verifications": [ ... ],
  "style_guidelines": "string"
}
```

```
{
  "final_draft": "string",
  "inline_citations": [
    {"claim_id": 1, "doc_id": "product_ashwagandha_tablets_internal.md"}
  ],
  "flags": [
    {"type": "unsupported_claim", "claim_id": 3, "action": "editor_review_required"}
  ]
}
```

**One likely failure mode:** Tone pass rewrites a sentence in a way that breaks an attached citation (moving claim context away).\
**Guardrail / check:** Any sentence that had supported=true must retain its claim_id and corresponding inline citation. Tone Editor can rephrase, but must not separate claim from its citation. Validate mapping after rewriting.

Minimal evaluation loop
==============================================

Tiny golden set idea
--------------------

Create a small curated set (5--8 briefs) representing realistic internal use-cases:

-   Example briefs:

    1.  "Short explainer: Ashwagandha --- benefits, who should use it, safety notes."

    2.  "How Ayurveda supports sleep --- short guide for patients."

    3.  "Product page: Triphala capsules --- what it does and precautions."

-   For each brief, prepare a **gold article** (~400--800 words) written/approved by an editor, with required citations (doc_id + supporting snippet lines).

Score (for each generated draft)
----------------------------------------

-   **Grounding correctness (0--1 per claim):** Is the claim supported by the cited snippet? (binary or 0--1 scale via editor check)

-   **Citation coverage (%):** % of factual claims that have at least one citation.

-   **Structure score (0--1):** Does draft follow outline and headings expected?

-   **Brand tone (0--1):** Editor rates how well the draft matches tone guidelines.

-   **Editor workload:** Editor's assessment (None / Minor edits / Major edits / Reject).

Metrics to track over time
--------------------------

-   **Hallucination rate:** % of claims marked unsupported by the Fact-Checker or editors.

-   **Citation coverage:** % factual claims with valid citation (target > 95%).

-   **Average editor edit time:** minutes per draft.

-   **False-positive verification rate:** % claims Fact-Checker marked supported but editors later mark unsupported.

**What I would *ship* in the first 2 weeks**
=======================================================

### **1\. Reliable RAG-backed Q&A**

-   Clean chunking, retrieval, and citation extraction.

-   Deterministic prompt template that guarantees grounding.

-   Basic guardrails ("If not in context, say so").\
    **Reason:** Fastest path to value --- teams can immediately use it internally.

* * * * *

### **2\. A simple article-generation workflow (Outline → Draft → Citations)**

-   Structured outline generator (small, rule-based + LLM).

-   Writer agent that outputs a draft with **claim sentences tagged** for checking.

-   Fact-checker that uses current RAG to attach citations or flag unsupported claims.\
    **Reason:** This is the core of the Growth/Product content workflow --- good enough even if imperfect.

* * * * *

### **3\. Editor-facing output format**

-   Final JSON + Markdown draft with inline citations.

-   Flags for unsupported claims and areas needing human completion.\
    **Reason:** Editors need something that drops cleanly into their process.
* * * * *


**What I would explicitly postpone**
================================================

### **1\. Full multi-agent orchestration system**

-   No need for LangGraph or complex agent routing initially.\
    **Why:** Adds engineering overhead; a serial pipeline (Outline → Draft → Check → Tone) is enough for week 1--2.

* * * * *

### **2\. Perfect tone/style engine**

-   Avoid building full rule-based tone validators or custom finetunes.\
    **Why:** Early goal is correctness, not perfect brand voice.

* * * * *

### **3\. Advanced retrieval setups**

-   Hybrid BM25 + dense, re-ranking, cross-encoder scoring can come later.\
    **Why:** Simple vector retrieval already works well for your internal, small corpus.

#**short reflection**

-   I spent **around 5 hours** on the assignment, including understanding the corpus structure, designing the RAG approach, and sketching the agentic workflow.

-   The most interesting part was **designing the fact-checking workflow** using RAG, especially thinking about failure modes and guardrails.

-   I used **ChatGPT** to help with **brainstorming architectural options**, **rephrasing explanations**, and **debugging parts of the retrieval**.