pipeline: Load PDF → Split → Embed → Build FAISS → Save → Reload → RetrievalQA

# RetrievalQA pipeline (PDF → chunks → embeddings → FAISS → save/load → QA)

**What this notebook does (end-to-end):**

1. Load a local PDF (`sample.pdf`).
2. Split into chunks.
3. Create embeddings (Sentence-Transformers `all-MiniLM-L6-v2`).
4. Build FAISS vectorstore and **save it locally**.
5. Reload the FAISS index (demonstrates fast startup).
6. Create a RetrievalQA chain using a remote Groq LLM wrapper (if available) and run queries.

**Before running:**
- Put a PDF named `sample.pdf` in the same folder as this notebook, or change the `PDF_PATH` variable below.
- Make sure your environment has the required packages installed and the environment variables set (or use the inline `os.environ` cell to set them for the session):


- Recommended packages: `langchain`, `sentence-transformers`, `faiss-cpu`, `python-dotenv`, `requests`.

This notebook is intended for development on a CPU-only machine and uses small embedding models to keep memory modest.

In [2]:
from dotenv import load_dotenv
import os
load_dotenv()   # reads .env from working directory

if not os.getenv("GROQ_API_URL"):
    raise RuntimeError("GROQ_API_URL missing. Please create .env or set os.environ before proceeding.")

In [3]:
# Optional: install required packages (uncomment to run)
# !pip install langchain sentence-transformers faiss-cpu python-dotenv requests PyPDF2

# If using Streamlit later:
# !pip install streamlit

print('Skip installation if packages already present.')

Skip installation if packages already present.


In [4]:
# GroqRemoteLLM inline fallback (will prefer to import groq_remote_llm.py if present)
try:
    from groq_remote_llm import GroqRemoteLLM
    print('Using groq_remote_llm.py from working directory')
except Exception:
    print('groq_remote_llm.py not found or failed to import — using inline fallback (works for basic calls)')
    from langchain.llms.base import LLM
    from typing import Optional, List, Mapping, Any
    from pydantic import Field

    class GroqRemoteLLM(LLM):
        api_url: str = Field(default_factory=lambda: os.getenv('GROQ_API_URL'))
        api_key: str = Field(default_factory=lambda: os.getenv('GROQ_API_KEY'))
        model: str   = Field(default_factory=lambda: os.getenv('GROQ_MODEL', 'llama-3.3-70b-versatile'))
        timeout: int = 60

        @property
        def _llm_type(self) -> str:
            return 'groq-remote-llm'

        @property
        def _identifying_params(self) -> Mapping[str, Any]:
            return {'model': self.model, 'url': self.api_url}

        def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
            import requests
            headers = {'Authorization': f'Bearer {self.api_key}', 'Content-Type': 'application/json'}
            payload = {
                'model': self.model,
                'messages': [
                    {'role': 'system', 'content': 'You are a helpful assistant.'},
                    {'role': 'user', 'content': prompt},
                ],
                'max_tokens': 256,
            }
            resp = requests.post(self.api_url, json=payload, headers=headers, timeout=self.timeout)
            resp.raise_for_status()
            doc = resp.json()
            try:
                return doc['choices'][0]['message']['content']
            except Exception:
                return str(doc)

print('GroqRemoteLLM available as class')

Using groq_remote_llm.py from working directory
GroqRemoteLLM available as class


In [5]:
import os

# Config
PDF_PATH = 'sample.pdf'  # change if needed
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120
EMBED_MODEL = 'all-MiniLM-L6-v2'
SAVE_DIR = 'faiss_index'

# 1) Load PDF
if not os.path.exists(PDF_PATH):
    raise FileNotFoundError(f"Place a PDF named '{PDF_PATH}' in the notebook folder or update PDF_PATH")

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader(PDF_PATH)
docs = loader.load()
print('Loaded', len(docs), 'pages')

# 2) Split into chunks
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
chunks = splitter.split_documents(docs)
print('Created', len(chunks), 'chunks')

# 3) embeddings
from langchain.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)

# 4) build FAISS vectorstore
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(chunks, embeddings)
print('Vectorstore created with', len(chunks), 'vectors')

Loaded 3 pages
Created 3 chunks
Vectorstore created with 3 vectors


In [6]:
# Persist the FAISS index to disk
vectorstore.save_local(SAVE_DIR)
print('Saved FAISS index to', SAVE_DIR)

Saved FAISS index to faiss_index


In [7]:
# reload_faiss_allow_pickle.py
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS

EMBED_MODEL = "all-MiniLM-L6-v2"
SAVE_DIR = "faiss_index"

emb = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)

# WARNING: allow_dangerous_deserialization=True will load pickled Python objects.
# Only use this if you TRUST the files in SAVE_DIR (you created them locally).
vectorstore = FAISS.load_local(SAVE_DIR, emb, allow_dangerous_deserialization=True)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
print("Loaded FAISS index (pickle deserialization enabled). Retriever ready.")

Loaded FAISS index (pickle deserialization enabled). Retriever ready.


In [8]:
# Run a few queries and print answers
queries = [
    "Summarize the main conclusion of the document in one sentence.",
    "What methods were used in the study?",
    "Who are the authors?",
]

for q in queries:
    print("\nQUESTION:", q)
    try:
        ans = qa.run(q)   # qa was created earlier as RetrievalQA
        print("ANSWER:", ans.strip())
    except Exception as e:
        print("Error while answering:", repr(e))


QUESTION: Summarize the main conclusion of the document in one sentence.
Error while answering: ValueError("`run` not supported when there is not exactly one output key. Got ['result', 'source_documents'].")

QUESTION: What methods were used in the study?
Error while answering: ValueError("`run` not supported when there is not exactly one output key. Got ['result', 'source_documents'].")

QUESTION: Who are the authors?
Error while answering: ValueError("`run` not supported when there is not exactly one output key. Got ['result', 'source_documents'].")


  ans = qa.run(q)   # qa was created earlier as RetrievalQA


In [9]:
from langchain.chains import RetrievalQA
from groq_remote_llm import GroqRemoteLLM   # or use your inline GroqRemoteLLM fallback

# Create Groq LLM
llm = GroqRemoteLLM()

# Create RetrievalQA chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",     # try "map_reduce" if doc is long
    retriever=retriever     # retriever you already built from FAISS
)

print("RetrievalQA chain ready (qa). Now you can run qa.run(query).")

RetrievalQA chain ready (qa). Now you can run qa.run(query).


In [10]:
import os
from dotenv import load_dotenv

# load .env if present
load_dotenv()

print("GROQ_API_URL:", repr(os.getenv("GROQ_API_URL")))
print("GROQ_API_KEY:", bool(os.getenv("GROQ_API_KEY")))   # prints True if key present (keeps it hidden)
print("GROQ_MODEL:", repr(os.getenv("GROQ_MODEL")))

GROQ_API_URL: 'https://api.groq.com/openai/v1/chat/completions'
GROQ_API_KEY: True
GROQ_MODEL: 'llama-3.3-70b-versatile'


In [11]:
import requests, os, json
API_URL = os.getenv("GROQ_API_URL")
API_KEY = os.getenv("GROQ_API_KEY")
MODEL = os.getenv("GROQ_MODEL")

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
payload = {
    "model": MODEL,
    "messages": [
        {"role":"system","content":"You are a helpful assistant."},
        {"role":"user","content":"Say HELLO"}
    ],
    "max_tokens": 20
}

resp = requests.post(API_URL, json=payload, headers=headers, timeout=20)
print("HTTP", resp.status_code)
try:
    print(json.dumps(resp.json(), indent=2))
except Exception:
    print(resp.text)

HTTP 200
{
  "id": "chatcmpl-2924523e-1a50-429e-9a91-6c9182c1c55d",
  "object": "chat.completion",
  "created": 1759416019,
  "model": "llama-3.3-70b-versatile",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "HELLO"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "queue_time": 0.049026797,
    "prompt_tokens": 44,
    "prompt_time": 0.001960063,
    "completion_tokens": 3,
    "completion_time": 0.00963365,
    "total_tokens": 47,
    "total_time": 0.011593713
  },
  "usage_breakdown": null,
  "system_fingerprint": "fp_9e1e8f8435",
  "x_groq": {
    "id": "req_01k6jnvtn6f23befn5pdwaj7pg"
  },
  "service_tier": "on_demand"
}


In [12]:
# reload module if updated on disk
import importlib, sys
if "groq_remote_llm" in sys.modules:
    importlib.reload(sys.modules["groq_remote_llm"])

from groq_remote_llm import GroqRemoteLLM
from langchain.chains import RetrievalQA

llm = GroqRemoteLLM()
print("LLM api_url:", llm.api_url)

# recreate qa (assumes retriever exists)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

# Use invoke to run (some versions expect dict input)
# For RetrievalQA chain, pass {"query": "..."} or just use qa.run if you prefer
try:
    # prefer qa.invoke to satisfy new API (returns a dict)
    out = qa.invoke({"query": "What is the main conclusion of the document?"})
    # The output key is usually "output_text" or similar; print whole dict:
    print(out)
except Exception as e:
    # fallback to .run for quick test (older style)
    print("invoke failed, trying run():", e)
    print("run() output:", qa.run("What is the main conclusion of the document?"))

LLM api_url: https://api.groq.com/openai/v1/chat/completions
{'query': 'What is the main conclusion of the document?', 'result': 'The document does not present a single main conclusion, as it appears to be a collection of abstracts or summaries of various research papers and projects presented at the 15th International Conference on Science and Innovative Engineering 2025. Each section describes a different topic, such as data deduplication, grievance resolution, graphs based on Sidon sets, and smart wheelchairs, among others. Therefore, there is no overarching conclusion that ties the entire document together.'}


In [13]:
import requests, os, json
API_URL = os.getenv("GROQ_API_URL")
API_KEY = os.getenv("GROQ_API_KEY")
MODEL = os.getenv("GROQ_MODEL")

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
payload = {
    "model": MODEL,
    "messages": [
        {"role":"system","content":"You are a helpful assistant."},
        {"role":"user","content":"Say HELLO"}
    ],
    "max_tokens": 20
}

try:
    resp = requests.post(API_URL, json=payload, headers=headers, timeout=20)
except Exception as e:
    print("Request failed:", repr(e))
else:
    print("HTTP status:", resp.status_code)
    txt = resp.text[:2000]  # show up to first 2000 chars
    print("Response (first 2000 chars):")
    print(txt)
    # try to extract assistant text if possible
    try:
        doc = resp.json()
        content = None
        if isinstance(doc, dict):
            if "choices" in doc and isinstance(doc["choices"], list) and doc["choices"]:
                c = doc["choices"][0]
                # OpenAI style: c['message']['content']
                if isinstance(c, dict):
                    if "message" in c and isinstance(c["message"], dict) and "content" in c["message"]:
                        content = c["message"]["content"]
                    elif "text" in c:
                        content = c["text"]
            elif "output" in doc:
                content = doc["output"]
            elif "result" in doc:
                content = doc["result"]
        if content:
            print("\\nExtracted assistant content:", repr(content))
    except Exception:
        pass


HTTP status: 200
Response (first 2000 chars):
{"id":"chatcmpl-8d52ec5e-beb1-4762-9763-fa16bf4209db","object":"chat.completion","created":1759416022,"model":"llama-3.3-70b-versatile","choices":[{"index":0,"message":{"role":"assistant","content":"HELLO"},"logprobs":null,"finish_reason":"stop"}],"usage":{"queue_time":0.048215622,"prompt_tokens":44,"prompt_time":0.002205348,"completion_tokens":3,"completion_time":0.009568862,"total_tokens":47,"total_time":0.01177421},"usage_breakdown":null,"system_fingerprint":"fp_9e1e8f8435","x_groq":{"id":"req_01k6jnvwyff25bvqhvc558qt3b"},"service_tier":"on_demand"}

\nExtracted assistant content: 'HELLO'


In [14]:
# reload module in case you edited it
import importlib, sys
if "groq_remote_llm" in sys.modules:
    importlib.reload(sys.modules["groq_remote_llm"])

from groq_remote_llm import GroqRemoteLLM
from langchain.chains import RetrievalQA

llm = GroqRemoteLLM()   # should show api_url when constructed if you added the check
print("LLM api_url:", llm.api_url)

# recreate the QA chain (assumes `retriever` exists)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
print("RetrievalQA ready.")


LLM api_url: https://api.groq.com/openai/v1/chat/completions
RetrievalQA ready.


In [15]:
query = "What is the main conclusion of the document?"

# invoke returns a dict-like result with varying keys depending on LangChain version
result = qa.invoke({"query": query})
print("Raw result keys:", list(result.keys()))
# pretty-print the returned dict (safe small view)
import json
print(json.dumps(result, indent=2)[:2000])


Raw result keys: ['query', 'result']
{
  "query": "What is the main conclusion of the document?",
  "result": "The document appears to be a collection of proceedings from the 15th International Conference on Science and Innovative Engineering 2025, featuring various research papers on different topics. As such, there is no single main conclusion that can be drawn from the document as a whole. Each paper presents its own findings and conclusions, but there is no overarching conclusion that ties the entire document together."
}


In [16]:
# Robust extraction
answer = None

# common keys to check
for key in ("output_text", "result", "answer", "text", "output"):
    if key in result:
        answer = result[key]
        break

# some versions embed the text under the chain's output key
if not answer:
    # print everything and then choose a likely field
    # try the first string value
    for v in result.values():
        if isinstance(v, str) and len(v) > 0:
            answer = v
            break

print("ANSWER:\n", answer)

# If the chain returned source documents (sometimes under 'source_documents' or similar), print them
if "source_documents" in result:
    docs = result["source_documents"]
elif "source_docs" in result:
    docs = result["source_docs"]
else:
    # try to fetch returned sources from qa if available
    try:
        # Some RetrievalQA implementations support return_source_documents flag — might be in 'result'
        docs = result.get("source_documents", None)
    except Exception:
        docs = None

if docs:
    print(f"\nRetrieved {len(docs)} source documents:")
    for i, d in enumerate(docs, 1):
        src = getattr(d, "metadata", {}).get("source", "unknown")
        print(f"\n--- SOURCE {i} (source: {src}) ---")
        print(d.page_content[:800])
else:
    print("\nNo source_documents returned with the result. You can still fetch them via retriever.get_relevant_documents(query).")


ANSWER:
 The document appears to be a collection of proceedings from the 15th International Conference on Science and Innovative Engineering 2025, featuring various research papers on different topics. As such, there is no single main conclusion that can be drawn from the document as a whole. Each paper presents its own findings and conclusions, but there is no overarching conclusion that ties the entire document together.

No source_documents returned with the result. You can still fetch them via retriever.get_relevant_documents(query).


In [17]:
try:
    out = qa.run(query)
    print("qa.run output:\n", out)
except Exception as e:
    print("qa.run failed:", e)


qa.run output:
 The document appears to be a collection of proceedings from the 15th International Conference on Science and Innovative Engineering 2025, featuring various research papers on different topics. There is no single main conclusion that can be drawn from the document as a whole, as each paper presents its own unique findings and contributions to its respective field.


In [18]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
res = qa.invoke({"query": query})
# then extract res['source_documents'] as above


In [19]:
# === REBUILD FAISS WITH SMALLER CHUNKS (Notebook cell) ===
import os, json
from dotenv import load_dotenv
load_dotenv()

# Config - change PDF_PATH if needed
PDF_PATH = "sample.pdf"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 80
EMBED_MODEL = "all-MiniLM-L6-v2"
SAVE_DIR = "faiss_index"
CHUNKS_JSONL = os.path.join(SAVE_DIR, "chunks.jsonl")

# Imports (make sure packages are installed)
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS

# 1) Load PDF
if not os.path.exists(PDF_PATH):
    raise FileNotFoundError(f"Put your PDF at '{PDF_PATH}' or change the PDF_PATH variable.")
loader = PyPDFLoader(PDF_PATH)
docs = loader.load()
print("Loaded pages:", len(docs))

# 2) Split into smaller chunks
splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
chunks = splitter.split_documents(docs)
print("Created chunks:", len(chunks))

# 3) Save chunks to JSONL (safe, portable)
os.makedirs(SAVE_DIR, exist_ok=True)
with open(CHUNKS_JSONL, "w", encoding="utf-8") as f:
    for doc in chunks:
        rec = {"page_content": doc.page_content, "metadata": getattr(doc, "metadata", {})}
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")
print("Saved chunks JSONL to:", CHUNKS_JSONL)

# 4) Create embeddings and FAISS vectorstore
embeddings = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)
vectorstore = FAISS.from_documents(chunks, embeddings)
print("Built FAISS vectorstore with vectors:", len(chunks))

# 5) Persist FAISS index to disk
vectorstore.save_local(SAVE_DIR)
print("Saved FAISS index to folder:", SAVE_DIR)

# 6) Create retriever for immediate use (k=2 recommended)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
print("Retriever ready with k=2. Done.")

Loaded pages: 3
Created chunks: 3
Saved chunks JSONL to: faiss_index\chunks.jsonl
Built FAISS vectorstore with vectors: 3
Saved FAISS index to folder: faiss_index
Retriever ready with k=2. Done.


In [20]:
from langchain.chains import RetrievalQA
from groq_remote_llm import GroqRemoteLLM

# Create LLM instance (ensure env vars loaded)
llm = GroqRemoteLLM()

# Build a map-reduce RetrievalQA
qa_map = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",   # map_reduce does per-chunk summaries then reduces
    retriever=retriever,
    return_source_documents=True    # ask it to return sources too
)

query = "Summarize the methodology section of the paper."
res = qa_map.invoke({"query": query})
print("Raw keys:", list(res.keys()))
# Extract answer robustly
answer = res.get("output_text") or res.get("result") or next((v for v in res.values() if isinstance(v, str)), None)
print("\n=== MAP-REDUCE ANSWER ===\n", answer)

# If sources returned, print them
docs = res.get("source_documents") or res.get("source_docs")
if docs:
    for i, d in enumerate(docs, 1):
        print(f"\n--- SOURCE {i} ---\n{d.page_content[:800]}")
else:
    print("\nNo source_documents in response — use retriever.get_relevant_documents(query) to inspect sources.")


Token indices sequence length is longer than the specified maximum sequence length for this model (1672 > 1024). Running this sequence through the model will result in indexing errors


Raw keys: ['query', 'result', 'source_documents']

=== MAP-REDUCE ANSWER ===
 Not stated in the document.

--- SOURCE 1 ---
Proceedings of 15th International Conference on Science and Innovative Engineering 2025 
April 26th - 27th, 2025 
Prince Dr.K.Vasudevan college of Engineering and Technology, India  
Manipal University College Malaysia, Melaka, Malaysia           ISBN 978-81-983498-5-9                                                                                                                       
 
 
117. LMS PLATFORM USING GENERATIVE AI 
 
Ruben George Varghese 
Dharshan R E 
Harish Jayaram S S 
R Dheepthi 
Computer Science and Engineering 
Hindustan Institute of Technology and Science, 
Chennai, Tamil Nadu, India 
 
The "LMS Platform Using Generative AI" addresses the lack of personalized and engaging learning 
resources in traditional Learning Management Systems (LMS). By overcoming the limitation

--- SOURCE 2 ---
Proceedings of 15th International Conference on Science a

In [21]:
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import FAISS
from groq_remote_llm import GroqRemoteLLM
from langchain.chains import RetrievalQA

emb = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.load_local("faiss_index", emb, allow_dangerous_deserialization=True)

retriever = vectorstore.as_retriever(search_kwargs={"k": 2})  # k=2 for precision

llm = GroqRemoteLLM()
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="map_reduce", retriever=retriever, return_source_documents=True)

query = "What is the main conclusion of the paper?"
res = qa.invoke({"query": query})
print("Answer:", res.get("output_text") or res.get("result"))

# Inspect sources
for i, d in enumerate(res.get("source_documents", []), 1):
    print(f"\n--- SOURCE {i} ---\n{d.page_content[:500]}")

KeyError: 'choices'

In [None]:
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
from groq_remote_llm import GroqRemoteLLM
from langchain.chains import RetrievalQA

emb = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.load_local("faiss_index", emb, allow_dangerous_deserialization=True)  # only if you saved via save_local
retriever = vectorstore.as_retriever(search_kwargs={"k":2})

llm = GroqRemoteLLM()
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="map_reduce", retriever=retriever, return_source_documents=True)


In [22]:
# === QA AUDIT / EVALUATION SCRIPT ===
import csv, datetime, os
from tqdm import tqdm

# Edit or add queries you want to test
queries = [
    "What is the main conclusion of the document?",
    "What methods were used in the study?",
    "Who are the authors of the paper?",
    "What dataset did the authors use for evaluation?",
    "List any limitations mentioned."
]

OUT_CSV = "qa_audit.csv"

def get_top_chunk_previews(question, k=3, chars=300):
    docs = retriever.get_relevant_documents(question)
    previews = [d.page_content[:chars].replace("\n"," ") for d in docs[:k]]
    return " ||| ".join(previews)

# Run and log
rows = []
for q in tqdm(queries, desc="Running queries"):
    try:
        # Use invoke for new API (returns dict) when available
        try:
            res = qa.invoke({"query": q})
            # extract string answer robustly
            answer = res.get("output_text") or res.get("result") or next((v for v in res.values() if isinstance(v, str)), "")
        except Exception:
            # fallback to legacy run()
            answer = qa.run(q)
            res = {}
        previews = get_top_chunk_previews(q, k=3, chars=300)
        rows.append([datetime.datetime.utcnow().isoformat(), q, answer, previews])
    except Exception as e:
        rows.append([datetime.datetime.utcnow().isoformat(), q, f"ERROR: {e}", ""])

# Append to CSV
os.makedirs(".", exist_ok=True)
write_header = not os.path.exists(OUT_CSV)
with open(OUT_CSV, "a", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    if write_header:
        writer.writerow(["timestamp_utc", "question", "answer", "top_chunk_previews"])
    writer.writerows(rows)

print(f"Logged {len(rows)} rows to {OUT_CSV}")

  docs = retriever.get_relevant_documents(question)
  rows.append([datetime.datetime.utcnow().isoformat(), q, answer, previews])
  rows.append([datetime.datetime.utcnow().isoformat(), q, answer, previews])
  rows.append([datetime.datetime.utcnow().isoformat(), q, f"ERROR: {e}", ""])
  rows.append([datetime.datetime.utcnow().isoformat(), q, f"ERROR: {e}", ""])
  rows.append([datetime.datetime.utcnow().isoformat(), q, f"ERROR: {e}", ""])
Running queries: 100%|███████████████████████████████████████████████████████████████████| 5/5 [00:11<00:00,  2.40s/it]

Logged 5 rows to qa_audit.csv





In [28]:
# Robust evaluation cell: reads CSVs with encoding fallback, scores QA outputs vs gold answers.
import csv, difflib, os
from statistics import mean
from datetime import datetime
import pandas as pd

QA_FILE = "qa_audit.csv"
GOLD_FILE = "gold_answers.csv"
OUT_FILE = "qa_eval_results.csv"

def read_csv_with_fallback(path):
    """
    Try utf-8, then latin-1. Return list of dict rows (csv.DictReader).
    """
    if not os.path.exists(path):
        raise FileNotFoundError(path)
    for enc in ("utf-8", "latin-1"):
        try:
            with open(path, "r", encoding=enc, errors="strict") as f:
                reader = csv.DictReader(f)
                rows = [r for r in reader]
            print(f"Read '{path}' with encoding {enc} (rows: {len(rows)})")
            return rows
        except UnicodeDecodeError:
            print(f"Failed to read '{path}' with encoding {enc}, trying next encoding...")
    # final fallback: latin-1 with replace to avoid crashes
    with open(path, "r", encoding="latin-1", errors="replace") as f:
        reader = csv.DictReader(f)
        rows = [r for r in reader]
    print(f"Read '{path}' with encoding latin-1 (replace errors) (rows: {len(rows)})")
    return rows

# 1) Ensure QA file exists
if not os.path.exists(QA_FILE):
    raise FileNotFoundError(f"Expected QA audit file '{QA_FILE}' not found. Run audit step first.")

# 2) If gold file missing, create template from QA questions and exit (so user fills gold answers)
if not os.path.exists(GOLD_FILE):
    print(f"'{GOLD_FILE}' not found. Creating template from audit file questions.")
    qa_rows = read_csv_with_fallback(QA_FILE)
    questions = []
    for r in qa_rows:
        q = r.get("question") or r.get("Question") or r.get("question_text")
        if q:
            questions.append(q.strip())
    questions = sorted(set(questions))
    with open(GOLD_FILE, "w", newline="", encoding="utf-8") as f:
        w = csv.writer(f)
        w.writerow(["question","gold_answer"])
        for q in questions:
            w.writerow([q, ""])  # user fills gold answers
    print(f"Wrote template '{GOLD_FILE}'. Please open it, fill the 'gold_answer' column, save, and re-run this cell.")
    raise SystemExit("Fill gold_answers.csv with expected answers, then re-run this cell.")

# 3) Read files (with fallback)
qa_rows = read_csv_with_fallback(QA_FILE)
gold_rows = read_csv_with_fallback(GOLD_FILE)

# 4) Build gold dict (strip keys) and QA list
gold = {}
for r in gold_rows:
    q = (r.get("question") or r.get("Question") or "").strip()
    g = (r.get("gold_answer") or r.get("gold") or "").strip()
    if q:
        gold[q] = g

# 5) Score each QA row (handle various column names)
scored = []
for r in qa_rows:
    q = (r.get("question") or r.get("Question") or r.get("question_text") or "").strip()
    ans = (r.get("answer") or r.get("Answer") or r.get("answer_text") or "").strip()
    if not q:
        continue
    gold_ans = gold.get(q, "").strip()
    exact = 1 if (gold_ans and ans.lower() == gold_ans.lower()) else 0
    ratio = difflib.SequenceMatcher(None, ans.lower(), gold_ans.lower()).ratio() if gold_ans else 0.0
    scored.append({
        "timestamp_utc": r.get("timestamp_utc") or datetime.utcnow().isoformat(),
        "question": q,
        "answer": ans,
        "gold_answer": gold_ans,
        "exact_match": exact,
        "fuzzy_ratio": round(ratio, 4)
    })

# 6) Save results to OUT_FILE
if scored:
    keys = ["timestamp_utc","question","answer","gold_answer","exact_match","fuzzy_ratio"]
    with open(OUT_FILE, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=keys)
        writer.writeheader()
        writer.writerows(scored)
    print(f"Wrote evaluation results to '{OUT_FILE}' (rows: {len(scored)})")
else:
    print("No QA rows to score. Check your QA audit file.")

# 7) Print summary and preview
df = pd.read_csv(OUT_FILE)
print("\nPreview:")
display(df.head(10))

if len(df):
    exact_rate = df["exact_match"].mean()
    avg_fuzzy = df["fuzzy_ratio"].mean()
    print(f"\nExact match rate: {exact_rate:.3f}  ({int(exact_rate*100)}%)")
    print(f"Average fuzzy ratio: {avg_fuzzy:.3f}")
else:
    print("No scored rows found.")

Read 'qa_audit.csv' with encoding utf-8 (rows: 5)
Failed to read 'gold_answers.csv' with encoding utf-8, trying next encoding...
Read 'gold_answers.csv' with encoding latin-1 (rows: 6)
Wrote evaluation results to 'qa_eval_results.csv' (rows: 5)

Preview:


Unnamed: 0,timestamp_utc,question,answer,gold_answer,exact_match,fuzzy_ratio
0,2025-10-02T14:41:20.908400,What is the main conclusion of the document?,Not stated in the document.,The proposed Hybrid GRU-CNN classification str...,0,0.0074
1,2025-10-02T14:41:24.628699,What methods were used in the study?,The methods used in the study include:\n\n1. F...,A hybrid GRU (Gated Recurrent Unit) + CNN (Con...,0,0.1216
2,2025-10-02T14:41:27.594518,Who are the authors of the paper?,ERROR: `run` not supported when there is not e...,"Arul Selvam P (Assistant Professor, Dept. of C...",0,0.1953
3,2025-10-02T14:41:28.406473,What dataset did the authors use for evaluation?,ERROR: `run` not supported when there is not e...,The Insider Threat Test Dataset developed by t...,0,0.2648
4,2025-10-02T14:41:29.283210,List any limitations mentioned.,ERROR: `run` not supported when there is not e...,Existing research has not taken into account t...,0,0.1825



Exact match rate: 0.000  (0%)
Average fuzzy ratio: 0.154


**Notes & troubleshooting**

- If the Groq LLM call fails, check your `.env` and run the connectivity `requests.post` test shown earlier.
- If `SentenceTransformer` import errors occur, set `os.environ['TRANSFORMERS_NO_TF']='1'` before importing or use a clean conda env.
- To reuse the saved FAISS index in a Streamlit app, use `FAISS.load_local` at app startup.

Enjoy — run the cells sequentially from top to bottom. If you want, I can also provide this notebook as a downloadable file.


In [32]:
# Single cell: rebuild chunks (400/100), rebuild FAISS, create strict Groq LLM, use refine chain, test query + grounding check.

import os, json, traceback
from dotenv import load_dotenv
load_dotenv()

try:
    # ----------------- CONFIG -----------------
    PDF_PATH = "sample.pdf"   # change if needed
    SAVE_DIR = "faiss_index"
    CHUNK_SIZE = 400
    CHUNK_OVERLAP = 100
    EMBED_MODEL = "all-MiniLM-L6-v2"
    RETRIEVER_K = 3
    TEST_QUERY = "What is the main conclusion of the document?"
    # tokens to check for grounding (extract from your gold answer)
    GROUND_TOKENS = ["GRU", "CNN", "GRU-CNN", "98.43", "97.12", "insider", "classification", "accuracy"]
    # ------------------------------------------

    # 1) imports (do here so errors show if libs missing)
    from langchain.document_loaders import PyPDFLoader
    from langchain.text_splitter import CharacterTextSplitter
    from langchain.embeddings import SentenceTransformerEmbeddings
    from langchain.vectorstores import FAISS
    from langchain.schema import Document

    # 2) load PDF and re-split (smaller chunks)
    if not os.path.exists(PDF_PATH):
        raise FileNotFoundError(f"PDF not found: {PDF_PATH}. Place it in the working folder or change PDF_PATH.")
    loader = PyPDFLoader(PDF_PATH)
    pages = loader.load()
    print("Loaded pages:", len(pages))

    splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
    chunks = splitter.split_documents(pages)
    print("Created chunks:", len(chunks))

    # 3) save raw chunks JSONL (portable, avoids pickle reliance)
    os.makedirs(SAVE_DIR, exist_ok=True)
    chunks_file = os.path.join(SAVE_DIR, "chunks.jsonl")
    with open(chunks_file, "w", encoding="utf-8") as f:
        for d in chunks:
            rec = {"page_content": d.page_content, "metadata": getattr(d, "metadata", {})}
            f.write(json.dumps(rec, ensure_ascii=False) + "\n")
    print("Saved chunks JSONL to:", chunks_file)

    # 4) create embeddings and FAISS
    embeddings = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)
    vectorstore = FAISS.from_documents(chunks, embeddings)
    vectorstore.save_local(SAVE_DIR)
    print("Built and saved FAISS index in:", SAVE_DIR)

    # 5) reload fresh (simulate fresh session)
    emb = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)
    vectorstore = FAISS.load_local(SAVE_DIR, emb, allow_dangerous_deserialization=True)
    retriever = vectorstore.as_retriever(search_kwargs={"k": RETRIEVER_K})
    print("Retriever ready with k =", RETRIEVER_K)

    # 6) Create a custom Groq LLM with strict system prompt & temp=0.0
    #    We define it inline so you don't need to edit groq_remote_llm.py on disk.
    try:
        from langchain.llms.base import LLM
    except Exception:
        class LLM: pass

    from pydantic import BaseModel, Field
    import requests
    from typing import Optional, List, Mapping, Any

    class StrictGroqLLM(LLM, BaseModel):
        api_url: str = Field(default_factory=lambda: os.getenv("GROQ_API_URL"))
        api_key: str = Field(default_factory=lambda: os.getenv("GROQ_API_KEY"))
        model: str   = Field(default_factory=lambda: os.getenv("GROQ_MODEL", "llama-3.3-70b-versatile"))
        timeout: int = 60
        temperature: float = 0.0
        max_tokens: int = 256

        class Config:
            arbitrary_types_allowed = True
            allow_population_by_field_name = True

        @property
        def _llm_type(self) -> str:
            return "strict-groq-llm"

        @property
        def _identifying_params(self) -> Mapping[str, Any]:
            return {"model": self.model, "url": self.api_url}

        def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
            if not self.api_url:
                raise RuntimeError("GROQ_API_URL not set in environment.")
            if not self.api_key:
                raise RuntimeError("GROQ_API_KEY not set in environment.")
            headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
            # Strict system prompt: answer only from provided context, else say Not stated...
            
            system_msg = (
                "You are an information extraction assistant. "
                "Carefully read the provided document chunks and answer factually. "
                "If the answer is explicitly mentioned (even partially), extract it and restate clearly. "
                "Include numbers, model names, and datasets if they appear. "
                "Only if the information is completely absent in the context, respond exactly: 'Not stated in the document.'"
            )

            
            payload = {
                "model": self.model,
                "messages": [
                    {"role":"system", "content": system_msg},
                    {"role":"user", "content": prompt},
                ],
                "temperature": float(self.temperature),
                "max_tokens": int(self.max_tokens),
            }
            resp = requests.post(self.api_url, json=payload, headers=headers, timeout=self.timeout)
            if not resp.ok:
                raise RuntimeError(f"Groq API error {resp.status_code}: {resp.text[:1000]}")
            doc = resp.json()
            # robust extraction
            if isinstance(doc, dict):
                if "choices" in doc and isinstance(doc["choices"], list) and doc["choices"]:
                    first = doc["choices"][0]
                    if isinstance(first, dict):
                        if "message" in first and isinstance(first["message"], dict):
                            return first["message"].get("content")
                        if "text" in first:
                            return first.get("text")
                for key in ("text","output_text","result","output"):
                    if key in doc:
                        v = doc[key]
                        if isinstance(v, str):
                            return v
                        return str(v)
            return str(doc)

    # instantiate it
    llm = StrictGroqLLM()
    print("Created StrictGroqLLM with model:", llm.model)

    # 7) Create RetrievalQA with chain_type = 'refine' and return_source_documents
    from langchain.chains import RetrievalQA
    qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever, return_source_documents=True)
    print("Created RetrievalQA (refine).")

    # 8) Run the test query using invoke() (new API)
    print("\n--- Running test query ---\nQuery:", TEST_QUERY)
    res = qa.invoke({"query": TEST_QUERY})
    print("\nRaw result keys:", list(res.keys()))

    # Extract answer robustly
    answer = res.get("output_text") or res.get("result") or next((v for v in res.values() if isinstance(v, str)), None)
    print("\n=== ANSWER ===\n", answer)

    # Show returned sources when available
    docs = res.get("source_documents") or res.get("source_docs")
    if docs:
        print(f"\nReturned {len(docs)} source documents (first 3 shown):")
        for i, d in enumerate(docs[:3], 1):
            src = getattr(d, "metadata", {}).get("source", "unknown")
            print(f"\n--- SOURCE {i} (source: {src}) ---\n")
            print(d.page_content[:800])
    else:
        print("\nNo source_documents included in response; retrieving top-k via retriever for inspection.")
        docs = retriever.get_relevant_documents(TEST_QUERY)
        for i, d in enumerate(docs[:3], 1):
            print(f"\n--- Retrieved chunk {i} ---\n")
            print(d.page_content[:800])

    # 9) Grounding check: look for target tokens inside the returned docs
    combined_text = " ".join([d.page_content for d in (docs or [])]).lower()
    print("\n--- Grounding check for tokens from gold answer ---")
    for t in GROUND_TOKENS:
        found = (t.lower() in combined_text)
        print(f"Token '{t}':", "FOUND" if found else "NOT FOUND")

    print("\nDone. If the answer looks good and tokens are FOUND, grounding improved.")
except Exception as e:
    print("ERROR during rebuild+test:")
    traceback.print_exc()

Loaded pages: 3
Created chunks: 3
Saved chunks JSONL to: faiss_index\chunks.jsonl
Built and saved FAISS index in: faiss_index
Retriever ready with k = 3
Created StrictGroqLLM with model: llama-3.3-70b-versatile
Created RetrievalQA (refine).

--- Running test query ---
Query: What is the main conclusion of the document?


* 'allow_population_by_field_name' has been renamed to 'populate_by_name'



Raw result keys: ['query', 'result', 'source_documents']

=== ANSWER ===
 Not stated in the document. 

The new context provided still appears to be a collection of abstracts or summaries of different research papers or presentations, each with its own unique topic and conclusion. The first part discusses Sidon sets and their applications in graph theory, while the second part describes a project for an IoT-enabled smart wheelchair. There is no overarching main conclusion that encompasses the entire document.

Returned 3 source documents (first 3 shown):

--- SOURCE 1 (source: sample.pdf) ---

Proceedings of 15th International Conference on Science and Innovative Engineering 2025 
April 26th - 27th, 2025 
Prince Dr.K.Vasudevan college of Engineering and Technology, India  
Manipal University College Malaysia, Melaka, Malaysia           ISBN 978-81-983498-5-9                                                                                                                       
 
 
Jerus

In [37]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# ---- Custom Q&A Prompt ----
prompt_template = """
You are given the following document context:

{context}

Question: {question}

Task: Extract the factual answer as clearly and concisely as possible.
- Include technical terms (e.g., GRU, CNN), dataset names, and numbers (accuracy, percentages).
- If info is not in the context, respond: "Not stated in the document."
- Do not add extra info.

Answer:
"""
QA_PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# ---- Build RetrievalQA chain ----
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",
    retriever=retriever,
    chain_type_kwargs={"question_prompt": QA_PROMPT},  # ✅ use question_prompt, not prompt
    return_source_documents=True
)

print("✅ RetrievalQA built with map_reduce + custom prompt.")

# ---- Run test query ----
query = "What is the main conclusion of the document?"
res = qa.invoke({"query": query})

print("\n=== ANSWER ===\n", res["result"])

if "source_documents" in res:
    print(f"\nReturned {len(res['source_documents'])} sources (showing first 2):")
    for i, d in enumerate(res["source_documents"][:2], 1):
        print(f"\n--- SOURCE {i} ---\n{d.page_content[:500]}")

✅ RetrievalQA built with map_reduce + custom prompt.

=== ANSWER ===
 There are two separate main conclusions presented in the document: 
1. The "LMS Platform Using Generative AI" enhances learning effectiveness by integrating generative AI for personalized content.
2. The Hybrid GRU-CNN approach achieved high accuracy (98.43% in training and 97.12% in testing) and low false positive rates (0.17% in training and 2.88% in testing) for insider attack classification in cloud networks using the Insider Threat Test Dataset.

Returned 3 sources (showing first 2):

--- SOURCE 1 ---
Proceedings of 15th International Conference on Science and Innovative Engineering 2025 
April 26th - 27th, 2025 
Prince Dr.K.Vasudevan college of Engineering and Technology, India  
Manipal University College Malaysia, Melaka, Malaysia           ISBN 978-81-983498-5-9                                                                                                                       
 
 
Jerusalem College of Engi