pipeline: Load PDF → Split → Embed → Build FAISS → Save → Reload → RetrievalQA

# RetrievalQA pipeline (PDF → chunks → embeddings → FAISS → save/load → QA)

**What this notebook does (end-to-end):**

1. Load a local PDF (`sample.pdf`).
2. Split into chunks.
3. Create embeddings (Sentence-Transformers `all-MiniLM-L6-v2`).
4. Build FAISS vectorstore and **save it locally**.
5. Reload the FAISS index (demonstrates fast startup).
6. Create a RetrievalQA chain using a remote Groq LLM wrapper (if available) and run queries.

**Before running:**
- Put a PDF named `sample.pdf` in the same folder as this notebook, or change the `PDF_PATH` variable below.
- Make sure your environment has the required packages installed and the environment variables set (or use the inline `os.environ` cell to set them for the session):


- Recommended packages: `langchain`, `sentence-transformers`, `faiss-cpu`, `python-dotenv`, `requests`.

This notebook is intended for development on a CPU-only machine and uses small embedding models to keep memory modest.

In [1]:
from dotenv import load_dotenv
import os
load_dotenv()   # reads .env from working directory

if not os.getenv("GROQ_API_URL"):
    raise RuntimeError("GROQ_API_URL missing. Please create .env or set os.environ before proceeding.")

In [2]:
# Optional: install required packages (uncomment to run)
# !pip install langchain sentence-transformers faiss-cpu python-dotenv requests PyPDF2

# If using Streamlit later:
# !pip install streamlit

print('Skip installation if packages already present.')

Skip installation if packages already present.


In [1]:
# GroqRemoteLLM inline fallback (will prefer to import groq_remote_llm.py if present)
try:
    from groq_remote_llm import GroqRemoteLLM
    print('Using groq_remote_llm.py from working directory')
except Exception:
    print('groq_remote_llm.py not found or failed to import — using inline fallback (works for basic calls)')
    from langchain.llms.base import LLM
    from typing import Optional, List, Mapping, Any
    from pydantic import Field

    class GroqRemoteLLM(LLM):
        api_url: str = Field(default_factory=lambda: os.getenv('GROQ_API_URL'))
        api_key: str = Field(default_factory=lambda: os.getenv('GROQ_API_KEY'))
        model: str   = Field(default_factory=lambda: os.getenv('GROQ_MODEL', 'llama-3.3-70b-versatile'))
        timeout: int = 60

        @property
        def _llm_type(self) -> str:
            return 'groq-remote-llm'

        @property
        def _identifying_params(self) -> Mapping[str, Any]:
            return {'model': self.model, 'url': self.api_url}

        def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
            import requests
            headers = {'Authorization': f'Bearer {self.api_key}', 'Content-Type': 'application/json'}
            payload = {
                'model': self.model,
                'messages': [
                    {'role': 'system', 'content': 'You are a helpful assistant.'},
                    {'role': 'user', 'content': prompt},
                ],
                'max_tokens': 256,
            }
            resp = requests.post(self.api_url, json=payload, headers=headers, timeout=self.timeout)
            resp.raise_for_status()
            doc = resp.json()
            try:
                return doc['choices'][0]['message']['content']
            except Exception:
                return str(doc)

print('GroqRemoteLLM available as class')

Using groq_remote_llm.py from working directory
GroqRemoteLLM available as class


In [2]:
import os

# Config
PDF_PATH = 'sample.pdf'  # change if needed
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120
EMBED_MODEL = 'all-MiniLM-L6-v2'
SAVE_DIR = 'faiss_index'

# 1) Load PDF
if not os.path.exists(PDF_PATH):
    raise FileNotFoundError(f"Place a PDF named '{PDF_PATH}' in the notebook folder or update PDF_PATH")

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader(PDF_PATH)
docs = loader.load()
print('Loaded', len(docs), 'pages')

# 2) Split into chunks
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
chunks = splitter.split_documents(docs)
print('Created', len(chunks), 'chunks')

# 3) embeddings
from langchain.embeddings import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)

# 4) build FAISS vectorstore
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(chunks, embeddings)
print('Vectorstore created with', len(chunks), 'vectors')

Loaded 3 pages
Created 3 chunks


  embeddings = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)


Vectorstore created with 3 vectors


In [3]:
# Persist the FAISS index to disk
vectorstore.save_local(SAVE_DIR)
print('Saved FAISS index to', SAVE_DIR)

Saved FAISS index to faiss_index


In [4]:
# reload_faiss_allow_pickle.py
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS

EMBED_MODEL = "all-MiniLM-L6-v2"
SAVE_DIR = "faiss_index"

emb = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)

# WARNING: allow_dangerous_deserialization=True will load pickled Python objects.
# Only use this if you TRUST the files in SAVE_DIR (you created them locally).
vectorstore = FAISS.load_local(SAVE_DIR, emb, allow_dangerous_deserialization=True)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
print("Loaded FAISS index (pickle deserialization enabled). Retriever ready.")

Loaded FAISS index (pickle deserialization enabled). Retriever ready.


In [5]:
# Run a few queries and print answers
queries = [
    "Summarize the main conclusion of the document in one sentence.",
    "What methods were used in the study?",
    "Who are the authors?",
]

for q in queries:
    print("\nQUESTION:", q)
    try:
        ans = qa.run(q)   # qa was created earlier as RetrievalQA
        print("ANSWER:", ans.strip())
    except Exception as e:
        print("Error while answering:", repr(e))


QUESTION: Summarize the main conclusion of the document in one sentence.
Error while answering: NameError("name 'qa' is not defined")

QUESTION: What methods were used in the study?
Error while answering: NameError("name 'qa' is not defined")

QUESTION: Who are the authors?
Error while answering: NameError("name 'qa' is not defined")


In [6]:
from langchain.chains import RetrievalQA
from groq_remote_llm import GroqRemoteLLM   # or use your inline GroqRemoteLLM fallback

# Create Groq LLM
llm = GroqRemoteLLM()

# Create RetrievalQA chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",     # try "map_reduce" if doc is long
    retriever=retriever     # retriever you already built from FAISS
)

print("RetrievalQA chain ready (qa). Now you can run qa.run(query).")

RetrievalQA chain ready (qa). Now you can run qa.run(query).


In [7]:
import os
from dotenv import load_dotenv

# load .env if present
load_dotenv()

print("GROQ_API_URL:", repr(os.getenv("GROQ_API_URL")))
print("GROQ_API_KEY:", bool(os.getenv("GROQ_API_KEY")))   # prints True if key present (keeps it hidden)
print("GROQ_MODEL:", repr(os.getenv("GROQ_MODEL")))

GROQ_API_URL: 'https://api.groq.com/openai/v1/chat/completions'
GROQ_API_KEY: True
GROQ_MODEL: 'llama-3.3-70b-versatile'


In [8]:
import requests, os, json
API_URL = os.getenv("GROQ_API_URL")
API_KEY = os.getenv("GROQ_API_KEY")
MODEL = os.getenv("GROQ_MODEL")

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
payload = {
    "model": MODEL,
    "messages": [
        {"role":"system","content":"You are a helpful assistant."},
        {"role":"user","content":"Say HELLO"}
    ],
    "max_tokens": 20
}

resp = requests.post(API_URL, json=payload, headers=headers, timeout=20)
print("HTTP", resp.status_code)
try:
    print(json.dumps(resp.json(), indent=2))
except Exception:
    print(resp.text)

HTTP 200
{
  "id": "chatcmpl-06224f50-2d76-4568-815a-1707623925a9",
  "object": "chat.completion",
  "created": 1759307089,
  "model": "llama-3.3-70b-versatile",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "HELLO"
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "queue_time": 0.050131119,
    "prompt_tokens": 44,
    "prompt_time": 0.002049261,
    "completion_tokens": 3,
    "completion_time": 0.009642606,
    "total_tokens": 47,
    "total_time": 0.011691867
  },
  "usage_breakdown": null,
  "system_fingerprint": "fp_155ab82e98",
  "x_groq": {
    "id": "req_01k6fdzhe4fjqv8vfad2kt38yd"
  },
  "service_tier": "on_demand"
}


In [9]:
# reload module if updated on disk
import importlib, sys
if "groq_remote_llm" in sys.modules:
    importlib.reload(sys.modules["groq_remote_llm"])

from groq_remote_llm import GroqRemoteLLM
from langchain.chains import RetrievalQA

llm = GroqRemoteLLM()
print("LLM api_url:", llm.api_url)

# recreate qa (assumes retriever exists)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

# Use invoke to run (some versions expect dict input)
# For RetrievalQA chain, pass {"query": "..."} or just use qa.run if you prefer
try:
    # prefer qa.invoke to satisfy new API (returns a dict)
    out = qa.invoke({"query": "What is the main conclusion of the document?"})
    # The output key is usually "output_text" or similar; print whole dict:
    print(out)
except Exception as e:
    # fallback to .run for quick test (older style)
    print("invoke failed, trying run():", e)
    print("run() output:", qa.run("What is the main conclusion of the document?"))

LLM api_url: https://api.groq.com/openai/v1/chat/completions
{'query': 'What is the main conclusion of the document?', 'result': 'Not stated in the document. \n\nThe document appears to be a collection of abstracts or summaries of various research papers presented at the 15th International Conference on Science and Innovative Engineering 2025, and does not have a single main conclusion. Each paper has its own conclusion or contribution, but there is no overarching conclusion for the entire document.'}


In [10]:
import requests, os, json
API_URL = os.getenv("GROQ_API_URL")
API_KEY = os.getenv("GROQ_API_KEY")
MODEL = os.getenv("GROQ_MODEL")

headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
payload = {
    "model": MODEL,
    "messages": [
        {"role":"system","content":"You are a helpful assistant."},
        {"role":"user","content":"Say HELLO"}
    ],
    "max_tokens": 20
}

try:
    resp = requests.post(API_URL, json=payload, headers=headers, timeout=20)
except Exception as e:
    print("Request failed:", repr(e))
else:
    print("HTTP status:", resp.status_code)
    txt = resp.text[:2000]  # show up to first 2000 chars
    print("Response (first 2000 chars):")
    print(txt)
    # try to extract assistant text if possible
    try:
        doc = resp.json()
        content = None
        if isinstance(doc, dict):
            if "choices" in doc and isinstance(doc["choices"], list) and doc["choices"]:
                c = doc["choices"][0]
                # OpenAI style: c['message']['content']
                if isinstance(c, dict):
                    if "message" in c and isinstance(c["message"], dict) and "content" in c["message"]:
                        content = c["message"]["content"]
                    elif "text" in c:
                        content = c["text"]
            elif "output" in doc:
                content = doc["output"]
            elif "result" in doc:
                content = doc["result"]
        if content:
            print("\\nExtracted assistant content:", repr(content))
    except Exception:
        pass


HTTP status: 200
Response (first 2000 chars):
{"id":"chatcmpl-f9561049-a860-48a3-8c6e-4c4cecc1e023","object":"chat.completion","created":1759307103,"model":"llama-3.3-70b-versatile","choices":[{"index":0,"message":{"role":"assistant","content":"HELLO"},"logprobs":null,"finish_reason":"stop"}],"usage":{"queue_time":0.048795471,"prompt_tokens":44,"prompt_time":0.002932659,"completion_tokens":3,"completion_time":0.008652851,"total_tokens":47,"total_time":0.01158551},"usage_breakdown":null,"system_fingerprint":"fp_155ab82e98","x_groq":{"id":"req_01k6fdzzkmfwfb8pp22egs4b5w"},"service_tier":"on_demand"}

\nExtracted assistant content: 'HELLO'


In [11]:
# reload module in case you edited it
import importlib, sys
if "groq_remote_llm" in sys.modules:
    importlib.reload(sys.modules["groq_remote_llm"])

from groq_remote_llm import GroqRemoteLLM
from langchain.chains import RetrievalQA

llm = GroqRemoteLLM()   # should show api_url when constructed if you added the check
print("LLM api_url:", llm.api_url)

# recreate the QA chain (assumes `retriever` exists)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
print("RetrievalQA ready.")


LLM api_url: https://api.groq.com/openai/v1/chat/completions
RetrievalQA ready.


In [12]:
query = "What is the main conclusion of the document?"

# invoke returns a dict-like result with varying keys depending on LangChain version
result = qa.invoke({"query": query})
print("Raw result keys:", list(result.keys()))
# pretty-print the returned dict (safe small view)
import json
print(json.dumps(result, indent=2)[:2000])


Raw result keys: ['query', 'result']
{
  "query": "What is the main conclusion of the document?",
  "result": "Not stated in the document. \n\nThe document appears to be a collection of abstracts or summaries of various research papers presented at the 15th International Conference on Science and Innovative Engineering 2025, and it does not provide a main conclusion. Each section discusses a different topic, and there is no overarching conclusion that ties the entire document together."
}


In [13]:
# Robust extraction
answer = None

# common keys to check
for key in ("output_text", "result", "answer", "text", "output"):
    if key in result:
        answer = result[key]
        break

# some versions embed the text under the chain's output key
if not answer:
    # print everything and then choose a likely field
    # try the first string value
    for v in result.values():
        if isinstance(v, str) and len(v) > 0:
            answer = v
            break

print("ANSWER:\n", answer)

# If the chain returned source documents (sometimes under 'source_documents' or similar), print them
if "source_documents" in result:
    docs = result["source_documents"]
elif "source_docs" in result:
    docs = result["source_docs"]
else:
    # try to fetch returned sources from qa if available
    try:
        # Some RetrievalQA implementations support return_source_documents flag — might be in 'result'
        docs = result.get("source_documents", None)
    except Exception:
        docs = None

if docs:
    print(f"\nRetrieved {len(docs)} source documents:")
    for i, d in enumerate(docs, 1):
        src = getattr(d, "metadata", {}).get("source", "unknown")
        print(f"\n--- SOURCE {i} (source: {src}) ---")
        print(d.page_content[:800])
else:
    print("\nNo source_documents returned with the result. You can still fetch them via retriever.get_relevant_documents(query).")


ANSWER:
 Not stated in the document. 

The document appears to be a collection of abstracts or summaries of various research papers presented at the 15th International Conference on Science and Innovative Engineering 2025, and it does not provide a main conclusion. Each section discusses a different topic, and there is no overarching conclusion that ties the entire document together.

No source_documents returned with the result. You can still fetch them via retriever.get_relevant_documents(query).


In [14]:
try:
    out = qa.run(query)
    print("qa.run output:\n", out)
except Exception as e:
    print("qa.run failed:", e)


  out = qa.run(query)


qa.run output:
 Not stated in the document. The document appears to be a collection of abstracts or summaries of various research papers presented at a conference, and it does not have a single main conclusion. Each paper has its own conclusion and findings, but there is no overarching conclusion that ties the entire document together.


In [15]:
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
res = qa.invoke({"query": query})
# then extract res['source_documents'] as above


In [17]:
# === REBUILD FAISS WITH SMALLER CHUNKS (Notebook cell) ===
import os, json
from dotenv import load_dotenv
load_dotenv()

# Config - change PDF_PATH if needed
PDF_PATH = "sample.pdf"
CHUNK_SIZE = 500
CHUNK_OVERLAP = 80
EMBED_MODEL = "all-MiniLM-L6-v2"
SAVE_DIR = "faiss_index"
CHUNKS_JSONL = os.path.join(SAVE_DIR, "chunks.jsonl")

# Imports (make sure packages are installed)
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS

# 1) Load PDF
if not os.path.exists(PDF_PATH):
    raise FileNotFoundError(f"Put your PDF at '{PDF_PATH}' or change the PDF_PATH variable.")
loader = PyPDFLoader(PDF_PATH)
docs = loader.load()
print("Loaded pages:", len(docs))

# 2) Split into smaller chunks
splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
chunks = splitter.split_documents(docs)
print("Created chunks:", len(chunks))

# 3) Save chunks to JSONL (safe, portable)
os.makedirs(SAVE_DIR, exist_ok=True)
with open(CHUNKS_JSONL, "w", encoding="utf-8") as f:
    for doc in chunks:
        rec = {"page_content": doc.page_content, "metadata": getattr(doc, "metadata", {})}
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")
print("Saved chunks JSONL to:", CHUNKS_JSONL)

# 4) Create embeddings and FAISS vectorstore
embeddings = SentenceTransformerEmbeddings(model_name=EMBED_MODEL)
vectorstore = FAISS.from_documents(chunks, embeddings)
print("Built FAISS vectorstore with vectors:", len(chunks))

# 5) Persist FAISS index to disk
vectorstore.save_local(SAVE_DIR)
print("Saved FAISS index to folder:", SAVE_DIR)

# 6) Create retriever for immediate use (k=2 recommended)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
print("Retriever ready with k=2. Done.")

Loaded pages: 3
Created chunks: 3
Saved chunks JSONL to: faiss_index\chunks.jsonl
Built FAISS vectorstore with vectors: 3
Saved FAISS index to folder: faiss_index
Retriever ready with k=2. Done.


In [18]:
from langchain.chains import RetrievalQA
from groq_remote_llm import GroqRemoteLLM

# Create LLM instance (ensure env vars loaded)
llm = GroqRemoteLLM()

# Build a map-reduce RetrievalQA
qa_map = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",   # map_reduce does per-chunk summaries then reduces
    retriever=retriever,
    return_source_documents=True    # ask it to return sources too
)

query = "Summarize the methodology section of the paper."
res = qa_map.invoke({"query": query})
print("Raw keys:", list(res.keys()))
# Extract answer robustly
answer = res.get("output_text") or res.get("result") or next((v for v in res.values() if isinstance(v, str)), None)
print("\n=== MAP-REDUCE ANSWER ===\n", answer)

# If sources returned, print them
docs = res.get("source_documents") or res.get("source_docs")
if docs:
    for i, d in enumerate(docs, 1):
        print(f"\n--- SOURCE {i} ---\n{d.page_content[:800]}")
else:
    print("\nNo source_documents in response — use retriever.get_relevant_documents(query) to inspect sources.")


Raw keys: ['query', 'result', 'source_documents']

=== MAP-REDUCE ANSWER ===
 The methodology section of the paper describes the use of a GRU-CNN hybrid ML approach to categorize assaults. The model is trained using pre-existing static patterns and then used to categorize insider attacks using various case studies, with the Insider Threat Test Dataset used to assess its effectiveness. The proposed Hybrid Classification Strategy (HCS) showed high accuracy and low false positive rates in both training and testing phases.

--- SOURCE 1 ---
Proceedings of 15th International Conference on Science and Innovative Engineering 2025 
April 26th - 27th, 2025 
Prince Dr.K.Vasudevan college of Engineering and Technology, India  
Manipal University College Malaysia, Melaka, Malaysia           ISBN 978-81-983498-5-9                                                                                                                       
 
 
117. LMS PLATFORM USING GENERATIVE AI 
 
Ruben George Varghese 


In [19]:
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import FAISS
from groq_remote_llm import GroqRemoteLLM
from langchain.chains import RetrievalQA

emb = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.load_local("faiss_index", emb, allow_dangerous_deserialization=True)

retriever = vectorstore.as_retriever(search_kwargs={"k": 2})  # k=2 for precision

llm = GroqRemoteLLM()
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="map_reduce", retriever=retriever, return_source_documents=True)

query = "What is the main conclusion of the paper?"
res = qa.invoke({"query": query})
print("Answer:", res.get("output_text") or res.get("result"))

# Inspect sources
for i, d in enumerate(res.get("source_documents", []), 1):
    print(f"\n--- SOURCE {i} ---\n{d.page_content[:500]}")

Answer: Not stated in the document.

--- SOURCE 1 ---
Proceedings of 15th International Conference on Science and Innovative Engineering 2025 
April 26th - 27th, 2025 
Prince Dr.K.Vasudevan college of Engineering and Technology, India  
Manipal University College Malaysia, Melaka, Malaysia           ISBN 978-81-983498-5-9                                                                                                                       
 
 
 
Sidon sets-SSs are subsets of real numbers possessing different totals for pair wise sums. Simon Sidon 
i

--- SOURCE 2 ---
Proceedings of 15th International Conference on Science and Innovative Engineering 2025 
April 26th - 27th, 2025 
Prince Dr.K.Vasudevan college of Engineering and Technology, India  
Manipal University College Malaysia, Melaka, Malaysia           ISBN 978-81-983498-5-9                                                                                                                       
 
 
Jerusalem College of Engineering, 

In [20]:
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import FAISS
from groq_remote_llm import GroqRemoteLLM
from langchain.chains import RetrievalQA

emb = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.load_local("faiss_index", emb, allow_dangerous_deserialization=True)  # only if you saved via save_local
retriever = vectorstore.as_retriever(search_kwargs={"k":2})

llm = GroqRemoteLLM()
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="map_reduce", retriever=retriever, return_source_documents=True)


**Notes & troubleshooting**

- If the Groq LLM call fails, check your `.env` and run the connectivity `requests.post` test shown earlier.
- If `SentenceTransformer` import errors occur, set `os.environ['TRANSFORMERS_NO_TF']='1'` before importing or use a clean conda env.
- To reuse the saved FAISS index in a Streamlit app, use `FAISS.load_local` at app startup.

Enjoy — run the cells sequentially from top to bottom. If you want, I can also provide this notebook as a downloadable file.
