# RAG implementation with LangChain framework with local LLMs - complete class

In this notebook I create Retrieval-Augmented Generation (RAG) system based on my documents (some scientific papers) for by local LLMs implemented with LangChain framework

__What is RAG?__

Retrieval-Augmented Generation (RAG) is an architecture that augments a foundation LLM with external, up-to-date knowledge via a retrieval step before generation. It decouples the LLM's static training data from dynamic, domain-specific information.

__Core Components (Local Setup):__

* Vector Database: Stores embeddings of external documents (e.g., ChromaDB, FAISS, Weaviate).

* Embedding Model: Converts text → dense vectors (e.g., all-MiniLM-L6-v2, BGE-small).

* Retriever: Queries the DB for top-k relevant passages via similarity search (e.g., cosine distance).

* LLM: Takes the retrieved context + user query → generates a truth/fact-grounded response.


__This notebook builds on the previous RAG-LangChain notebook by encapsulating all the separate steps, functions and my trials into a single, concise and re-usable Class.__

_Note: There are still some improvements to apply, but overall it works great!_

## Complete RAG implementation with LangChain, ChromaDB, and options for local LLMs
The solution supports document updates, multiple file types, various retrievers, and flexible LLM integration, all wrapped in a dedicted Class.

### Small example of using sentence embedding models 

In [42]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# Manual loading of the new Qwen3-0.6B embedding model via Transformers
# from sentence_transformers import SentenceTransformer
# model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
# # Not working at the moment:
# #ValueError: The checkpoint you are trying to load has model type `qwen3` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

# Mixedbread-ai/mxbai-embed-large-v1 embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

In [43]:
# The prompt used for query retrieval tasks:
# query_prompt = 'Represent this sentence for searching relevant passages: '
query = "A man is eating a piece of bread"
docs = [
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

In [44]:
query_embedding = model.encode(query, prompt_name="query")
# query_embedding = model.encode(query, prompt=query_prompt)
docs_embeddings = model.encode(docs)

similarities = cos_sim(query_embedding, docs_embeddings)
print('similarities:', similarities)

similarities: tensor([[0.7920, 0.6369, 0.1651, 0.3621]])


### Imports & Configs

In [1]:
my_llms = {'qwen3-coder': 'qwen3-coder-30b-a3b-instruct', 
           'qwen3-thinking': 'qwen3-30b-a3b-thinking-2507',
           'glm-4.5-air': 'glm-4.5-air',
           'glm-4.5-air-m': 'glm-4.5-air-mlx',
           'glm-4.5-air-5': 'glm-4.5-air@5bit',
           'gpt-oss': 'gpt-oss-120b',
           'gpt-oss-us': 'unsloth/gpt-oss-120b',
           }

In [25]:
import os
from pathlib import Path
from typing import List, Dict, Union
import torch
import chromadb

# LangChain/Document Processing Imports
from langchain.document_loaders import (
    DirectoryLoader,
    TextLoader,
    PyPDFLoader,
    UnstructuredMarkdownLoader,
    Docx2txtLoader
)
from langchain_community.document_loaders import Docx2txtLoader

from langchain import hub
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.chat_models import ChatOpenAI

# ChromaDB Vector Store
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings

# Retrievers
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.memory import ConversationBufferMemory

# LLM Integration
from langchain.llms import HuggingFacePipeline, OpenAI
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

# Prompts
from langchain.prompts import ChatPromptTemplate, PromptTemplate


workdir = Path.home()/"Downloads"/"RAG_personal_knowledge_base"

# Configuration of embdedding and LLM models
PERSIST_DIR = str(workdir/"chroma_db_mxbai-embed-large-v1")
#EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"
EMBEDDING_MODEL = "mixedbread-ai/mxbai-embed-large-v1"
BASE_URL = "http://localhost:1234/v1"  # LM-Studio local API
API_KEY = "not-needed"

LLM_CONFIG = {
    "provider": "lmstudio",  # Options: "openai", "huggingface", "lmstudio"
    "model_name": my_llms['gpt-oss-us'],
    "openai_api_key": API_KEY,
    "lmstudio_url": BASE_URL
}

# device setup
if torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
else:
    DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {DEVICE}")

Using device: mps


In [82]:
# Example of which keys PromptTemplate and ChatPromptTemplate have
prompttemplate = hub.pull("hwchase17/react")
chatprompttemplate = hub.pull("hwchase17/openai-tools-agent")
print(f'How PromptTemplate looks:\n{prompttemplate}\n')
print(f'How ChatPromptTemplate looks:\n{chatprompttemplate}\n')



How PromptTemplate looks:
input_variables=['agent_scratchpad', 'input', 'tool_names', 'tools'] input_types={} partial_variables={} metadata={'lc_hub_owner': 'hwchase17', 'lc_hub_repo': 'react', 'lc_hub_commit_hash': 'd15fe3c426f1c4b3f37c9198853e4a86e20c425ca7f4752ec0c9b0e97ca7ea4d'} template='Answer the following questions as best you can. You have access to the following tools:\n\n{tools}\n\nUse the following format:\n\nQuestion: the input question you must answer\nThought: you should always think about what to do\nAction: the action to take, should be one of [{tool_names}]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can repeat N times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\n\nBegin!\n\nQuestion: {input}\nThought:{agent_scratchpad}'

How ChatPromptTemplate looks:
input_variables=['agent_scratchpad', 'input'] optional_variables=['chat_histo

### RAG Class implementation

In [90]:
class LocalRAGSystem:

    def __init__(self, persist_dir: str = PERSIST_DIR, 
                 embedding_model_name: str = EMBEDDING_MODEL):
        """
        Initialize the local RAG system with Chroma vector store, embedding model and local LLM, plus all configuration parameters.
        """

        self.persist_dir = persist_dir
        self.embedding_model_name = embedding_model_name
        self.CHUNK_SIZE_THRESHOLD = 1200

        # Initialize embedding model (runs locally)
        self.embeddings = HuggingFaceEmbeddings(  # HuggingFace sentence_transformers embedding models
            model_name=self.embedding_model_name,
            model_kwargs={'device': DEVICE},
            encode_kwargs={
            'normalize_embeddings': True,  # Crucial for cosine similarity
            'batch_size': 32,  # Batch processing speedup
            'dtype': torch.float32  # (fp32), can also use: torch.bfloat16,
        })

        # Initialize local LLM with OpenAI-compatible API (LM-Studio)
        self.llm = ChatOpenAI(
            model_name=LLM_CONFIG["model_name"],
            openai_api_key=LLM_CONFIG["openai_api_key"],
            openai_api_base=LLM_CONFIG["lmstudio_url"],
            temperature=1, # 0.7 for most models, for gpt-oss 1 is recommended
            max_tokens=32000,
        )

        # Initialize Chroma vector store
        if os.path.exists(self.persist_dir) and bool(os.listdir(self.persist_dir)):
            print("Loading existing vector store...")
            try:
                self.vectordb = Chroma(
                    persist_directory=self.persist_dir,
                    embedding_function=self.embeddings,
                    collection_name="personal_knowledge"
                )
                self.split_docs = None
                # Check available collections
                client = chromadb.PersistentClient(path=self.persist_dir)
                print(f'Available collections in ChromaDB: {client.list_collections()}')
            except Exception as e:
                print(f"Error loading existing vector store: {e}")

        else:
            print("No existing vector store found.")


    def load_documents(self, source_dir: Union[str, Path]) -> List[Dict]:
        """Load documents from directory with multiple file support"""

        source_dir = Path(source_dir)
        if not source_dir.exists():
            print(f"Source directory does not exist: {source_dir}")
            return []
        print(f'Contents of the documents dir: {list(source_dir.glob("**/*"))}')

        # Define loaders for different file types
        loaders = {
            ".txt": TextLoader,
            ".md": UnstructuredMarkdownLoader,
            ".pdf": PyPDFLoader,
            ".docx": Docx2txtLoader,
        }

        documents = []
        for ext, loader_class in loaders.items():
            files = list(source_dir.glob(f"**/*{ext}"))
            if files:
                loader = DirectoryLoader(
                    str(source_dir),
                    glob=f"**/*{ext}",
                    loader_cls=loader_class,
                    show_progress=True
                )
                documents.extend(loader.load())
                print(f"Loaded {len(loader.load())} pages")

        return documents
    

    def chunk_docs(self, docs: List[Document]) -> List[Document]:
        """
        Split documents into smaller chunks for RAG, optimized for LLMs. 
        chunk_size: Max characters per text segment (e.g., 500). Too small: context loss. Too large: LLM overflows/context noise.
        chunk_overlap: Characters reused from previous chunk (e.g., 50). Prevents splitting mid-sentence → preserves context across chunks.
        """

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.CHUNK_SIZE_THRESHOLD, 
            chunk_overlap=self.CHUNK_SIZE_THRESHOLD/10, # 10-20% overlap
            length_function=len, # Function to measure text length (characters not tokens!)
            is_separator_regex=False,
            separators=["\n\n", "\n", " ", "", "."],
        )
        split_docs = text_splitter.split_documents(docs)
        print(f"Total chunks created: {len(split_docs)}")

        return split_docs


    def build_vector_db_from_docs(self, file_paths: List[str]):
        """Build ChromaDB vector store (persist to disk) with BGE-small embeddings from document chunks."""

        self.documents = self.load_documents(file_paths)
        self.split_docs = self.chunk_docs(self.documents)

        self.vectordb = Chroma.from_documents(
            documents=self.split_docs,
            embedding=self.embeddings,
            persist_directory=self.persist_dir,
            collection_name="personal_knowledge"
        )
        print(f"Vector store persisted at {str(self.persist_dir)}")
        print(f"Added {len(self.split_docs)} documents to vector store")

        return self.vectordb


    def update_vector_db(self, new_docs_path: List[str]):
        """Update existing vector store with new documents"""

        new_documents = self.load_documents(new_docs_path)
        new_split_docs = self.chunk_docs(new_documents)
        print(new_split_docs)

        self.vectordb.add_documents(new_split_docs)
        # texts = [doc.page_content for doc in split_docs]
        # metadatas = [doc.metadata for doc in split_docs]
        # self.vectordb.add_texts(texts=texts, metadatas=metadatas)
        print("\nDatabase updated with new documents")
        self.vectordb.persist()

        return self.vectordb


    def list_existing_documents(self) -> List[str]:
        """
        List all documents (metadata) in the current vector store
        """
        try:
            docs = self.vectordb.get()
            return docs['metadatas'] if docs else []
        except Exception as e:
            print(f"Error listing documents: {e}")
            return []


    def create_retrievers(self):
            """
            Create different types of retrievers
            """

            if self.split_docs is None:
                print("No split documents available for BM25Retriever creation. Need to build vector DB first.")
                # Vector Retriever (semantic search)
                vector_retriever = self.vectordb.as_retriever(
                search_type="similarity",
                search_kwargs={"k": 30}
            )
                return {"vec_semantic": vector_retriever}

            else:
                # BM25 Retriever (for keyword-based search)
                bm25_retriever = BM25Retriever.from_documents(
                    self.split_docs,
                )
                bm25_retriever.k = 30

                # Vector Retriever (semantic search)
                vector_retriever = self.vectordb.as_retriever(
                    search_type="similarity",
                    search_kwargs={"k": 30}
                )

                # Ensemble Retriever combining BM25 and Vector
                ensemble_retriever = EnsembleRetriever(
                    retrievers=[bm25_retriever, vector_retriever],
                    weights=[0.4, 0.6]
                )

                return {
                    "bm25": bm25_retriever,
                    "vec_semantic": vector_retriever,
                    "ensemble": ensemble_retriever
                }


    def build_qa_chain(self,  
              retriever_type: str = "vec_semantic", 
              use_conversation: bool = False):
        """
        Builds a RetrievalQA chain with different retrieval methods, using the given local LLM, vector store and a question template.
        """

        retrievers = self.create_retrievers()

        # template = """
        #             Use ONLY the following context (source documents and metadata) to answer the question. 
        #             Do not use prior knowledge.
        #             If the context doesn't contain the answer, respond with "INSUFFICIENT CONTEXT".

        #             Context: {context}

        #             Question: {question}

        #             Helpful Answer:
        #         """
        # prompt = PromptTemplate.from_template(template)

        # The new implementation
        prompt = ChatPromptTemplate.from_messages(
        [
            ("system",
            '''Use ONLY the following context (source documents and metadata) to answer the question. 
                Do not use prior knowledge.
                If the context doesn't contain the answer, respond with "INSUFFICIENT CONTEXT"'''
             "Context: {context}"),
            ("human", "{question}")
        ])

        if use_conversation:
            memory = ConversationBufferMemory(
                memory_key="chat_history",
                output_key="answer",
                return_messages=True
            )
            conv_chain = ConversationalRetrievalChain.from_llm(
                llm=self.llm,
                retriever=retrievers[retriever_type],
                memory=memory,
                combine_docs_chain_kwargs={"prompt": prompt, 
                    "document_separator": "\n\n"},
                return_source_documents=True,
            )
            return conv_chain

        else:
            qa_chain = RetrievalQA.from_chain_type(
                llm=self.llm,
                chain_type="stuff",  # All context stuffed into prompt
                retriever=retrievers[retriever_type],
                return_source_documents=True,  # For source verification
                chain_type_kwargs={"prompt": prompt,
                    "document_separator": "\n\n"  # Clean context formatting, improves prompt readability for the LLM.
                }
            )
            # # the new LangChain api with "create_*"
            # # “stuff” strategy = concatenate all retrieved chunks
            # combine_chain = create_stuff_documents_chain(self.llm, prompt)
            # retrieval_chain = create_retrieval_chain(
            #     retriever=retrievers[retriever_type],
            #     combine_docs_chain=combine_chain,
            # )
            return qa_chain


### Running the complete RAG implementation

In [91]:
# Initialize the local RAG system, connecting to an existing vector db (if available at PERSIST_DIR)
local_rag_sys = LocalRAGSystem()

# check config setting are correct
print(f'Chroma DB path: {local_rag_sys.persist_dir}')
print(f'Embedding model: {EMBEDDING_MODEL}')
print(f'LLM model: {LLM_CONFIG["model_name"]}')
print(f'Document chunk size: {local_rag_sys.CHUNK_SIZE_THRESHOLD}')

Loading existing vector store...
Available collections in ChromaDB: [Collection(name=personal_knowledge)]
Chroma DB path: /Users/danid/Downloads/RAG_personal_knowledge_base/chroma_db_mxbai-embed-large-v1
Embedding model: mixedbread-ai/mxbai-embed-large-v1
LLM model: unsloth/gpt-oss-120b
Document chunk size: 1200


In [None]:
# List existing documents in the vector store
docs_in_db = local_rag_sys.list_existing_documents()
docs = []
for doc in docs_in_db:
    docs.append(doc['title'])
print(set(docs))  # Display unique document titles in the vector store

In [48]:
# Build a new vector database from pdf-documents
vectordb = local_rag_sys.build_vector_db_from_docs(file_paths=workdir/"pdfs")

Contents of the documents dir: [PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Selection of microorganisms capable of polyethylene (PE) and polypropylene (PP) degradation.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/.DS_Store'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Diffusion Sequence Models for Enhanced Protein Representation and Generation.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Biodegradation of polyethylene and polypropylene by Lysinibacillus species JJY0216 isolated from soil grove.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/xTrimoPGLM- unified 100-billion-parameter pretrained transformer for deciphering the language of proteins.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/SaProt- Protein Language Modeling with Structure-aware Vocabulary.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Po

100%|██████████| 7/7 [00:10<00:00,  1.47s/it]
100%|██████████| 7/7 [00:09<00:00,  1.42s/it]


Loaded 131 pages
Total chunks created: 528
Vector store persisted at /Users/danid/Downloads/RAG_personal_knowledge_base/chroma_db_mxbai-embed-large-v1
Added 528 documents to vector store


In [6]:
# update the vector db with new documents
vectordb = local_rag_sys.update_vector_db(new_docs_path=workdir/"new_docs_to_add")

Contents of the documents dir: [PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx')]


100%|██████████| 1/1 [00:00<00:00, 109.65it/s]
100%|██████████| 1/1 [00:00<00:00, 153.94it/s]

Loaded 1 pages
Total chunks created: 56
[Document(metadata={'source': '/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx'}, page_content='Genomic and Metabolic Insights into Microbial Biodegradation of TNT\n\n\n\nAbstract'), Document(metadata={'source': '/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx'}, page_content='Abstract\n\n2,4,6-Trinitrotoluene (TNT) is a recalcitrant explosive pollutant whose breakdown in the environment is facilitated by diverse microbes. This document summarizes key findings on TNT biodegradation, including the spectrum of TNT-transforming bacteria, fungi, and yeasts, the major genes and enzymes driving TNT metabolism, and the primary degradation pathways. Genomic analyses have identified critical catabolic genes (e.g. nitroreductases, oxygen-insensitive reductases of the Old Yell





Database updated with new documents


  self.vectordb.persist()


In [7]:
# update the vector db with new documents
vectordb = local_rag_sys.update_vector_db(new_docs_path=workdir/"new_pdfs_to_add")

Contents of the documents dir: [PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/new_pdfs_to_add/A marine bacterial community capable of degrading poly(ethylene terephthalate) and polyethylene.pdf')]


100%|██████████| 1/1 [00:00<00:00,  4.25it/s]
100%|██████████| 1/1 [00:00<00:00,  3.76it/s]


Loaded 12 pages
Total chunks created: 64
[Document(metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'Elsevier', 'creationdate': '2021-06-12T16:00:41+00:00', 'moddate': '2021-06-12T16:49:48+00:00', 'title': 'A marine bacterial community capable of degrading poly(ethylene terephthalate) and polyethylene', 'keywords': 'Ocean,Plastics,Bacterial community reconstitution,Degradation,Pollution', 'subject': 'Journal of Hazardous Materials, 416 (2021) 125928. doi:10.1016/j.jhazmat.2021.125928', 'author': 'Rongrong Gao', 'source': '/Users/danid/Downloads/RAG_personal_knowledge_base/new_pdfs_to_add/A marine bacterial community capable of degrading poly(ethylene terephthalate) and polyethylene.pdf', 'total_pages': 12, 'page': 0, 'page_label': '1'}, page_content='Journal of Hazardous Materials 416 (2021) 125928\nAvailable online 24 April 2021\n0304-3894/© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license\n(http://creativ

In [29]:
# List existing documents in the vector store
docs_in_db = local_rag_sys.list_existing_documents()
docs = []
for doc in docs_in_db:
    docs.append(doc['source'])
print(set(docs))  # Display unique document titles in the vector store

{'/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Polyethylene Degradation by a Rhodococcous Strain Isolated from Naturally Weathered Plastic Waste Enrichment .pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx', '/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Biodegradation of polyethylene and polypropylene by Lysinibacillus species JJY0216 isolated from soil grove.pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Selection of microorganisms capable of polyethylene (PE) and polypropylene (PP) degradation.pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/SaProt- Protein Language Modeling with Structure-aware Vocabulary.pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Using artificial intelligence to document the hidden RNA virosphere.pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Diffusion Sequence Models for Enhanced P

### Single Prompts Chain (not conversation)

In [103]:
# Complete workflow with existing documents vector db, running QA-chain with a query to the model
# Single prompts chain
qa_chain = local_rag_sys.build_qa_chain(retriever_type="vec_semantic", 
                                        use_conversation=False)

No split documents available for BM25Retriever creation. Need to build vector DB first.


In [104]:
result = qa_chain.invoke('''How does the SaProt protein-LLM differs from other protein-LLMs such as ESM-2? 
                         Two new protein LLMs models called: Diffusion Sequence Model (DSM) and xTrimoPGLM are suppose to be state-of-the-art. Compare them to SaProt. 
                        What are the key architectural novelties of these models?
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.''')

In [105]:
print("\n=== ANSWER ===")
print(result['result'])   
print("\n=== SOURCES ===")
for doc in result['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['title']}\n")
    print(f"Metadata page number: {doc.metadata['page']}\n")


=== ANSWER ===
**What makes SaProt different from other protein‑LLMs**

| Feature | SaProt (the “structure‑aware” LLM) | ESM‑2 (baseline LLM) |
|---------|-----------------------------------|----------------------|
| **Core novelty** | Introduces a *Structure‑Aware (SA) vocabulary* that concatenates the usual residue token with a structure token derived from Foldseek‑encoded 3‑D information. The model therefore sees “SA‑tokens” instead of pure amino‑acid letters. | Pure sequence‑only tokenisation; no explicit structural token. |
| **Back‑bone** | Uses the exact same transformer architecture, size and pre‑training data as ESM‑2 (≈650 M parameters), but the input embedding layer is enlarged to accommodate the extra structure tokens. | Standard ESM‑2 transformer. |
| **Training data** | ~40 M protein sequences *paired* with predicted structures (Foldseek/AlphaFold). This is far larger in terms of structural examples than any previous PLM. | Only raw amino‑acid sequences; no paired struct

In [14]:
result = qa_chain.invoke('''Which microorganisms (bacteria and fungi) are capable of degradation polyethylene (PE)?. 
                        By using which methods were they isolated and from which environments? 
                        Is Lysinibacillus and also Rhodococcous strains are capable of degrading PE? 
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.''')

In [15]:
print("\n=== ANSWER ===")
print(result['result'])   
print("\n=== SOURCES ===")
for doc in result['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['title']}\n")
    print(f"Metadata page number: {doc.metadata['page']}\n")


=== ANSWER ===
**Microorganisms that have been shown in the supplied documents to degrade polyethylene (PE)**  

| Domain | Species / genus (examples) reported as PE‑degraders | How they were obtained (isolation method & source environment) | Document(s) & bibliographic details |
|--------|------------------------------------------------------|--------------------------------------------------------------------------|-------------------------------------|
| **Bacteria** | • *Streptomyces* spp. ( *badius, setonia, viridosporus* )  <br>• *Bacillus* spp. (many strains cited in several surveys)  <br>• *Lysinibacillus sp.* JJY0216  <br>• *Lysinibacillus fusiformis*  <br>• *Methylobacterium paraoxydans*  <br>• *Brevibacillus borstelensis* strain 707  <br>• *Pseudomonas* spp.  <br>• *Arthrobacter* sp. (HDPE degrader)  <br>• *Priestia megaterium* (formerly *Bacillus megaterium*)  <br>• *Klebsiella pneumoniae*  <br>• *Pseudomonas fluorescens*  <br>• *Enterobacter ludwigii*  <br>• *Chryseobacte

In [16]:
result = qa_chain.invoke('''Summarize the main group of TNT degrading microbial species, separate into Aerobic vs. Anaerobic bacteria, what are the main differences? 
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.''')

In [17]:
print("\n=== ANSWER ===")
print(result['result'])   
print("\n=== SOURCES ===")
for doc in result['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['source']}\n")


=== ANSWER ===
**Main groups of TNT‑degrading microbes (as described in the document “Comprehensive analysis of microbial TNT degradation”)**

|                     | Representative taxa mentioned in the text | Typical metabolic features |
|---------------------|-------------------------------------------|----------------------------|
| **Aerobic bacteria** | • *Pseudomonas* spp. (e.g., *P. fluorescens*, *P. putida*)  <br>• *Bacillus* spp.<br>• *Staphylococcus* spp.<br>• *Mycobacterium* spp.<br>• *Rhodococcus* spp. (e.g., *R. erythropolis*)<br>• *Serratia marcescens* <br>• *Stenotrophomonas* isolates | – Operate in the presence of O₂.  <br>– Initiate TNT degradation by **nitro‑group reduction** (nitroreductases, azoreductases) producing hydroxylamino‑ and amino‑dinitrotoluenes (ADNTs, DANTs).<br>– Some strains form Meisenheimer σ‑complexes that release nitrite (denitration) and generate dinitrotoluene intermediates that can be further attacked by dioxygenases.<br>– Frequently generate

In [18]:
result = qa_chain.invoke('''What are the main genes and enzymes involved in TNT degradation? 
                        Make a brief summary of the mechanism of action of each such gene/enzyme.
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.''')

In [20]:
print("\n=== ANSWER ===")
print(result['result'])   
print("\n=== SOURCES ===")
for doc in result['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['source']}\n")


=== ANSWER ===
**Main genes / enzymes that have been identified in the microbial degradation of 2‑4‑6‑trinitrotoluene (TNT)**  

| Gene / Enzyme (example) | Primary catalytic/mechanistic role in TNT turnover | Where this information appears in the supplied context |
|--------------------------|---------------------------------------------------|--------------------------------------------------------|
| **Type I nitroreductases** – *nfsA, nfsB, nemA* (e.g., NemA from a Citrobacter sp.) | Two‑electron reduction of TNT’s nitro groups (–NO₂ → –NO → –NHOH → –NH₂) using FMN/FAD and NAD(P)H as electron donors. This generates aminodinitrotoluene (ADNT) intermediates but does **not** cleave the aromatic ring. | “Nitroreductases (Type I)… genes encoding such enzymes (e.g., nfsA, nfsB … nemA…) have been found… NemA was most up‑regulated… converts TNT to 4‑ADNT and 2‑ADNT.” – *Genomic Insights: Key Genes and Enzymes*; also in the “TNT‑Degrading Microbial Species” section. |
| **Azoreductase** (a

### Conversational Chain (with ConversationBufferMemory)
For follow‑up calls use the same qa_chain object, and the memory will automatically include everything you said before.

If you want to start a brand‑new chat without creating a new chain object:

`qa_chain.memory.clear()`   # wipes the stored chat_history

Or just instantiate a fresh chain:

```
qa_chain = local_rag_sys.build_qa_chain(
    retriever_type="vec_semantic",
    use_conversation=True,
)
```

In [106]:
# Complete workflow with existing documents vector db, running QA-chain with a query to the model
# Build a conversational chain
qa_chain = local_rag_sys.build_qa_chain(retriever_type="vec_semantic", 
                                        use_conversation=True)

No split documents available for BM25Retriever creation. Need to build vector DB first.


In [107]:
# First turn
ans1 = qa_chain.invoke({"question": '''How does the SaProt protein-LLM differs from other protein-LLMs such as ESM-2? 
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.'''})
print("\n🤖", ans1["answer"])

# Second turn – the model sees the previous answer in its context now
ans2 = qa_chain.invoke({"question": '''Two new protein LLMs models called: Diffusion Sequence Model (DSM) and xTrimoPGLM are suppose to be state-of-the-art. 
                        Compare them to the model you discussed in the previous prompt.
                        What are the key architectural novelties of these models?
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer. '''})
print("\n🤖", ans2["answer"])


🤖 **How SaProt differs from other protein‑LLMs such as ESM‑2**

| Aspect | SaProt (SaProt paper) | ESM‑2 (ESM family) |
|--------|----------------------|--------------------|
| **Vocabulary** | Uses a *structure‑aware (SA) vocabulary* that fuses the ordinary amino‑acid token with a Foldseek‑derived 3Di structural token at each position. The embedding layer therefore contains 441 tokens (21 residues × 21 structure symbols). | Uses only the conventional 20 residue tokens (plus special tokens). |
| **Input representation** | Each protein is encoded as an *SA‑token sequence* (e.g., “s₁f₁, s₂f₂ …”), where the second component carries explicit 3‑D conformation information. | Input consists solely of the linear amino‑acid string; any structural signal must be inferred implicitly from the sequence. |
| **Model size & architecture** | Identical transformer backbone and parameter count to ESM‑2 650M; the only architectural change is the expanded embedding matrix for the SA tokens. (SaProt “shar

In [108]:
for doc in ans2['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['source']}\n")

Content: process. DSM extends ESM2 ( 33) with a novel language modeling head and training objective, enabling robust denoising
across high corruption rates and sequence generation with global context. After tr...
Metadata title: /Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Diffusion Sequence Models for Enhanced Protein Representation and Generation.pdf

Content: arXiv:2506.08293v1  [q-bio.BM]  9 Jun 2025
Diffusion Sequence Models for Enhanced Protein Representation and
Generation
Logan Hallee1, 2, Nikolaos Rafailidis1, David B. Bichara2, and Jason P. Gleghorn...
Metadata title: /Users/danid/Downloads/RAG_personal_knowledge_base/pdfs/Diffusion Sequence Models for Enhanced Protein Representation and Generation.pdf

Content: embeddings and valuable downstream tasks (datasets described in Supplemental Figure S3). DSM650 produced the highest
quality embeddings among similarly sized pLMs, generating consistently high F1 scor...
Metadata title: /Users/danid/Downloads/RAG_personal_k

In [80]:
# First turn
ans1 = qa_chain.invoke({"question": '''Summarize the main group of TNT degrading microbial species, separate into Aerobic vs. Anaerobic bacteria, what are the main differences? 
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.'''})
print("\n🤖", ans1["answer"])

# Second turn – the model sees the previous answer in its context now
ans2 = qa_chain.invoke({"question": '''What are the main genes and enzymes involved in TNT degradation of the Aerobic vs. Anaerobic bacteria that you have summarized? 
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer. '''})
print("\n🤖", ans2["answer"])


🤖 **Summary of the main groups of TNT‑degrading microbes (bacteria)**  

| **Category** | **Typical genera / species mentioned in the source material** | **Characteristic metabolic features** |
|--------------|---------------------------------------------------------------|---------------------------------------|
| **Aerobic bacteria** | • *Pseudomonas* spp. (e.g., *P. fluorescens*, *P. putida*, strain IIBx)  <br>• *Bacillus* spp.<br>• *Staphylococcus* spp.<br>• *Mycobacterium* spp.<br>• *Rhodococcus* spp.<br>• *Serratia marcescens*<br>• *Stenotrophomonas* isolates | – Operate under oxic conditions. <br>– Primary route is **nitro‑group reduction** (NO₂ → NO → NHOH → NH₂) producing hydroxylamino‑ and amino‑dinitrotoluenes (ADNTs, DANTs). <br>– Some strains can form **Meisenheimer complexes** that eject nitrite, yielding dinitrotoluene (DNT) which can then be attacked by dioxygenases. <br>– Generally stop at partially reduced intermediates (ADNT, azoxy dimers); complete mineralisation o

In [102]:
for doc in ans2['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['source']}\n")

Content: Scope of Document: This document provides a comprehensive analysis of microbial TNT degradation. We first catalog the known TNT-degrading species across bacteria, fungi, and yeasts, highlighting their...
Metadata title: /Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx

Content: Genomic Insights: Key Genes and Enzymes



Microbial TNT degradation relies on a suite of specialized enzymes that initiate TNT transformation despite its inert aromatic structure. Genomic and biochem...
Metadata title: /Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx

Content: Comparative Genomic Insights: Genome sequencing of TNT-degrading strains has shed light on their evolutionary adaptations. For example, three Antarctic Pseudomonas isolates (TNT3, TNT11, TNT19) were s...
Metadata title: /Users/danid/Downloads/RAG_pers