# RAG implementation with LangChain framework with local LLMs - complete class

In this notebook I create Retrieval-Augmented Generation (RAG) system based on my documents (some scientific papers) for by local LLMs implemented with LangChain framework

__What is RAG?__

Retrieval-Augmented Generation (RAG) is an architecture that augments a foundation LLM with external, up-to-date knowledge via a retrieval step before generation. It decouples the LLM's static training data from dynamic, domain-specific information.

__Core Components (Local Setup):__

* Vector Database: Stores embeddings of external documents (e.g., ChromaDB, FAISS, Weaviate).

* Embedding Model: Converts text → dense vectors (e.g., all-MiniLM-L6-v2, BGE-small).

* Retriever: Queries the DB for top-k relevant passages via similarity search (e.g., cosine distance).

* LLM: Takes the retrieved context + user query → generates a truth/fact-grounded response.


__This notebook builds on the previous RAG-LangChain notebook by encapsulating all the separate steps, functions and my trials into a single, concise and re-usable Class.__

_Note: There are still some improvements to apply, but overall it works great!_

## Complete RAG implementation with LangChain, ChromaDB, and options for local LLMs
The solution supports document updates, multiple file types, various retrievers, and flexible LLM integration, all wrapped in a dedicted Class.

### Imports & Configs

In [1]:
my_llms = {'qwen3-coder': 'qwen3-coder-30b-a3b-instruct', 
           'qwen3-thinking': 'qwen3-30b-a3b-thinking-2507',
           'glm-4.5-air': 'glm-4.5-air',
           'glm-4.5-air-m': 'glm-4.5-air-mlx',
           'glm-4.5-air-5': 'glm-4.5-air@5bit',
           'gpt-oss': 'gpt-oss-120b',
           'gpt-oss-us': 'unsloth/gpt-oss-120b',
           }

In [2]:
import os
from pathlib import Path
from typing import List, Dict, Union
import torch
import chromadb

# LangChain/Document Processing Imports
from langchain.document_loaders import (
    DirectoryLoader,
    TextLoader,
    PyPDFLoader,
    UnstructuredMarkdownLoader,
    Docx2txtLoader
)
from langchain_community.document_loaders import Docx2txtLoader

from langchain import hub
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.chat_models import ChatOpenAI

# ChromaDB Vector Store
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings

# Retrievers
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.memory import ConversationBufferMemory

# LLM Integration
from langchain.llms import HuggingFacePipeline, OpenAI
from langchain.chains import RetrievalQA, ConversationalRetrievalChain

# Prompts
from langchain.prompts import ChatPromptTemplate, PromptTemplate


workdir = Path.home()/"Downloads"/"RAG_personal_knowledge_base"

# Configuration of embdedding and LLM models
PERSIST_DIR = str(workdir/"chroma_db_v4")
EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"
BASE_URL = "http://localhost:1234/v1"  # LM-Studio local API
API_KEY = "not-needed"

LLM_CONFIG = {
    "provider": "lmstudio",  # Options: "openai", "huggingface", "lmstudio"
    "model_name": my_llms['glm-4.5-air-5'],
    "openai_api_key": API_KEY,
    "lmstudio_url": BASE_URL
}

# device setup
if torch.backends.mps.is_available():
    DEVICE = torch.device("mps")
else:
    DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {DEVICE}")

Using device: mps


### RAG Class implementation

In [3]:
class LocalRAGSystem:

    def __init__(self, persist_dir: str = PERSIST_DIR, 
                 embedding_model_name: str = EMBEDDING_MODEL):
        """
        Initialize the local RAG system with Chroma vector store, embedding model and local LLM, plus all configuration parameters.
        """

        self.persist_dir = persist_dir
        self.embedding_model_name = embedding_model_name
        self.CHUNK_SIZE_THRESHOLD = 1200

        # Initialize embedding model (runs locally)
        self.embeddings = HuggingFaceEmbeddings(  # HuggingFace sentence_transformers embedding models
            model_name=self.embedding_model_name,
            model_kwargs={'device': DEVICE},
            encode_kwargs={
            'normalize_embeddings': True,  # Crucial for cosine similarity
            'batch_size': 32,  # Batch processing speedup
            'dtype': torch.float32  # (fp32), can also use: torch.bfloat16,
        })

        # Initialize local LLM with OpenAI-compatible API (LM-Studio)
        self.llm = ChatOpenAI(
            model_name=LLM_CONFIG["model_name"],
            openai_api_key=LLM_CONFIG["openai_api_key"],
            openai_api_base=LLM_CONFIG["lmstudio_url"],
            temperature=0.7,
            max_tokens=16000,
        )

        # Initialize Chroma vector store
        if os.path.exists(self.persist_dir) and bool(os.listdir(self.persist_dir)):
            print("Loading existing vector store...")
            try:
                self.vectordb = Chroma(
                    persist_directory=self.persist_dir,
                    embedding_function=self.embeddings,
                    collection_name="personal_knowledge"
                )
                self.split_docs = None
                # Check available collections
                client = chromadb.PersistentClient(path=self.persist_dir)
                print(f'Available collections in ChromaDB: {client.list_collections()}')
            except Exception as e:
                print(f"Error loading existing vector store: {e}")

        else:
            print("No existing vector store found.")


    def load_documents(self, source_dir: Union[str, Path]) -> List[Dict]:
        """Load documents from directory with multiple file support"""

        source_dir = Path(source_dir)
        if not source_dir.exists():
            print(f"Source directory does not exist: {source_dir}")
            return []
        print(f'Contents of the documents dir: {list(source_dir.glob("**/*"))}')

        # Define loaders for different file types
        loaders = {
            ".txt": TextLoader,
            ".md": UnstructuredMarkdownLoader,
            ".pdf": PyPDFLoader,
            ".docx": Docx2txtLoader,
        }

        documents = []
        for ext, loader_class in loaders.items():
            files = list(source_dir.glob(f"**/*{ext}"))
            if files:
                loader = DirectoryLoader(
                    str(source_dir),
                    glob=f"**/*{ext}",
                    loader_cls=loader_class,
                    show_progress=True
                )
                documents.extend(loader.load())
                print(f"Loaded {len(loader.load())} pages")

        return documents
    

    def chunk_docs(self, docs: List[Document]) -> List[Document]:
        """
        Split documents into smaller chunks for RAG, optimized for LLMs. 
        chunk_size: Max characters per text segment (e.g., 500). Too small: context loss. Too large: LLM overflows/context noise.
        chunk_overlap: Characters reused from previous chunk (e.g., 50). Prevents splitting mid-sentence → preserves context across chunks.
        """

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.CHUNK_SIZE_THRESHOLD, 
            chunk_overlap=self.CHUNK_SIZE_THRESHOLD/10, # 10-20% overlap
            length_function=len, # Function to measure text length (characters not tokens!)
            is_separator_regex=False,
            separators=["\n\n", "\n", " ", "", "."],
        )
        split_docs = text_splitter.split_documents(docs)
        print(f"Total chunks created: {len(split_docs)}")

        return split_docs


    def build_vector_db_from_docs(self, file_paths: List[str]):
        """Build ChromaDB vector store (persist to disk) with BGE-small embeddings from document chunks."""

        self.documents = self.load_documents(file_paths)
        self.split_docs = self.chunk_docs(self.documents)

        self.vectordb = Chroma.from_documents(
            documents=self.split_docs,
            embedding=self.embeddings,
            persist_directory=self.persist_dir,
            collection_name="personal_knowledge"
        )
        print(f"Vector store persisted at {str(self.persist_dir)}")
        print(f"Added {len(self.split_docs)} documents to vector store")

        return self.vectordb


    def update_vector_db(self, new_docs_path: List[str]):
        """Update existing vector store with new documents"""

        new_documents = self.load_documents(new_docs_path)
        new_split_docs = self.chunk_docs(new_documents)
        print(new_split_docs)

        self.vectordb.add_documents(new_split_docs)
        # texts = [doc.page_content for doc in split_docs]
        # metadatas = [doc.metadata for doc in split_docs]
        # self.vectordb.add_texts(texts=texts, metadatas=metadatas)
        print("\nDatabase updated with new documents")
        self.vectordb.persist()

        return self.vectordb


    def list_existing_documents(self) -> List[str]:
        """
        List all documents (metadata) in the current vector store
        """
        try:
            docs = self.vectordb.get()
            return docs['metadatas'] if docs else []
        except Exception as e:
            print(f"Error listing documents: {e}")
            return []


    def create_retrievers(self):
            """
            Create different types of retrievers
            """

            if self.split_docs is None:
                print("No split documents available for BM25Retriever creation. Need to build vector DB first.")
                # Vector Retriever (semantic search)
                vector_retriever = self.vectordb.as_retriever(
                search_type="similarity",
                search_kwargs={"k": 30}
            )
                return {"vec_semantic": vector_retriever}

            else:
                # BM25 Retriever (for keyword-based search)
                bm25_retriever = BM25Retriever.from_documents(
                    self.split_docs,
                )
                bm25_retriever.k = 30

                # Vector Retriever (semantic search)
                vector_retriever = self.vectordb.as_retriever(
                    search_type="similarity",
                    search_kwargs={"k": 30}
                )

                # Ensemble Retriever combining BM25 and Vector
                ensemble_retriever = EnsembleRetriever(
                    retrievers=[bm25_retriever, vector_retriever],
                    weights=[0.4, 0.6]
                )

                return {
                    "bm25": bm25_retriever,
                    "vec_semantic": vector_retriever,
                    "ensemble": ensemble_retriever
                }


    def build_qa_chain(self,  
              retriever_type: str = "vec_semantic", 
              use_conversation: bool = False):
        """
        Builds a RetrievalQA chain with different retrieval methods, using the given local LLM, vector store and a question template.
        """

        retrievers = self.create_retrievers()

        template = """
                    Use ONLY the following context (source documents and metadata) to answer the question. 
                    Do not use prior knowledge.
                    If the context doesn't contain the answer, respond with "INSUFFICIENT CONTEXT".

                    Context:
                    {context}

                    Question: {question}
                    Helpful Answer:
                """
        prompt = PromptTemplate.from_template(template)

        if use_conversation:
            memory = ConversationBufferMemory(
                memory_key="chat_history",
                output_key="answer",
                return_messages=True
            )
            
            chain = ConversationalRetrievalChain.from_llm(
                llm=self.llm,
                retriever=retrievers[retriever_type],
                memory=memory,
                return_source_documents=True
            )
        else:
            qa_chain = RetrievalQA.from_chain_type(
                llm=self.llm,
                chain_type="stuff",  # All context stuffed into prompt
                retriever=retrievers[retriever_type],
                return_source_documents=True,  # For source verification
                chain_type_kwargs={"prompt": prompt,
                    "document_separator": "\n\n"  # Clean context formatting, improves prompt readability for the LLM.
                }
            )

        return qa_chain


### Running the complete RAG implementation

In [4]:
# Initialize the local RAG system, connecting to an existing vector db (if available at PERSIST_DIR)
local_rag_sys = LocalRAGSystem()

# check config setting are correct
print(f'Chroma DB path: {local_rag_sys.persist_dir}')
print(f'Embedding model: {EMBEDDING_MODEL}')
print(f'LLM model: {LLM_CONFIG["model_name"]}')
print(f'Document chunk size: {local_rag_sys.CHUNK_SIZE_THRESHOLD}')

  self.embeddings = HuggingFaceEmbeddings(  # HuggingFace sentence_transformers embedding models
  self.llm = ChatOpenAI(


Loading existing vector store...


  self.vectordb = Chroma(


Available collections in ChromaDB: [Collection(name=personal_knowledge)]
Chroma DB path: /Users/danid/Downloads/RAG_personal_knowledge_base/chroma_db_v4
Embedding model: BAAI/bge-small-en-v1.5
LLM model: glm-4.5-air@5bit
Document chunk size: 1200


In [54]:
# List existing documents in the vector store
docs_in_db = local_rag_sys.list_existing_documents()
docs = []
for doc in docs_in_db:
    docs.append(doc['title'])
print(set(docs))  # Display unique document titles in the vector store

{'Biodegradation of polyethylene and polypropylene by Lysinibacillus species JJY0216 isolated from soil grove', 'xTrimoPGLM: unified 100-billion-parameter pretrained transformer for deciphering the language of proteins', 'Using artificial intelligence to document the hidden RNA virosphere', 'SaProt: Protein Language Modeling with Structure-aware Vocabulary', 'Selection of microorganisms capable of polyethylene (PE) and polypropylene (PP) degradation', 'Diffusion Sequence Models for Enhanced Protein Representation and Generation', 'Polyethylene Degradation by a Rhodococcous Strain Isolated from Naturally Weathered Plastic Waste Enrichment', 'A marine bacterial community capable of degrading poly(ethylene terephthalate) and polyethylene'}


In [12]:
# Build a new vector database from documents
vectordb = local_rag_sys.build_vector_db_from_docs(file_paths=workdir/"docs")

Contents of the documents dir: [PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/docs/Selection of microorganisms capable of polyethylene (PE) and polypropylene (PP) degradation.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/docs/.DS_Store'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/docs/Diffusion Sequence Models for Enhanced Protein Representation and Generation.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/docs/Biodegradation of polyethylene and polypropylene by Lysinibacillus species JJY0216 isolated from soil grove.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/docs/xTrimoPGLM- unified 100-billion-parameter pretrained transformer for deciphering the language of proteins.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/docs/SaProt- Protein Language Modeling with Structure-aware Vocabulary.pdf'), PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/docs/Po

100%|██████████| 7/7 [00:09<00:00,  1.42s/it]
100%|██████████| 7/7 [00:09<00:00,  1.42s/it]


Loaded 131 pages
Total chunks created: 528
Vector store persisted at /Users/danid/Downloads/RAG_personal_knowledge_base/chroma_db_v4
Added 528 documents to vector store


In [78]:
# update the vector db with new documents
vectordb = local_rag_sys.update_vector_db(new_docs_path=workdir/"new_docs_to_add")

Contents of the documents dir: [PosixPath('/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx')]


100%|██████████| 1/1 [00:00<00:00, 109.96it/s]
100%|██████████| 1/1 [00:00<00:00, 171.13it/s]

Loaded 1 pages
Total chunks created: 56
[Document(metadata={'source': '/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx'}, page_content='Genomic and Metabolic Insights into Microbial Biodegradation of TNT\n\n\n\nAbstract'), Document(metadata={'source': '/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx'}, page_content='Abstract\n\n2,4,6-Trinitrotoluene (TNT) is a recalcitrant explosive pollutant whose breakdown in the environment is facilitated by diverse microbes. This document summarizes key findings on TNT biodegradation, including the spectrum of TNT-transforming bacteria, fungi, and yeasts, the major genes and enzymes driving TNT metabolism, and the primary degradation pathways. Genomic analyses have identified critical catabolic genes (e.g. nitroreductases, oxygen-insensitive reductases of the Old Yell





Database updated with new documents


In [5]:
# List existing documents in the vector store
docs_in_db = local_rag_sys.list_existing_documents()
docs = []
for doc in docs_in_db:
    docs.append(doc['source'])
print(set(docs))  # Display unique document titles in the vector store

{'/Users/danid/Downloads/RAG_personal_knowledge_base/docs/Selection of microorganisms capable of polyethylene (PE) and polypropylene (PP) degradation.pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/docs/SaProt- Protein Language Modeling with Structure-aware Vocabulary.pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/Genomic and Metabolic Insights into Microbial Biodegradation of TNT.docx', '/Users/danid/Downloads/RAG_personal_knowledge_base/docs/Polyethylene Degradation by a Rhodococcous Strain Isolated from Naturally Weathered Plastic Waste Enrichment .pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/docs/Diffusion Sequence Models for Enhanced Protein Representation and Generation.pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/docs/Biodegradation of polyethylene and polypropylene by Lysinibacillus species JJY0216 isolated from soil grove.pdf', '/Users/danid/Downloads/RAG_personal_knowledge_base/new_docs_to_add/A marine bacterial c

In [7]:
# Complete workflow with existing documents vector db, running QA-chain with a query to the model
qa_chain = local_rag_sys.build_qa_chain(retriever_type="vec_semantic", 
                                        use_conversation=False)

No split documents available for BM25Retriever creation. Need to build vector DB first.


In [None]:
result = qa_chain.invoke('''How does the SaProt protein-LLM differs from other protein-LLMs such as ESM-2? 
                         Two new protein LLMs models called: Diffusion Sequence Model (DSM) and xTrimoPGLM are suppose to be state-of-the-art. Compare them to SaProt. 
                        What are the key architectural novelties of these models?
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.''')

No split documents available for BM25Retriever creation. Need to build vector DB first.


In [None]:
print("\n=== ANSWER ===")
print(result['result'])   
print("\n=== SOURCES ===")
for doc in result['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['title']}\n")
    print(f"Metadata page number: {doc.metadata['page']}\n")

In [45]:
result = qa_chain.invoke('''Which microorganisms (bacteria and fungi) are capable of degradation polyethylene (PE)?. 
                        By using which methods were they isolated and from which environments? 
                        Is Lysinibacillus and also Rhodococcous strains are capable of degrading PE? 
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.''')

In [47]:
print("\n=== ANSWER ===")
print(result['result'])   
print("\n=== SOURCES ===")
for doc in result['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['title']}\n")
    print(f"Metadata page number: {doc.metadata['page']}\n")


=== ANSWER ===
**Micro‑organisms reported to degrade polyethylene (PE)**  

| Kingdom | Species / Strain (examples) | How they were isolated | Source environment |
|---------|-----------------------------|------------------------|--------------------|
| **Bacteria** | *Rhodococcus* sp. A34 | Enriched for 609 days in carbon‑free basal medium (CFBM) using naturally weathered PE as the sole carbon source; isolates were obtained from the enrichment culture and identified by 16S rRNA sequencing. | Naturally weathered plastic waste collected from a landfill/field site (weathered PE pieces). |
| | *Lysinibacillus* sp. JJY0216 | Screened from hundreds of soil‑derived samples; colonies were selected on minimal medium containing LDPE film as the only carbon source and then tested for weight loss, SEM/FTIR changes. | Soil taken from a grove where a kenaf–plastic composite had been laid (soil under 30 cm depth). |
| | *Priestia megaterium* (formerly *Bacillus megaterium*) | Isolated from landfill

In [8]:
result = qa_chain.invoke('''Summarize the main group of TNT degrading microbial species, separate into Aerobic vs. Anaerobic bacteria, what are the main differences? 
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.''')

In [9]:
print("\n=== ANSWER ===")
print(result['result'])   
print("\n=== SOURCES ===")
for doc in result['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['source']}\n")


=== ANSWER ===

<think>Hmm, I need to summarize the main groups of TNT-degrading microbial species, separating them into aerobic and anaerobic bacteria, based solely on the provided context. I must not use any prior knowledge outside of this document.

First, I'll scan the context for sections that list TNT-degrading species. The "TNT-Degrading Microbial Species" section provides a clear overview, starting with anaerobes and then moving to aerobes. It mentions Veillonella alkalescens as an anaerobe and Escherichia coli under facultative anaerobes. For aerobes, it lists Pseudomonas species like P. fluorescens and P. putida, along with Bacillus, Staphylococcus, Mycobacterium, Rhodococcus, and others.

I should organize these into aerobic and anaerobic categories as requested. For anaerobes, the context includes Veillonella alkalescens (strictly anaerobic), Desulfovibrio species (sulfate-reducers), and Clostridium strains. For aerobes, it covers Pseudomonas spp., Bacillus, Staphylococcus

In [10]:
result = qa_chain.invoke('''What are the main genes and enzymes involved in TNT degradatiion? 
                        Make a brief summary of the mechanism of action of each such gene/enzyme.
                        Use the given context (source documents) to answer the question.
                        Give the name of the source documents and metadata that supports your answer.''')

In [11]:
print("\n=== ANSWER ===")
print(result['result'])   
print("\n=== SOURCES ===")
for doc in result['source_documents']:
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata title: {doc.metadata['source']}\n")


=== ANSWER ===

<think>Hmm, the user is asking about the main genes and enzymes involved in TNT degradation along with their mechanisms of action. They specifically want me to use only the provided context and not any prior knowledge. The context is quite extensive, covering genomic insights about TNT-degrading microbes.

Looking through the context, I see several key genes and enzymes mentioned that are crucial for TNT degradation. There are nitroreductases, azoreductases, and various support genes involved in stress response. The context also discusses transcriptomic studies showing upregulation of specific enzymes like azoreductase in Pseudomonas sp. TNT3.

For the mechanism of action, I need to extract what each enzyme does based solely on the text. Nitroreductases reduce nitro groups stepwise, azoreductase appears to act on TNT intermediates despite its normal function with azo compounds. There are also stress response genes like those for oxidative and nitrosative damage, transp