# LangChain RAG Search

This notebook demonstrates various retrieval strategies for Retrieval-Augmented Generation (RAG) using LangChain and IBM Watsonx AI.

## Overview
- **LLM**: IBM Watsonx AI (Mistral-Small)
- **Vector Store**: ChromaDB
- **Embeddings**: IBM Slate-125M English Retriever

## Retrieval Strategies Covered
1. Vector Store-Backed Retrieval (similarity search, MMR, score threshold)
2. Multi-Query Retriever
3. Self-Querying Retriever
4. Parent Document Retriever

## Requirements
- Python 3.9+
- IBM Watsonx AI account with API credentials
- Internet connection for downloading sample documents

In [None]:
# Install necessary Python packages
!pip install "ibm-watsonx-ai==1.1.2" | tail -n 1
!pip install "langchain==0.2.1" | tail -n 1
!pip install "langchain-ibm==0.1.11" | tail -n 1
!pip install "langchain-community==0.2.1" | tail -n 1
!pip install "chromadb==0.4.24" | tail -n 1
!pip install "pypdf==4.3.1" | tail -n 1
!pip install 'posthog<6.0.0' | tail -n 1

In [None]:
# Suppress warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

## LLM Configuration


### Build retriever model


In [None]:
## Use IBM Watsonx AI as the base LLM
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models.extensions.langchain import WatsonxLLM

In [None]:
# Define the llm function to initialize and return the IBM Watsonx AI LLM and its parameters
def llm():
    model_id = 'mistralai/mistral-small-3-1-24b-instruct-2503'
    
    parameters = {
        GenParams.MAX_NEW_TOKENS: 256, 
        GenParams.TEMPERATURE: 0.4,
    }
    
    credentials = {
        "url": "Input your IBM Watsonx AI URL", # Replace with your actual IBM Watsonx AI URL
        "api_key": "Input your IBM Watsonx AI API Key", # Replace with your actual API key
    }
    
    project_id = 'your_project_id'  # Replace with your actual project ID
    
    model = ModelInference(
        model_id=model_id,
        params=parameters,
        credentials=credentials,
        project_id=project_id
    )
    
    mixtral_llm = WatsonxLLM(model = model)
    return mixtral_llm

## Text Splitting Configuration

Configure recursive character text splitter for document chunking.

In [None]:
# Import necessary libraries
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Define the text_splitter function to split the input data into chunks
def text_splitter(data, chunk_size, chunk_overlap):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks = text_splitter.split_documents(data)
    return chunks

## Embedding Model Setup


### Create the embedding model


In [None]:
# Define the embedding function to generate embeddings for the input data
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames
from langchain_ibm import WatsonxEmbeddings

In [None]:
# Define the watsonx_embedding function to initialize and return the IBM Watsonx AI embedding model
def watsonx_embedding():
    embed_params = {
        EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
        EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
    }
    
    watsonx_embedding = WatsonxEmbeddings(
        model_id="ibm/slate-125m-english-rtrvr",
        url="https://us-south.ml.cloud.ibm.com",
        project_id="skills-network",
        params=embed_params,
    )
    return watsonx_embedding

## Retrieval Strategies

### Initiate and use Different Retrievers


#### Vector Store-Backed Retriever


In [None]:
# Load sample text for retriever from a URL
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/MZ9z1lm-Ui3YBp3SYWLTAQ/companypolicies.txt"

In [None]:
# Import TextLoader to load text data
from langchain_community.document_loaders import TextLoader

In [None]:
# Load the text data from the file
loader = TextLoader("companypolicies.txt")
txt_data = loader.load()

In [None]:
# Chunk the text data into 350-character chunks with 50-character overlap
chunks_txt = text_splitter(txt_data, 350, 50)

In [None]:
# Import Chroma to create a vector store
from langchain_community.vectorstores import Chroma

In [None]:
# Create a vector store from the chunks using the IBM Watsonx AI embedding model and ChromaDB
vectordb = Chroma.from_documents(chunks_txt, watsonx_embedding())

##### Simple similarity search


In [None]:
# Define as example query and create a retriever from the vector store; note that default is k=4 for the number of top chunks to retrieve
query = "email policy"
retriever = vectordb.as_retriever()

In [None]:
# Invoke the retriever with the example query to retrieve relevant documents
docs = retriever.invoke(query)

In [None]:
# Print the retrieved documents
docs

In [None]:
# Now set retriever with a different number of top chunks to retrieve (e.g., k=2)
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
docs = retriever.invoke(query)
docs

##### MMR search


MMR reduces redundancy while maintaining query relevance by balancing how relevant a document is to the query with how different it is from already-selected items. It's useful for RAG systems with limited context windows, selecting relevant but non-duplicate snippets. The main drawbacks are higher computational cost than simple similarity search and the need to tune a diversity parameter for your use case.

In [None]:
# Set retriever to a an MMR (Maximal Marginal Relevance) search type
retriever = vectordb.as_retriever(search_type="mmr")
docs = retriever.invoke(query)
docs

##### Similarity score threshold retrieval


Similarity score threshold retrieval returns all documents with similarity scores above a minimum threshold, very useful if you don't know the correct k value upfront. It retrieves only highly relevant documents based on a similarity score threshold, filtering out less relevant ones. The main benefit is control as it ensures only sufficiently relevant results surface. However, it may return zero results if nothing meets the threshold, and may require tuning of threshold values for your use case.

In [None]:
# Set retriever to use a similarity score threshold for retrieval
retriever = vectordb.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.4}
)
docs = retriever.invoke(query)
docs

#### Multi-Query Retriever


Generates multiple variations of the user query using an LLM, then searches with all variations and merges the results. It retrieves multiple sets of documents based on varied interpretations, which increases the likelihood of pinpointing relevant answers for vague or imprecisely formulated queries. Can improve recall by capturing documents from different phrasings and perspectives, however, it increases computational cost due to multiple LLM calls and vector searches and may also create potential noise from less relevant variations.

In [None]:
# Import PyPDFLoader to load PDF data
from langchain_community.document_loaders import PyPDFLoader

In [None]:
# Load sample PDF for retriever from a URL
loader = PyPDFLoader("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/ioch1wsxkfqgfLLgmd-6Rw/langchain-paper.pdf")
pdf_data = loader.load()

In [None]:
# Chunk the PDF data into 500-character chunks with 20-character overlap
chunks_pdf = text_splitter(pdf_data, 500, 20)

## Create a vector store from the PDF chunks using the IBM Watsonx AI embedding model and ChromaDB
# Get existing document
ids = vectordb.get()["ids"]
# Clear existing embeddings from previous documents
vectordb.delete(ids) 
# Create a new vector store with the PDF chunks
vectordb = Chroma.from_documents(documents=chunks_pdf, embedding=watsonx_embedding())

In [None]:
# Import MultiQueryRetriever to create a retriever that can handle multiple queries
from langchain.retrievers.multi_query import MultiQueryRetriever
# Define an example query and create a retriever from the vector store
query = "What does the paper say about langchain?"

retriever = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm()
)

In [None]:
## Set logging level to INFO for MultiQueryRetriever
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [None]:
# Invoke the retriever with the example query to retrieve relevant documents
docs = retriever.invoke(query)
docs

#### Self-Querying Retriever


Uses an LLM to write a structured query from natural language, allowing it to extract filters from user queries on document metadata and execute those filters alongside semantic similarity comparison.This can achieve higher precision but is dependent on the underlying LLM; if the model misinterprets ambiguous queries or metadata is poorly structured, filtering accuracy suffers.


In [None]:
# Import necessary libraries for self-querying
from langchain_core.documents import Document
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever

In [None]:
# Define a list of documents with metadata for self-querying
docs = [
    Document(
        page_content="A master thief assembles a crew to pull off an impossible heist in a sprawling casino",
        metadata={"year": 2001, "rating": 8.4, "genre": "heist"},
    ),
    Document(
        page_content="A rogue AI system threatens humanity while scientists race against time to shut it down",
        metadata={"year": 2015, "director": "Alex Garland", "rating": 7.9},
    ),
    Document(
        page_content="A brilliant mathematician discovers hidden patterns in reality that no one else can perceive",
        metadata={"year": 1998, "director": "Darren Aronofsky", "rating": 8.5},
    ),
    Document(
        page_content="A group of astronauts venture into deep space searching for signals of intelligent life",
        metadata={"year": 2014, "director": "Denis Villeneuve", "rating": 8.0},
    ),
    Document(
        page_content="Sentient robots develop consciousness and question their purpose in society",
        metadata={"year": 2004, "genre": "science fiction"},
    ),
    Document(
        page_content="A lone wanderer navigates through an apocalyptic landscape seeking redemption",
        metadata={
            "year": 1985,
            "director": "Ridley Scott",
            "genre": "post-apocalyptic",
            "rating": 9.1,
        },
    ),
]

In [None]:
# Define metadata field information for self-querying
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The category or classification of the film. Select from ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The release date of the film (expressed as a calendar year)",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The filmmaker responsible for directing the production",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A numerical score between 1 and 10 indicating quality", type="float"
    ),
]

In [None]:
# Create a vector store from the documents using the IBM Watsonx AI embedding model and ChromaDB
vectordb = Chroma.from_documents(docs, watsonx_embedding())

In [None]:
## Initialize the self-querying retriever with the LLM, vector store, document content description, and metadata field information
document_content_description = "Brief summary of a movie."

retriever = SelfQueryRetriever.from_llm(
    llm(),
    vectordb,
    document_content_description,
    metadata_field_info,
)

In [None]:
# Test the self-querying retriever with a query
retriever.invoke("I want to watch a movie rated higher than 7")

In [None]:
# Test the self-querying retriever with a query that includes metadata
retriever.invoke("Find films directed by Denis Villeneuve in the science fiction category")

In [None]:
# This example specifies a composite filter
retriever.invoke("Show me an action film with an excellent rating (above 8.0)")

#### Parent Document Retriever


Chunks large documents into smaller sub-documents, queries the smaller chunks for precision, then returns the full parent document to the LLM. This approach ensures the system benefits from accurate child document embeddings for precise retrieval while providing the LLM with the broader parent document context, resulting in more extensive and detailed answers. The downside is increased storage overhead (maintaining both parent and child documents) and potentially passing too much context to the LLM, which can introduce noise or dilute signal.

In [None]:
# Import necessary libraries for parent document retriever
from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import CharacterTextSplitter
from langchain.storage import InMemoryStore

In [None]:
# Set two splitters. One is with big chunk size (parent) and one is with small chunk size (child)
parent_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=20, separator='\n')
child_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20, separator='\n')

In [None]:
# Create a vector store for the parent documents using the IBM Watsonx AI embedding model and ChromaDB
vectordb = Chroma(
    collection_name="split_parents", embedding_function=watsonx_embedding()
)

# The storage layer for the parent documents
store = InMemoryStore()

# Initialize the parent document retriever with the vector store, document store, and splitters
retriever = ParentDocumentRetriever(
    vectorstore=vectordb,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add the text data to the parent document retriever
retriever.add_documents(txt_data)

In [None]:
# Verify the number of parent documents
len(list(store.yield_keys()))

In [None]:
# Verify the number of child documents
sub_docs = vectordb.similarity_search("smoking policy")
print(sub_docs[0].page_content)

In [None]:
# Retrieve and verify documents using the parent document retriever
retrieved_docs = retriever.invoke("smoking policy")
print(retrieved_docs[0].page_content)