# Retrieval Augmented Generation (RAG) with Langchain
*Using IBM Granite Models*

## In this notebook
This notebook contains instructions for performing Retrieval Augumented Generation (RAG). RAG is an architectural pattern that can be used to augment the performance of language models by recalling factual information from a knowledge base, and adding that information to the model query. The most common approach in RAG is to create dense vector representations of the knowledge base in order to retrieve text chunks that are semantically similar to a given user query.

RAG use cases include:
- Customer service: Answering questions about a product or service using facts from the product documentation.
- Domain knowledge: Exploring a specialized domain (e.g., finance) using facts from papers or articles in the knowledge base.
- News chat: Chatting about current events by calling up relevant recent news articles.

In its simplest form, RAG requires 3 steps:

- Initial setup:
  - Index knowledge-base passages for efficient retrieval. In this recipe, we take embeddings of the passages and store them in a vector database.
- Upon each user query:
  - Retrieve relevant passages from the database. In this recipe, we use an embedding of the query to retrieve semantically similar passages.
  - Generate a response by feeding retrieved passage into a large language model, along with the user query.

## Setting up the environment

Ensure you are running python 3.10, 3.11, or 3.12 in a freshly-created virtual environment.

In [5]:
import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 13), "Use Python 3.10, 3.11, or 3.12 to run this notebook."

### Install dependencies

Granite utils provides some helpful functions for recipes.

In [1]:
! pip install git+https://github.com/ibm-granite-community/utils \
    transformers \
    langchain_community \
    langchain-huggingface \
    langchain_ollama \
    langchain-milvus \
    replicate \
    wget

Collecting git+https://github.com/ibm-granite-community/granite-kitchen.git
  Cloning https://github.com/ibm-granite-community/granite-kitchen.git to /private/var/folders/5x/cztshy892cbf92p2fdgqlxhc0000gn/T/pip-req-build-e5ric6s1
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/granite-kitchen.git /private/var/folders/5x/cztshy892cbf92p2fdgqlxhc0000gn/T/pip-req-build-e5ric6s1
  Resolved https://github.com/ibm-granite-community/granite-kitchen.git to commit dfecc88262b948cc9f703904f8f356108ab0f9cf
  Preparing metadata (setup.py) ... [?25ldone
Collecting ibm-granite-community-utils@ git+https://github.com/ibm-granite-community/utils (from granite-kitchen==0.1.0)
  Cloning https://github.com/ibm-granite-community/utils to /private/var/folders/5x/cztshy892cbf92p2fdgqlxhc0000gn/T/pip-install-mv9bodfe/ibm-granite-community-utils_e6897180527941208ebbd943907ea509
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granit

### Serving the Granite AI model


This notebook requires IBM Granite models to be served by an AI model runtime so that the models can be invoked or called. This notebook can use a locally accessible [Ollama](https://github.com/ollama/ollama) server to serve the models, or the [Replicate](https://replicate.com) cloud service.

During the pre-work, you may have either started a local Ollama server on your computer, or setup Replicate access and obtained an [API token](https://replicate.com/account/api-tokens).

## Selecting System Components

### Choose your Embeddings Model

Specify the model to use for generating embedding vectors from text.

To use a model from a provider other than Huggingface, replace this code cell with one from [this Embeddings Model recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Embeddings_Models.ipynb).

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings
from transformers import AutoTokenizer

embeddings_model_path = "ibm-granite/granite-embedding-30m-english"
embeddings_model = HuggingFaceEmbeddings(
    model_name=embeddings_model_path,
)
embeddings_tokenizer = AutoTokenizer.from_pretrained(embeddings_model_path)

  from .autonotebook import tqdm as notebook_tqdm


### Choose your Vector Database

Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus substitute this code cell with one from [this Vector Store recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Vector_Stores.ipynb).

In [3]:
from langchain_milvus import Milvus
import tempfile

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    index_params={"index_type": "AUTOINDEX"},
)

The vector database will be saved to /var/folders/5x/cztshy892cbf92p2fdgqlxhc0000gn/T/milvus_jlwamqh8.db


## Select your model

Select a Granite model to use. Here we use a Langchain client to connect to the model. If there is a locally accessible Ollama server, we use an Ollama client to access the model. Otherwise, we use a Replicate client to access the model.

When using Replicate, if the `REPLICATE_API_TOKEN` environment variable is not set, or a `REPLICATE_API_TOKEN` Colab secret is not set, then the notebook will ask for your [Replicate API token](https://replicate.com/account/api-tokens) in a dialog box.

In [4]:
import os
import requests
from langchain_ollama.llms import OllamaLLM
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

model_path = "ibm-granite/granite-3.2-8b-instruct"

try: # Look for a locally accessible Ollama server for the model
    response = requests.get(os.getenv("OLLAMA_HOST", "http://127.0.0.1:11434"))
    model = OllamaLLM(
        model="granite3.2:2b",
    )
    model = model.bind(raw=True) # Client side controls prompt
except Exception: # Use Replicate for the model
    model = Replicate(
        model=model_path,
        replicate_api_token=get_env_var('REPLICATE_API_TOKEN'),
    )

tokenizer = AutoTokenizer.from_pretrained(model_path)

## Building the Vector Database

In this example, we take the State of the Union speech text, split it into chunks, derive embedding vectors using the embedding model, and load it into the vector database for querying.

### Download the document

Here we use President Biden's State of the Union address from March 1, 2022.

In [14]:
import os
import wget

filename = 'state_of_the_union.txt'
url = 'https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(filename):
  wget.download(url, out=filename)

### Split the document into chunks

Split the document into text segments that can fit into the model's context window.

In [15]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader(filename)
documents = loader.load()
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=embeddings_tokenizer,
    chunk_size=embeddings_tokenizer.max_len_single_sentence,
    chunk_overlap=0,
)
texts = text_splitter.split_documents(documents)
for doc_id, text in enumerate(texts):
    text.metadata["doc_id"] = doc_id
print(f"{len(texts)} text document chunks created")

### Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [16]:
ids = vector_db.add_documents(texts)
print(f"{len(ids)} documents added to the vector database")

[454518073358286932,
 454518073358286933,
 454518073358286934,
 454518073358286935,
 454518073358286936,
 454518073358286937,
 454518073358286938,
 454518073358286939,
 454518073358286940,
 454518073358286941,
 454518073358286942,
 454518073358286943,
 454518073358286944,
 454518073358286945,
 454518073358286946,
 454518073358286947,
 454518073358286948,
 454518073358286949,
 454518073358286950,
 454518073358286951,
 454518073358286952,
 454518073358286953,
 454518073358286954,
 454518073358286955,
 454518073358286956,
 454518073358286957,
 454518073358286958,
 454518073358286959,
 454518073358286960,
 454518073358286961,
 454518073358286962,
 454518073358286963,
 454518073358286964,
 454518073358286965,
 454518073358286966,
 454518073358286967,
 454518073358286968,
 454518073358286969,
 454518073358286970,
 454518073358286971,
 454518073358286972,
 454518073358286973]

## Querying the Vector Database

### Conduct a similarity search

Search the database for similar documents by proximity of the embedded vector in vector space.

In [17]:
query = "What did the president say about Ketanji Brown Jackson?"
docs = vector_db.similarity_search(query)
print(f"{len(docs)} documents returned")
for doc in docs:
    print(doc)
    print("=" * 80)  # Separator for clarity

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


## Answering Questions

### Automate the RAG pipeline

Build a RAG chain with the model and the document retriever.

First we create the prompts for Granite to perform the RAG query. We use the Granite chat template and supply the placeholder values that the LangChain RAG pipeline will replace.

`{context}` will hold the retrieved chunks, as shown in the previous search, and feeds this to the model as document context for answering our question.

Next, we construct the RAG pipeline by using the Granite prompt templates previously created.

In [19]:
from langchain.prompts import PromptTemplate
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Create a Granite prompt for question-answering with the retrieved context
prompt = tokenizer.apply_chat_template(
    conversation=[{
        "role": "user",
        "content": "{input}",
    }],
    documents=[{
        "title": "placeholder",
        "text": "{context}",
    }],
    add_generation_prompt=True,
    tokenize=False,
)
prompt_template = PromptTemplate.from_template(template=prompt)

# Create a Granite document prompt template to wrap each retrieved document
document_prompt_template = PromptTemplate.from_template(template="""\
Document {doc_id}
{page_content}""")
document_separator="\n\n"

# Assemble the retrieval-augmented generation chain
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
    document_prompt=document_prompt_template,
    document_separator=document_separator,
)
rag_chain = create_retrieval_chain(
    retriever=vector_db.as_retriever(),
    combine_docs_chain=combine_docs_chain,
)

In [26]:
prompt_template = """\
<|start_of_role|>user<|end_of_role|>Use the following pieces of context to answer the question at the end.

{context}

Question: {input}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"""

# Assemble the retrieval-augmented generation chain.
qa_chain_prompt = PromptTemplate.from_template(prompt_template)
combine_docs_chain = create_stuff_documents_chain(model, qa_chain_prompt)

### Generate a retrieval-augmented response to a question

Use the RAG chain to process a question. The document chunks relevant to that question are retrieved and used as context.

In [20]:
output = rag_chain.invoke({"input": query})

print(output['answer'])

The president nominated Ketanji Brown Jackson to serve on the United States Supreme Court. She is a Circuit Court of Appeals Judge and one of our nation's top legal minds. The president described her as continuing Justice Stephen Breyer's legacy of excellence.


In [27]:
combine_docs_chain.invoke({"input": "What did the president say about Ketanji Brown Jackson?", "context": ""})

"I'm sorry for the confusion, but I don't have any information about a specific president or Ketanji Brown Jackson. As an AI language model developed by IBM in 2024, my knowledge is based on the data I was trained on and does not include real-time updates or events that occurred after my training period. Therefore, I cannot provide an answer to your question."

In [21]:
output

{'input': 'What did the president say about Ketanji Brown Jackson?',
 'context': [Document(metadata={'pk': 454514983435501599, 'source': 'state_of_the_union.txt'}, page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'),
  Document(metadata={'pk'