[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/VectorInstitute/rag-bootcamp/blob/refactor/uv-migration/implementations/document_search/document_search_langchain.ipynb)

# Document Search with LangChain

This example shows how to use the Python [LangChain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request on open-source LLMs and embedding models using the OpenAI SDK, then augment that request using the text stored in a collection of local PDF documents.

### 📝 Requirements

To run this notebook, you will need:

- **OpenAI API key**:  
    - Sign up at [OpenAI](https://platform.openai.com/) and create an API key

## Set up the RAG workflow environment

#### Install libraries (Only in Google Colab)

In [None]:
import os

if 'COLAB_RELEASE_TAG' in os.environ:
    # This is a Google Colab environment
    # Install required dependencies
    !pip3 install faiss-cpu langchain langchain-community langchain-huggingface langchain-openai # aieng-rag-utils

#### Import libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import os

from aieng.rag.utils import get_device_name
from aieng.rag.utils.search import DocumentReader, pretty_print, download_file

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI

#### Load OpenAI env variables

In [5]:
OPENAI_BASE_URL = os.getenv("OPENAI_BASE_URL","https://api.openai.com/v1")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")

#### Download source documents

In [6]:
DIRECTORY_PATH = "./source_documents"
DOCUMENT_URL = "https://vectorinstitute.ai/wp-content/uploads/2023/05/vector-institute-2021-22-annual-report_accessible.pdf"

download_file(DOCUMENT_URL, DIRECTORY_PATH)

Downloaded https://vectorinstitute.ai/wp-content/uploads/2023/05/vector-institute-2021-22-annual-report_accessible.pdf to ./source_documents/vector-institute-2021-22-annual-report_accessible.pdf


#### Choose LLM and embedding model

In [5]:
GENERATOR_MODEL_NAME = "gpt-4.1"
EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

## Start with a basic generation request without RAG augmentation

Let's start by asking the model a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's world knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is domain-specific and it won't know the answer to. A good example would be an obscure detail buried deep within a company's annual report. For example:

*How many Vector scholarships in AI were awarded in 2022?*

In [6]:
query = "How many Vector scholarships in AI were awarded in 2022?"

## Now send the query to the open source model using KScope

In [7]:
llm = ChatOpenAI(
    model=GENERATOR_MODEL_NAME,
    temperature=0,
    max_tokens=None,
    base_url=OPENAI_BASE_URL,
    api_key=OPENAI_API_KEY
)
message = [
    ("human", query),
]
try:
    result = llm.invoke(message)
    print(f"Result: \n\n{result.content}")
except Exception as err:
    if "Error code: 503" in err.message:
        print(f"The model {GENERATOR_MODEL_NAME} is not ready yet.")
    else:
        raise

Result: 

In 2022, the **Vector Institute awarded 295 Vector Scholarships in Artificial Intelligence (VSAI)** to students entering AI-related master’s programs at Ontario universities.


Without additional information, the model is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from `source_documents`

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [8]:
# Load PDFs
document_reader = DocumentReader(directory_path=DIRECTORY_PATH)
docs, chunks = document_reader.load()

print(f"Number of source documents: {len(docs)}")
print(f"Number of text chunks: {len(chunks)}")

Number of source documents: 42
Number of text chunks: 196


#### Define the embeddings model

In [9]:
device = get_device_name()

model_kwargs = {'device': device, 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the embeddings model...


## Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [10]:
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Retrieve the most relevant context from the vector store based on the query
retrieved_docs = retriever.invoke(query)

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [11]:
pretty_print(retrieved_docs)

Document 1:

5 
Annual Report 2021–22Vector Institute
SPOTLIGHT ON FIVE YEARS OF AI 
LEADERSHIP FOR CANADIANS 
SINCE THE VECTOR INSTITUTE WAS FOUNDED IN 2017: 
2,080+ 
Students have graduated from 
Vector-recognized AI programs and 
study paths 
$6.2 M 
Scholarship funds committed to 
students in AI programs 
3,700+ 
Postings for AI-focused jobs and 
internships ofered on Vector’s 
Digital Talent Hub 
$103 M 
In research funding committed to 
Vector-afliated researchers 
94 
Research awards earned by
----------------------------------------------------------------------------------------------------
Document 2:

26 
 
 
VECTOR SCHOLARSHIPS IN 
AI ATTRACT TOP TALENT 
TO ONTARIO UNIVERSITIES 
109 
Vector Scholarships in AI awarded 
34 
Programs 
13 
Universities 
351 
Scholarships awarded since the 
program launched in 2018 
Supported with funding from the Province of 
Ontario, the Vector Institute Scholarship in Artifcial 
Intelligence (VSAI) helps Ontario universities to attract 
the b

## Now send the query to the RAG pipeline

In [12]:
rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever)
result = rag_pipeline.invoke(input=query)
print(f"Result: \n\n{result['result']}")

Result: 

In 2022, 109 Vector Scholarships in AI were awarded.


The model provides the correct answer (109) using the retrieved information.

But it also continues with the following 2 scenarios (as of now) due to stochasticity:
1. It sometimes outputs another sentence which seems to be hallucinated as it gets confused between the total scholarships (351) and those awarded just in 2022.
2. It sometimes just says *I don't know* because its not sure about the year.