# OpenAI Document Search with Langchain

This example shows how to use the Python [langchain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request on open-source LLMs and embedding models using the OpenAI SDK, then augment that request using the text stored in a collection of local PDF documents.

### <u>Requirements</u>
1. As you will accessing the LLMs and embedding models through Vector AI Engineering's Kaleidoscope Service (Vector Inference + Autoscaling), you will need to request a KScope API Key:
- If using **VPN**:
  
  Visit [https://kscope.vectorinstitute.ai/](https://kscope.vectorinstitute.ai/) and select *Request API Key*.
  
- If running **without VPN**:

  Run the following command (replace ```<user_id>``` and ```<password>```) from **within the cluster** to obtain the API Key. The ```access_token``` in the output is your KScope API Key.
  ```bash
  curl -X POST -d "grant_type=password" -d "username=<user_id>" -d "password=<password>" https://kscope.vectorinstitute.ai/token
  ```
2. After obtaining the `.env` configurations, make sure to create the ```.kscope.env``` file in your home directory (```/h/<user_id>```) and set the following env variables:
- For local models through Kaleidoscope (KScope):
    ```bash
    export OPENAI_BASE_URL="https://kscope.vectorinstitute.ai/v1"
    export OPENAI_API_KEY=<kscope_api_key>
    ```
- For OpenAI models:
   ```bash
   export OPENAI_BASE_URL="https://api.openai.com/v1"
   export OPENAI_API_KEY=<openai_api_key>
   ```
3. (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [1]:
from getpass import getpass
import os
import requests

from langchain.chains import RetrievalQA
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.schema import HumanMessage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

#### Load config files

In [2]:
import sys
from pathlib import Path

# Add root folder of the rag_bootcamp repo to PYTHONPATH
current_dir = Path().resolve()
parent_dir = current_dir.parent
sys.path.insert(0, str(parent_dir))

In [3]:
from utils.load_secrets import load_env_file
load_env_file()

In [4]:
GENERATOR_BASE_URL = os.environ.get("OPENAI_BASE_URL")
EMBEDDING_BASE_URL = os.environ.get("OPENAI_BASE_URL")

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

#### Set up some helper functions

In [5]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

#### Make sure other necessary items are in place

In [6]:
# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

#### Set LLM and embedding model

In [7]:
GENERATOR_MODEL_NAME = "Meta-Llama-3.1-8B-Instruct"
EMBEDDING_MODEL_NAME = "e5-mistral-7b-instruct" # "text-embedding-3-small"

## Start with a basic generation request without RAG augmentation

Let's start by asking OpenAI a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's basic knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is very domain-specific that it won't know the answer to. A good example would an obscure detail buried deep within a company's annual report. For example:

"*How many Vector scholarships in AI were awarded in 2022?*"

In [8]:
# query = "How many Vector scholarships in AI were awarded in 2022?"
query = "How many AI master's students began their studies in 2021-22?"

## Now send the query to KScope

In [9]:
llm = ChatOpenAI(model=GENERATOR_MODEL_NAME, base_url=GENERATOR_BASE_URL, api_key=OPENAI_API_KEY)
message = [
    HumanMessage(
        content=query
    )
]
try:
    result = llm(message)
    print(f"Result: \n\n{result.content}")
except Exception as err:
    if "Error code: 503" in err.message:
        print(f"The model {GENERATOR_MODEL_NAME} is not yet ready.")
    else:
        raise

  warn_deprecated(


Result: 

Unfortunately, I'm a large language model, I don't have direct access to real-time data or specific statistics on the number of AI master's students who began their studies in 2021-22. The number of students varies widely depending on the country, institution, and program.

However, I can suggest some possible sources where you may be able to find the information you're looking for:

1. **National statistics offices**: You can try contacting the national statistics office of the country where you're interested in finding the data. For example, in the United States, you can contact the National Center for Education Statistics (NCES).
2. **University websites**: Many universities publish information on their website about the number of students enrolled in their master's programs, including AI programs. You can try searching for the websites of universities that offer AI master's programs and see if they provide this information.
3. **Professional organizations**: Organizations

Without additional information, Llama-3.1 is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

-----------------------------------------------------

## Ingestion: Load and store the documents from source_documents

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [10]:
# Load the pdfs
directory_path = "./source_documents"
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source documents: {len(docs)}")

# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

# Define the embeddings model
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...")
embeddings = OpenAIEmbeddings(base_url=EMBEDDING_BASE_URL, model=EMBEDDING_MODEL_NAME, api_key=OPENAI_API_KEY)

print(f"Done")

Number of source documents: 42
Number of text chunks: 228
Setting up the embeddings model...
Done


# Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [11]:
try:
    vectorstore = FAISS.from_documents(chunks, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
    
    # Retrieve the most relevant context from the vector store based on the query
    docs = retriever.get_relevant_documents(query)
    pretty_print_docs(docs)
except Exception as err:
    if "Error code: 503" in err.message:
        print(f"The model {EMBEDDING_MODEL_NAME} is not yet ready.")
    else:
        raise



Document 1:

ADVANCING LEADING AI RESEARCH BY 
HARNESSING VECTOR’S ENGINEERING RESOURCES 
This year, the AI Engineering team worked directly 
with researchers and their labs, exploring their specifc topics and collaborating with them to build and engineer software solutions and tools to address the technological barriers limiting their work. 
Following a successful pilot project in 2020–21,
----------------------------------------------------------------------------------------------------
Document 2:

engineering. Through workshops, graduate studies and internship placements with government and other public health partners, AI4PH will enable future public health AI specialists to harness real-time data processing, analysis, and visualization from broad and specifc data sources to gain a better understanding of what is happening in the population. NEW CHIEF DATA OFFICER ROLE
----------------------------------------------------------------------------------------------------
Document 3:

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

These results seem to somewhat match our original query, but we still can't seem to find the information we're looking for. Let's try sending our LLM query again including these results, and see what it comes up with.

In [12]:
print(f"Sending the RAG generation with query: {query}")
qa = RetrievalQA.from_chain_type(llm=llm,
        chain_type="stuff",
        retriever=retriever)
print(f"Result:\n\n{qa.run(query=query)}") 

  warn_deprecated(


Sending the RAG generation with query: How many AI master's students began their studies in 2021-22?
Result:

I don't know. The provided text does not mention the number of AI master's students who began their studies in 2021-22.


# Reranking: Improve the ordering of the document chunks

In [27]:
embeddings = OpenAIEmbeddings(base_url=EMBEDDING_BASE_URL, model=EMBEDDING_MODEL_NAME, api_key=os.environ["OPENAI_API_KEY"])
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.66)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter, base_retriever=retriever
)
compressed_docs = compression_retriever.get_relevant_documents(query)



Now let's see what the reranked results look like:

In [28]:
pretty_print_docs(compressed_docs)

Document 1:

ADVANCING LEADING AI RESEARCH BY 
HARNESSING VECTOR’S ENGINEERING RESOURCES 
This year, the AI Engineering team worked directly 
with researchers and their labs, exploring their specifc topics and collaborating with them to build and engineer software solutions and tools to address the technological barriers limiting their work. 
Following a successful pilot project in 2020–21,
----------------------------------------------------------------------------------------------------
Document 2:

Talent and Research was renewed in the federal Budget 2021, and funding support associated with that renewal is expected to begin in 2022–23. The federal Budget 2021 also announced support for each of the national AI institutes to accelerate the translation of AI research into commercial or other innovations, and this funding started at the end of 2021–22. 
The Vector Institute’s audited fnancial statements for 
the 2021–22 fscal year are available on our website . 
STATEMENT OF FINANCIA

Lastly, let's run our LLM query a final time with the reranked results:

In [29]:
qa = RetrievalQA.from_chain_type(llm=llm,
        # chain_type="stuff",
        retriever=compression_retriever)

print(f"Result:\n\n {qa.run(query=query)}")



Result:

 I don't know. The provided text does not mention the number of Vector scholarships in AI that were awarded in 2022. It does mention the Talent and Research program, which was renewed in the federal Budget 2021 and received funding support beginning in 2022-23, but it does not provide information on scholarship awards.
