[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/VectorInstitute/rag-bootcamp/blob/refactor/uv-migration/implementations/document_search/document_search_langchain_cohere.ipynb)

# Document Search with LangChain using Cohere

This example shows how to use the Python [LangChain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request on Cohere LLMs and local embedding models, then augment that request using the text stored in a collection of local PDF documents.

### 📝 Requirements

To run this notebook, you will need:

- **Cohere API key**:  
    - Sign up at [Cohere](https://dashboard.cohere.com/api-keys) and create an API key

## Set up the RAG workflow environment

#### Install libraries (Only in Google Colab)

In [1]:
import os

if 'COLAB_RELEASE_TAG' in os.environ:
    # This is a Google Colab environment
    
    # Check if the notebook is running in a GPU environment and install the appropriate version of faiss
    if 'COLAB_GPU' in os.environ:
        !pip3 install faiss-gpu
    else:
        !pip3 install faiss-cpu

    # Install other dependencies
    !pip3 install langchain langchain-community langchain-huggingface langchain-cohere # aieng-rag-utils

#### Import libraries

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import os
import faiss

from aieng.rag.utils import get_device_name
from aieng.rag.utils.search import DocumentReader, pretty_print, download_file

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_cohere import ChatCohere


#### Set Cohere API Key

In [4]:
COHERE_API_KEY = os.getenv("COHERE_API_KEY", "YOUR_COHERE_API_KEY")

#### Download source documents

In [6]:
DIRECTORY_PATH = "./source_documents"
DOCUMENT_URL = "https://vectorinstitute.ai/wp-content/uploads/2023/05/vector-institute-2021-22-annual-report_accessible.pdf"

download_file(DOCUMENT_URL, DIRECTORY_PATH)

Downloaded https://vectorinstitute.ai/wp-content/uploads/2023/05/vector-institute-2021-22-annual-report_accessible.pdf to source_documents/vector-institute-2021-22-annual-report_accessible.pdf


#### Choose Cohere LLM and local embedding model

In [7]:
GENERATOR_MODEL_NAME = "command-r"
EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

## Start with a basic generation request without RAG augmentation

Let's start by asking Cohere a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's world knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is domain-specific and it won't know the answer to. A good example would be an obscure detail buried deep within a company's annual report. For example:

*How many Vector scholarships in AI were awarded in 2022?*

In [8]:
query = "How many Vector scholarships in AI were awarded in 2022?"

## Now send the query to the open source model using KScope

In [9]:
llm = ChatCohere(
    model=GENERATOR_MODEL_NAME,
    temperature=0,
    max_tokens=128,
    cohere_api_key=COHERE_API_KEY,
)
message = [
    ("human", query),
]

result = llm.invoke(message)
print(f"Result: \n\n{result.content}")

Result: 

According to the official Vector Institute website, in 2022, a total of 42 scholarships were awarded to outstanding students across Canada as part of the Vector Institute's Graduate Scholarship Program. These scholarships are aimed at supporting exceptional master's and doctoral students conducting research in the field of artificial intelligence (AI) and related areas. 

The Vector Institute, based in Toronto, Ontario, is a hub for AI research and innovation, focusing on building Canada's capacity in AI. The institute offers these scholarships to attract top talent and foster excellence in AI research. 

The 42 scholarship recipients in 2022 were selected based on their potential to become top researchers and leaders in AI. They each received $10,000 in funding to support their studies. These scholarships are a step towards fostering a vibrant AI ecosystem and building a talented workforce in Canada. 

The number of scholarships awarded can vary each year, so it's recommende

Without additional information, Cohere is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from `source_documents`

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [10]:
# Load PDFs
doc_reader = DocumentReader(directory_path=DIRECTORY_PATH)
docs, chunks = doc_reader.load()

print(f"Number of source documents: {len(docs)}")
print(f"Number of text chunks: {len(chunks)}")

Number of source documents: 42
Number of text chunks: 196


#### Define the embeddings model

In [11]:
device = get_device_name()

model_kwargs = {'device': device, 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the embeddings model...


## Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [12]:
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Retrieve the most relevant context from the vector store based on the query
retrieved_docs = retriever.invoke(query)

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [13]:
pretty_print(retrieved_docs)

Document 1:

5 
Annual Report 2021–22Vector Institute
SPOTLIGHT ON FIVE YEARS OF AI 
LEADERSHIP FOR CANADIANS 
SINCE THE VECTOR INSTITUTE WAS FOUNDED IN 2017: 
2,080+ 
Students have graduated from 
Vector-recognized AI programs and 
study paths 
$6.2 M 
Scholarship funds committed to 
students in AI programs 
3,700+ 
Postings for AI-focused jobs and 
internships ofered on Vector’s 
Digital Talent Hub 
$103 M 
In research funding committed to 
Vector-afliated researchers 
94 
Research awards earned by
----------------------------------------------------------------------------------------------------
Document 2:

26 
 
 
VECTOR SCHOLARSHIPS IN 
AI ATTRACT TOP TALENT 
TO ONTARIO UNIVERSITIES 
109 
Vector Scholarships in AI awarded 
34 
Programs 
13 
Universities 
351 
Scholarships awarded since the 
program launched in 2018 
Supported with funding from the Province of 
Ontario, the Vector Institute Scholarship in Artifcial 
Intelligence (VSAI) helps Ontario universities to attract 
the b

## Now send the query to the RAG pipeline

In [14]:
rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever)
result = rag_pipeline.invoke(input=query)
print(f"Result: \n\n{result['result']}")

Result: 

I'm unable to provide a definitive answer as the year the scholarships were awarded is not stated within the text provided. However, it does mention that 351 scholarships have been awarded since the program's launch in 2018.

In the section highlighting the 2021-22 annual report, it states that Vector Institute has helped facilitate 2,080+ student graduations from AI programs and study paths, and that $6.2M in scholarship funds have been committed to students in AI programs. It's unclear if the funds mentioned are solely related to the Vector Scholarships in AI or includes additional scholarship programs.


The model provides the correct answer (109) using the retrieved information, but is somewhat confused with the year.