# Document Search with LlamaIndex using Cohere

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request on Cohere LLMs and local embedding models, then augment that request using the text stored in a collection of local PDF documents.

### <u>Requirements</u>
1. Make sure to create the ```.cohere.env``` file in your home directory (```/h/<user_id>```) and store your Cohere API key in plain text.
2. (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

#### Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import faiss
import os
import sys

from pathlib import Path

from langchain.text_splitter import RecursiveCharacterTextSplitter

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.core.llms import ChatMessage
from llama_index.core.node_parser import LangchainNodeParser
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.cohere import Cohere
from llama_index.vector_stores.faiss import FaissVectorStore

#### Read Cohere API Key

In [3]:
try:
    f = open(Path.home() / ".cohere.key", "r")
    os.environ["COHERE_API_KEY"] = f.read().rstrip("\n")
    f.close()
except Exception as err:
    print(f"Could not read your Cohere API key. Please make sure this is available in plain text under your home directory in ~/.cohere.key: {err}")

#### Set up some helper functions

In [4]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.text for i, d in enumerate(docs)]
        )
    )

#### Make sure other necessary items are in place

In [5]:
# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./source_documents"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

#### Choose Cohere LLM and local embedding model

In [6]:
GENERATOR_MODEL_NAME = "command-r"
EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

## Start with a basic generation request without RAG augmentation

Let's start by asking Cohere a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's world knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is domain-specific and it won't know the answer to. A good example would be an obscure detail buried deep within a company's annual report. For example:

*How many Vector scholarships in AI were awarded in 2022?*

In [7]:
query = "How many Vector scholarships in AI were awarded in 2022?"

## Now send the query to the open source model using KScope

In [8]:
llm = Cohere(
    model=GENERATOR_MODEL_NAME,
    temperature=0,
    max_tokens=128,
    api_key=os.environ["COHERE_API_KEY"],
)
message = [
    ChatMessage(
        role="user",
        content=query
    )
]

result = llm.chat(message)
print(f"Result: \n\n{result}")

Result: 

assistant: According to the official Vector Institute website, in 2022, 40 scholarships in AI were awarded. The Vector Institute is a non-profit organization that focuses on research and talent development in the field of artificial intelligence. These scholarships, worth $10,000 each, are offered to outstanding graduate students pursuing research in AI and related fields. 

The Vector Institute also awarded 5 exceptional scholars with the Premier's Awards, which recognizes the top scholars among the Vector Institute scholarship recipients. These awards are named in honor of the Premier of Ontario and carry an additional cash prize of $5,000. 

Would you like more information on the Vector Institute or the scholarships they offer?


Without additional information, Cohere is unable to answer the question correctly. **Vector in fact awarded 109 AI scholarships in 2022.** Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from `source_documents`

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [9]:
# Load the pdfs
directory_path = "./source_documents"
docs = SimpleDirectoryReader(input_dir=directory_path).load_data()
print(f"Number of source documents: {len(docs)}")

# Split the documents into smaller chunks
parser = LangchainNodeParser(RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32))
chunks = parser.get_nodes_from_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

Number of source documents: 42
Number of text chunks: 228


#### Define the embeddings model

In [10]:
print(f"Setting up the embeddings model...")
embeddings = HuggingFaceEmbedding(
    model_name=EMBEDDING_MODEL_NAME,
    device='cuda',
    trust_remote_code=True,
)

Setting up the embeddings model...


#### Set LLM and embedding model [recommended for LlamaIndex]

In [11]:
Settings.llm = llm
Settings.embed_model = embeddings

## Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [12]:
def get_embed_model_dim(embed_model):
    embed_out = embed_model.get_text_embedding("Dummy Text")
    return len(embed_out)

faiss_dim = get_embed_model_dim(embeddings)
faiss_index = faiss.IndexFlatL2(faiss_dim)

vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(chunks, storage_context=storage_context)

In [13]:
retriever = index.as_retriever(similarity_top_k=5)

# Retrieve the most relevant context from the vector store based on the query
retrieved_docs = retriever.retrieve(query)

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [14]:
pretty_print_docs(retrieved_docs)

Document 1:

26 
  VECTOR SCHOLARSHIPS IN 
AI ATTRACT TOP TALENT TO ONTARIO UNIVERSITIES 
109 
Vector Scholarships in AI awarded 
34 
Programs 
13 
Universities 
351 
Scholarships awarded since the 
program launched in 2018 Supported with funding from the Province of Ontario, the Vector Institute Scholarship in Artifcial Intelligence (VSAI) helps Ontario universities to attract the best and brightest students to study in AI-related master’s programs. 
Scholarship recipients connect directly with leading
----------------------------------------------------------------------------------------------------
Document 2:

5 
Annual Report 2021–22 Vector Institute
SPOTLIGHT ON FIVE YEARS OF AI 
LEADERSHIP FOR CANADIANS 
SINCE THE VECTOR INSTITUTE WAS FOUNDED IN 2017: 
2,080+ 
Students have graduated from 
Vector-recognized AI programs and 
study paths $6.2 M 
Scholarship funds committed to 
students in AI programs 3,700+ 
Postings for AI-focused jobs and 
internships ofered on Vector’s 
Digita

## Now send the query to the RAG pipeline

In [15]:
query_engine = RetrieverQueryEngine(retriever=retriever)
result = query_engine.query(query)
print(f"Result: \n\n{result}")

Result: 

According to the Vector Institute's Annual Report of 2021-22, 109 Vector Scholarships in AI were awarded in 2022. 

The Vector Institute Scholarship in Artificial Intelligence (VSAI) helps Ontario universities to attract the best students to study in AI-related master's programs.


The model provides the correct answer (109) using the retrieved information.