# Cohere Document Search with LlamaIndex

This example shows how to use the Python [LlamaIndex](https://docs.llamaindex.ai/en/stable/) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [1]:
from getpass import getpass
import os
from pathlib import Path

from llama_index import ServiceContext, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.cohereai import CohereEmbedding
from llama_index.llms import Cohere
from llama_index.postprocessor.cohere_rerank import CohereRerank

Set up some helper functions:

In [3]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [4]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
    os.environ["CO_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

In [5]:
# Look for the source-materials folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./AML_Data/Personal/PDF"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

## Start with a basic generation request without RAG augmentation

Let's start by asking the Cohere LLM a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's basic knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is very domain-specific that it won't know the answer to. A good example would an obscure detail buried deep within a company's annual report. For example:

"*How many Vector scholarships in AI were awarded in 2022?*"

In [6]:
query = "What are all human trafficking cases in California?"

## Now send the query to Cohere

In [7]:
llm = Cohere(api_key=os.environ["COHERE_API_KEY"])
result = llm.complete(query)
print(f"Result: \n\n{result}")

Result: 

 I cannot provide information on specific human trafficking cases in California due to legal and privacy considerations. 

However, I can provide some general information about human trafficking in California and the resources available to support victims and combat this crime:

According to the California Attorney General's Office, human trafficking is a form of modern-day slavery that involves the manipulation and exploitation of vulnerable individuals for sexual or labor purposes. It is a crime that affects communities across California and worldwide.

If you are seeking information about specific human trafficking cases in California, I recommend visiting the website of the California Department of Justice's Office of the Attorney General. This website provides comprehensive information about human trafficking in California, including statistics, resources for victims, and information about ongoing investigations and prosecutions. You can also find contact information for

## Ingestion: Load and store the documents from source-materials

In [8]:
file_path = './AML_Data/'

In [9]:
file_name = 'personal_data.csv'

In [10]:
df = pd.read_csv(file_path + file_name)

In [11]:
print(df.shape)
df.head()

(324, 4)


Unnamed: 0,alert_identifier,customer_name,suspicious_activity,predicate_offense
0,TMML20240342768,Sam Waksal,"Alert ID: TMML20240342768\nBernie Madoff, who ...",fraud
1,TMML202403475910,Mark Denning,Alert ID: TMML202403475910\nPublished\n\nOne o...,broke investment rules
2,TMML202403405311,Russell Wasendorf Sr,Alert ID: TMML202403405311\nPublished\n\nThe f...,pleads guilty to fraud
3,TMML202403479017,Charlie Shrem,Alert ID: TMML202403479017\nA senior figure in...,arrested for money launering
4,TMML202403436919,Shane Whittle,Alert ID: TMML202403436919\nRohan Marley is no...,sanction against smn


In [12]:
# sample adverse media news
df.iloc[50].suspicious_activity

'Alert ID: TMML2024033036135\nThe former president of an airline at the center of a financial scandal at Newport News-Williamsburg International Airport was sentenced Thursday to two years in prison for fraud in connection with the failure of People Express Airlines in 2014 and the filing of a false income tax return.\n\nMichael Morisi, 59, of Suffolk, is the former president of People Express which engaged in failed start-up operations at the Newport News/Williamsburg International Airport, according to court papers.\n\nHe pleaded guilty last July to wire fraud and filing a false federal income tax return.\n\nHe faced 23 years on the two charges.\n\nFederal prosecutors said Morisi led the push to get the airline operational, despite a failed track record of getting private investments and significant outstanding liabilities.\n\nA switch to a focus on the public commitment of funds led to PEX obtaining a $5 million loan from TowneBank that was guaranteed by the Peninsula Airport Commis

Start by reading in all the PDF files from `source_documents`.

In [13]:
# Load the pdfs
pdf_folder_path = "./AML_Data/Personal/PDF"
documents = SimpleDirectoryReader(pdf_folder_path).load_data()
print(f"Number of source materials: {len(documents)}\n")

Number of source materials: 461



## Define an embeddings model

This embeddings model will convert the textual data from our PDF files into vector embeddings. These vector embeddings will later enable us to quickly find the chunk of text that most closely corresponds to our original query.

In [15]:
embed_model = CohereEmbedding(
    model_name="embed-english-v3.0",
    input_type="search_query"
)
service_context = ServiceContext.from_defaults(
    embed_model=embed_model,
    llm=llm,
    chunk_size=256
)

## Storage: Store the documents in a vector database

In [16]:
index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

Parsing nodes:   0%|          | 0/461 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/6521 [00:00<?, ?it/s]

## Retrieval: Now do a search to retrieve the chunk of document text that most closely matches our original query

In [17]:
queries = ['What are all human trafficking cases in California? State the alert identifiers.',
           'What are all drug trafficking cases in California? State the alert identifiers.',
           'What are cases that has more than $1 million street value of drugs? State the alert identifier of the cases.',
           'What are cases that include females? State the alert identifier and predicate offenses.',
           'What are cases that has minors as victims of human trafficking?',
           'What are the names of individuals or entities involved in alert identifier TMML2024033036135?',
           'what is the predicate offense of alert identifier TMML2024033036135?',
           'what are cases involved Iran? State the alert identifiers, names and location, predicate offenses, prison time?',
           'what are cases with the name of Ashley? State alert identifiers and description of the case.'
          ]

In [18]:
search_query_retriever = index.as_retriever(service_context=service_context)

In [19]:
for query in queries:
    print(f"{query}")
    search_query_retrieved_nodes = search_query_retriever.retrieve(query)
    print(f"Search query retriever found {len(search_query_retrieved_nodes)} results")
    print(f"First result example:\n{search_query_retrieved_nodes[0]}\n")
    print(f"Second result example:\n{search_query_retrieved_nodes[1]}\n")
    print('***************************************************************************')

What are all human trafficking cases in California? State the alert identifiers.
Search query retriever found 2 results
First result example:
Node ID: 2cd6efff-e334-4f3f-841c-38fef4684c3d
Text: Caro Quintero is wanted in the Central District of California on
criminal charges related to the kidnapping and murder of SA Camarena
as well as drug trafficking. The U.S. government is seeking Caro
Quintero’s arrest to face these charges. Today’s action, which
designated 20 entities and one individual linked to Rafael Caro
Quintero pursuant to ...
Score:  0.409


Second result example:
Node ID: 7a5a6df2-eba8-4154-be8f-cbdd28c7339b
Text: According to court documents, Adebara coordinated with overseas
co-conspirators who had assumed false identities on online dating
websites and social media platforms with the intent to defraud
victims. Adebara opened multiple accounts using fraudulent identities
then provided the account and routing numbers to the overseas co-
conspirators. The o...
Score:  0.39

That first result doesn't look right, but it's close? Could it be that we got the result that we wanted from that retrieval, but the results came back out of order? Let's try using a reranker to check which of our results is a closest match.

## Reranking: Improve the ordering of the document chunks

In [20]:
reranker = CohereRerank()
query_engine = index.as_query_engine(
    node_postprocessors = [reranker]
)

## Final RAG-augmented query

In [21]:
for query in queries:
    print(f"{query}")
    result = query_engine.query(query)
    print(f"Result: {result}\n\n")
    print('***************************************************************************')

What are all human trafficking cases in California? State the alert identifiers.
Result: There are two alerts that warn about human trafficking activities in California. 

Alert Identifier: TMML202403153625

Caro Quintero is wanted in the Central District of California on criminal charges related to the kidnapping and murder of SA Camarena as well as drug trafficking. The U.S. government is seeking Caro Quintero’s arrest to face these charges. Today’s action, which designated 20 entities and one individual linked to Rafael Caro Quintero pursuant to the Kingpin Act, generally prohibits U.S. persons from conducting financial or commercial transactions with these designees, and also freezes any assets they may have under U.S. jurisdiction. This action targeted 20 companies primarily located in or near Guadalajara. Several of these companies are engaged in real estate activities, including Arrendadora Turin, Barsat, and Villas del Colli. Others are gasoline retailers or engaged in agricult