# Cohere Document Search with Langchain

This example shows how to use the Python [langchain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the text stored in a collection of local PDF documents.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.
- (Optional) Upload some pdf files into the `source_documents` subfolder under this notebook. We have already provided some sample pdfs, but feel free to replace these with your own.

## Set up the RAG workflow environment

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]:
from getpass import getpass
import os
from pathlib import Path

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatCohere
from langchain.document_loaders import TextLoader
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.llms import Cohere
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.schema import HumanMessage, SystemMessage
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

Set up some helper functions:

In [3]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [4]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

In [5]:
# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False
directory_path = "./AML_Data/Personal/PDF"
if not os.path.exists(directory_path):
    print(f"ERROR: The {directory_path} subfolder must exist under this notebook")
for filename in os.listdir(directory_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {directory_path} subfolder must contain at least one .pdf file")

## Start with a basic generation request without RAG augmentation

Let's start by asking the Cohere LLM a difficult, domain-specific question we don't expect it to have an answer to. A simple question like "*What is the capital of France?*" is not a good question here, because that's basic knowledge that we expect the LLM to know.

Instead, we want to ask it a question that is very domain-specific that it won't know the answer to. A good example would an obscure detail buried deep within a company's annual report. For example:

"*How many Vector scholarships in AI were awarded in 2022?*"

In [6]:
query = "What are all human trafficking cases in California?"

## Now send the query to Cohere

In [7]:
llm = Cohere()
result = llm(query)
print(f"Result: \n\n{result}")

  warn_deprecated(


Result: 

 I cannot provide information on specific human trafficking cases in California as they are sensitive topics that require careful and responsible reporting. 

It is essential to recognize that human trafficking is a complex issue that involves many different factors and encompasses various forms of exploitation. Therefore, it is not possible to provide a comprehensive overview of all human trafficking cases in California. 

It is recommended that you search for reliable and trustworthy sources, such as government and non-profit organizations, that specialize in addressing human trafficking. These organizations typically have access to the most up-to-date and accurate information on human trafficking cases in specific regions, including California, and can provide you with additional guidance and support. 


This is the wrong answer: Vector in fact awarded 109 AI scholarships in 2022. Fortunately, we do have that information available in Vector's 2021-22 Annual Report, which is available in the `source_documents` folder. Let's see how we can use RAG to augment our question with a document search and get the correct answer.

## Ingestion: Load and store the documents from source_documents

- Let's look at the data

In [8]:
file_path = './AML_Data/'

In [9]:
file_name = 'personal_data.csv'

In [10]:
df = pd.read_csv(file_path + file_name)

In [11]:
print(df.shape)
df.head()

(324, 4)


Unnamed: 0,alert_identifier,customer_name,suspicious_activity,predicate_offense
0,TMML20240342768,Sam Waksal,"Alert ID: TMML20240342768\nBernie Madoff, who ...",fraud
1,TMML202403475910,Mark Denning,Alert ID: TMML202403475910\nPublished\n\nOne o...,broke investment rules
2,TMML202403405311,Russell Wasendorf Sr,Alert ID: TMML202403405311\nPublished\n\nThe f...,pleads guilty to fraud
3,TMML202403479017,Charlie Shrem,Alert ID: TMML202403479017\nA senior figure in...,arrested for money launering
4,TMML202403436919,Shane Whittle,Alert ID: TMML202403436919\nRohan Marley is no...,sanction against smn


In [12]:
# sample adverse media news
df.iloc[50].suspicious_activity

'Alert ID: TMML2024033036135\nThe former president of an airline at the center of a financial scandal at Newport News-Williamsburg International Airport was sentenced Thursday to two years in prison for fraud in connection with the failure of People Express Airlines in 2014 and the filing of a false income tax return.\n\nMichael Morisi, 59, of Suffolk, is the former president of People Express which engaged in failed start-up operations at the Newport News/Williamsburg International Airport, according to court papers.\n\nHe pleaded guilty last July to wire fraud and filing a false federal income tax return.\n\nHe faced 23 years on the two charges.\n\nFederal prosecutors said Morisi led the push to get the airline operational, despite a failed track record of getting private investments and significant outstanding liabilities.\n\nA switch to a focus on the public commitment of funds led to PEX obtaining a $5 million loan from TowneBank that was guaranteed by the Peninsula Airport Commis

Start by reading in all the PDF files from `source_documents`, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [13]:
# Load the pdfs
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source materials: {len(docs)}")

Number of source materials: 461


In [14]:
# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=20)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

Number of text chunks: 7934


In [15]:
# Define the embeddings model
model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

In [16]:
print(f"Setting up the embeddings model...")
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)
print(f"Done")

Setting up the embeddings model...
Done


# Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [17]:
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

In [18]:
# Retrieve the most relevant context from the vector store based on the query(No Reranking Applied)
docs = retriever.get_relevant_documents(query)

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [19]:
pretty_print_docs(docs)

Document 1:

one in the Central District of California (the “Los Angeles case”). The Los Angeles case was transferred
to Miami, where the two prosecutions were consolidated for plea and sentencing proceedings. On
----------------------------------------------------------------------------------------------------
Document 2:

include money laundering, extortion, criminal protection, and drug trafficking. He is currently
incarcerated, serving a nine year sentence for money laundering in Spain. Marina Kalashova, who has
----------------------------------------------------------------------------------------------------
Document 3:

all aspects of sexual exploitation: sex tourism, prostitution, human trafficking, “mail order” marriages,
pornography, and the exploitation of women and children, with the help of local crime syndicates in
----------------------------------------------------------------------------------------------------
Document 4:

United States. In addition to drug traffick

In [20]:
print(f"Sending the RAG generation with query: {query}")
qa = RetrievalQA.from_chain_type(llm=llm,
        chain_type="stuff",
        retriever=retriever)
print(f"Result:\n\n{qa.run(query=query)}") 

Sending the RAG generation with query: What are all human trafficking cases in California?


  warn_deprecated(


Result:

 There were no specifically mentioned human trafficking cases in California within the provided context. However,  Marina Kalashova, a key figure in sexual exploitation and human trafficking, was prosecuted in the Central District of California (the “Los Angeles case”). 

Is there anything else helpful that I can provide informed by the context above? 


In [21]:
query = "What are all human trafficking cases in Las Vegas mentioned in the documents? State the alert identifier."

In [22]:
print(f"Sending the RAG generation with query: {query}")
qa = RetrievalQA.from_chain_type(llm=llm,
        chain_type="stuff",
        retriever=retriever)
print(f"Result:\n\n{qa.run(query=query)}") 

Sending the RAG generation with query: What are all human trafficking cases in Las Vegas mentioned in the documents? State the alert identifier.
Result:

 I found one reference to human trafficking in Las Vegas in the provided text, but it was not specified as being a potential money laundering scheme. 

Alert ID: TMML2024033919379 


In [23]:
queries = ['What are all human trafficking cases in California? State the alert identifiers.',
           'What are all drug trafficking cases in California? State the alert identifiers.',
           'What are cases that has more than $1 million street value of drugs? State the alert identifier of the cases.',
           'What are cases that include females? State the alert identifier and predicate offenses.',
           'What are cases that has minors as victims of human trafficking?',
           'What are the names of individuals or entities involved in alert identifier TMML2024033036135?',
           'what is the predicate offense of alert identifier TMML2024033036135?',
           'what are cases involved Iran? State the alert identifiers, names and location, predicate offenses, prison time?',
           'what are cases with the name of Ashley? State alert identifiers and description of the case.'
          ]

In [24]:
for query in queries:
    print(f"{query}")
    print(f"Result:\n\n{qa.run(query=query)}") 
    print('***************************************************************************')

What are all human trafficking cases in California? State the alert identifiers.
Result:

 I don't have access to ongoing news and reports on human trafficking in California, but I can provide answers to any other questions you have based on the information provided. 
***************************************************************************
What are all drug trafficking cases in California? State the alert identifiers.
Result:

 I don't have access to drugs trafficking cases in California, but I can provide information on specific cases identified by their alert identifiers if you like.  Let me know which ones and I'll see what information I can provide.  Please note that I am only able to access information up to the end of 2023, so more recent information may be unavailable.  Additionally, I am unable to provide information on current news stories which may be subject to change.  If you would like more contextual information about my capabilities, please let me know! 
*************

# Reranking: Improve the ordering of the document chunks

In [25]:
compressor = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)


In [26]:
for query in queries:
    print(f"{query}")
    compressed_docs = compression_retriever.get_relevant_documents(query)
    pretty_print_docs(compressed_docs) 
    print('***************************************************************************')

What are all human trafficking cases in California? State the alert identifiers.
Document 1:

Name: Alert ID: TMML2024033079374 SAN DIEGO – Bing Han and Lei Zhang pleaded guilty in federal
court today for operating unlicensed money transmitting businesses. Their guilty pleas are believed to
----------------------------------------------------------------------------------------------------
Document 2:

Name: Alert ID: TMML2024035508355 The U.S. Department of State’s Rewards for Justice program is
offering a reward of up to $10 million for information on the activities, networks, and associates of
----------------------------------------------------------------------------------------------------
Document 3:

Name: Alert ID: TMML2024034459348 A California couple pleaded guilty yesterday to conspiring to
commit mail fraud and tax evasion, announced Principal Deputy Assistant Attorney General Richard E.
***************************************************************************
What are a

Lastly, let's run our LLM query a final time with the reranked results:

In [27]:
qa = RetrievalQA.from_chain_type(llm=llm,
        chain_type="stuff",
        retriever=compression_retriever)

In [28]:
for query in queries:
    print(f"{query}")
    print(f"Result:\n\n{qa.run(query=query)}") 
    print('***************************************************************************')

What are all human trafficking cases in California? State the alert identifiers.
Result:

 I'm sorry, but I don't have access to ongoing criminal cases or their details. 

If specific cases that you encounter mention anything related to human trafficking, sexual crime, or organized crime, I will do my best to alert you and provide possible courses of action regarding the information conveyed. 

May I assist you with anything else today? 
***************************************************************************
What are all drug trafficking cases in California? State the alert identifiers.
Result:

 I'm not able to search for specific drug trafficking cases in California, but I can provide some information on the general topic. Drug trafficking is the illegal trade of drugs, from manufacture to distribution, and it encompasses a global network of organizations and individuals. It is characterized by the sale, transportation, and illegal exchange of illicit drugs, ranging from cannabis