# Multilingual Search with Cohere and Langchain

***Read the accompanying [blog post here](https://txt.cohere.ai/search-cohere-langchain/).***

This notebook contains two examples for performing multilingual search using Cohere and Langchain. Langchain is a library that assists the development of applications built on top of large language models (LLMs), such as Cohere's models.

In short, Cohere makes it easy for developers to leverage LLMs and Langchain makes it easy to build applications with these models.

We'll go through the following examples:
- **Example 1 - Basic Multilingual Search**

  This is a simple example of multilingual search over a list of documents.

  The steps in summary:
  - Import a list of documents
  - Embed the documents and store them in an index
  - Enter a query
  - Return the document most similar to the query
- **Example 2 - Search-Based Question Answering**

  This example shows a more involved example where search is combined with text generation to answer questions about long-form documents.

  The steps in summary:
  - Add an article and chunk it into smaller passages
  - Embed the passages and store them in an index
  - Enter a question
  - Answer the question based on the most relevant documents

In [None]:
! pip install cohere langchain==0.0.91 chromadb==0.3.2 tfds-nightly> /dev/null

In [None]:

from langchain.embeddings.cohere import CohereEmbeddings
from langchain.llms import Cohere
from langchain.prompts import PromptTemplate
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains.question_answering import load_qa_chain
import os
import random

In [None]:
os.environ["COHERE_API_KEY"] = "" # add your Cohere API key here

# Example 1 - Basic Multilingual Search

### Import a list of documents

In [None]:
# Import documents
import tensorflow_datasets as tfds
dataset = tfds.load('trec', split='train')
texts = [item['text'].decode('utf-8') for item in tfds.as_numpy(dataset)]
print(f"Number of documents: {len(texts)}")

Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/trec/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/trec/1.0.0.incompleteTPZF22/trec-train.tfrecord*...:   0%|          | 0/54…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/trec/1.0.0.incompleteTPZF22/trec-test.tfrecord*...:   0%|          | 0/500…

Dataset trec downloaded and prepared to /root/tensorflow_datasets/trec/1.0.0. Subsequent calls will reuse this data.
Number of documents: 5452


In [None]:
# Print a few sample documents
random.seed(11)
for item in random.sample(texts, 5):
  print(item) 

What is the starting salary for beginning lawyers ?
Where did Bill Gates go to college ?
What task does the Bouvier breed of dog perform ?
What are the top boy names in the U.S. ?
What is a female rabbit called ?


### Embed the documents and store them in an index

In [None]:
# Define the embeddings model
embeddings = CohereEmbeddings(model = "multilingual-22-12")

# Embed the documents and store in index
docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{"source": f"{i}"} for i in range(len(texts))])

DEBUG:Chroma:Logger created


Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to {self._save_folder}/index.bin
DEBUG:Chroma:Index saved to {self._save_folder}/index.bin


### Enter a query

In [None]:
queries = ["How to get in touch with Bill Gates",
           "Comment entrer en contact avec Bill Gates",
           "बिल गेट्स से कैसे संपर्क करें",
           "Cara menghubungi Bill Gates"]

queries_lang = ["English", "French", "Hindi", "Indonesian"] 

### Return the document most similar to the query

In [None]:
# Return document most similar to the query
answers = []
for query in queries:
  docs = docsearch.similarity_search(query)
  answers.append(docs[0].page_content)

DEBUG:Chroma:time to pre process our knn query: 3.337860107421875e-06
DEBUG:Chroma:time to run knn query: 0.00027751922607421875
DEBUG:Chroma:time to pre process our knn query: 2.6226043701171875e-06
DEBUG:Chroma:time to run knn query: 0.0002639293670654297
DEBUG:Chroma:time to pre process our knn query: 2.86102294921875e-06
DEBUG:Chroma:time to run knn query: 0.0002460479736328125
DEBUG:Chroma:time to pre process our knn query: 2.86102294921875e-06
DEBUG:Chroma:time to run knn query: 0.0002503395080566406


In [None]:
# Print the top document match for each query
for idx,query in enumerate(queries):
  print(f"Query language: {queries_lang[idx]}")
  print(f"Query: {query}")
  print(f"Most similar existing question: {answers[idx]}")
  print("-"*20,"\n")

Query language: English
Query: How to get in touch with Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
-------------------- 

Query language: French
Query: Comment entrer en contact avec Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
-------------------- 

Query language: Hindi
Query: बिल गेट्स से कैसे संपर्क करें
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
-------------------- 

Query language: Indonesian
Query: Cara menghubungi Bill Gates
Most similar existing question: What is Bill Gates of Microsoft E-mail address ?
-------------------- 



# Example 2 - Search-Based Question Answering

## Add an article and chunk it into smaller passages

In [None]:
# We'll use Steve Jobs' Stanford University commencement address as the example. Link: https://news.stanford.edu/2005/06/12/youve-got-find-love-jobs-says/

!wget 'https://docs.google.com/uc?export=download&id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F' -O steve-jobs-commencement.txt

--2023-03-12 00:51:05--  https://docs.google.com/uc?export=download&id=1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F
Resolving docs.google.com (docs.google.com)... 172.217.194.101, 172.217.194.113, 172.217.194.139, ...
Connecting to docs.google.com (docs.google.com)|172.217.194.101|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0g-84-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/cl75650es18bo7kg9197tfjb0v90oatu/1678582200000/12721472133292131824/*/1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F?e=download&uuid=a3505ad7-8283-4565-b6dd-1ddcdab27ece [following]
--2023-03-12 00:51:07--  https://doc-0g-84-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/cl75650es18bo7kg9197tfjb0v90oatu/1678582200000/12721472133292131824/*/1f1INWOfJrHTFmbyF_0be5b4u_moz3a4F?e=download&uuid=a3505ad7-8283-4565-b6dd-1ddcdab27ece
Resolving doc-0g-84-docs.googleusercontent.com (doc-0g-84-docs.googleusercontent.com)... 74.125.68.132, 2404:68

In [None]:
with open('steve-jobs-commencement.txt') as f:
    text = f.read()
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
texts = text_splitter.split_text(text)



## Embed the passages and store them in an index

In [None]:
embeddings = CohereEmbeddings(model = "multilingual-22-12")
docsearch = Chroma.from_texts(texts, embeddings, metadatas=[{"source": f"{i}"} for i in range(len(texts))])

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


DEBUG:Chroma:Index not found
DEBUG:Chroma:Index saved to {self._save_folder}/index.bin
DEBUG:Chroma:Index saved to {self._save_folder}/index.bin


## Enter a question

In [None]:
questions = [
           "What did the author liken The Whole Earth Catalog to?",
           "What was Reed College great at?",
           "What was the author diagnosed with?",
           "What is the key lesson from this article?",
           "What did the article say about Michael Jackson?",
           ]

## Answer the question based on the most relevant documents


In [None]:

# Create our own prompt template
prompt_template = """Text: {context}

Question: {question}

Answer the question based on the text provided. If the text doesn't contain the answer, reply that the answer is not available."""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [None]:

# Create our own prompt template
template = 'Rephrase the following HTML STRING so that it reads like a MINIMUM REQUIREMENT.'

# Hand-labeled examples
template += '\n\nHTML STRING: "<orq>Ability to data Lead</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Skills as a Data Lead are required."
template += '\n\nHTML STRING: "<orq>Ability to data Lake</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Experience working with a Data Lake are required."
template += '\n\nHTML STRING: "<orq>Ability to no travel.</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Ability to not have to travel is required."
template += '\n\nHTML STRING: "<orq>Ability to oracle HCM</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Experience working with Oracle HCM is required."
template += '\n\nHTML STRING: "<orq>Ability to no Travel.</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Ability to not have to travel is required."

# Suggested and fed back in
# template += '\n\nHTML STRING: "<orq>Ability to lead</orq>"\n=========\n'
# template += 'MINIMUM REQUIREMENT:\n'
# template += "Leadership skills are required."
# template += '\n\nHTML STRING: "<orq>Ability to jIRA).</orq>"\n=========\n'
# template += 'MINIMUM REQUIREMENT:\n'
# template += "Proficiency in JIRA is necessary."
# template += '\n\nHTML STRING: "<orq>Ability to elysian.</orq>"\n=========\n'
# template += 'MINIMUM REQUIREMENT:\n'
# template += "Proficiency in Elysian is required."
# template += '\n\nHTML STRING: "<orq>Ability to pMO Lead</orq>"\n=========\n'
# template += 'MINIMUM REQUIREMENT:\n'
# template += "Leadership skills in project management are necessary."
# template += '\n\nHTML STRING: "<orq>Ability to pVC Lead</orq>"\n=========\n'
# template += 'MINIMUM REQUIREMENT:\n'
# template += "Leadership skills in venture capital are necessary."
template += '\n\nHTML STRING: "<orq>Ability to qA Director</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Experience as a QA Director is required."
template += '\n\nHTML STRING: "<orq>Ability to tax reports</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Knowledge of tax reports is required."
template += '\n\nHTML STRING: "<orq>Ability to pMO Analyst</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Experience as a PMO Analyst is required."
template += '\n\nHTML STRING: "<orq>Ability to team Support</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Ability to provide team support is required."
template += '\n\nHTML STRING: "<orq>Ability to data Synapse</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Experience working with Data Synapse is required."
template += '\n\nHTML STRING: "<orq>Ability to data Support</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Skills in data support are required."
template += '\n\nHTML STRING: "<orq>Ability to oCM Advisory</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Experience in OCM Advisory is required."
template += '\n\nHTML STRING: "<orq>Ability to training Lead</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Leadership in training is required."
template += '\n\nHTML STRING: "<orq>Ability to security Lead</orq>"\n=========\n'
template += 'MINIMUM REQUIREMENT:\n'
template += "Experience as a Security Lead is required."

template += "\n\nHTML STRING: {html_str}\n=========\nMINIMUM REQUIREMENT:"
PROMPT = PromptTemplate(
    input_variables=['html_str'],
    template=template,
)

In [None]:

# Generate the answer given the context
chain = load_qa_chain(Cohere(model="command-xlarge-nightly", temperature=0), chain_type="stuff", prompt=PROMPT)

# questions = ["What did the text say about Michael Jackson?"]
for question in questions:
    docs = docsearch.similarity_search(question)
    answer = chain.run(input_documents=docs, question=question)
    answer = answer.replace("\n","").replace("Answer:","")
    print("-"*20,"\n")
    print(f"Question: {question}")
    print(f"Answer: {answer}")

DEBUG:Chroma:time to pre process our knn query: 2.1457672119140625e-06
DEBUG:Chroma:time to run knn query: 0.0007393360137939453


-------------------- 

Question: What did the author liken The Whole Earth Catalog to?
Answer: The Whole Earth Catalog was like Google in paperback form, 35 years before Google came along.


DEBUG:Chroma:time to pre process our knn query: 2.86102294921875e-06
DEBUG:Chroma:time to run knn query: 0.000213623046875


-------------------- 

Question: What was Reed College great at?
Answer: Calligraphy instruction


DEBUG:Chroma:time to pre process our knn query: 1.9073486328125e-06
DEBUG:Chroma:time to run knn query: 0.0005228519439697266


-------------------- 

Question: What was the author diagnosed with?
Answer:  cancer


DEBUG:Chroma:time to pre process our knn query: 2.1457672119140625e-06
DEBUG:Chroma:time to run knn query: 0.0008227825164794922


-------------------- 

Question: What is the key lesson from this article?
Answer: Stay Hungry. Stay Foolish.


DEBUG:Chroma:time to pre process our knn query: 2.384185791015625e-06
DEBUG:Chroma:time to run knn query: 0.0001544952392578125


-------------------- 

Question: What did the article say about Michael Jackson?
Answer: The article didn't contain the answer to the question.


In [None]:
# Here is a complete prompt example to see the passages that go into the context
chain = load_qa_chain(Cohere(model="command-xlarge-nightly", temperature=0), chain_type="stuff", prompt=PROMPT)

question = questions[1] # Take one example, the second question in the list
docs = docsearch.similarity_search(question)
context = ""
for doc in docs:
  context += doc.page_content + "\n"

prompt_template = f"""Text: {context}

Question: {question}

Answer the question based on the text provided. If the text doesn't contain the answer, reply that the answer is not available."""

print(prompt_template)

DEBUG:Chroma:time to pre process our knn query: 2.1457672119140625e-06
DEBUG:Chroma:time to run knn query: 0.0007650852203369141


Text: I am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I’ve ever gotten to a college graduation. Today I want to tell you three stories from my life. That’s it. No big deal. Just three stories.
Stay Hungry. Stay Foolish.

Thank you all very much.
Reed College at that time offered perhaps the best calligraphy instruction in the country. Throughout the campus every poster, every label on every drawer, was beautifully hand calligraphed. Because I had dropped out and didn’t have to take the normal classes, I decided to take a calligraphy class to learn how to do this. I learned about serif and sans serif typefaces, about varying the amount of space between different letter combinations, about what makes great typography great. It was beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.
The first story is about conn