# Cohere Web Search with Langchain

This example shows how to use the Python [langchain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request against [Cohere's](https://cohere.com/) API, then augment that request using the results from a Google web search.

**Requirements:**
- You will need an access key to Cohere's API key, which you can sign up for at (https://dashboard.cohere.com/welcome/login). A free trial account will suffice, but will be limited to a small number of requests.
- After obtaining this key, store it in plain text in your home in directory in the `~/.cohere.key` file.

## Set up the RAG workflow environment

In [1]:
from bs4 import BeautifulSoup
from getpass import getpass
from googlesearch import search
import os
from pathlib import Path
import requests

from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.llms import Cohere
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

Set up some helper functions:

In [2]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Make sure other necessary items are in place:

In [3]:
try:
    os.environ["COHERE_API_KEY"] = open(Path.home() / ".cohere.key", "r").read().strip()
except Exception:
    print(f"ERROR: You must have a Cohere API key available in your home directory at ~/.cohere.key")

## Start with a basic generation request without RAG augmentation

Let's start by asking the Cohere LLM a question about recent events that it doesn't know about, something that happened after it finished training. At the time I'm writing this notebook in January 2024, Cohere doesn't know who won the last World Series of baseball.

**The correct answer is the Texas Rangers won in November 2023.**

"*Who won the 2023 World Series of baseball?*"

In [4]:
query = "Who won the 2023 World Series of baseball?"

## Now send the query to Cohere

In [5]:
llm = Cohere()
result = llm(query)
print(f"Result: \n\n{result}")

  warn_deprecated(
unknown field: parameter model is not a valid field


Result: 

 The Houston Astros won the 2023 World Series of baseball, defeating the Philadelphia Phillies in six games. This was the second time the Astros have won the World Series, having previously won in 2017. 


At the best, the Cohere LLM admits that it doesn't know. At worst, it tells a lie and says the Houston Astros won (they won the year before, in 2022).

Let's see how we can use RAG to augment our question with a Google wen search and get the correct answer.

## Ingestion: Do a Google web search with the question

Parse through all the websites returned by a Google search, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [6]:
# Do a Google web search and parse the results into a big text string
result_text = ""
for result_url in search(query, tld="com", num=10, stop=10, pause=2):
    response = requests.get(result_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    result_text = result_text + soup.get_text()

# Split the result text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
chunks = text_splitter.split_text(result_text)
print(f"Number of text chunks: {len(chunks)}\n")

# Define Embeddings Model
model_name = "BAAI/bge-small-en-v1.5"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the embeddings model...\n")
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)

print("Done")

Number of text chunks: 503

Setting up the embeddings model...

Done


# Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query. (This takes about 1-2 minutes)

In [7]:
vectorstore = FAISS.from_texts(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

# Retrieve the most relevant context from the vector store based on the query(No Reranking Applied)
docs = retriever.get_relevant_documents(query)

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [8]:
pretty_print_docs(docs)

Document 1:

The 2023 World Series was the championship series of Major League Baseball's (MLB) 2023 season, and the 119th edition of the World Series. It was a best-of-seven playoff played between the American League (AL) champion Texas Rangers and the National League (NL) champion Arizona Diamondbacks. The series began on October 27 and ended on November 1 with Texas winning in five games. The Rangers won their first World Series title since their founding in 1961. This marked the first time since 1989 in which consecutive championships were won by different teams from the same state.
----------------------------------------------------------------------------------------------------
Document 2:

The 2023 World Series was the championship series of Major League Baseball's (MLB) 2023 season, and the 119th edition of the World Series. It was a best-of-seven playoff played between the American League (AL) champion Texas Rangers and the National League (NL) champion Arizona Diamondbacks.

These results clearly indicate the correct answer. Let's try sending the query again with these results.

In [9]:
print(f"Sending the RAG generation with query: {query}")
qa = RetrievalQA.from_chain_type(llm=llm,
        chain_type="stuff",
        retriever=retriever)
print(f"Result:\n\n{qa.run(query=query)}") 

Sending the RAG generation with query: Who won the 2023 World Series of baseball?


  warn_deprecated(
unknown field: parameter model is not a valid field


Result:

 The Texas Rangers won the 2023 World Series of baseball.  It was played between the American League (AL) champion Texas Rangers and the National League (NL) champion Arizona Diamondbacks. The series began on October 27 and ended on November 1 with Texas winning in five games. This marked the first time since 1989 in which consecutive championships were won by different teams from the same state. 


# Reranking: Improve the ordering of the document chunks

I guess this part isn't necessary because we already got the correct answer in the previous query. But let's step through it anyway.

In [10]:
compressor = CohereRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.get_relevant_documents(query)

Now let's see what the reranked results look like:

In [11]:
pretty_print_docs(compressed_docs)

Document 1:

The 2023 World Series was the championship series of Major League Baseball's (MLB) 2023 season, and the 119th edition of the World Series. It was a best-of-seven playoff played between the American League (AL) champion Texas Rangers and the National League (NL) champion Arizona Diamondbacks. The series began on October 27 and ended on November 1 with Texas winning in five games. The Rangers won their first World Series title since their founding in 1961. This marked the first time since 1989 in which consecutive championships were won by different teams from the same state.
----------------------------------------------------------------------------------------------------
Document 2:

The 2023 World Series was the championship series of Major League Baseball's (MLB) 2023 season, and the 119th edition of the World Series. It was a best-of-seven playoff played between the American League (AL) champion Texas Rangers and the National League (NL) champion Arizona Diamondbacks.

Lastly, let's run our LLM query a final time with the reranked results:

In [12]:
qa = RetrievalQA.from_chain_type(llm=llm,
        chain_type="stuff",
        retriever=compression_retriever)

print(f"Result:\n\n {qa.run(query=query)}")

unknown field: parameter model is not a valid field


Result:

  The Texas Rangers won the 2023 World Series of baseball, winning the best-of-seven series in five games against the Arizona Diamondbacks. 
