# Web Search with Langchain

This example shows how to use the Python [LangChain](https://python.langchain.com/docs/get_started/introduction) library to run a text-generation request on open-source LLMs and embedding models using the OpenAI SDK, then augment that request using results from Google web search.


## Set up the RAG workflow environment

In [6]:
%%capture
!git clone https://github.com/VectorInstitute/rag_bootcamp /tmp/rag_bootcamp
!cp -ar /tmp/rag_bootcamp/document_search/* ./

# Install langchain and LlamaIndex packages, filtering out version numbers
!grep -Eo '^(langchain|llama)[^=]*' \
    /tmp/rag_bootcamp/envs/rag_dataloaders/requirements.txt > \
    /tmp/requirements.txt
!pip3 install -r /tmp/requirements.txt
!pip3 install faiss-cpu
!pip3 install googlesearch-python

In [None]:
import os

# TODO: fill in the two following lines
os.environ["OPENAI_API_KEY"] = "..."
os.environ["OPENAI_BASE_URL"] = "https://.../v1"

#### Import libraries

In [8]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os
import requests

from bs4 import BeautifulSoup
from googlesearch import search

from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

#### Load config files

In [10]:
GENERATOR_BASE_URL = os.environ["OPENAI_BASE_URL"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

#### Set up some helper functions

In [11]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

#### Choose LLM and embedding model

In [12]:
GENERATOR_MODEL_NAME = "Meta-Llama-3.1-8B-Instruct"

## Select one of the two options:
## - "all-MiniLM-L6-v2" (22M parameters)
## - "bge-base-en-v1.5" (110M parameters)

# EMBEDDING_MODEL_NAME = "bge-base-en-v1.5"
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"

## Start with a basic generation request without RAG augmentation

Let's start by asking Llama-3.1 a question about recent events that it doesn't know about, something that happened after it finished training. At the time I'm writing this notebook in November 2024, Llama3 doesn't know who won the last World Series of baseball.

*Who won the 2024 World Series of baseball?*

**The correct answer is the Los Angeles Dodgers won in October 2024.**

In [13]:
query = "Who won the 2024 World Series of baseball?"

## Now send the query to the open source model using KScope

In [14]:
llm = ChatOpenAI(
    model=GENERATOR_MODEL_NAME,
    temperature=0,
    base_url=GENERATOR_BASE_URL,
    api_key=OPENAI_API_KEY
)
message = [
    ("human", query),
]
try:
    result = llm.invoke(message)
    print(f"Result: \n\n{result.content}")
except Exception as err:
    if "Error code: 503" in err.message:
        print(f"The model {GENERATOR_MODEL_NAME} is not ready yet.")
    else:
        raise

Result: 

I don't have the ability to predict the future or know the outcome of future events, including the 2024 World Series. The 2024 World Series has not yet occurred, and I don't have any information about it. I can provide information about past World Series winners, though. Would you like to know more about that?


Llama-3.1 admits that it doesn't know the answer, since according to the model it's a future event.

Let's see how we can use RAG to augment our question with a Google web search and get the correct answer.

## Ingestion: Do a Google web search for the query and obtain the necessary information

Parse through all the websites returned by a Google search, break them up into smaller digestible chunks, then encode them as vector embeddings.

In [17]:
# Do a Google web search and parse the results into a big text string
web_documents = []
for result_url in list(search(query))[:10]:
    try:
        response = requests.get(result_url)
        soup = BeautifulSoup(response.content, 'html.parser')
    except:
        # Skip if connection error
        continue

    web_documents.append(soup.get_text())

# Wrap text as Document object
docs = [Document(page_content=web_txt, metadata={"source": "web"}) for web_txt in web_documents]
print(f"Number of source documents: {len(docs)}")

# Split the result text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=32)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}\n")

Number of source documents: 10
Number of text chunks: 547



#### Define the embeddings model

In [None]:
print(f"Setting up the embeddings model {EMBEDDING_MODEL_NAME} at {GENERATOR_BASE_URL}")
embeddings = OpenAIEmbeddings(
    model=EMBEDDING_MODEL_NAME,
    # Leverage the RoBERTa tokenizer to make sure that
    # the chunks stay within the 512-token context window.
    tiktoken_model_name="roberta-base",
    tiktoken_enabled=False
)

## Retrieval: Make the document chunks available via a retriever

The retriever will identify the document chunks that most closely match our original query.

Depending on the number of chunks provided, generating the embeddings might require:
- about 5 minutes, when using "bge-base-en-v1.5" (110M parameters);
- about 1 minute, when using "all-MiniLM-L6-v2" (22M parameters).

In [19]:
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Retrieve the most relevant context from the vector store based on the query
retrieved_docs = retriever.invoke(query)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Let's see what results it found. Important to note, these results are in the order the retriever thought were the best matches.

In [20]:
pretty_print_docs(retrieved_docs)

Document 1:

2024 World Series - Los Angeles Dodgers over New York Yankees (4-1) | Baseball-Reference.com






























































































 Sports Reference ®
Baseball
Football (college)
Basketball (college)
Hockey
Fútbol
Blog
Stathead ®
Immaculate Grid ®
Questions or Comments?
Welcome  · Your Account
Logout
Ad-Free Login
Create Account


MENU

Players
Teams
Seasons
Leaders
Scores
Playoffs
Stathead
Newsletter
Full Site Menu Below
----------------------------------------------------------------------------------------------------
Document 2:

Game 2[edit]
Yoshinobu Yamamoto gave up only one hit through six innings in his World Series debut.

October 26, 2024 5:15 pm (PDT) at Dodger Stadium in Los Angeles, California 77 °F (25 °C), Partly Cloudy


Team
1
2
3
4
5
6
7
8
9
R
H
E


New York
0
0
1
0
0
0
0
0
1
2
4
0


Los Angeles
0
1
3
0
0
0
0
0
X
4
8
0


WP: Yoshinobu Yamamoto (1–0)   LP: Carlos Rodón (0–1)   Sv: Alex Vesia (1)Home 

## Now send the query to the RAG pipeline

In [21]:
rag_pipeline = RetrievalQA.from_llm(llm=llm, retriever=retriever)
result = rag_pipeline.invoke(input=query)
print(f"Result: \n\n{result['result']}")

Result: 

The Los Angeles Dodgers won the 2024 World Series, defeating the New York Yankees 4 games to 1.


The model provides the correct answer based on the information from the web.