# Exploring The Role of LangChain's Indexes and Retrievers
## Introduction

In LangChain, indexes and retrievers play a crucial role in structuring documents and fetching relevant data for LLMs.  We will explore some of the advantages and disadvantages of using document based LLMs (i.e., LLMs that leverage relevant pieces of documents inside their prompts), with a particular focus on the role of indexes and retrievers.

An `index` is a powerful data structure that meticulously organizes and stores documents to enable efficient searching, while a `retriever` harnesses the index to locate and return pertinent documents in response to user queries. Within LangChain, the primary index types are centered on vector databases, with embeddings-based indexes being the most prevalent.

Retrievers focus on extracting relevant documents to merge with prompts for language models. A retriever exposes a `get_relevant_documents` method, which accepts a query string as input and returns a list of related documents.

Here we use the TextLoader class to load a text file. Remember to install the required packages with the following command:

In [1]:
%pip install -qU langchain-text-splitters
%pip install -qU langchain-openai
%pip install -qU langchain-community
%pip install -q deeplake==3.9.27 tiktoken

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.6/411.6 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.3/454.3 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m618.7/618.7 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l

In [2]:
from langchain.document_loaders import TextLoader

# text to write to a local file
# taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text = """Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
it says will help businesses “generate text, images, code, videos, audio, and more from
simple natural language prompts.”

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or
Meta’s LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,
PaLM is a flexible system that can potentially carry out all sorts of text generation and
editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for
example, or you could use it for tasks like summarizing text or even writing code.
(It’s similar to features Google also announced today for its Workspace apps like Google
Docs and Gmail.)
"""

# write text to local file
with open("my_file.txt", "w") as file:
    file.write(text)

# use TextLoader to load text from local file
loader = TextLoader("my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))
# 1

1


Then, we use CharacterTextSplitter to split the docs into texts.

In [4]:
from langchain.text_splitter import CharacterTextSplitter

# create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# split documents into chunks
docs = text_splitter.split_documents(docs_from_file)

print(len(docs))
# 2



2


These embeddings allow us to effectively search for documents or portions of documents that relate to our query by examining their semantic similarities.

In [5]:
from langchain.embeddings import OpenAIEmbeddings
import os
from google.colab import userdata

# Get the API key from Colab's userdata
openai_api_key = userdata.get('OPENAI_API_KEY')

# Set it as an environment variable
os.environ["OPENAI_API_KEY"] = openai_api_key

activeloop_token = userdata.get('ACTIVELOOP_TOKEN')

# Set it as an environment variable
os.environ["ACTIVELOOP_TOKEN"] = activeloop_token

activeloop_org_id = userdata.get('ACTIVELOOP_ORG_ID')

os.environ["ACTIVELOOP_ORG_ID"] = activeloop_org_id

# Before executing the following code, make sure to have
# your OpenAI key saved in the “OPENAI_API_KEY” environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

  embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")


Let’s create an instance of a Deep Lake dataset.

In [None]:
from langchain.vectorstores import DeepLake

# Before executing the following code, make sure to have your
# Activeloop key saved in the “ACTIVELOOP_TOKEN” environment variable.

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = activeloop_org_id
my_activeloop_dataset_name = "langchain_course_indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

In this example, we are adding text documents to the dataset. However, being Deep Lake multimodal, we could have also added images to it, specifying an image embedder model. This could be useful for searching images according to a text query or using an image as a query (and thus looking for similar images).

As datasets become bigger, storing them in local memory becomes less manageable. In this example, we could have also used a local vector store, as we are uploading only two documents. However, in a typical production scenario, thousands or millions of documents could be used and accessed from different programs, thus having the need for a centralized cloud dataset.

Back to the code example of this lesson. Next, we create a retriever.

In [7]:
# create retriever from db
retriever = db.as_retriever()

Once we have the retriever, we can start with question-answering.

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
	llm=OpenAI(model="gpt-3.5-turbo-instruct"),
	chain_type="stuff",
	retriever=retriever
)

We can query our document that is an about specific topic that can be found in the documents.

In [9]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

  response = qa_chain.run(query)



Google plans to challenge OpenAI by opening up its AI language model PaLM to developers and launching an API for PaLM alongside other AI enterprise tools.


### A Potential Problem
This method has a downside: you might not know how to get the right documents later when storing data. In the Q&A example, we cut the text into equal parts, causing both useful and useless text to show up when a user asks a question.

Including unrelated information in the LLM prompt is detrimental because:

- It can divert the LLM's focus from pertinent details.
- It occupies valuable space that could be utilized for more relevant information.

### Possible Solution
A `DocumentCompressor` abstraction has been introduced to address this issue, allowing compress_documents on the retrieved documents.

The `ContextualCompressionRetriever` is a wrapper around another retriever in LangChain. It takes a base retriever and a `DocumentCompressor` and automatically compresses the retrieved documents from the base retriever. This means that only the most relevant parts of the retrieved documents are returned, given a specific query.

A popular compressor choice is the `LLMChainExtractor`, which uses an LLMChain to extract only the statements relevant to the query from the documents. To improve the retrieval process, a ContextualCompressionRetriever is used, wrapping the base retriever with an LLMChainExtractor. The LLMChainExtractor iterates over the initially returned documents and extracts only the content relevant to the query.

Here's an example of how to use `ContextualCompressionRetriever` with `LLMChainExtractor`:

In [10]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# create GPT3 wrapper
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)

# create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
	base_compressor=compressor,
	base_retriever=retriever
)

Once we have created the compression_retriever, we can use it to retrieve the compressed relevant documents to a query.

In [11]:
retrieved_docs = compression_retriever.get_relevant_documents(
	"How Google plans to challenge OpenAI?"
)
print(retrieved_docs[0].page_content)


  retrieved_docs = compression_retriever.get_relevant_documents(


Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
