# LangChain’s Indexes and Retrievers

As seen earlier, an index in LangChain is a ***data structure that organizes and stores data to facilitate quick and efficient searches***. A retriever effectively uses this index to find and provide relevant data in response to specific queries. LangChain’s **indexes** and **retrievers** provide modular, adaptable, and customizable options for ***handling unstructured data with LLMs***. The primary index types in LangChain are based on **vector databases**, mainly emphasizing indexes using **embeddings**.

The role of retrievers is ***to extract relevant documents for integration into language model prompts***. In LangChain, a retriever employs a `get_relevant_documents` method, taking a query string as input and generating a list of documents that are relevant to that query.

Let’s see how they work with a practical application:

In [None]:
from langchain.document_loaders import TextLoader

# text to write to a local file
# taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text =""" Google opens up its AI language model PaLM to challenge OpenAI and GPT-3 Google offers developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta's Llama family of models. Google first announced PaLM in April 2022. Like other LLMs, PaLM is a flexible system that can potentially carry out all sorts of text generation and editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for example, or you could use it for tasks like summarizing text or even writing code. (It's similar to features Google also announced today for its Workspace apps like Google Docs and Gmail.)
"""

# write text to local file
with open("my_file.txt", "w") as file:
    file.write(text)

# use TextLoader to load text from local file
loader = TextLoader("my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))
# 1

Use `CharacterTextSplitter` to split the documents into text snippets called “chunks.” `chunk_overlap` is the number of characters that overlap between two consecutive chunks. It preserves context and improves coherence by ensuring that important information is not cut off at the boundaries of chunks.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

# create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# split documents into chunks
docs = text_splitter.split_documents(docs_from_file)

print(len(docs))
# 2

Create a **vector embedding** for each text snippet. These embeddings allow us to effectively search for documents or portions of documents that relate to our query by examining their semantic similarities.

Here, we chose ***OpenAI’s embedding*** model to create the embeddings.

In [None]:
from langchain.embeddings import OpenAIEmbeddings

# Before executing the following code, make sure to have
# your OpenAI key saved in the "OPENAI_API_KEY" environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

We first need to set up a vector store to create those embeddings. A **vector store** is a system that stores embeddings, allowing us to query them. In this example, we will use **Deep Lake**, a cloud-based vector database, but others like  [Chroma DB](https://www.trychroma.com/)  would do.

Let’s create an instance of a **Deep Lake** dataset and the embeddings by providing the embedding_function.

You will need a free Activeloop account to follow along:

In [None]:
import os
from langchain_custom_utils.helper import get_openai_api_key, get_activeloop_api_key 
OPENAI_API_KEY = get_openai_api_key()
DEEPLAKE_API_KEY = get_activeloop_api_key()

In [None]:
from langchain.vectorstores import DeepLake

# Before executing the following code, make sure to have your
# Activeloop key saved in the "ACTIVELOOP_TOKEN" environment variable.

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = DEEPLAKE_API_KEY
my_activeloop_dataset_name = "langchain_course_indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

The next step is to create a LangChain retriever by calling the `.as_retriever()` method on your **vector store instance**.

In [None]:
# create retriever from db
retriever = db.as_retriever()

Once we have the retriever, we can use the `RetrievalQA` class to define a question answering chain using an external data source and start with `question-answering`.

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=retriever
)

We can query our document about a specific topic found in the documents.

In [None]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

You should see something like the following:

    Google plans to challenge OpenAI by offering access to its AI language model PaLM, which is similar to OpenAI's GPT series and Meta's Llama family of models. PaLM is a large language model that can be used for tasks like summarizing text or writing code.

In creating the retriever stages, we set the `chain_type` to “stuff.” This is the most straightforward document chain (“stuff” as in “to stuff” or “to fill”). It takes a list of documents, inserts them all into a prompt, and passes that prompt to an LLM. This approach is only efficient with shorter documents due to the context length limitations of most LLMs.

The process also involves conducting a similarity search using embeddings to find documents relevant to the query and can be used as context for the LLM. While this might appear limited in scope with a single document, its effectiveness is enhanced when dealing with multiple documents segmented into chunks. We supply the LLM with the relevant information within its context size by selecting the most relevant documents based on semantic similarity.

The effectiveness of this approach in enhancing the language comprehension of large language models is underscored by the retriever’s ability to pinpoint documents closely related to a user’s query in the embedding space.

It is important to note that this method poses a notable challenge, especially when dealing with a more extensive data set. In the example, the text was divided into equal parts, 200 characters long, which resulted in both relevant and irrelevant text being presented in response to a user’s query.

Incorporating unrelated content in the LLM prompt can be problematic because it may distract the LLM from focusing on essential details and it consumes space in the prompt that could be allocated to more relevant information.

A `DocumentCompressor` addresses this issue. Instead of immediately returning retrieved documents as-is, it compresses them so that only the information relevant to the query is returned. “Compressing” here refers to using an LLM to rewrite the retrieved chunk so that it contains only information relevant to the query. This way, the chunks are smaller, and more chunks can be used as contextual information to generate the final answer.

`The ContextualCompressionRetriever` serves as a wrapper that combines a base retriever with a `DocumentCompressor`, ensuring that only the most pertinent segments of the documents retrieved by the base retriever are used.

The `LLMChainExtractor` class is a `DocumentCompressor` that uses an LLM chain to extract relevant parts of documents.

The following example demonstrates the application of the `ContextualCompressionRetriever` with the `LLMChainExtractor`:

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# create GPT3 wrapper
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)

# create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

Once the `compression_retriever` is created, we can retrieve the relevant compressed documents for a query.

In [None]:
# retrieving compressed documents
retrieved_docs = compression_retriever.get_relevant_documents(
    "How Google plans to challenge OpenAI?"
)
print(retrieved_docs[0].page_content)

You should see an output like the following:

    Google is offering developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."

Compressors try to simplify the process by sending  **only essential**  data to the LLM. This also allows you to provide more information to the LLM. Letting the compressors handle precision during the initial retrieval step will allow you to focus on recall (for example, by increasing the number of documents returned).

We saw how to create a retriever from a .txt file; however, data can come in different types. The LangChain framework offers diverse classes that enable data to be loaded from multiple sources, including PDFs, URLs, and Google Drive, among others, which we will explore next.

# Data Ingestion

Data ingestion can be simplified with various data loaders, each with its own specialization. The `TextLoader` from `LangChain` excels at handling plain text files. The `PyPDFLoader` is optimized for PDF files, allowing easy access to the content. The `SeleniumURLLoader` is the go-to tool for web-based data, notably HTML documents from URLs that require JavaScript rendering. The `GoogleDriveLoader` integrates seamlessly with Google Drive, allowing for data import from Google Docs or entire folders.

In [None]:
from langchain.document_loaders import TextLoader

loader = TextLoader('file_path.txt')
documents = loader.load()

>💡You can use the encoding argument to change the encoding type. (For example: encoding="ISO-8859-1")


 ## Loading Data from PDF Files

The `PyPDFLoader` class can import PDF files and create a list of `LangChain` documents. Each document in this array contains the content and metadata of a single page, including the page number.

Here’s a code snippet to load and split a PDF file using `PyPDFLoader`:

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()

print(pages[0])

## Loading Data from Webpages

The `SeleniumURLLoader` class in `LangChain` provides a user-friendly solution for importing HTML documents from URLs that require JavaScript rendering.

>The code examples provided have been tested with the unstructured and selenium libraries, versions 0.7.7 and 4.10.0, respectively. You are encouraged to install the most recent versions for optimal performance and features in your application and keep these versions for output consistency in the book.

Instantiate the `SeleniumURLLoader` class by providing a list of URLs to load, for example:

In [None]:
from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

print(data[0])

The `SeleniumURLLoader` class in `LangChain` offers several attributes, such as the URLs (List[str]) to access a list of URLs, `continue_on_failure (bool, default=True)` to determine whether the loader should continue processing other URLs in case of a failure, browser (str, default=“chrome”) to select the browser (Chrome or Firefox) for loading the URLs, executable_path (Optional[str], default=None) to determine the path to the browser’s executable file, and headless (bool, default=True) to specify whether the browser should operate in headless mode, meaning it runs without a visible user interface.

These attributes can be adjusted during initialization. For example, to use Firefox instead of Chrome, set the browser attribute to “firefox”:

In [None]:
loader = SeleniumURLLoader(urls=urls, browser="firefox")

When the `load()` method is used with the `SeleniumURLLoader` object, it returns a collection of Document instances, each containing the content fetched from the web pages. These Document instances have a page_content attribute, which includes the text extracted from the HTML, and a metadata attribute that stores the source URL.

The `SeleniumURLLoader` class might operate slower than other loaders because it initializes a browser instance for each URL to render pages, especially those that require JavaScript accurately.

>💡This approach will not work in a Google Colab notebook without further configuration, which is outside the scope of this book. Instead, try running the code directly using the Python interpreter.

## Loading Data from Google Drive

The LangChain `GoogleDriveLoader` class can import data directly from Google Drive. It can retrieve data from a list of Google Docs document IDs or a single folder ID on Google Drive.

To use the `GoogleDriveLoader`, you need to set up the necessary credentials and tokens. The loader typically looks for the credentials.json file in the ***~/.credentials/credentials.json*** directory. You can specify a different path using the `credentials_file` keyword argument. For the token, the ***token.json*** file is created automatically on the loader’s first use and follows a similar path convention.

To set up the ***credentials_file***, follow these steps:

1.  Create or select a Google Cloud Platform project by visiting the Google Cloud Console. Make sure billing is enabled for the project.
2.  Activate the Google Drive API from the Google Cloud Console dashboard and click “Enable”.
3.  Follow the steps to set up a service account via the Service Accounts page in the Google Cloud Console.
4.  Assign the necessary roles to the service account. Roles like “Google Drive API - Drive File Access” and “Google Drive API - Drive Metadata Read/Write Access” might be required, depending on your specific use case.
5.  Navigate to the “Actions” menu next to it, select “Manage keys,” then click “Add Key” and choose “JSON” as the key type. This will generate a JSON key file and download it to your computer, which will be used as your credentials_file.
6.  Retrieve the folder or document ID identified at the end of the URL like this:
    
    – Folder: https://drive.google.com/drive/u/0/folders/{folder_id}
    
    – Document: https://docs.google.com/document/d/{document_id}/edit
    
7.  Import the `GoogleDriveLoader` class:

In [None]:
from langchain.document_loaders import GoogleDriveLoader

8. Instantiate `GoogleDriveLoader`:

In [None]:
loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False
)

9. Load the documents:

In [None]:
docs = loader.load()

It is important to note that currently, only Google Docs are supported.