# LangChain’s Indexes and Retrievers

As seen earlier, an index in LangChain is a ***data structure that organizes and stores data to facilitate quick and efficient searches***. A retriever effectively uses this index to find and provide relevant data in response to specific queries. LangChain’s **indexes** and **retrievers** provide modular, adaptable, and customizable options for ***handling unstructured data with LLMs***. The primary index types in LangChain are based on **vector databases**, mainly emphasizing indexes using **embeddings**.

The role of retrievers is ***to extract relevant documents for integration into language model prompts***. In LangChain, a retriever employs a `get_relevant_documents` method, taking a query string as input and generating a list of documents that are relevant to that query.

Let’s see how they work with a practical application:

In [None]:
from langchain.document_loaders import TextLoader

# text to write to a local file
# taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text =""" Google opens up its AI language model PaLM to challenge OpenAI and GPT-3 Google offers developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or Meta's Llama family of models. Google first announced PaLM in April 2022. Like other LLMs, PaLM is a flexible system that can potentially carry out all sorts of text generation and editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for example, or you could use it for tasks like summarizing text or even writing code. (It's similar to features Google also announced today for its Workspace apps like Google Docs and Gmail.)
"""

# write text to local file
with open("my_file.txt", "w") as file:
    file.write(text)

# use TextLoader to load text from local file
loader = TextLoader("my_file.txt")
docs_from_file = loader.load()

print(len(docs_from_file))
# 1

Use `CharacterTextSplitter` to split the documents into text snippets called “chunks.” `chunk_overlap` is the number of characters that overlap between two consecutive chunks. It preserves context and improves coherence by ensuring that important information is not cut off at the boundaries of chunks.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

# create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# split documents into chunks
docs = text_splitter.split_documents(docs_from_file)

print(len(docs))
# 2

Create a **vector embedding** for each text snippet. These embeddings allow us to effectively search for documents or portions of documents that relate to our query by examining their semantic similarities.

Here, we chose ***OpenAI’s embedding*** model to create the embeddings.

In [None]:
from langchain.embeddings import OpenAIEmbeddings

# Before executing the following code, make sure to have
# your OpenAI key saved in the "OPENAI_API_KEY" environment variable.
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

We first need to set up a vector store to create those embeddings. A **vector store** is a system that stores embeddings, allowing us to query them. In this example, we will use **Deep Lake**, a cloud-based vector database, but others like  [Chroma DB](https://www.trychroma.com/)  would do.

Let’s create an instance of a **Deep Lake** dataset and the embeddings by providing the embedding_function.

You will need a free Activeloop account to follow along:

In [None]:
import os
from langchain_custom_utils.helper import get_openai_api_key, get_activeloop_api_key 
OPENAI_API_KEY = get_openai_api_key()
ACTIVELOOP_API_KEY = get_activeloop_api_key()

In [None]:
from langchain.vectorstores import DeepLake

# Before executing the following code, make sure to have your
# Activeloop key saved in the "ACTIVELOOP_TOKEN" environment variable.

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = ACTIVELOOP_API_KEY
my_activeloop_dataset_name = "langchain_course_indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

The next step is to create a LangChain retriever by calling the `.as_retriever()` method on your **vector store instance**.

In [None]:
# create retriever from db
retriever = db.as_retriever()

Once we have the retriever, we can use the `RetrievalQA` class to define a question answering chain using an external data source and start with `question-answering`.

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(model="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=retriever
)

We can query our document about a specific topic found in the documents.

In [None]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

You should see something like the following:

    Google plans to challenge OpenAI by offering access to its AI language model PaLM, which is similar to OpenAI's GPT series and Meta's Llama family of models. PaLM is a large language model that can be used for tasks like summarizing text or writing code.

In creating the retriever stages, we set the `chain_type` to “stuff.” This is the most straightforward document chain (“stuff” as in “to stuff” or “to fill”). It takes a list of documents, inserts them all into a prompt, and passes that prompt to an LLM. This approach is only efficient with shorter documents due to the context length limitations of most LLMs.

The process also involves conducting a similarity search using embeddings to find documents relevant to the query and can be used as context for the LLM. While this might appear limited in scope with a single document, its effectiveness is enhanced when dealing with multiple documents segmented into chunks. We supply the LLM with the relevant information within its context size by selecting the most relevant documents based on semantic similarity.

The effectiveness of this approach in enhancing the language comprehension of large language models is underscored by the retriever’s ability to pinpoint documents closely related to a user’s query in the embedding space.

It is important to note that this method poses a notable challenge, especially when dealing with a more extensive data set. In the example, the text was divided into equal parts, 200 characters long, which resulted in both relevant and irrelevant text being presented in response to a user’s query.

Incorporating unrelated content in the LLM prompt can be problematic because it may distract the LLM from focusing on essential details and it consumes space in the prompt that could be allocated to more relevant information.

A `DocumentCompressor` addresses this issue. Instead of immediately returning retrieved documents as-is, it compresses them so that only the information relevant to the query is returned. “Compressing” here refers to using an LLM to rewrite the retrieved chunk so that it contains only information relevant to the query. This way, the chunks are smaller, and more chunks can be used as contextual information to generate the final answer.

`The ContextualCompressionRetriever` serves as a wrapper that combines a base retriever with a `DocumentCompressor`, ensuring that only the most pertinent segments of the documents retrieved by the base retriever are used.

The `LLMChainExtractor` class is a `DocumentCompressor` that uses an LLM chain to extract relevant parts of documents.

The following example demonstrates the application of the `ContextualCompressionRetriever` with the `LLMChainExtractor`:

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# create GPT3 wrapper
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)

# create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

Once the `compression_retriever` is created, we can retrieve the relevant compressed documents for a query.

In [None]:
# retrieving compressed documents
retrieved_docs = compression_retriever.get_relevant_documents(
    "How Google plans to challenge OpenAI?"
)
print(retrieved_docs[0].page_content)

You should see an output like the following:

    Google is offering developers access to one of its most advanced AI language models: PaLM. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses "generate text, images, code, videos, audio, and more from simple natural language prompts."

Compressors try to simplify the process by sending  **only essential**  data to the LLM. This also allows you to provide more information to the LLM. Letting the compressors handle precision during the initial retrieval step will allow you to focus on recall (for example, by increasing the number of documents returned).

We saw how to create a retriever from a .txt file; however, data can come in different types. The LangChain framework offers diverse classes that enable data to be loaded from multiple sources, including PDFs, URLs, and Google Drive, among others, which we will explore next.

# Data Ingestion

Data ingestion can be simplified with various data loaders, each with its own specialization. The `TextLoader` from `LangChain` excels at handling plain text files. The `PyPDFLoader` is optimized for PDF files, allowing easy access to the content. The `SeleniumURLLoader` is the go-to tool for web-based data, notably HTML documents from URLs that require JavaScript rendering. The `GoogleDriveLoader` integrates seamlessly with Google Drive, allowing for data import from Google Docs or entire folders.

In [None]:
from langchain.document_loaders import TextLoader

loader = TextLoader('file_path.txt')
documents = loader.load()

>💡You can use the encoding argument to change the encoding type. (For example: encoding="ISO-8859-1")


 ## Loading Data from PDF Files

The `PyPDFLoader` class can import PDF files and create a list of `LangChain` documents. Each document in this array contains the content and metadata of a single page, including the page number.

Here’s a code snippet to load and split a PDF file using `PyPDFLoader`:

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()

print(pages[0])

## Loading Data from Webpages

The `SeleniumURLLoader` class in `LangChain` provides a user-friendly solution for importing HTML documents from URLs that require JavaScript rendering.

>The code examples provided have been tested with the unstructured and selenium libraries, versions 0.7.7 and 4.10.0, respectively. You are encouraged to install the most recent versions for optimal performance and features in your application and keep these versions for output consistency in the book.

Instantiate the `SeleniumURLLoader` class by providing a list of URLs to load, for example:

In [None]:
from langchain.document_loaders import SeleniumURLLoader

urls = [
    "https://www.youtube.com/watch?v=TFa539R09EQ&t=139s",
    "https://www.youtube.com/watch?v=6Zv6A_9urh4&t=112s"
]

loader = SeleniumURLLoader(urls=urls)
data = loader.load()

print(data[0])

The `SeleniumURLLoader` class in `LangChain` offers several attributes, such as the URLs (List[str]) to access a list of URLs, `continue_on_failure (bool, default=True)` to determine whether the loader should continue processing other URLs in case of a failure, browser (str, default=“chrome”) to select the browser (Chrome or Firefox) for loading the URLs, executable_path (Optional[str], default=None) to determine the path to the browser’s executable file, and headless (bool, default=True) to specify whether the browser should operate in headless mode, meaning it runs without a visible user interface.

These attributes can be adjusted during initialization. For example, to use Firefox instead of Chrome, set the browser attribute to “firefox”:

In [None]:
loader = SeleniumURLLoader(urls=urls, browser="firefox")

When the `load()` method is used with the `SeleniumURLLoader` object, it returns a collection of Document instances, each containing the content fetched from the web pages. These Document instances have a page_content attribute, which includes the text extracted from the HTML, and a metadata attribute that stores the source URL.

The `SeleniumURLLoader` class might operate slower than other loaders because it initializes a browser instance for each URL to render pages, especially those that require JavaScript accurately.

>💡This approach will not work in a Google Colab notebook without further configuration, which is outside the scope of this book. Instead, try running the code directly using the Python interpreter.

## Loading Data from Google Drive

The LangChain `GoogleDriveLoader` class can import data directly from Google Drive. It can retrieve data from a list of Google Docs document IDs or a single folder ID on Google Drive.

To use the `GoogleDriveLoader`, you need to set up the necessary credentials and tokens. The loader typically looks for the credentials.json file in the ***~/.credentials/credentials.json*** directory. You can specify a different path using the `credentials_file` keyword argument. For the token, the ***token.json*** file is created automatically on the loader’s first use and follows a similar path convention.

To set up the ***credentials_file***, follow these steps:

1.  Create or select a Google Cloud Platform project by visiting the Google Cloud Console. Make sure billing is enabled for the project.
2.  Activate the Google Drive API from the Google Cloud Console dashboard and click “Enable”.
3.  Follow the steps to set up a service account via the Service Accounts page in the Google Cloud Console.
4.  Assign the necessary roles to the service account. Roles like “Google Drive API - Drive File Access” and “Google Drive API - Drive Metadata Read/Write Access” might be required, depending on your specific use case.
5.  Navigate to the “Actions” menu next to it, select “Manage keys,” then click “Add Key” and choose “JSON” as the key type. This will generate a JSON key file and download it to your computer, which will be used as your credentials_file.
6.  Retrieve the folder or document ID identified at the end of the URL like this:
    
    – Folder: https://drive.google.com/drive/u/0/folders/{folder_id}
    
    – Document: https://docs.google.com/document/d/{document_id}/edit
    
7.  Import the `GoogleDriveLoader` class:

In [None]:
from langchain.document_loaders import GoogleDriveLoader

8. Instantiate `GoogleDriveLoader`:

In [None]:
loader = GoogleDriveLoader(
    folder_id="your_folder_id",
    recursive=False
)

9. Load the documents:

In [None]:
docs = loader.load()

It is important to note that currently, only Google Docs are supported.

# Text Splitters

• Find the  [Notebook](https://colab.research.google.com/github/towardsai/ragbook-notebooks/blob/main/notebooks/Chapter%2007%20-%20What_are_Text_Splitters_and_Why_They_are_Useful_.ipynb)  for this section at  [towardsai.net/book](http://towardsai.net/book).

A challenge in LLMs is the limitation of input prompt size, preventing them from including all documents economically and without introducing noise. However, this can be managed using text splitters to divide documents into smaller, cohesive parts. Text splitters help break down large text documents into smaller, more digestible pieces that language models can process more effectively. It is an important tool for efficiently splitting long documents into smaller but cohesive sections to enhance the effectiveness of vector store searches.

Text splitters help provide a source document to a large language model and, in turn, guide its content generation, reducing the likelihood of producing false or irrelevant information. With access to a reliable source, the LLM can deliver more accurate answers, which is particularly valuable in scenarios demanding high precision. Additionally, users can verify the information generated by cross-referencing it with the source document, ensuring reliability and correctness.

However, relying on a single document can limit the scope of content generated, as the LLM is restricted to the information available in that document. If the document contains errors or biases, the LLM’s output may be misleading or incorrect. Moreover, although referencing a document can reduce the likelihood of hallucinations, it cannot entirely prevent the LLM from generating false or irrelevant content.

A text splitter helps provide adequate context for the LLM to answer the query, as many small relevant segments might be more likely to match a query than a single big segment. Experimenting with different chunk sizes and overlaps can be beneficial in tailoring results to suit your specific needs.

This process can become complicated when retaining the integrity of semantically connected text parts is critical.

Text segmentation typically involves breaking the text into smaller, semantically meaningful units, often sentences, aggregating these smaller units into more significant segments until they reach a certain size, defined by specific criteria, and once the target size is achieved, the segment is isolated as a distinct piece. The process is repeated with some segment overlap to preserve contextual continuity.

In customizing text segmentation, consider two key factors: the technique for dividing the text and the criteria used to determine the size of each final text segment.

Below, we discuss the techniques and criteria commonly employed to determine the size of the chunks.

## Splitting Text by Number of Characters

This splitter offers customization in two key areas: *the size of each chunk* and the *extent of overlap between chunks*. This customization balances creating manageable segments and maintaining semantic continuity across them.

To begin processing documents, use the `PyPDFLoader` class. The [sample PDF file](https://github.com/towardsai/rag-ebook-files/blob/main/The%20One%20Page%20Linux%20Manual.pdf) used for this example is accessible at [towardsai.net/book](http://towardsai.net/book).

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("the_one_page_linux_manual.pdf")
pages = loader.load_and_split()

Here, we split the text into “chunks” of 1000 characters, overlapping 20 characters.

In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
texts = text_splitter.split_documents(pages)

print(texts[0])

print (f"You have {len(texts)} documents")
print ("Preview:")
print (texts[0].page_content)

There isn’t a one-size-fits-all method for segmenting text, as the effectiveness of a process can vary widely depending on the documents used. An iterative approach can determine the optimal  **chunk size**  for your project.

Begin by cleaning your data and removing unnecessary elements like HTML tags from web sources. Next, experiment with different chunk sizes. Evaluate the effectiveness of each size by running queries and analyzing the results. Although this process can be time-consuming, it is an important step in achieving the best outcomes for your project.

## Splitting Text at Logical End Points

The `RecursiveCharacterTextSplitter` splitter segments text into chunks based on a predefined list of strings used as separators, trying to produce chunks that are not longer than a specified max length and that follow logical sections like paragraphs or sentences.

For example, the `RecursiveCharacterTextSplitter` first tries to segment the text by splitting it by paragraphs (using the “\n\n” separator). If a paragraph is shorter than the specified max length, it becomes a chunk. Otherwise, the `RecursiveCharacterTextSplitter` tries splitting the paragraph by newlines (using the “\n” separator). If a line is shorter than the specified max length, it becomes a chunk. Otherwise, the next separator is used (e.g., a whitespace separator), and so on.

To utilize the `RecursiveCharacterTextSplitter` splitter, create an instance with the following parameters:

-   `chunk_size`: This defines the maximum size of each chunk. It is determined by the length_function, with a default value of 100.
-   `chunk_overlap`: This specifies the maximum overlap between chunks to ensure continuity, with a default of 20.
-   `length_function`: This calculates the length of chunks. The default is len, which counts the number of characters.

Using a token counter instead of the default len function can be advantageous for specific applications, such as when working with language models with token limits. For instance, considering OpenAI’s GPT-3’s token limit of ***4096 tokens per request***, a token counter might be more effective for managing and optimizing requests.

Here’s an example of how to use `RecursiveCharacterTextSplitter`:

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("The One Page Linux Manual.pdf")
pages = loader.load_and_split()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
    length_function=len,
)

docs = text_splitter.split_documents(pages)
for doc in docs:
    print(doc)

In the example, we set up an instance of the `RecursiveCharacterTextSplitter` class with specific parameters, using the default character set ["\n\n", "\n", " ", ""] for splitting the text.

Initially, the text is segmented using two newline characters (\n\n). If the resulting chunks exceed the desired size of 50 characters, the class attempts to divide the text using a single newline character (\n). The result is a series of documents comprising the segmented text.

To incorporate a token counter, you can create a function that determines the token count in a text and use this as the `length_function` parameter. This modification ensures that the chunk lengths are calculated based on tokens rather than character counts.

## Splitting Text with Foreign Linguistic Structures with NLTK

The `NLTKTextSplitter` splitter leverages the capabilities of the Natural Language Toolkit (NLTK) library for text segmentation. This class can make splitting decisions based on linguistic structure, thanks to many hand-written rules created by linguistics. This means it can more intelligently identify sentence boundaries, paragraph divisions, and other natural language cues that depend on the specific language used, resulting in more semantically coherent chunks of text.

>💡If it is your first time using this package, you will need to install the NLTK library using !pip install -q nltk.

In [None]:
from langchain.text_splitter import NLTKTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt',
encoding= 'unicode_escape') as f:
    sample_text = f.read()

text_splitter = NLTKTextSplitter(chunk_size=500)
texts = text_splitter.split_text(sample_text)
print(texts)

Consider using this tokenizer, particularly for foreign languages whose syntax is not based on words separated by whitespaces, such as Chinese (Mandarin and Cantonese), Japanese, and Thai.

## Splitting Text with Foreign Linguistic Structures with Spacy

The `SpacyTextSplitter` splitter is another class for separating large text documents into smaller parts of a specific size. The `SpacyTextSplitter` splitter is an alternative to NLTK-based sentence-splitting algorithms. To use this splitter, first construct a `SpacyTextSplitter` object and set the `chunk_size` property. This size is decided by a length function, which measures the number of characters in the text by default.

In [None]:
from langchain.text_splitter import SpacyTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt',
encoding= 'unicode_escape') as f:
    sample_text = f.read()

# Instantiate the SpacyTextSplitter with the desired chunk size
text_splitter = SpacyTextSplitter(chunk_size=500, chunk_overlap=20)

# Split the text using SpacyTextSplitter
texts = text_splitter.split_text(sample_text)

# Print the first chunk
print(texts[0])

## Splitting Text with the Markdown Format

The `MarkdownTextSplitter` splitter specializes in segmenting text formatted with Markdown, targeting elements like headers, code blocks, or dividers. This splitter is a specialized version of the `RecursiveCharacterSplitter` splitter, adapted for Markdown with specific separators. These separators are, by default, aligned with standard Markdown syntax but can be tailored by supplying a customized list of characters during the initialization of the `MarkdownTextSplitter` instance. The default measurement for chunk size is based on the number of characters, as determined by the provided length function. When creating an instance, an integer value can be specified to adjust the chunk size to specific requirements.

In [None]:
from langchain.text_splitter import MarkdownTextSplitter

markdown_text = """
#

# Welcome to My Blog!

## Introduction
Hello everyone! My name is **John Doe** and I am a _software developer_. I specialize in Python, Java, and JavaScript.

Here's a list of my favorite programming languages:

1. Python
2. JavaScript
3. Java

You can check out some of my projects on [GitHub](https://github.com).

## About this Blog
In this blog, I will share my journey as a software developer. I'll post tutorials, my thoughts on the latest technology trends, and occasional book reviews.

Here's a small piece of Python code to say hello:

\``` python
def say_hello(name):
    print(f"Hello, {name}!")

say_hello("John")
\```

Stay tuned for more updates!

## Contact Me
Feel free to reach out to me on [Twitter](https://twitter.com) or send me an email at johndoe@email.com.

"""

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])

print(docs)

Identifying Markdown syntax elements (e.g., headings, lists, and code blocks) enables intelligent content division based on its structural hierarchy, leading to semantically coherent segments.

## Splitting Text with Tokens

The `TokenTextSplitter` splitter offers a key advantage over splitters like the `CharacterTextSplitter` splitter by ensuring that the resulting chunks contain, at most, a specified number of tokens. This is very useful when using LLMs with limited context window since it allows us to determine the maximum number of chunks that can be inserted into the prompt without making it bigger than the maximum context size.

This splitter first converts the input text into BPE (Byte Pair Encoding, seen in Chapter 2) tokens and then groups them into chunks. Then, the tokens within each chunk are converted back to their original text.

In [None]:
from langchain.text_splitter import TokenTextSplitter

# Load a long document
with open('/home/cloudsuperadmin/scrape-chain/langchain/LLM.txt',
encoding= 'unicode_escape') as f:
    sample_text = f.read()

# Initialize the TokenTextSplitter with desired chunk size and overlap
text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=50)

# Split into smaller chunks
texts = text_splitter.split_text(sample_text)
print(texts[0])

The `chunk_size` parameter in `TokenTextSplitter` dictates the maximum number of BPE tokens each chunk can contain, whereas `chunk_overlap` determines the extent of token overlap between successive chunks.

A potential but small downside of the `TokenTextSplitter` splitter is the increased computational effort required to convert text into BPE tokens and vice versa. For quicker and more straightforward text segmentation, the `CharacterTextSplitter` splitter may be a preferable option because it offers a more direct and less computationally intensive approach to dividing text.

The above text splitters are the most commonly used approaches to splitting text. The next section focuses on how these text splitters can be leveraged to enhance your application with an example where we build a customer support Q&A chatbot powered by LLMs.

# Similarity Search and Vector Embeddings

OpenAI’s embedding models are versatile and can generate embeddings that we can use for similarity searches. In this section, we will use the OpenAI API to create embeddings from a collection of documents and then perform a similarity search using cosine similarity.

Now, let’s generate embeddings for our documents and perform a similarity search.

Begin by defining a list of documents as strings. This text data will be used for the subsequent steps.

Next, compute the embeddings for each document using the OpenAIEmbeddings class. Set the embedding model to ***"text-embedding-ada-002***". This model will generate embeddings for each document, transforming them into vector representations of their semantic content.

>💡Computing embeddings using a proprietary model like the "text-embedding-ada-002" model incurs costs due to the usage of the API. Embedding models are usually very cheap compared to using an LLM for inference, but the total cost can become significant if millions of text chunks are used. However, in this tutorial (and in all the other tutorials in this book), we will compute embeddings of a few texts, keeping the costs to a minimum. Check the OpenAI pricing page to see the current pricing for that model.

Similarly, convert the query string to an embedding. The query string contains the text for which we want to find the most similar document.

After obtaining the embeddings for our documents and the query, calculate the cosine similarity between the query embedding and each document embedding. Cosine similarity is a widely used distance metric to assess the similarity between two vectors. In our context, it provides a series of similarity scores, each indicating how similar the query is to each document.

Once we have these similarity scores, we identify the document that is most similar to the query. This is achieved by finding the index of the highest similarity score and then retrieving the corresponding document from our collection.

In [None]:
import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OpenAIEmbeddings

# Define the documents
documents = [
    "The cat is on the mat.",
    "There is a cat on the mat.",
    "The dog is in the yard.",
    "There is a dog in the yard.",
]

# Initialize the OpenAIEmbeddings instance
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Generate embeddings for the documents
document_embeddings = embeddings.embed_documents(documents)

# Perform a similarity search for a given query
query = "A cat is sitting on a mat."
query_embedding = embeddings.embed_query(query)

# Calculate similarity scores
similarity_scores = cosine_similarity([query_embedding], document_embeddings)[0]

# Find the most similar document
most_similar_index = np.argmax(similarity_scores)
most_similar_document = documents[most_similar_index]

print(f"Most similar document to the query '{query}':")
print(most_similar_document)

Embedding models can be open-source or proprietary. The choice of embedding model type depends on specific requirements, explained in the next sections.

## Open-Source Embedding Models

As previously discussed, embedding models belong to a specific category of machine learning models designed to transform discrete data points into vector representations. In natural language processing, these discrete elements can be words, sentences, or entire documents. The resulting vector representations, referred to as embeddings, aim to encapsulate the semantic essence of the original data. For instance, words with similar meanings, such as “cat” and “kitten,” are likely to have closely aligned embeddings. These embeddings possess high dimensionality and are utilized to capture subtle semantic differences.

Open-source embedding models offer flexibility, transparency, and cost savings, allowing customization and peer-reviewed improvements. However, they may lack support, and quality can vary. Proprietary models typically provide better performance, stability, and support but come with higher costs, limited customization, and potential vendor lock-in. The choice depends on specific needs like control vs. convenience.

One key advantage of using embeddings is their ability to enable mathematical operations for interpreting semantic meanings. As illustrated, a common application involves calculating the cosine similarity between two embeddings to assess the semantic closeness of associated words or documents. The following example shows how to use an open-source embedding model for this task.

The example uses the model “*sentence-transformers/all-mpnet-base-v2”*, a pre-trained model for converting sentences into semantically meaningful vectors.

In the model_kwargs settings, ensure the computations are carried out on the CPU.

Before executing the following code, install the sentence transformer library with the command *!pip install sentence_transformers===2.2.2.*

***This library has robust pre-trained models specialized in generating embedding representations.***

Next, define a list of documents - these are the chunks of text we wish to turn into semantic embeddings and generate the embeddings. This is accomplished by invoking the embed_documents function on the Hugging Face Embeddings instance and supplying our document list as an argument. This method goes through each document and returns a list of embeddings.

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
hf = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

documents = ["Document 1", "Document 2", "Document 3"]
doc_embeddings = hf.embed_documents(documents)

These embeddings are now ready for further processing, such as classification, grouping, or similarity analysis. They reflect our original documents in a machine-readable format, allowing us to conduct complicated semantic computations.

## Cohere Embeddings

While the choice between closed-source and open-source embedding models will ultimately depend on your specific needs, including budget, control, and flexibility, closed-source models generally offer more accuracy, speed, and performance. Several companies offer closed-source embedding models, we chose Cohere because it is optimized for tasks like semantic search, text classification, and recommendation systems. They provide a multilingual embedding model that maps texts in different languages into the same semantic vector space, and therefore, it’s ideal in multilingual applications, especially search functionalities. This model, distinct from their English language model, employs dot product computations as a similarity metric for improved performance. The model produces 768-dimensional embeddings.

An API key is required to use the Cohere API. Navigate to the  [Cohere Dashboard](https://dashboard.cohere.ai/api-keys), create a new account, or log in. Once logged in, the dashboard offers an easy-to-use interface for creating and managing API keys.

After acquiring the API key, create an instance of the CohereEmbeddings class with LangChain using the “*embed-multilingual-v2.0*” model.

Next, prepare a list of texts in various languages. Use the `embed_documents()` method to generate distinctive embeddings for each text.

To showcase these embeddings, each text is printed with its corresponding embedding. For clarity, only the first five dimensions of each embedding are displayed.

For this, the Cohere package must be installed by executing !pip install cohere.

In [None]:
import cohere
from langchain_custom_utils.helper import get_cohere_api_key
from langchain.embeddings import CohereEmbeddings

COHERE_API_KEY = get_cohere_api_key()

# Initialize the CohereEmbeddings object
cohere = CohereEmbeddings(
    model="embed-multilingual-v2.0",
    cohere_api_key=COHERE_API_KEY
)

# Define a list of texts
texts = [
    "Hello from Cohere!",
    "مرحبًا من كوهير!",
    "Hallo von Cohere!",  
    "Bonjour de Cohere!",
    "¡Hola desde Cohere!",
    "Olá do Cohere!",  
    "Ciao da Cohere!",
    "您好，来自 Cohere！",
    "कोहेरे से नमस्ते!"
]

# Generate embeddings for the texts
document_embeddings = cohere.embed_documents(texts)

# Print the embeddings
for text, embedding in zip(texts, document_embeddings):
    print(f"Text: {text}")
    print(f"Embedding: {embedding[:5]}")  # print first 5 dimensions

In this example, LangChain proved helpful in simplifying the integration of an embedding model like Cohere’s multilingual embedding model into a developer’s workflow. This is one of the main advantages of working with libraries like LangChain and LlamaIndex: they make it easy to work with different types of models and switch between them without the need for big code changes.

Embeddings are typically computed once and then stored in a vector database for future use. Vector databases, like most systems, can be open-source or proprietary, with respective pros and cons.

We explored how vector embeddings and similarity searches can be performed using OpenAI and various embedding models. In the next section, we’ll see how LangChain, OpenAI, and Deep Lake come together to build a conversational AI system. This system efficiently retrieves relevant information and answers user queries, demonstrating the power of embeddings in real-world applications.