# Semantic Search

This notebook demonstrates LangChain's [document loader](https://python.langchain.com/docs/concepts/document_loaders/) [embedding](https://python.langchain.com/docs/concepts/embedding_models/), and [vector store](https://python.langchain.com/docs/concepts/vectorstores/) abstractions.

These are fundamental abstractions for more advanced LLM use cases, such as retrieval-augmented generation (RAG).

We will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.

## Documents and Document Loaders

The first abstraction we're going to use is the Document abstraction, which is intended to represent a unit of text, and associated metadata. It has three attributes:
- `page_content`: a string representing the content
- `metadata`: dict with arbitrary metadata 
- `id` (optional): string identifier for the document.

Below, we'll generate sample documents

In [1]:
# Imports
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

However, the LangChain *ecosystem* implements document loaders that [integrate with hundreds of common sources](https://python.langchain.com/docs/integrations/document_loaders/). This makes it easy to incorporate those sources into an AI application.

### Loading Documents

Below, we'll load a PDF into a sequence of `Document` objects.

> See [this guide](https://python.langchain.com/docs/how_to/document_loader_pdf/) for more detail on PDF document loaders.

In [5]:
from langchain_community.document_loaders import PyPDFLoader
import os
file_path = "./data/2021_lewis_et_al..pdf"

if not os.path.exists(file_path):
    print(f'File {file_path} not found')

loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

19


The `PyPDFLoader` loads one `Document` object per PDF page. We can easily access the string content of the page, and the metadata containing the file name and page number.

In [8]:
import random
page = random.choice(range(len(docs)))
print(f"{docs[page].page_content[:200]}\n")
print(docs[page].metadata)

Table 4: Human assessments for the Jeopardy
Question Generation Task.
Factuality Speciﬁcity
BART better 7.1% 16.8%
RAG better 42.7% 37.4%
Both good 11.7% 11.8%
Both poor 17.7% 6.9%
No majority 20.8% 2

{'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-04-13T00:48:38+00:00', 'author': '', 'keywords': '', 'moddate': '2021-04-13T00:48:38+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': './data/2021_lewis_et_al..pdf', 'total_pages': 19, 'page': 7, 'page_label': '8'}


### Splitting

For information retrieval (e.g. RAG) and downstream question-answering purposes, a page may be too coarse a representation. 

Our goal in the end will be to retrieve Document objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not "washed out" by surrounding text.

We can use text splitters for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the `RecursiveCharacterTextSplitter`, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set add_start_index=True so that the character index where each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

See [this guide](https://python.langchain.com/docs/how_to/document_loader_pdf/) for more detail about working with PDFs, including how to extract text from specific sections and images.


In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)

all_splits = text_splitter.split_documents(docs)
len(all_splits)

92

## Embeddings

In order to store and search over unstructured data, we can use embeddings. Embeddings "capture" the semantics (meaning) of text chunks, codifying it as an embedding, a position of a multidimensional space. An embedding is another word for a vector in a multidimensional space. So, each text chunk is turned into a multidimensional vector that "captures" its meaning.

Given a query, we can embed it using the same number of dimensions as our embeddings, and use vector similarity metrics (such as cosine similarity) to indentify related text.

LangChain supports embeddings from [many providers](https://python.langchain.com/docs/integrations/text_embedding/).

For our example we'll use HuggingFace embeddings.

In [14]:
import getpass
import os
from dotenv import load_dotenv

load_dotenv()

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [17]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)

print(f"Generated vectors of length {len(vector_2)}\n")
print(vector_1[:10])

Generated vectors of length 3072

[0.012752422131597996, 0.0023397256154567003, -0.011890682391822338, -0.021683795377612114, -0.00503349956125021, 0.03107609786093235, -0.030247757211327553, 0.020855454728007317, -0.038023460656404495, 0.009806472808122635]


# Vector Store

Now that we have embeddings, we can store them in special data structures that support similarity search.

For this example we'll use an in-memory vector store.

In [18]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [None]:
ids = vector_store.add_documents(documents=all_splits)

## Usage

Now we have a vector database that we can use to perform queries relating to our PDF.

The PDF that we stored in the database is the [landmark paper on RAG, by Lewis et al.](https://arxiv.org/pdf/2005.11401).

"Querying" the database is actually very simple now. Well write a query, which will be embedded in the same vector space of the embeddings. Then we perform a similarity search, which will return the chunks that are most relevant to the query.

In [21]:
results = vector_store.similarity_search(
    "What are some limitations of pre-trained language models when it comes to knowledge intensive tasks?"
)

print(results[0])

page_content='per token. We ﬁne-tune and evaluate our models on a wide range of knowledge-
intensive NLP tasks and set the state of the art on three open domain QA tasks,
outperforming parametric seq2seq models and task-speciﬁc retrieve-and-extract
architectures. For language generation tasks, we ﬁnd that RAG models generate
more speciﬁc, diverse and factual language than a state-of-the-art parametric-only
seq2seq baseline.
1 Introduction
Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowl-
edge from data [47]. They can do so without any access to an external memory, as a parameterized
implicit knowledge base [51, 52]. While this development is exciting, such models do have down-
sides: They cannot easily expand or revise their memory, can’t straightforwardly provide insight into
their predictions, and may produce “hallucinations” [38]. Hybrid models that combine parametric' metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hy

## Retrievers

LangChain `VectorStore` does not implement the `Runnable` interface, but `Retriever`s do, so they implement a standard set of methods, that allow chaining them to a LangChain application. 

While we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, such as external APIs.

Below we create a simple version of this, without subclassing `Retriever`.

In [None]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain

@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=2)

results = retriever.batch([
    "What is the cost function that is minimized in RAG training?",
    "In what tasks does the RAG approach surpass the state of the art with seq2seq models?",
    "What are the advantages of RAG vs. purely parametric models?"
])

len(results)

3

In [37]:
from pprint import pprint

for matches in results:
    for match in matches:
        pprint(match.page_content)

('minimize the negative marginal log-likelihood of each target, ∑\n'
 'j−log p(yj|xj) using stochastic\n'
 'gradient descent with Adam [28]. Updating the document encoder BERTd during '
 'training is costly as\n'
 'it requires the document index to be periodically updated as REALM does '
 'during pre-training [20].\n'
 'We do not ﬁnd this step necessary for strong performance, and keep the '
 'document encoder (and\n'
 'index) ﬁxed, only ﬁne-tuning the query encoder BERTq and the BART '
 'generator.\n'
 '2.5 Decoding\n'
 'At test time, RAG-Sequence and RAG-Token require different ways to '
 'approximatearg maxyp(y|x).\n'
 'RAG-Token The RAG-Token model can be seen as a standard, autoregressive '
 'seq2seq genera-\n'
 'tor with transition probability: p′\n'
 'θ(yi|x,y1:i−1) = ∑\n'
 'z∈top-k(p(·|x)) pη(zi|x)pθ(yi|x,zi,y1:i−1) To\n'
 'decode, we can plug p′\n'
 'θ(yi|x,y1:i−1) into a standard beam decoder.\n'
 'RAG-Sequence For RAG-Sequence, the likelihood p(y|x) does not break into a '
 