# Build a semantic search engine

This tutorial will familiarize you with LangChain's document loader, embedding, and vector store abstractions. These abstractions are designed to support retrieval of data-- from (vector) databases and other sources-- for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augmented generation, or RAG (see our RAG tutorial here).

Here we will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.

## Prerequisites

You will need to provision the following Azure resources:
* Azure OpenAI with two models: GPT-4o and Embedding model.
* Azure AI Search.
You can run the terraform template from folder `../400_azure_ai_foundry` to create all of these resources by simply running the following commands.

```sh
terraform init
terraform plan -out tfplan
terraform apply tfplan
```

This tutorial requires the langchain-community and pypdf packages.

In [46]:
%pip install langchain-community pypdf --quiet

Note: you may need to restart the kernel to use updated packages.


## Documents and Document Loaders

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

page_content: a string representing the content;
metadata: a dict containing arbitrary metadata;
id: (optional) a string identifier for the document.
The metadata attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual Document object often represents a chunk of a larger document.

We can generate sample documents when desired.

In [5]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

## Loading documents

Let's load a PDF into a sequence of Document objects. There is a sample PDF in the LangChain repo here -- a 10-k filing for Nike from 2023. We can consult the LangChain documentation for available PDF document loaders. Let's select PyPDFLoader, which is fairly lightweight.

In [12]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "./example_data/azure-for-architects.pdf" #nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

701


PyPDFLoader loads one Document object per PDF page. For each, we can easily access:

* The string content of the page;
* Metadata containing the file name and page number.

In [14]:
print(f"{docs[1].page_content[:200]}\n")
print(docs[1].metadata)

Ritesh Modi, Jack Lee, and Rithin Skaria
Create secure, scalable, high-availability 
applications on the cloud
Azure for Architects
Third Edition

{'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign 16.0 (Windows)', 'creationdate': '2021-06-17T13:34:27+05:30', 'author': 'Ritesh Modi', 'moddate': '2021-06-17T14:21:14+05:30', 'subject': 'Create secure, scalable, high-availability applications on the cloud', 'title': 'Azure for Architects, Third Edition', 'trapped': '/False', 'source': './example_data/azure-for-architects.pdf', 'total_pages': 701, 'page': 1, 'page_label': 'a'}


## Splitting

For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve `Document` objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not "washed out" by surrounding text.

We can use text splitters for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

We set `add_start_index=True` so that the character index where each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print("Number of splits: ", len(all_splits))

Number of splits:  1473
how long the message should be stored, the size of the message, latency, and cost. 
Azure Service Bus provides support for 256 KB messages, while Queue storage provides 
support for 64 KB messages. Azure Service Bus can store messages for an unlimited 
period, while Queue storage can store messages for 7 days. The cost and latency are 
higher with Service Bus queues.
Depending on your application's requirements and needs, the preceding factors 
should be considered before deciding on the best queue. In the next section, we will be 
discussing different types of messaging patterns.


View a sample split or chunk.

In [None]:
# print a sample chunk
print(all_splits[309])
print("---------------")
print(all_splits[310])

page_content='sender, stored in durable storage, and finally consumed by recipients.
The top architectural concerns addressed by messaging patterns are as follows:
• Durability: Messages are stored in durable storage, and applications can read 
them after they are received in case of a failover.
• Reliability: Messages help implement reliability as they are persisted on disk and 
never lost.
• Availability of messages: The messages are available for consumption by 
applications after the restoration of connectivity and before downtime.
Azure provides Service Bus queues and topics to implement messaging patterns within 
applications. Azure Queue storage can also be used for the same purpose. 
Choosing between Azure Service Bus queues and Queue storage is about deciding on 
how long the message should be stored, the size of the message, latency, and cost. 
Azure Service Bus provides support for 256 KB messages, while Queue storage provides' metadata={'producer': 'Adobe PDF Library 15.0',

## Embeddings

Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text.

LangChain supports embeddings from dozens of providers. These models specify how text should be converted into a numeric vector. 

In [30]:
%pip install -qU langchain-openai
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [48]:
import os
from langchain_openai import AzureOpenAIEmbeddings
from dotenv import load_dotenv

if os.path.exists(".env"):
    load_dotenv(override=True)


embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    openai_api_key=os.environ["AZURE_OPENAI_API_KEY"],
    azure_deployment=os.environ["AZURE_OPENAI_EMBEDDING_MODEL"],
    openai_api_version=os.environ["AZURE_OPENAI_EMBEDDING_API_VERSION"],
)

In [43]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 3072

[0.010515968315303326, -0.01905210129916668, -0.017833741381764412, 0.022737640887498856, 0.008901641704142094, -0.006392581854015589, -0.011467811651527882, 0.024169212207198143, -0.0178185123950243, 0.02282901667058468]


Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.

## Vector stores

LangChain VectorStore objects contain methods for adding text and `Document` objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors.

LangChain includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as Postgres) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads.

In [41]:
%pip install --upgrade --quiet  azure-search-documents
%pip install --upgrade --quiet  azure-identity

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings, AzureOpenAIEmbeddings

embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(
    azure_deployment=os.environ["AZURE_OPENAI_EMBEDDING_MODEL"],
    openai_api_version=os.environ["AZURE_OPENAI_EMBEDDING_API_VERSION"],
    azure_endpoint=os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"],
    api_key=os.environ["AZURE_SEARCH_SERVICE_ADMIN_KEY"],
)

## Create vector store instance

Create instance of the AzureSearch class using the embeddings from above.

In [65]:
# Specify additional properties for the Azure client such as the following https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/core/azure-core/README.md#configurations
vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"],
    azure_search_key=os.environ["AZURE_SEARCH_SERVICE_ADMIN_KEY"],
    index_name=os.environ["AZURE_SEARCH_SERVICE_INDEX"],
    embedding_function=embeddings.embed_query,
    # Configure max retries for the Azure client
    additional_search_client_options={"retry_total": 3},
    relevance_score_fn="cosine",
)

## Insert text and embeddings into vector store

This step loads, chunks, and vectorizes the sample document, and then indexes the content into a search index on Azure AI Search.

In [54]:
vector_store.add_documents(documents=all_splits[:30])

['MTcwNzU5NDUtNjJhNy00YjE5LWE2YmYtMWI3Y2Y0YjQ3ZjIy',
 'NmU0MGFkYmItMDA3Ni00OGE1LThjM2ItNDI3ODFmNGRiMzZl',
 'ZjY2NDZkZDgtZTExZi00YWJjLWJmMTQtY2IwOTJhZmM3MDI3',
 'MzBlZTZjNDgtN2ZlNi00N2ZkLWEwMjEtMDVlNjIxNDcxMGEz',
 'OGJhZTFmNTItYjk4OS00MDU1LTlmMDUtNDE5MzYxZTg2Mjdk',
 'MDg1Y2JmMjQtMmZlMS00OTYxLThmZTAtMGI2NzI3NjRmY2Ji',
 'OTFkYjRlMTctOTA5ZC00OThmLTg5MzUtMjI5MzQ3YmUyMWEz',
 'NWRlMWM5MmYtMGE3Ni00MzFjLWI5NDItMjJlMWNlZDQ5YTNl',
 'N2NhYWUxZGMtM2NkMC00OGYyLTg4MDktYWI4NDdiNjcwYWY5',
 'MWI2YTUzNDctYTJmMy00YmMzLTgyM2UtNjUzMDQ3MjMxZjI0',
 'MTExYzc4ZmQtMzczNy00ODAyLWJkNDQtOGYxZTI0NTQyMjNj',
 'ZmEzMTE1NWYtNTVkZS00NGIzLWI0MzAtODNlZGRlODY0ZjM1',
 'MDI4ZDBiOTQtZTFiNy00ZjM5LWEzMWItZWZkOTdjZjM2YWZk',
 'NzA1MWViOGQtMTU1OS00YTk1LWI1N2EtNTlmYzdkMmU4Nzk2',
 'MzQ3MDA4OGUtODc5MC00MDY4LTkyNjUtZWMxYmFhMWZlODFh',
 'ZTQwZThkMGEtNzIxMC00OTZiLWFmM2ItMzcwMTY4NmFmNWUy',
 'MjVjZTFhZmMtNTc0NS00N2I0LWEyYmYtNmJiZGJlMjgxYWU3',
 'ODM0ZTczMmEtYzdmMi00M2I4LTliYjYtZDE0YzdjMmVmMTkw',
 'ZTQ2MmNjYzQtN2Y0Yi00MDgzLWEwMDktMjNlOWRkYTM0

Once we've instantiated a VectorStore that contains documents, we can query it. VectorStore includes methods for querying:

* Synchronously and asynchronously;
* By string query and by vector;
* With and without returning similarity scores;
* By similarity and maximum marginal relevance (to balance similarity with query to diversity in retrieved results).

## Perform a vector similarity search

Execute a pure vector similarity search using the `similarity_search()` method.

In [66]:
# Perform a similarity search
docs = vector_store.similarity_search(
    query="Who are the authors of the book Azure for Architects",
    k=3,
    search_type="similarity",
)
print(docs[0].page_content)

Ritesh Modi, Jack Lee, and Rithin Skaria
Create secure, scalable, high-availability 
applications on the cloud
Azure for Architects
Third Edition


## Perform a vector similarity search with relevance scores

Execute a pure vector similarity search using the `similarity_search_with_relevance_scores()` method. Queries that don't meet the threshold requirements are exluded.

In [67]:
docs_and_scores = vector_store.similarity_search_with_relevance_scores(
    query="Who are the authors of the book Azure for Architects",
    k=4,
    score_threshold=0.70,
)
from pprint import pprint

pprint(docs_and_scores)

[(Document(metadata={'id': 'Nzg3Y2JiYzItNjcxZS00NzU1LTlmYzQtMGMxYzY3YmNlYzFi', 'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign 16.0 (Windows)', 'creationdate': '2021-06-17T13:34:27+05:30', 'author': 'Ritesh Modi', 'moddate': '2021-06-17T14:21:14+05:30', 'subject': 'Create secure, scalable, high-availability applications on the cloud', 'title': 'Azure for Architects, Third Edition', 'trapped': '/False', 'source': './example_data/azure-for-architects.pdf', 'total_pages': 701, 'page': 1, 'page_label': 'a', 'start_index': 0}, page_content='Ritesh Modi, Jack Lee, and Rithin Skaria\nCreate secure, scalable, high-availability \napplications on the cloud\nAzure for Architects\nThird Edition'),
  0.7745179),
 (Document(metadata={'id': 'MTcwNzU5NDUtNjJhNy00YjE5LWE2YmYtMWI3Y2Y0YjQ3ZjIy', 'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign 16.0 (Windows)', 'creationdate': '2021-06-17T13:34:27+05:30', 'author': 'Ritesh Modi', 'moddate': '2021-06-17T14:21:14+05:30', 'su

## Perform a hybrid search

Execute hybrid search using the search_type or hybrid_search() method. Vector and nonvector text fields are queried in parallel, results are merged, and top matches of the unified result set are returned.

In [68]:
# Perform a hybrid search using the search_type parameter
docs = vector_store.similarity_search(
    query="Who are the authors of the book Azure for Architects",
    k=3,
    search_type="hybrid",
)
print(docs[0].page_content)

Azure for Architects Third Edition
Copyright © 2020 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, 
or transmitted in any form or by any means, without the prior written permission of the 
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of 
the information presented. However, the information contained in this book is sold 
without warranty, either express or implied. Neither the authors, nor Packt Publishing, 
and its dealers and distributors will be held liable for any damages caused or alleged to 
be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the 
companies and products mentioned in this book by the appropriate use of capitals. 
However, Packt Publishing cannot guarantee the accuracy of this information.


Async query.

In [69]:
results = await vector_store.asimilarity_search("When was this book published ?")

print(results[0])

page_content='Azure for Architects Third Edition
Copyright © 2020 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, 
or transmitted in any form or by any means, without the prior written permission of the 
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of 
the information presented. However, the information contained in this book is sold 
without warranty, either express or implied. Neither the authors, nor Packt Publishing, 
and its dealers and distributors will be held liable for any damages caused or alleged to 
be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the 
companies and products mentioned in this book by the appropriate use of capitals. 
However, Packt Publishing cannot guarantee the accuracy of this information.' met

Return documents based on similarity to an embedded query.

In [70]:
embedding = embeddings.embed_query("How many chapters in this book ?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

NotImplementedError: 

## Layout analysis and extraction of text from images

If you require a more granular segmentation of text (e.g., into distinct paragraphs, titles, tables, or other structures) or require extraction of text from images, the method below is appropriate. It will return a list of Document objects, where each object represents a structure on the page. The Document's metadata stores the page number and other information related to the object (e.g., it might store table rows and columns in the case of a table object).

Under the hood it uses the langchain-unstructured library. See the integration docs for more information about using Unstructured with LangChain.

Unstructured supports multiple parameters for PDF parsing:

* strategy (e.g., "fast" or "hi-res")
* API or local processing. You will need an API key to use the API.

The hi-res strategy provides support for document layout analysis and OCR. We demonstrate it below via the API. See local parsing section below for considerations when running locally.

In [1]:
%pip install -qU langchain-unstructured

Note: you may need to restart the kernel to use updated packages.


In [None]:
# test unstructured
%pip install unstructured
%pip install "unstructured[pdf]"

In [None]:
from unstructured.partition.auto import partition

elements = partition(filename="./example_data/azure-for-architects.pdf")

print("\n\n".join([str(el) for el in elements]))