# Build a semantic search engine

This tutorial will familiarize you with LangChain's [document loader](/docs/concepts/document_loaders), [embedding](/docs/concepts/embedding_models), and [vector store](/docs/concepts/vectorstores) abstractions. These abstractions are designed to support retrieval of data--  from (vector) databases and other sources--  for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augmented generation, or [RAG](/docs/concepts/rag) (see our RAG tutorial [here](/docs/tutorials/rag)).

Here we will build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.

## Concepts

This guide focuses on retrieval of text data. We will cover the following concepts:

- Documents and document loaders;
- Text splitters;
- Embeddings;
- Vector stores and retrievers.

## Setup

### Jupyter Notebook

This and other tutorials are perhaps most conveniently run in a Jupyter notebook. See [here](https://jupyter.org/install) for instructions on how to install.

### Installation

This tutorial requires the `langchain-community` and `pypdf` packages:

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from "@theme/CodeBlock";

<Tabs>
  <TabItem value="pip" label="Pip" default>
    <CodeBlock language="bash">pip install langchain-community pypdf</CodeBlock>
  </TabItem>
  <TabItem value="conda" label="Conda">
    <CodeBlock language="bash">conda install langchain-community pypdf -c conda-forge</CodeBlock>
  </TabItem>
</Tabs>


For more details, see our [Installation guide](/docs/how_to/installation).

### LangSmith

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls.
As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent.
The best way to do this is with [LangSmith](https://smith.langchain.com).

After you sign up at the link above, make sure to set your environment variables to start logging traces:

```shell
export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."
```

Or, if in a notebook, you can set them with:

```python
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
```


## Documents and Document Loaders

LangChain implements a [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

- `page_content`: a string representing the content;
- `metadata`: a dict containing arbitrary metadata;
- `id`: (optional) a string identifier for the document.

The `metadata` attribute can capture information about the source of the document, its relationship to other documents, and other information. Note that an individual `Document` object often represents a chunk of a larger document.

We can generate sample documents when desired:
```python
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]
```

In [1]:
# Import the dotenv library to load environment variables
from dotenv import load_dotenv
import os

# Load the environment variables from .env file
load_dotenv()

# Verify that the environment variables are loaded correctly
# We'll check if they exist without printing the actual values for security
env_vars = [
    "LANGSMITH_TRACING",
    "LANGSMITH_ENDPOINT",
    "LANGSMITH_API_KEY",
    "LANGSMITH_PROJECT",
    "OPENAI_API_KEY"
]

for var in env_vars:
    # Print whether each variable is set or not
    print(f"{var} is {'set' if os.getenv(var) else 'not set'}")

LANGSMITH_TRACING is set
LANGSMITH_ENDPOINT is set
LANGSMITH_API_KEY is set
LANGSMITH_PROJECT is set
OPENAI_API_KEY is set


In [3]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()
llm.invoke("Hello, world!")

AIMessage(content='Hello! How are you today?', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 8, 'prompt_tokens': 11, 'total_tokens': 19, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-54b87658-741f-4a9d-9f0e-924e4ad1a494-0', usage_metadata={'input_tokens': 11, 'output_tokens': 8, 'total_tokens': 19, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}})

However, the LangChain ecosystem implements [document loaders](/docs/concepts/document_loaders) that [integrate with hundreds of common sources](/docs/integrations/document_loaders/). This makes it easy to incorporate data from these sources into your AI application.

### Loading documents

Let's load a PDF into a sequence of `Document` objects. There is a sample PDF in the LangChain repo [here](https://github.com/langchain-ai/langchain/tree/master/docs/docs/example_data) -- a 10-k filing for Nike from 2023. We can consult the LangChain documentation for [available PDF document loaders](/docs/integrations/document_loaders/#pdfs). Let's select [PyPDFLoader](/docs/integrations/document_loaders/pypdfloader/), which is fairly lightweight.

In [4]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../example_data/wot-1.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

518


In [5]:
docs

[Document(metadata={'source': '../example_data/wot-1.pdf', 'page': 0, 'page_label': '1'}, page_content=''),
 Document(metadata={'source': '../example_data/wot-1.pdf', 'page': 1, 'page_label': '2'}, page_content='“\nThe Eye of the World\n is the best of its genre.”\n—\nThe Ottawa Citizen\n“A splendid tale of heroic fantasy, vast in scope, colorful in detail, and convincing in its\npresentation of human character and personality.”\n—L. Sprague De Camp\n“This richly detailed fantasy presents fully realized, complex adventure. Recommended.”\n—\nLibrary Journal\n“This one is as solid as a steel blade and glowing with the true magic. Robert Jordan deserves\ncongratulations.”\n—Fred Saberhagen\n“One hell of a story. [It] kept me up past my bedtime for three nights running—and it’s been a\nlong time since a novel’s done \nthat.\n”\n—Baird Searles,\nIsaac Asimov’s Science Fiction Magazine\n“A future collector’s item. Jordan has brought out a completely new allegory in a fantasy\nconcept that go

:::tip

See [this guide](/docs/how_to/document_loader_pdf/) for more detail on PDF document loaders.

:::

`PyPDFLoader` loads one `Document` object per PDF page. For each, we can easily access:

- The string content of the page;
- Metadata containing the file name and page number.

In [8]:
print(f"{docs[8].page_content[:200]}\n")
print(docs[0].metadata)

PROLOGUE
 
Dragonmount
 
 
T
he palace still shook occasionally as the earth rumbled in memory, groaned as if it would
deny what had happened. Bars of sunlight cast through rents in the walls made mot

{'source': '../example_data/wot-1.pdf', 'page': 0, 'page_label': '1'}


### Splitting

For both information retrieval and downstream question-answering purposes, a page may be too coarse a representation. Our goal in the end will be to retrieve `Document` objects that answer an input query, and further splitting our PDF will help ensure that the meanings of relevant portions of the document are not "washed out" by surrounding text.

We can use [text splitters](/docs/concepts/text_splitters) for this purpose. Here we will use a simple text splitter that partitions based on characters. We will split our documents into chunks of 1000 characters
with 200 characters of overlap between chunks. The overlap helps
mitigate the possibility of separating a statement from important
context related to it. We use the
[RecursiveCharacterTextSplitter](/docs/how_to/recursive_text_splitter),
which will recursively split the document using common separators like
new lines until each chunk is the appropriate size. This is the
recommended text splitter for generic text use cases.

We set `add_start_index=True` so that the character index where each
split Document starts within the initial Document is preserved as
metadata attribute “start_index”.

See [this guide](/docs/how_to/document_loader_pdf/) for more detail about working with PDFs, including how to extract text from specific sections and images. 

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

2285

In [8]:
all_splits

[Document(metadata={'source': '../example_data/nke-10k-2023.pdf', 'page': 0, 'page_label': '1', 'start_index': 0}, page_content="Table of Contents\nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-K\n(Mark One)\n☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\nFOR THE FISCAL YEAR ENDED MAY 31, 2023\nOR\n☐  TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934\nFOR THE TRANSITION PERIOD FROM                         TO                         .\nCommission File No. 1-10635\nNIKE, Inc.\n(Exact name of Registrant as specified in its charter)\nOregon 93-0584541\n(State or other jurisdiction of incorporation) (IRS Employer Identification No.)\nOne Bowerman Drive, Beaverton, Oregon 97005-6453\n(Address of principal executive offices and zip code)\n(503) 671-6453\n(Registrant's telephone number, including area code)\nSECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:\nClass B C

In [15]:
all_splits[100]

Document(metadata={'source': '../example_data/wot-1.pdf', 'page': 30, 'page_label': '31', 'start_index': 778}, page_content='“to be mistrusted in the best of times.”\nWith a shrill cry the raven launched itself into the air so violently that two black feathers drifted\ndown from the roof’s edge.\nStartled, Rand and Mat twisted to follow the bird’s swift flight, over the Green and toward the\ncloud-tipped Mountains of Mist, tall beyond the Westwood, until it dwindled to a speck in the west,\nthen vanished from view.\nRand’s gaze fell to the woman who had spoken. She, too, had been watching the flight of the\nraven, but now she turned back, and her eyes met his. He could only stare. This had to be the Lady\nMoiraine, and she was everything that Mat and Ewin had said, everything and more.\nWhen he had heard she called Nynaeve child, he had pictured her as old, but she was not. At\nleast, he could not put any age to her at all. At first he thought she was as young as Nynaeve, but the\nlong

In [12]:
# | output: false
# | echo: false

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
print(f"Current embedding model: {embeddings.model}")

Current embedding model: text-embedding-3-large


In [13]:
print("Embedding configuration:")
for key, value in embeddings.__dict__.items():
    print(f"{key}: {value}")

Embedding configuration:
client: <openai.resources.embeddings.Embeddings object at 0x10da11510>
async_client: <openai.resources.embeddings.AsyncEmbeddings object at 0x10ef86950>
model: text-embedding-3-large
dimensions: None
deployment: text-embedding-ada-002
openai_api_version: None
openai_api_base: None
openai_api_type: None
openai_proxy: None
embedding_ctx_length: 8191
openai_api_key: **********
openai_organization: None
allowed_special: None
disallowed_special: None
chunk_size: 1000
max_retries: 2
request_timeout: None
headers: None
tiktoken_enabled: True
tiktoken_model_name: None
show_progress_bar: False
model_kwargs: {}
skip_empty: False
default_headers: None
default_query: None
retry_min_seconds: 4
retry_max_seconds: 20
http_client: None
http_async_client: None
check_embedding_ctx_length: True


In [14]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 3072

[0.009298405610024929, -0.01608305238187313, 0.00028412663959898055, 0.0064094592817127705, 0.020547788590192795, -0.03926966339349747, -0.007359934970736504, 0.04102053865790367, -0.008072791621088982, 0.05998003110289574]


Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.

## Vector stores

LangChain [VectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html) objects contain methods for adding text and `Document` objects to the store, and querying them using various similarity metrics. They are often initialized with [embedding](/docs/how_to/embed_text) models, which determine how text data is translated to numeric vectors.

LangChain includes a suite of [integrations](/docs/integrations/vectorstores) with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as [Postgres](/docs/integrations/vectorstores/pgvector)) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Let's select a vector store:

import VectorStoreTabs from "@theme/VectorStoreTabs";

<VectorStoreTabs/>

In [6]:
from langchain_milvus import Milvus

# The easiest way is to use Milvus Lite where everything is stored in a local file.
# If you have a Milvus server you can use the server URI such as "http://localhost:19530".
URI = "./milvus_example.db"

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
)

Having instantiated our vector store, we can now index the documents.

In [7]:
ids = vector_store.add_documents(documents=all_splits)

Note that most vector store implementations will allow you to connect to an existing vector store--  e.g., by providing a client, index name, or other information. See the documentation for a specific [integration](/docs/integrations/vectorstores) for more detail.

Once we've instantiated a `VectorStore` that contains documents, we can query it. [VectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html) includes methods for querying:
- Synchronously and asynchronously;
- By string query and by vector;
- With and without returning similarity scores;
- By similarity and [maximum marginal relevance](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html#langchain_core.vectorstores.base.VectorStore.max_marginal_relevance_search) (to balance similarity with query to diversity in retrieved results).

The methods will generally include a list of [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) objects in their outputs.

### Usage

Embeddings typically represent text as a "dense" vector such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document.

Return documents based on similarity to a string query:

In [8]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213 
NIKE Brand in-line stores (including employee-only stores) 74 
Converse stores (including factory stores) 82 
TOTAL 369 
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}


Async query:

In [9]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}


Return scores:

In [11]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.23699893057346344

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metad

Return documents based on similarity to an embedded query:

In [11]:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
•Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
•Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
•Unfavorable changes in net foreign currency exchange rates, including hedges; and
•Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:' met

Learn more:

- [API reference](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html)
- [How-to guide](/docs/how_to/vectorstores)
- [Integration-specific docs](/docs/integrations/vectorstores)

## Retrievers

LangChain `VectorStore` objects do not subclass [Runnable](https://python.langchain.com/api_reference/core/index.html#langchain-core-runnables). LangChain [Retrievers](https://python.langchain.com/api_reference/core/index.html#langchain-core-retrievers) are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous `invoke` and `batch` operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs).

We can create a simple version of this ourselves, without subclassing `Retriever`. If we choose what method we wish to use to retrieve documents, we can create a runnable easily. Below we will build one around the `similarity_search` method:

In [14]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectivel

Vectorstores implement an `as_retriever` method that will generate a Retriever, specifically a [VectorStoreRetriever](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStoreRetriever.html). These retrievers include specific `search_type` and `search_kwargs` attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:

In [13]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectivel

`VectorStoreRetriever` supports search types of `"similarity"` (default), `"mmr"` (maximum marginal relevance, described above), and `"similarity_score_threshold"`. We can use the latter to threshold documents output by the retriever by similarity score.

Retrievers can easily be incorporated into more complex applications, such as [retrieval-augmented generation (RAG)](/docs/concepts/rag) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the [RAG tutorial](/docs/tutorials/rag) tutorial.

### Learn more:

Retrieval strategies can be rich and complex. For example:

- We can [infer hard rules and filters](/docs/how_to/self_query/) from a query (e.g., "using documents published after 2020");
- We can [return documents that are linked](/docs/how_to/parent_document_retriever/) to the retrieved context in some way (e.g., via some document taxonomy);
- We can generate [multiple embeddings](/docs/how_to/multi_vector) for each unit of context;
- We can [ensemble results](/docs/how_to/ensemble_retriever) from multiple retrievers;
- We can assign weights to documents, e.g., to weigh [recent documents](/docs/how_to/time_weighted_vectorstore/) higher.

The [retrievers](/docs/how_to#retrievers) section of the how-to guides covers these and other built-in retrieval strategies.

It is also straightforward to extend the [BaseRetriever](https://python.langchain.com/api_reference/core/retrievers/langchain_core.retrievers.BaseRetriever.html) class in order to implement custom retrievers. See our how-to guide [here](/docs/how_to/custom_retriever).


## Next steps

You've now seen how to build a semantic search engine over a PDF document.

For more on document loaders:

- [Conceptual guide](/docs/concepts/document_loaders)
- [How-to guides](/docs/how_to/#document-loaders)
- [Available integrations](/docs/integrations/document_loaders/)

For more on embeddings:

- [Conceptual guide](/docs/concepts/embedding_models/)
- [How-to guides](/docs/how_to/#embedding-models)
- [Available integrations](/docs/integrations/text_embedding/)

For more on vector stores:

- [Conceptual guide](/docs/concepts/vectorstores/)
- [How-to guides](/docs/how_to/#vector-stores)
- [Available integrations](/docs/integrations/vectorstores/)

For more on RAG, see:

- [Build a Retrieval Augmented Generation (RAG) App](/docs/tutorials/rag/)
- [Related how-to guides](/docs/how_to/#qa-with-rag)

In [3]:
from dotenv import load_dotenv
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_milvus import Milvus

# Load environment variables
load_dotenv()

# Load and process the PDF
file_path = "../example_data/wot-1.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()

# Split the documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Connect to Milvus
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={
        "uri": "http://localhost:19530",  # Using port-forwarded address
        "collection_name": "wot_vectors"  # Name for your vector collection
    }, auto_id=True
)

# Add documents to the vector store
vector_store.add_documents(all_splits)

# Verify insertion with a simple similarity search
query = "Test query to verify vectors are searchable"
results = vector_store.similarity_search(query, k=3)
print("Search results:", results)

Search results: [Document(metadata={'pk': 455557701328443538}, page_content='This is a test document three.'), Document(metadata={'pk': 455557701328443537}, page_content='This is a test document two.'), Document(metadata={'pk': 455557701328443536}, page_content='This is a test document one.')]


In [6]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_milvus import Milvus
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import time

class RAGQueryEngine:
    def __init__(self, collection_name="test_vectors", similarity_top_k=4, score_threshold=0.7):
        """Initialize the RAG Query Engine with configurable parameters.
        
        Args:
            collection_name: Name of the Milvus collection to query
            similarity_top_k: Number of similar documents to retrieve
            score_threshold: Minimum similarity score to consider (0-1)
        """
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
        self.vector_store = Milvus(
            embedding_function=self.embeddings,
            connection_args={
                "uri": "http://localhost:19530",
                "collection_name": collection_name
            },
            auto_id=True
        )
        
        # Initialize GPT-4 with specific settings for better control
        self.llm = ChatOpenAI(
            model="gpt-4o",
            temperature=0.3,  # Lower temperature for more focused answers
            max_tokens=1000
        )
        
        self.similarity_top_k = similarity_top_k
        self.score_threshold = score_threshold
        
        # Create a custom prompt template that enforces using only provided context
        self.qa_prompt = PromptTemplate(
            template="""You are a helpful AI assistant tasked with answering questions based ONLY on the provided context. 
            If the context doesn't contain enough information to answer the question fully, say so explicitly.
            Do not use any knowledge outside of the given context.

            Context: {context}

            Question: {question}

            Please provide a detailed answer based solely on the above context. If you're unsure or if the context lacks sufficient information, 
            say so clearly. Include specific quotes or references from the context to support your answer.

            Answer: """,
            input_variables=["context", "question"]
        )

    def search_with_metadata(self, query, return_raw=False):
        """Perform a similarity search and return documents with their metadata."""
        docs_and_scores = self.vector_store.similarity_search_with_score(
            query, 
            k=self.similarity_top_k
        )
        
        if return_raw:
            return docs_and_scores
        
        # Format results for analysis
        results = []
        for doc, score in docs_and_scores:
            results.append({
                'content': doc.page_content,
                'metadata': doc.metadata,
                'similarity_score': score
            })
        return results

    def answer_question(self, question, debug=False):
        """Answer a question using RAG with optional debug information."""
        start_time = time.time()
        
        # First, let's get the relevant documents with scores for analysis
        docs_and_scores = self.search_with_metadata(question, return_raw=True)
        
        # Create a retrieval QA chain
        qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",  # 'stuff' method: concatenate all docs into one prompt
            retriever=self.vector_store.as_retriever(
                search_kwargs={"k": self.similarity_top_k}
            ),
            chain_type_kwargs={
                "prompt": self.qa_prompt,
                "verbose": debug
            },
            return_source_documents=True
        )
        
        # Get the answer
        result = qa_chain({"query": question})
        
        end_time = time.time()
        
        if debug:
            print("\n=== Debug Information ===")
            print(f"Time taken: {end_time - start_time:.2f} seconds")
            print("\nRetrieved Documents:")
            for i, (doc, score) in enumerate(docs_and_scores, 1):
                print(f"\nDocument {i} (Score: {score:.4f}):")
                print(f"Content: {doc.page_content[:200]}...")
                print(f"Metadata: {doc.metadata}")
            
            print("\n=== Answer ===")
        
        return {
            'answer': result['result'],
            'source_documents': result['source_documents'],
            'execution_time': end_time - start_time
        }

    def tune_parameters(self, question, parameter_sets):
        """Try different parameter combinations for the same question."""
        results = []
        original_top_k = self.similarity_top_k
        original_threshold = self.score_threshold
        
        for params in parameter_sets:
            self.similarity_top_k = params.get('top_k', original_top_k)
            self.score_threshold = params.get('threshold', original_threshold)
            
            result = self.answer_question(question, debug=True)
            results.append({
                'parameters': params,
                'result': result
            })
        
        # Reset to original parameters
        self.similarity_top_k = original_top_k
        self.score_threshold = original_threshold
        
        return results

# Example usage:
def main():
    # Initialize the query engine
    rag_engine = RAGQueryEngine()
    
    # Example question
    question = "Where is the place that first time Rand al'Thor experiences a dream involving Ishamael?"
    
    # Get answer with debug information
    print("Getting answer with default parameters...")
    result = rag_engine.answer_question(question, debug=True)
    print("\nAnswer:", result['answer'])
    
    # Try different parameter combinations
    print("\nTuning parameters...")
    parameter_sets = [
        {'top_k': 3, 'threshold': 0.6},
        {'top_k': 5, 'threshold': 0.8},
        {'top_k': 7, 'threshold': 0.7}
    ]
    
    tuning_results = rag_engine.tune_parameters(question, parameter_sets)
    
    # Analyze tuning results
    for i, res in enumerate(tuning_results, 1):
        print(f"\nTrial {i}:")
        print(f"Parameters: {res['parameters']}")
        print(f"Answer: {res['result']['answer'][:200]}...")

if __name__ == "__main__":
    main()

Getting answer with default parameters...


  result = qa_chain({"query": question})




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI assistant tasked with answering questions based ONLY on the provided context. 
            If the context doesn't contain enough information to answer the question fully, say so explicitly.
            Do not use any knowledge outside of the given context.

            Context: casually tucked the dagger under his pillow, too. Rand blew out the candle and crawled into his own
bed. He could feel the wrongness from the other bed, not from Mat, but from beneath his pillow. He
was still worrying about it when sleep came.
From the first he knew it was a dream, one of those dreams that was not entirely dream. He
stood staring at the wooden door, its surface dark and cracked and rough with splinters. The air was
cold and dank, thick with the smell of decay. In the distance water dripped, the splashes hollow
echoes down stone corridors.
Den

In [11]:
# Create an instance
rag_engine = RAGQueryEngine()

# Simple question answering
result = rag_engine.answer_question(
    "What is Manetheren?",
    debug=True  # Set to True to see detailed information
)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful AI assistant tasked with answering questions based ONLY on the provided context. 
            If the context doesn't contain enough information to answer the question fully, say so explicitly.
            Do not use any knowledge outside of the given context.

            Context: memory. Weep, for the loss of their blood.”
She fell silent then, but no one spoke. Rand was as bound as the others in the spell she had
created. When she spoke again, he drank it in, and so did the rest.
“For nearly two centuries the Trolloc Wars had ravaged the length and breadth of the world, and
wherever battles raged, the Red Eagle banner of Manetheren was in the forefront. The men of
Manetheren were a thorn to the Dark One’s foot and a bramble to his hand. Sing of Manetheren, that
would never bend knee to the Shadow. Sing of Manetheren, the sword that c

In [14]:
print(result['answer'])

Manetheren was a kingdom known for its bravery and resilience against the forces of the Dark One during the Trolloc Wars. The context describes Manetheren as a place where the Red Eagle banner was always at the forefront of battles, indicating its significant role in resisting the Shadow. The people of Manetheren were described as "a thorn to the Dark One’s foot and a bramble to his hand," emphasizing their fierce opposition to evil forces. The kingdom was led by King Aemon al Caar al Thorin and Queen Eldrene ay Ellan ay Carlan, both of whom were renowned for their courage and beauty. 

The context also notes that Manetheren was ultimately destroyed, with its great city and villages consumed by fire, leaving nothing but memories and a legacy of courage. Despite this destruction, the people remained bound to their land by ties stronger than steel, even as the reasons for their steadfastness faded from memory over time. The narrative concludes with a lament for the loss of Manetheren, ur