# Vector DB and similarity search

Author: Pavel Agurov, pavel_agurov@epam.com

Based on previous notebooks we can extract plan text from the data source, split text into chunks and vectorize it. Next step is to find relevant chunks. 

Simplest way is just to check each chunk one by one, but if we have many chunks it takes long time and can't be good production solution. So we should use special search algoritms adapted for vectors. It can be for example k-NN or ANN. You can find details here https://sudhiryelikar.com/articles/109-understanding-similarity-search-and-vector-databases.

There are special type of database where similarity search (i.e. cosin calculation) is integrated into database. Let's see how it works.

In [1]:
%pip install openai > /dev/null
%pip install tiktoken > /dev/null
%pip install langchain > /dev/null
%pip install langchain_openai > /dev/null
%pip install langchain_core > /dev/null
%pip install langchain_community > /dev/null
%pip install langchain_text_splitters > /dev/null

Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.


Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.


Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.


Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.


Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.
The system cannot find the path specified.


Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.


Note: you may need to restart the kernel to use updated packages.


First we will create chunks as we did it before.

In [1]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf"

loader = PyPDFLoader(file_path)
chunks = loader.load_and_split()
print(len(chunks))

16


Please note - langchain return class Document and we can add our own metadata into dictionary. This metadata will be also uploaded into vector db and can be used later for search.

In [2]:
for chunk in chunks:
    # just add a custom metadata field with the length of the page content
    # no specific reason, just to show how to add custom metadata
    chunk.metadata['custom'] = len(chunk.page_content)

We will need embedding model, let's use ADA from openAI.

In [3]:
import os
from langchain_openai.embeddings.azure import AzureOpenAIEmbeddings

# your GPT key should be in OPENAI_API_KEY environment variable
# from langchain_openai import OpenAIEmbeddings
# embedding_model = OpenAIEmbeddings(
#     model          ="text-embedding-ada-002",
#     api_key        = os.environ['OPENAI_API_KEY'],
# )

embedding_model = AzureOpenAIEmbeddings(
    model          ="text-embedding-ada-002",
    api_key        = os.environ['OPENAI_API_KEY'],
    azure_endpoint = "https://ai-proxy.lab.epam.com",
)

We will use search_query variable to store our request.

In [4]:
search_query = "financial results"

Let's check cosine distance manually for all chunks (please note - a cosine distance is not equal to a euclidean distance, but order order can be the same if its normalized function).

In [9]:
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

def distance_sentences(emd, s1, s2, distance):
    v1 = emd.embed_documents([s1])[0]
    v2 = emd.embed_documents([s2])[0]
    return distance([v1], [v2])[0][0]

print("Index, Cosine, Euclidean")
distance_array = []
for chunk_index, chunk in enumerate(chunks):
    cosine_distance = distance_sentences(embedding_model, chunk.page_content, search_query, cosine_similarity)
    euclidean_distance = distance_sentences(embedding_model, chunk.page_content, search_query, euclidean_distances)
    distance_array.append((chunk_index, cosine_distance, euclidean_distance))
    print(f"{chunk_index}: {cosine_distance:.3f}, {euclidean_distance:.3f}")

Index, Cosine, Euclidean
0: 0.790, 0.647
1: 0.817, 0.605
2: 0.821, 0.598
3: 0.812, 0.614
4: 0.787, 0.653
5: 0.753, 0.703
6: 0.798, 0.635
7: 0.821, 0.599
8: 0.736, 0.727
9: 0.802, 0.630
10: 0.783, 0.660
11: 0.794, 0.642
12: 0.792, 0.645
13: 0.800, 0.632
14: 0.791, 0.646
15: 0.794, 0.642


In [20]:
# sort by cosine distance
distance_array.sort(key=lambda x: x[1])
distance_array.reverse()
print("Sorted by cosine distance")
for index, cosine_distance, euclidean_distance in distance_array:
    print(f"{index}: {cosine_distance:.3f}, {euclidean_distance:.3f}")

Sorted by cosine distance
2: 0.821, 0.598
7: 0.821, 0.599
1: 0.817, 0.605
3: 0.812, 0.614
9: 0.802, 0.630
13: 0.800, 0.632
6: 0.798, 0.635
15: 0.794, 0.642
11: 0.794, 0.642
12: 0.792, 0.645
14: 0.791, 0.646
0: 0.790, 0.647
4: 0.787, 0.653
10: 0.783, 0.660
5: 0.753, 0.703
8: 0.736, 0.727


In [21]:
# sort by euclidean distance
distance_array.sort(key=lambda x: x[2])
print("Sorted by euclidean distance")

for index, cosine_distance, euclidean_distance in distance_array:
    print(f"{index}: {cosine_distance:.3f}, {euclidean_distance:.3f}")

Sorted by euclidean distance
2: 0.821, 0.598
7: 0.821, 0.599
1: 0.817, 0.605
3: 0.812, 0.614
9: 0.802, 0.630
13: 0.800, 0.632
6: 0.798, 0.635
15: 0.794, 0.642
11: 0.794, 0.642
12: 0.792, 0.645
14: 0.791, 0.646
0: 0.790, 0.647
4: 0.787, 0.653
10: 0.783, 0.660
5: 0.753, 0.703
8: 0.736, 0.727


# FAISS

In this example we will use FAISS (https://github.com/facebookresearch/faiss), but by langchain wrapper.

In [None]:
# Please note that there is gpu faiss version available

%pip install faiss-cpu > /dev/null

In [7]:
from langchain_community.vectorstores.faiss import FAISS

vector_store = FAISS.from_documents(
    documents = chunks,
    embedding = embedding_model
)

In [8]:
# we can save the vector store to disk

VECTORDB_PERSIST_DIRECTORY = '.vector_store/FAISS'
COLLECTION_NAME = 'data'
vector_store.save_local(folder_path = VECTORDB_PERSIST_DIRECTORY, index_name= COLLECTION_NAME)

In [9]:
# and load it back
vector_store = FAISS.load_local(
    folder_path = VECTORDB_PERSIST_DIRECTORY, 
    index_name= COLLECTION_NAME,
    embeddings= embedding_model,
    allow_dangerous_deserialization=True
)

Now we can find relevant chunks (parameter k allows to define count of chunks)

In [32]:
search_results = vector_store.similarity_search_with_relevance_scores(
    query = search_query,
    k     = 20,  # number of results to return
)

print(len(search_results))
for search_result_item in search_results:
    print(f"Source page: {search_result_item[0].metadata['source']}: {search_result_item[0].metadata['page']} Score: {search_result_item[1]:.3f}")

16
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 2 Score: 0.747
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.747
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 1 Score: 0.741
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 3 Score: 0.734
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 8 Score: 0.720
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 13 Score: 0.718
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.715
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 15 Score: 0.709
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 11

We can define a score threshold to have only relevant results.

In [28]:
score_threshold = 0.73

relevence_search_results = [search_result_item for search_result_item in search_results if search_result_item[1] > score_threshold]

print(f"Results with score above {score_threshold} [{len(relevence_search_results)}]:")
for search_result_item in relevence_search_results:
    print(f"Source page: {search_result_item[0].metadata['source']}: {search_result_item[0].metadata['page']} Score: {search_result_item[1]:.3f}")
    print()


Results with score above 0.73 [4]:
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 2 Score: 0.747

Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.747

Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 1 Score: 0.741

Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 3 Score: 0.734



If we have more than one document we can add simple filter by document name.

In [36]:
search_results = vector_store.similarity_search_with_relevance_scores(
    query = search_query,
    k     = 20,  # number of results to return
    filter = {"source": 'data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf'}
)

print(len(search_results))
for search_result_item in search_results:
    print(f"Source page: {search_result_item[0].metadata['source']}: {search_result_item[0].metadata['page']} Score: {search_result_item[1]:.3f}")

16
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 2 Score: 0.747
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.747
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 1 Score: 0.741
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 3 Score: 0.734
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 8 Score: 0.720
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 13 Score: 0.718
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.715
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 15 Score: 0.709
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 11

We can also add filter as function. In this case we can add any filter by metadata.

In [37]:
def filter_function(doc_metadata):
    filter_by_doc_page = doc_metadata.get('page') > 5
    return filter_by_doc_page

search_results = vector_store.similarity_search_with_relevance_scores(
    query = search_query,
    k = 20,
    filter = filter_function
)

print(len(search_results))
for search_result_item in search_results:
    print(f"Source page: {search_result_item[0].metadata['source']}: {search_result_item[0].metadata['page']} Score: {search_result_item[1]:.3f}")

10
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.747
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 8 Score: 0.720
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 13 Score: 0.718
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.715
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 15 Score: 0.709
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 11 Score: 0.709
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 12 Score: 0.706
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 14 Score: 0.705
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf:

# Qdrant

Qdrant is one more vector db (https://qdrant.tech/qdrant-vector-database/)

In [14]:
%pip install langchain-qdrant

Collecting grpcio>=1.41.0 (from qdrant-client<2.0.0,>=1.9.0->langchain-qdrant)
  Using cached grpcio-1.64.1-cp311-cp311-win_amd64.whl.metadata (3.4 kB)
Using cached grpcio-1.64.1-cp311-cp311-win_amd64.whl (4.1 MB)
Installing collected packages: grpcio
  Attempting uninstall: grpcio
    Found existing installation: grpcio 1.63.0
    Uninstalling grpcio-1.63.0:
      Successfully uninstalled grpcio-1.63.0
Successfully installed grpcio-1.64.1
Note: you may need to restart the kernel to use updated packages.


  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpc-google-iam-v1 0.13.0 requires protobuf!=3.20.0,!=3.20.1,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.27.2 which is incompatible.
pymilvus 2.4.4 requires grpcio<=1.63.0,>=1.49.1, but you have grpcio 1.64.1 which is incompatible.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 5.27.2 which is incompatible.
tensorflow-intel 2.15.0 requires protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3, but you have protobuf 5.27.2 which is incompatible.


In [16]:
from langchain_community.vectorstores import Qdrant

vector_store = Qdrant.from_documents(
    chunks,
    embedding_model,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="my_documents",
)

# we can also create a Qdrant index with a persistent storage
# qdrant = Qdrant.from_documents(
#     chunks,
#     embedding_model,
#     path = ...,
#     collection_name= "my_documents",
#     force_recreate=True
# )      


In [66]:
search_results = vector_store.similarity_search_with_score(
    query = search_query,
    k = 20
)

print(len(search_results))
for search_result_item in search_results:
    print(f"Source page: {search_result_item[0].metadata['source']}: {search_result_item[0].metadata['page']} Score: {search_result_item[1]:.3f}")

16
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 2 Score: 0.821
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.821
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 1 Score: 0.817
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 3 Score: 0.812
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 8 Score: 0.802
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 13 Score: 0.800
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.798
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 15 Score: 0.794
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 11

With QDrant we can use score_threshold parameter directly in similarity_search_with_score call.

In [26]:
score_threshold = 0.80

search_results = vector_store.similarity_search_with_score(
    query = search_query, 
    k     = 20, 
    score_threshold = score_threshold
)

print(len(search_results))
for search_result_item in search_results:
    print(f"Source page: {search_result_item[0].metadata['source']}: {search_result_item[0].metadata['page']} Score: {search_result_item[1]:.3f}")

6
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 2 Score: 0.821
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.821
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 1 Score: 0.817
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 3 Score: 0.812
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 8 Score: 0.802
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 13 Score: 0.800


We can also define filter as dictrionary

In [78]:
search_results = vector_store.similarity_search_with_score(
    query = search_query, 
    k     = 20,
    filter = {
        "source": 'data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf',
        # "page": 5 - you can add more filters here
    }
)

print(len(search_results))
for search_result_item in search_results:
    print(f"Source page: {search_result_item[0].metadata['source']}: {search_result_item[0].metadata['page']} Score: {search_result_item[1]:.3f}")

1
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 5 Score: 0.753


Custom filter can be also defined.

In [130]:
from qdrant_client import models as qdrant_models

custom_filter = qdrant_models.Filter(
    must= [
        qdrant_models.FieldCondition(
            key=f"{vector_store.metadata_payload_key}.source",
            match= qdrant_models.MatchValue(value="data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf")
        ),
        qdrant_models.FieldCondition(
            key=f"{vector_store.metadata_payload_key}.custom",
            range= qdrant_models.Range(gte= 1000)
        ),
    ]
)

search_results = vector_store.similarity_search_with_score(
    query  = search_query, 
    k      = 20,
    filter = custom_filter,
)

print(len(search_results))
for search_result_item in search_results:
    print(f"Source page: {search_result_item[0].metadata['source']}: {search_result_item[0].metadata['page']} Score: {search_result_item[1]:.3f}")

12
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.821
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 8 Score: 0.802
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 13 Score: 0.800
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 6 Score: 0.798
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 15 Score: 0.794
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 11 Score: 0.794
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 12 Score: 0.792
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf: 14 Score: 0.791
Source page: data/epam-reports-results-for-third-quarter-2023-and-updates-full-year-outlook.pdf:

As we can see later, it's very useful to have ability to read all chunks from the collection.

In [151]:
# get all documents
all_documents = vector_store.client.scroll(
    collection_name= vector_store.collection_name,
    offset = 0,
    limit  = vector_store.client.get_collection(vector_store.collection_name).points_count,
    with_payload=True,
    with_vectors=False,
)

print(len(all_documents[0]))
for doc in all_documents[0]:
    print(doc)

16
id='00986dcb4e564a488ad16b29dc579067' payload={'page_content': "(d) One -time charges for the three and nine months ended September 30, 2023 include $7.1 million related to the \nCompany's Cost Optimization Program initiated in the third quarter of 2023. Consistent with the Company's historical \nnon-GAAP policy, costs incu rred in connection with formal restructuring initiatives have been excluded from non- GAAP \nresults as these are one -time and unusual in nature.  \n(e) Geographic repositioning includes expenses associated with the relocation to other countries of employees based outside of Ukraine impacted by the war and geopolitical instability in the region, and includes the cost of accommodations, travel and food.  These expenses are incremental to those expenses incurred prior to the crisis, clearly \nseparable from normal operations, and not expected to recur once the crisis has subsided and operations return to \nnormal.  \n(f) As a result of the Company's decision to no

16