# Benchmark
[Vector Stores](https://python.langchain.com/docs/integrations/vectorstores)

In [1]:
from langchain_community.embeddings import GPT4AllEmbeddings
import markdown_splitter as mdS
import os
import subprocess
import tracemalloc

# data/reusables/open-source/open-source-guide-general.md
query = "Where can I find guidance on creating and nurturing an open source project"

embeddings = GPT4AllEmbeddings()

folder_name = "Local"

if not os.path.exists(folder_name):
    os.makedirs(folder_name)

    subprocess.run(["git", "clone", "https://github.com/github/docs.git", folder_name])

docs = mdS.create_db(path=folder_name, glob="**/*.md",split=None)

bert_load_from_file: gguf version     = 2
bert_load_from_file: gguf alignment   = 32
bert_load_from_file: gguf data offset = 695552
bert_load_from_file: model name           = BERT
bert_load_from_file: model architecture   = bert
bert_load_from_file: model file type      = 1
bert_load_from_file: bert tokenizer vocab = 30522


100%|█████████▉| 5951/5956 [00:01<00:00, 4892.86it/s]


In [2]:
# print(len(docs)) ; docs = docs[0:2000] ; print(len(docs))

### DeepLake

In [3]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import DeepLake as VectorStore
docsearch = await VectorStore.afrom_documents(docs, embeddings)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(928, 11633)


Creating 17959 embeddings in 36 batches of size 500:: 100%|██████████| 36/36 [36:35<00:00, 60.98s/it]

Dataset(path='./deeplake/', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape       dtype  compression
  -------    -------     -------     -------  ------- 
   text       text      (17959, 1)     str     None   
 metadata     json      (17959, 1)     str     None   
 embedding  embedding  (17959, 384)  float32   None   
    id        text      (17959, 1)     str     None   
(66911613, 97221769)





In [4]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 9.05 s, sys: 1.48 s, total: 10.5 s
Wall time: 4.58 s


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community."),
 Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community by recommending best practices for creating and maintaining repositories for your open source project."),
 Document(page_content='There are a variety of ways that you can contribute to open source projects.', metadata={'Header 2': 'Validating an issue or pull request'}),
 Document(page_content='- "[Your Code of Conduct](https://opensource.guide/code-of-conduct/)" from the Open Source Guides\n- "[Building Welcoming Communities](https://opensource.guide/building-community/)" from the Open Source Guides\n- "[Lea

### Annoy

In [5]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import Annoy as VectorStore
docsearch = await VectorStore.afrom_documents(docs, embeddings)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(992, 11697)
(14959921, 280986377)


In [6]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 2.48 s, sys: 19.8 ms, total: 2.5 s
Wall time: 657 ms


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community."),
 Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community by recommending best practices for creating and maintaining repositories for your open source project."),
 Document(page_content='- "[Your Code of Conduct](https://opensource.guide/code-of-conduct/)" from the Open Source Guides\n- "[Building Welcoming Communities](https://opensource.guide/building-community/)" from the Open Source Guides\n- "[Leadership and Governance](https://opensource.guide/leadership-and-governance/)" from the Open Source Guides', metadata={'Header 2': 'Further reading'}),
 Document(page_

### Chroma

In [7]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import Chroma as VectorStore
docsearch = await VectorStore.afrom_documents(docs, embeddings)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(928, 11633)
(15649323, 312020989)


In [8]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 3.49 s, sys: 37.1 ms, total: 3.53 s
Wall time: 936 ms


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community."),
 Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community by recommending best practices for creating and maintaining repositories for your open source project."),
 Document(page_content='There are a variety of ways that you can contribute to open source projects.', metadata={'Header 2': 'Validating an issue or pull request'}),
 Document(page_content='- "[Your Code of Conduct](https://opensource.guide/code-of-conduct/)" from the Open Source Guides\n- "[Building Welcoming Communities](https://opensource.guide/building-community/)" from the Open Source Guides\n- "[Lea

### FAISS

In [9]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import FAISS as VectorStore
docsearch = await VectorStore.afrom_documents(docs, embeddings)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(928, 11633)
(246193122, 280880734)


In [10]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 2.33 s, sys: 25 ms, total: 2.35 s
Wall time: 641 ms


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community."),
 Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community by recommending best practices for creating and maintaining repositories for your open source project."),
 Document(page_content='There are a variety of ways that you can contribute to open source projects.', metadata={'Header 2': 'Validating an issue or pull request'}),
 Document(page_content='- "[Your Code of Conduct](https://opensource.guide/code-of-conduct/)" from the Open Source Guides\n- "[Building Welcoming Communities](https://opensource.guide/building-community/)" from the Open Source Guides\n- "[Lea

### LanceDB

In [2]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import LanceDB as VectorStore
docsearch = await VectorStore.afrom_documents(docs, embeddings)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(928, 11633)
(50379580, 331842358)


In [3]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 3.59 s, sys: 451 ms, total: 4.04 s
Wall time: 1.19 s


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community.", metadata={'vector': [0.04793297499418259, -0.0921248272061348, -0.009399300441145897, 0.043598707765340805, 0.1082848459482193, 0.009789415635168552, -0.012989534996449947, 0.08570940792560577, -0.04553161561489105, 0.007563899736851454, -0.022277479991316795, 0.036220841109752655, 0.03735759109258652, -0.06543924659490585, 0.017069373279809952, 0.05354275554418564, -0.07615584880113602, 0.020080016925930977, -0.04739841818809509, -0.06112363934516907, 0.00895136222243309, 0.016217390075325966, 0.0828634649515152, -0.009005185216665268, -0.018939126282930374, 0.02957729995250702, 0.037045180797576904, -0.03912099078297615, 0.04402764141559601, -0.04160189628601074, 0.037625428289175034, 0.03516285866498947, 0.10236716270446777, -0.045192893594503

### Qdrant

In [2]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import Qdrant as VectorStore
docsearch = await VectorStore.afrom_documents(
    docs,
    embeddings,
    path="/tmp/local_qdrant",
    collection_name="my_documents",
)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(928, 11633)
(103526131, 142474501)


In [3]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 5.61 s, sys: 1.12 s, total: 6.73 s
Wall time: 1.48 s


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community.", metadata={'_id': '917b495d33f54ca1bdba88dc9488077f', '_collection_name': 'my_documents'}),
 Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community by recommending best practices for creating and maintaining repositories for your open source project.", metadata={'_id': '3ceba3ce19704ab39e14f4bd7e2999b6', '_collection_name': 'my_documents'}),
 Document(page_content='There are a variety of ways that you can contribute to open source projects.', metadata={'Header 2': 'Validating an issue or pull request', '_id': '0b6904df39fe45b495be3016d401c276', '_collection_name': '

### SKLearn

In [4]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import SKLearnVectorStore as VectorStore
docsearch = await VectorStore.afrom_documents(docs, embeddings)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(928, 11633)
(317917924, 319402979)


In [5]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 6.21 s, sys: 1.42 s, total: 7.63 s
Wall time: 1.55 s


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community.", metadata={'id': '94478b70-d171-4cb8-bdb9-0d00a9c2d80f'}),
 Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community by recommending best practices for creating and maintaining repositories for your open source project.", metadata={'id': 'e65d6a8b-78ca-4ad8-b11d-3e037e78e158'}),
 Document(page_content='There are a variety of ways that you can contribute to open source projects.', metadata={'id': '162bf5ae-6436-4955-80fe-4b1066bb3e00', 'Header 2': 'Validating an issue or pull request', '_id': '0b6904df39fe45b495be3016d401c276', '_collection_name': 'my_documents'}),
 Do

### SQLite-VSS 
(No-Async)

In [2]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import SQLiteVSS as VectorStore
docsearch = VectorStore.from_documents(docs, embeddings)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(992, 11697)
(1506173, 381592931)


In [3]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 2.47 s, sys: 26.4 ms, total: 2.49 s
Wall time: 685 ms


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community."),
 Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community by recommending best practices for creating and maintaining repositories for your open source project."),
 Document(page_content='There are a variety of ways that you can contribute to open source projects.', metadata={'Header 2': 'Validating an issue or pull request'}),
 Document(page_content='- "[Your Code of Conduct](https://opensource.guide/code-of-conduct/)" from the Open Source Guides\n- "[Building Welcoming Communities](https://opensource.guide/building-community/)" from the Open Source Guides\n- "[Lea

### USearch

In [4]:
tracemalloc.start()
print(tracemalloc.get_traced_memory())

# Store
from langchain_community.vectorstores import USearch as VectorStore
docsearch = await VectorStore.afrom_documents(docs, embeddings)

print(tracemalloc.get_traced_memory())
tracemalloc.stop()

(928, 11633)
(15427078, 295386573)


In [6]:
%%time
for i in range(10):
    docsearch.similarity_search(query)

docsearch.similarity_search(query)

CPU times: user 2.52 s, sys: 21.1 ms, total: 2.54 s
Wall time: 662 ms


[Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community."),
 Document(page_content="For more information on open source, specifically how to create and grow an open source project, we've created [Open Source Guides](https://opensource.guide/) that will help you foster a healthy open source community by recommending best practices for creating and maintaining repositories for your open source project."),
 Document(page_content='There are a variety of ways that you can contribute to open source projects.', metadata={'Header 2': 'Validating an issue or pull request'}),
 Document(page_content='- "[Your Code of Conduct](https://opensource.guide/code-of-conduct/)" from the Open Source Guides\n- "[Building Welcoming Communities](https://opensource.guide/building-community/)" from the Open Source Guides\n- "[Lea

# Result 

| Embeddings Module     | Speed Create table (Cell Time) | Speed Search x10 (CPU) | S. Search x10 (Wall Time) | Memory Usage (Peak for 2K) |  Memory Usage (Peak for 18K) |
|-----------------------|--------------------------------|------------------------|:-------------------------:|----------------------------|------------------------------|
| FAISS                 |33m 53.7s                       |2.35 s                  |641 ms                     |37281761                    |280880734                     |
| Annoy                 |34m 17.1s                       |2.5 s                   |657 ms                     |31509645                    |280986377                     |
| USearch               |34m 17.2s                       |2.54 s                  |662 ms                     |33104910                    |295386573                     |
|                       |                                |                        |                           |                            |                              |
| SKLearn               |34m 22.6s                       |7.63 s                  |1.55 s                     |64391689                    |319402979                     |
| Chroma                |35m 7.5s                        |3.53 s                  |936 ms                     |45508291                    |312020989                     |
| Qdrant                |35m 53.6s                       |6.73 s                  |1.48 s                     |34226917                    |142474501                     |
| DeepLake              |36m 41.3s                       |10.5s                   |4.58s                      |33136137                    |97221769                      |
| SQLite-VSS            |37m 18.9s                       |2.49 s                  |685 ms                     |42413252                    |381592931                     |
| LanceDB               |! Lose metadata                 |4.04 s                  |1.19 s                     |72083333                    |331842358                     |
|                       |                                |                        |                           |                            |                              |

On the tested systems:
- It appears that FAISS, Annoy, and USearch have similar performance in document search. There is no clear distinction between them.

About [Annoy](https://python.langchain.com/docs/integrations/vectorstores/annoy): It will be more cost-effective to not have to recreate the entire vector store with every change in the data.
> NOTE: Annoy is read-only - once the index is built you cannot add any more embeddings!
> If you want to progressively add new entries to your VectorStore then better choose an alternative!