# Semantic Similarity



The underlying vector database in [OnPrem.LLM](https://github.com/amaiya/onprem) can be used for detecting semantic similarity among pieces of text.

You can access the default vectorstore from an `LLM` object:
```python
from onprem import LLM

vectordb_path = tempfile.mkdtemp()
llm = LLM(
    embedding_model_name="sentence-transformers/nli-mpnet-base-v2",
    embedding_encode_kwargs={"normalize_embeddings": True},
    vectordb_path=vectordb_path,
    store_type='dense',
    verbose=False
)
```
But, we will create `VectorStore` instances explicitly here to avoid loading an LLM, which is not needed in this example.

The `VectorStoreFactory` is useful in instantiating different backend vectorstores (e.g., Chroma, Whoosh, Elasticsearch).

In [None]:
# | notest
import os, tempfile

from onprem.ingest.stores import VectorStoreFactory

store = VectorStoreFactory.create(
    kind='chroma',
    persist_location='/tmp/my_vectordb',
    embedding_model_name="sentence-transformers/nli-mpnet-base-v2",
    embedding_encode_kwargs={"normalize_embeddings": True},
)

In [None]:
# | notest

data = [  # from txtai
    "US tops 5 million confirmed virus cases",
    "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
    "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
    "The National Park Service warns against sacrificing slower friends in a bear attack",
    "Maine man wins $1M from $25 lottery ticket",
    "Make huge profits without work, earn up to $100,000 a day",
]
source_folder = tempfile.mkdtemp()
for i, d in enumerate(data):
    filename = os.path.join(source_folder, f"doc{i}.txt")
    with open(filename, "w") as f:
        f.write(d)

In [None]:
# | notest
store.ingest(source_folder, chunk_size=500, chunk_overlap=0)

Creating new vectorstore at /tmp/my_vectordb
Loading documents from /tmp/tmpeg2wt1z7


Loading new documents: 100%|████████████████████| 6/6 [00:00<00:00, 1540.98it/s]
Processing and chunking 6 new documents: 100%|██████████████████████████████████████████| 1/1 [00:00<00:00, 1940.01it/s]


Split into 6 chunks of text (max. 500 chars each for text; max. 2000 chars for tables)
Creating embeddings. May take some minutes...


100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.13it/s]

Ingestion complete! You can now query your documents using the LLM.ask or LLM.chat methods





Here, we get a reference to the underlying vector store and query it directly to find the best semantic match.

In [None]:
# | notest

for query in (
    "feel good story",
    "climate change",
    "public health story",
    "war",
    "wildlife",
    "asia",
    "lucky",
    "dishonest junk",
):
    docs = store.semantic_search(query)
    print(f"{query} : {docs[0].page_content}")

feel good story : Maine man wins $1M from $25 lottery ticket
climate change : Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
public health story : US tops 5 million confirmed virus cases
war : Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife : The National Park Service warns against sacrificing slower friends in a bear attack
asia : Beijing mobilises invasion craft along coast as Taiwan tensions escalate
lucky : Maine man wins $1M from $25 lottery ticket
dishonest junk : Make huge profits without work, earn up to $100,000 a day
