<img src="https://qdrant.tech/images/logo_with_text.svg"/>

<h1 style="font-size:3rem;color:#CE1620;">Introduction to Semantic Search with Qdrant</h1>

### For more on Qdrant vector databases visit: https://qdrant.tech/

### Install Dependencies

You need to process your data so that the search engine can work with it. The [Sentence Transformers](https://www.sbert.net/) framework gives you access to common Large Language Models that turn raw data into embeddings.

In [1]:
# !pip install sentence-transformers 

Once encoded, this data needs to be kept somewhere. Qdrant lets you store data as embeddings. You can also use Qdrant to run search queries against this data. This means that you can ask the engine to give you relevant answers that go way beyond keyword matching.

In [2]:
# !pip install qdrant-client

### Import the Models

Once the two main frameworks are defined, you need to specify the exact models this engine will use. Here we use the Qdrant Client for transactions with Qdrant, but for Qdrant API documentation visit: https://qdrant.github.io/qdrant/redoc/index.html#section/Examples

In [3]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

### Generate a sentence encoder

The Sentence Transformers framework contains many embedding models. However, all-MiniLM-L6-v2 is the fastest encoder for this tutorial.

In [4]:
# all-MiniLM-L6-v2 - is a distilated (lightweight) version of MPNet model, optimized for the fast inference
# Full list of available models: https://www.sbert.net/docs/pretrained_models.html
encoder = SentenceTransformer('all-MiniLM-L6-v2', device="cpu")

# Add the Dataset

[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) will encode the data you provide. Here you will list all the science fiction books in your library. Each book has metadata, a name, author, publication year and a short description.

In [5]:
documents = [
{ "name": "The Time Machine", "description": "A man travels through time and witnesses the evolution of humanity.", "author": "H.G. Wells", "year": 1895 },
{ "name": "Ender's Game", "description": "A young boy is trained to become a military leader in a war against an alien race.", "author": "Orson Scott Card", "year": 1985 },
{ "name": "Brave New World", "description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.", "author": "Aldous Huxley", "year": 1932 },
{ "name": "The Hitchhiker's Guide to the Galaxy", "description": "A comedic science fiction series following the misadventures of an unwitting human and his alien friend.", "author": "Douglas Adams", "year": 1979 },
{ "name": "Dune", "description": "A desert planet is the site of political intrigue and power struggles.", "author": "Frank Herbert", "year": 1965 },
{ "name": "Foundation", "description": "A mathematician develops a science to predict the future of humanity and works to save civilization from collapse.", "author": "Isaac Asimov", "year": 1951 },
{ "name": "Snow Crash", "description": "A futuristic world where the internet has evolved into a virtual reality metaverse.", "author": "Neal Stephenson", "year": 1992 },
{ "name": "Neuromancer", "description": "A hacker is hired to pull off a near-impossible hack and gets pulled into a web of intrigue.", "author": "William Gibson", "year": 1984 },
{ "name": "The War of the Worlds", "description": "A Martian invasion of Earth throws humanity into chaos.", "author": "H.G. Wells", "year": 1898 },
{ "name": "The Hunger Games", "description": "A dystopian society where teenagers are forced to fight to the death in a televised spectacle.", "author": "Suzanne Collins", "year": 2008 },
{ "name": "The Andromeda Strain", "description": "A deadly virus from outer space threatens to wipe out humanity.", "author": "Michael Crichton", "year": 1969 },
{ "name": "The Left Hand of Darkness", "description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will.", "author": "Ursula K. Le Guin", "year": 1969 },
{ "name": "The Three-Body Problem", "description": "Humans encounter an alien civilization that lives in a dying system.", "author": "Liu Cixin", "year": 2008 }
]

# Define Storage Location

You need to tell Qdrant where to store embeddings. For this basic demo, your local computer will use its memory as temporary storage.

In [6]:
qdrant = QdrantClient(":memory:") 

# Create a Collection

All data in Qdrant is organized by collections. In this case, you are storing books, so we are calling it my_books.

In [7]:
qdrant.recreate_collection(
    # The name of the collection to create
    collection_name="my_books",
    
    vectors_config=models.VectorParams(
        # The dimensionality of the vectors created by our encoder model
        size=encoder.get_sentence_embedding_dimension(), 
        
        # The distance metric used to calculate the similarity or distance between vectors in the collection
        distance=models.Distance.COSINE
    )
)

True

- Use [recreate_collection]() if you are experimenting and running the script several times. This function will first try to remove an existing collection with the same name.

- The [vector_size]() parameter defines the size of the vectors for a specific collection. If their size is different, it is impossible to calculate the distance between them. 384 is the encoder output dimensionality. You can also use model.get_sentence_embedding_dimension() to get the dimensionality of the model you are using.

- The distance parameter lets you specify the function used to measure the distance between two points. For more on the distance metrics supported by Qdrant visit: https://qdrant.tech/documentation/concepts/search/

# Upload Data to Collection

Tell the database to upload [documents]() to the [my_books]() collection. This will give each record (point) an id, a vector, and a payload. The payload is just the metadata from the dataset. 

In [8]:
qdrant.upload_records(
    # The collection to add points to
    collection_name="my_books",
    
    # The points to be uploaded
    records=[
        models.Record(
            # A value for the id of the point
            id=idx,
            # Encode the sentence with our encoder model and persist as a vector within the point 
            vector=encoder.encode(doc["description"]).tolist(),
            # Metadata from the sentence previously encoded
            payload=doc
            # Interate through the 'documents' dataset
        ) for idx, doc in enumerate(documents)
    ]
)

# Query the Search Engine

Now that the data is stored in Qdrant, you can ask it questions and receive semantically relevant results.

In [9]:
import pandas as pd

hits = qdrant.search(
    # The collection to query
    collection_name="my_books",
    # The semantic search string, here we search for "alien invasion" from our book descriptions
    query_vector=encoder.encode("alien invasion").tolist(),
    # And provide a limit of 3 results
    limit=3
)

df_results = pd.DataFrame(columns=['payload', 'score'])

# Display the payload and hit score from our results
for hit in hits:
    df_results.loc[len(df_results)] = [hit.payload, hit.score]

pd.set_option('display.max_colwidth', None)
df_results.head()

Unnamed: 0,payload,score
0,"{'name': 'The War of the Worlds', 'description': 'A Martian invasion of Earth throws humanity into chaos.', 'author': 'H.G. Wells', 'year': 1898}",0.570093
1,"{'name': 'The Hitchhiker's Guide to the Galaxy', 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'author': 'Douglas Adams', 'year': 1979}",0.504047
2,"{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008}",0.459029


Above the search engine shows three of the most likely responses that have to do with an "alien invasion" as we previously queried. Each of the responses is assigned a score to show how close the response is to the original inquiry.

## Filter Results

How about the most recent book from the early 2000s?

In [10]:
hits = qdrant.search(
    # The collection to search
    collection_name="my_books",
    # The semantic search string
    query_vector=encoder.encode("alien invasion").tolist(),
    
    # Apply a filter
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                # Where the year
                key="year",
                range=models.Range(
                    # Is greater than or equal to 2000
                    gte=2000
                )
            )
        ]
    ),
    # Limit to only the top result
    limit=1
)


df_filter_results = pd.DataFrame(columns=['payload', 'score'])

# Display the payload and hit score from our results
for hit in hits:
    df_filter_results.loc[len(df_filter_results)] = [hit.payload, hit.score]

df_filter_results.head()

Unnamed: 0,payload,score
0,"{'name': 'The Three-Body Problem', 'description': 'Humans encounter an alien civilization that lives in a dying system.', 'author': 'Liu Cixin', 'year': 2008}",0.459029


The query has been narrowed down to one result from 2008

# What's Next?

Try building an actual [Neural Search Service with a complete API and a dataset](https://qdrant.tech/documentation/tutorials/neural-search/)