Text based seqarch is based on matching keyword or exact words. This becomes a limitation when working with text heavy documents or images, audio, video or code.

Vector search: Vector search can match a word to videos, or pictures or even different words which mean the same .. works on semantic similarity between vectorized data representations (embeddings) or level of meanings.

Qdrant is an open-source vector search engine that helps with very fast and large searches. It's built in the programming language called Rust to enable speed of search.  

#### Step 0: Setup environment
1. Run a docker command to pull a docker image for Qdrant and run the container
   - In VS studio run the below command
     *docker pull qdrant/qdrant*
2. Install python libraries with a pip install: *pip install qdrant-client fastembed*

#### Step 1: Import Required Libraries & Connect to Qdrant
*from qdrant_client import QdrantClient, models*

##### Initialize the client
client = QdrantClient("http://localhost:6333") #connecting to local Qdrant instance

#### Step 2: Study the Dataset
To build a working vector search solution (and, more generally, to understand if/when/how it’s needed), it's good to study the dataset and figure out the nature and structure of the data we’re working with, for example:

modality — is it text, images, videos, a combination?
specifics — if it’s text: language used, how big are the text pieces, are there any special characters, etc.
It will help us define:

the right data "schema" (what to vectorize, what to store as metadata, etc);
the right embedding model (the best fit based on the domain, precision & resource requirements).

#### Step3: Choosing Fastembed with Qdrant

Qdrant uses fastembed under the hood to turn text into vectors.

#### Step4:
To start building a vector search solution in Qdrant, Qdrant needs a collection to be created. 

Point: Data point: Answer + Meta data. Points have 3 items 
1. id
2. several embedding vectors provided by GINA and along with the
3. meta data which is called pay load and in our case it is the course + section

A collection is a container of all data points

#### Step5: Create, Embed & Insert Points into the Collection

Points are the core data entities in Qdrant. Each point consists of:

- ID. A unique identifier. Qdrant supports both 64-bit unsigned integers and UUIDs.
- Vector. The embedding that represents the data point in vector space.
- Payload (optional). Additional metadata as key-value pairs.

The points are embedded and uploaded into the collection where the vector index gets built

**Study the data visually**
The uploaded data in the Qdrant Web UI at *http://localhost:6333/dashboard* to study semantic similarity visually.

#### Step6: Running a Similarity Search
We find the most similar text vector in Qdrant to a given query embedding - the most relevant answer to a given question.

How Similarity Search Works
1. Qdrant compares the query vector to stored vectors (based on a vector index) using the distance metric defined when creating the collection.

2. The closest matches are returned, ranked by similarity.Vector index is built for approximate nearest neighbor (ANN) search, making large-scale vector search feasible.

#### Step 7: Running a Similarity Search with Filters

We can update the search by ensuring some keywords are present within the search text. Qdrant also allows use of words like should, must_not, range, and more. 


To enable efficient filtering, we need to turn on indexing of payload fields.

client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword" # exact matching on string metadata fields
)

update our search function

def search_in_course(query, course="mlops-zoomcamp", limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle
        ),
        query_filter=models.Filter( # filter by course name
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results