# Searching using Qdrant Search Engine

## Lets Begin by first downloading the required libraries for this notebook.


*   **qdrant-client** is used for communicating with the qdrant server
*   **datasets** is used for downloading the dataset
*   **tqdm** is used for the progress bars
*   **sentence-transformers**  is used for generating and working with sentence embeddings.

In [46]:
!pip install qdrant-client datasets tqdm sentence-transformers



You should consider upgrading via the 'C:\Users\dell\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


With the required packages installed we can get started. Lets begin by launching the Qdrant service. The file being run is the docker-compose.yaml found in the folder of this file. This command launches a Qdrant standalone instance which we will use for this test.

In [47]:
! docker compose up -d

 Container milvus-etcd  Creating
 Container milvus-minio  Creating
Error response from daemon: Conflict. The container name "/milvus-minio" is already in use by container "b1747c11006c2da61c7feab994abddf076a2bf2df4ad31bd1d67fdd3a06ccdd8". You have to remove (or rename) that container to be able to reuse that name.


In [48]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
from qdrant_client import QdrantClient, models
import os

In [49]:
# Initializing Qdrant Client
qdrant_client = QdrantClient(':memory:')

## Dataset
  With Qdrant up and running, we can start gathering our data. Hugging Face Datasets provides a wide range of user datasets, and in this scenario, we will utilize the "movie-metadata" dataset from HuggingLearners. This dataset comprises metadata pairs for more than 8,000 movies. Our objective is to generate embeddings for each movie description and store them in Qdrant, alongside the movie's title, genre, release year, and rating.
  
The raw parsed data can be accessed via this https://huggingface.co/datasets/hugginglearners/netflix-shows/tree/main.

In [50]:
# Load the dataset from CSV
csv_path = '../netflix_titles.csv'
df = pd.read_csv(csv_path)

## Preprocess the Data
  In this step, we will preprocess data which is required for analysis and modeling. It involves tasks like selecting relevant columns, handling missing values, removing duplicates, scaling features, and encoding categorical variables. It ensures data quality, consistency, and suitability for further processing, improving analysis and modeling accuracy.

In [59]:
def preprocess_data(df):
    # Drop duplicates, if any
    df = df.drop_duplicates()

    # Handling missing values
    # For numeric columns, fill missing values with median
    numeric_cols = df.select_dtypes(include='number').columns
    for col in numeric_cols:
        df[col].fillna(df[col].median(), inplace=True)

    # For categorical columns, fill missing values with most frequent value
    categorical_cols = df.select_dtypes(include='object').columns
    for col in categorical_cols:
        df[col].fillna(df[col].mode()[0], inplace=True)

    # Convert all text data to lowercase
    df = df.applymap(lambda x: x.lower() if isinstance(x, str) else '')

    # Prepare NLTK for english Language Preprocessing
    eng_stopwords = set(stopwords.words('english'))  # Set of English stopwords
    stemmer = SnowballStemmer('english')  # Snowball Stemmer for English language

    # Preprocess text in categorical columns using NLTK
    for col in categorical_cols:
        if df[col].dtype == 'object':
            df[col] = df[col].apply(lambda x: ' '.join([stemmer.stem(word) for word in word_tokenize(x) if word.lower() not in eng_stopwords]))
            
    return df


df_movies = preprocess_data(df)

In [60]:
def preprocess_query(input_string):
    # Convert the input string to lowercase
    input_string = input_string.lower()

    # Prepare NLTK for English Language Preprocessing
    eng_stopwords = set(stopwords.words('english'))  # Set of English stopwords
    stemmer = SnowballStemmer('english')  # Snowball Stemmer for Englsih language

    # Tokenize the input string
    words = word_tokenize(input_string)

    # Remove English stopwords and apply stemming
    preprocessed_words = [stemmer.stem(word) for word in words if word.lower() not in eng_stopwords]

    # Combine the preprocessed words into a single string
    preprocessed_string = ' '.join(preprocessed_words)

    return preprocessed_string

## Encoding
  The below code snippet encodes movie descriptions using SentenceTransformer, a pre-trained model for sentence embeddings. It iterates over the movie descriptions in the DataFrame (df_movie) in batches and uses the **multi-qa-MiniLM-L6-cos-v1model** to encode the descriptions into dense vectors. The encoded vectors are then saved as a NumPy array in the 'data/vectors_movies.npy' file.

In [61]:
# Encode movie descriptions using SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1', device='cpu')
vectors = []
batch_size = 64
batch = []
for row in tqdm(df_movies.itertuples()):
    description = row.description
    batch.append(description)
    if len(batch) >= batch_size:
        vectors.append(model.encode(batch))
        batch = []

if len(batch) > 0:
    vectors.append(model.encode(batch))
    batch = []

vectors = np.concatenate(vectors)
np.save('data/vectors_movies.npy', vectors, allow_pickle=False)

0it [00:00, ?it/s]

## Qdrant Setup

*   **CODE_DIR** = os.getcwd(): Assigns the current working directory to CODE_DIR
*   **ROOT_DIR** = os.path.dirname(CODE_DIR): Assigns the parent directory of CODE_DIR to ROOT_DIR
*   **DATA_DIR** = os.path.join(CODE_DIR, 'data'): Creates a path for data-related files by joining CODE_DIR and the subdirectory name 'data', assigned to DATA_DIR
*   **COLLECTION_NAME** = 'movies': Sets the collection name in Qdrant as 'movies
*   **QDRANT_HOST** = os.environ.get('QDRANT_HOST', 'localhost'): Retrieves the QDRANT_HOST value from environment variables, defaulting to 'localhost' if not set.
*   **QDRANT_PORT** = os.environ.get('QDRANT_PORT', 6333): Retrieves the QDRANT_PORT value from environment variables, defaulting to 6333 if not set.
*   **vectors_path** = os.path.join(DATA_DIR, 'vectors_movies.npy'): Creates a path for the encoded vectors file by joining DATA_DIR and the filename 'vectors_movies.npy', assigned to vectors_path.
*   **vectors** = np.load(vectors_path): Loads encoded vectors from vectors_path using np.load(), assigned to vectors
*   **vector_size** = vectors.shape[1]: Retrieves the size of the vectors by accessing the shape of vectors and taking the second element ([1] index).

In [62]:
CODE_DIR = os.getcwd()
ROOT_DIR = os.path.dirname(CODE_DIR)
DATA_DIR = os.path.join(CODE_DIR, 'data')
COLLECTION_NAME = 'movies'
QDRANT_HOST = os.environ.get('QDRANT_HOST', 'localhost')
QDRANT_PORT = os.environ.get('QDRANT_PORT', 6333)
vectors_path = os.path.join(DATA_DIR, 'vectors_movies.npy')
vectors = np.load(vectors_path)
vector_size = vectors.shape[1]

In [63]:
# Qdrant client and collection creation
qdrant_client = QdrantClient(':memory:')  # QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)
qdrant_client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=models.VectorParams(size=vector_size, distance='Cosine')
)

True

In [64]:
# Step 7: Upload vectors and payload data to Qdrant
BATCH_SIZE = 64
qdrant_client.upload_collection(
    collection_name=COLLECTION_NAME,
    vectors=vectors,
    payload=df.to_dict(orient='records'),
    ids=None,
    batch_size=BATCH_SIZE,
    parallel=2
)

## Make a Search function
  Now that all the preparations are complete, let's build the search class.

In [65]:
def search_movies(qdrant_client, collection_name, query, query_filter=None, top_k=3):
    query_text = preprocess_query(query)
    query_vector = model.encode(query_text).tolist()
    hits = qdrant_client.search(
        collection_name=collection_name,
        query_vector=query_vector,
        query_filter=query_filter,
        limit=top_k
    )
    print('Search Results:')
    for i, hit in enumerate(hits):
        print(f'\nResult {i + 1}:')
        print('Title:', hit.payload.get('title', 'N/A'))
        print('Type:', hit.payload.get('type', 'N/A'))
        print('Release Year:', hit.payload.get('release_year', 'N/A'))
        print('Rating:', hit.payload.get('rating', 'N/A'))
        print('Description:', hit.payload.get('description', 'N/A'))
        print('Score:', hit.score)
        print('-'*120)

In [66]:
search_movies(qdrant_client, COLLECTION_NAME, 'fluffy animal')

Search Results:

Result 1:
Title: Pop Team Epic
Type: TV Show
Release Year: 2018
Rating: TV-14
Description: This animated adaptation of the quirky four-panel comic brings the random exploits of Popuko and Pipimi to life.
Score: 0.47200582458503304
------------------------------------------------------------------------------------------------------------------------

Result 2:
Title: Gabriel lglesias: Iâ€™m Sorry For What I Said When I Was Hungry
Type: Movie
Release Year: 2016
Rating: TV-14
Description: Hawaiian-shirt enthusiast Gabriel "Fluffy" Iglesias finds the laughs in racist gift baskets, Prius-driving cops and all-female taco trucks.
Score: 0.4275715285325412
------------------------------------------------------------------------------------------------------------------------

Result 3:
Title: Enter the Anime
Type: Movie
Release Year: 2019
Rating: TV-MA
Description: What is anime? Through deep-dives with notable masterminds of this electrifying genre, this fast-paced peek behi