# Movie Search Using Milvus and SentenceTransformers
In this example we are going to be going over a Wikipedia article search using Milvus and and the SentenceTransformers library. The dataset we are searching through is the Wikipedia-Movie-Plots Dataset found on [Kaggle](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots). For this example we have rehosted the data in a public google drive.

Lets get started.

## Installing Requirements
For this example we are going to be using `pymilvus` to connect to use Milvus, `sentence-transformers` to connect to embed the movie plots, and `gdown` to download the example dataset.

In [1]:
! pip install pymilvus sentence-transformers gdown

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 6.0 MB/s eta 0:00:011
Collecting torchvision
  Using cached torchvision-0.14.1-cp38-cp38-macosx_10_9_x86_64.whl (1.4 MB)
Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 19.4 MB/s eta 0:00:01
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125919 sha256=4a16738e15b53a4f00b540de1ef0b22c31d87a23ebe3c22714e92cfa653494d8
  Stored in directory: /Users/fzliu/Library/Caches/pip/wheels/5e/6f/8c/d88aec621f3f542d26fac0342bef5e693335d125f4e54aeffe
Successfully built sentence-transformers
Installing collected packages: torchvision, nltk, sentence-transformers
Successfully installed nltk-3.8.1 sentence-trans

## Grabbing the Data
We are going to use `gdown` to grab the zip from Google Drive and then decompress it with the built in `zipfile` library.

In [2]:
import gdown
url = 'https://drive.google.com/uc?id=11ISS45aO2ubNCGaC3Lvd3D7NT8Y7MeO8'
output = './movies.zip'
gdown.download(url, output)

Access denied with the following error:



 	Cannot retrieve the public link of the file. You may need to change
	the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=11ISS45aO2ubNCGaC3Lvd3D7NT8Y7MeO8 



In [3]:
import zipfile

with zipfile.ZipFile("./movies.zip","r") as zip_ref:
    zip_ref.extractall("./movies")

## Global Arguments
Here we can find the main arguments that need to be modified for running with your own accounts. Beside each is a description of what it is.

In [7]:
# Milvus Setup Arguments
import os
COLLECTION_NAME = 'movies_db'  # Collection name
DIMENSION = 384  # Embeddings size
URI=os.getenv('VECTOR_DB_URL')  # Endpoint URI obtained from Zilliz Cloud
USER='db_admin'  # Username specified when you created this database
PASSWORD=os.getenv('VECTOR_DB_PASS')  # Password set for that account

# Inference Arguments
BATCH_SIZE = 128

# Search Arguments
TOP_K = 3

## Setting Up Milvus
At this point we are going to begin setting up Milvus. The steps are as follows:

1. Connect to the Milvus instance using the provided URI.
2. If the collection already exists, drop it.
3. Create the collection that holds the id, title of the movie, and the plot embedding.
4. Create an index on the newly created collection and load it into memory.

Once these steps are done the collection is ready to be inserted into and searched. Any data added will be indexed automatically and be available to search immidiately. If the data is very fresh, the search might be slower as brute force searching will be used on data that is still in process of getting indexed.


In [9]:
from pymilvus import connections

# Connect to Milvus Database
connections.connect(uri=URI, user=USER, password=PASSWORD, secure=True)

In [10]:
from pymilvus import utility

# Remove any previous collections with the same name
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

In [11]:
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection


# Create collection which includes the id, title, and embedding.
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=200),  # VARCHARS need a maximum length, so for this example they are set to 200 characters
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

In [12]:
# Create an IVF_FLAT index for collection.
index_params = {
    'metric_type':'L2',
    'index_type':"AUTOINDEX",
    'params':{}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

## Inserting the Data
In these next few steps we will be: 
1. Loading the data.
2. Embedding the plot text data using SentenceTransformers.
3. Inserting the data into Milvus.

For this example we are going using SentenceTransformers miniLM model to create embeddings of the plot text. This model returns 384-dim embeddings.

In [13]:
import csv
from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer('all-MiniLM-L6-v2')

# Extract the movie titles
def csv_load(file):
    with open(file, newline='') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            if '' in (row[1], row[7]):
                continue
            yield (row[1], row[7])


# Extract embeding from text using SentenceEmbeddings
def embed_insert(data):
    embeds = transformer.encode(data[1]) 
    ins = [
            data[0],
            [x for x in embeds]
    ]
    collection.insert(ins)

2023-03-01 22:14:24.344794: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
import time

data_batch = [[],[]]

for title, plot in csv_load('./movies/plots.csv'):
    data_batch[0].append(title)
    data_batch[1].append(plot)
    if len(data_batch[0]) % BATCH_SIZE == 0:
        embed_insert(data_batch)
        data_batch = [[],[]]

# Embed and insert the remainder
if len(data_batch[0]) != 0:
    embed_insert(data_batch)

# Call a flush to index any unsealed segments.
collection.flush()


## Performing the Search
With all the data inserted into Milvus we can start performing our searches. In this example we are going to search for movies based on the plot. Because we are doing a batch search, the search time is shared across the movie searches. 

In [23]:
# Search for titles that closest match these phrases.
search_terms = ['A movie about cars', 'A movie about monsters']

In [24]:
# Search the database based on input text
def embed_search(data):
    embeds = transformer.encode(data) 
    return [x for x in embeds]

search_data = embed_search(search_terms)

start = time.time()
res = collection.search(
    data=search_data,  # Embeded search value
    anns_field="embedding",  # Search across embeddings
    param={},
    limit = TOP_K,  # Limit to top_k results per search
    output_fields=['title']  # Include title field in result
)
end = time.time()

for hits_i, hits in enumerate(res):
    print('Title:', search_terms[hits_i])
    print('Search Time:', end-start)
    print('Results:')
    for hit in hits:
        print( hit.entity.get('title'), '----', hit.distance)
    print()

Title: A movie about cars
Search Time: 0.04272913932800293
Results:
Red Line 7000 ---- 0.9104408621788025
The Mysterious Mr. Valentine ---- 0.9127437472343445
Tomboy ---- 0.9254708290100098

Title: A movie about monsters
Search Time: 0.04272913932800293
Results:
Monster Hunt ---- 0.8105474710464478
The Astro-Zombies ---- 0.8998500108718872
Wild Country ---- 0.9238440990447998

