# Setup Milvus collection and ingest embeddings + metadata
This notebook will walk through the steps to execute a search against a Milvus instance

### Pre-requisites: 
You will need the Milvus instance with data loaded and an index defined, as setup in: `3_bootstrap_milvus.ipynb`

**Author:** Leo Thomas - leo@developmentseed.org\
**Last updated:** 2023/06/15

In [1]:
import os
import numpy as np
import faiss
from pymilvus import connections, Collection

In [2]:
DATA_DIR = os.path.abspath("./one_percent_data_sample")
EMBEDDINGS_DIR = os.path.join(DATA_DIR, "embeddings")
COLLECTION_NAME = "a2o_bioacoustics"

### 1.0. Select vectors to search against

In [3]:
embeddings_files = [
    os.path.join(EMBEDDINGS_DIR, file) 
    for file in os.listdir(EMBEDDINGS_DIR) 
    if os.path.isfile(os.path.join(EMBEDDINGS_DIR, file))
]

embeddings = np.load(embeddings_files[0])
search_vectors = embeddings[np.random.choice(range(len(embeddings)), size=1)]

### 1.1. Apply PCA transformation to search vectors so that they match the dimensionality of the indexed data

In [4]:
# read trained PCA matrix from file
pca_matrix = faiss.read_VectorTransform("1280_to_256_dimensionality_reduction.pca")
    
# apply the dimensionality reduction using the PCA matrix
reduced_search_vectors = pca_matrix.apply(search_vectors)    

### 2.0. Connect to the Milvus instance

In [5]:
# Connecte to remote Milvus instance by using kubectl port forwarding: 
# gcloud auth login
# gcloud container clusters get-credentials bioacoustics-devseed-staging-cluster --region=us-central1-f  --project dulcet-clock-385511
# kubectl port-forward service/milvus 9091:9091 & kubectl port-forward service/milvus 19530:19530 &

HOST = "127.0.0.1"
PORT = 19530
connections.connect(host=HOST, port=PORT)
print("Connections: ", connections.list_connections())

Connections:  [('default', <pymilvus.client.grpc_handler.GrpcHandler object at 0x117b69520>)]


### 2.1. Load the collection into memory 
The collection will be released afterwards with `collection.release()`

In [6]:
collection = Collection(COLLECTION_NAME)
collection.load()

### 3.0. Execute a search!
Params: 
- `data`:the input search vector(s)
- `anns_field`: which field in the database to execute the vector search against
- `param`: field specific to the index being used
    - `metric_type`: defines the similarity metric to use when searching (ie: `L2` == Euclidean distance, or the sum of the squared difference between each element in the search vector and database vector, essentially the pythagorean theorem extended to a higher number of dimensions)
    - `nprobe`: number of regions to search (the higher the value of nprobe the higher the of the search will be, at the expense of longer search times. If nprobe is set to the total number of clusters, it becomes equivalent to a Flat or exhaustive search).
- `limit`: number of results to return

In [7]:
search_params = {
    "data": reduced_search_vectors,
    "anns_field": "embedding",
    "param": {"metric_type": "L2", "params": {"nprobe": 16}},
    "limit": 20,
}

search_results = collection.search(**search_params)
for result in search_results:
    for i, r in enumerate(result): 
        print(f"{i+1}. Id: {r.id}, distance: {r.distance}")

page_1_ids = [r.id for r in list(result) for result in search_results]

1. Id: 441649888463575420, distance: 0.2145068496465683
2. Id: 441649888461139864, distance: 0.2824166417121887
3. Id: 441649888465029954, distance: 0.2849480211734772
4. Id: 441649888465146372, distance: 0.2854631841182709
5. Id: 441649888457424163, distance: 0.28580403327941895
6. Id: 441649888463675072, distance: 0.2887558937072754
7. Id: 441649888466630397, distance: 0.28885337710380554
8. Id: 441649888469014433, distance: 0.2939363420009613
9. Id: 441649888468647003, distance: 0.29609179496765137
10. Id: 441649888460444665, distance: 0.298659086227417
11. Id: 441649888455100366, distance: 0.3015505075454712
12. Id: 441649888460965006, distance: 0.30356737971305847
13. Id: 441649888456058672, distance: 0.3056580424308777
14. Id: 441649888455112596, distance: 0.30653345584869385
15. Id: 441649888454621164, distance: 0.3073357343673706
16. Id: 441649888457254935, distance: 0.30832263827323914
17. Id: 441649888464497169, distance: 0.3085707426071167
18. Id: 441649888469177362, distanc

### 3.1. Paginate a search
Using the `offset` parameter, we can paginate through the result set. If `offset` == 0 and `limit` == 20, we get the first 20 results (as above), with `offset` == 10 and `limit` == 20, we get results 10-30. We can verify this by asserting that the first 10 results of the paginated search below are identical to the last 10 results from the search above

In [8]:
search_params = {
    "data": reduced_search_vectors,
    "anns_field": "embedding",
    "param": {"metric_type": "L2", "params": {"nprobe": 16}, "offset": 10},
    "limit": 20,
}

page_2_search_results = collection.search(**search_params)
for result in page_2_search_results:
    for i, r in enumerate(result): 
        print(f"{i+1}. Id: {r.id}, distance: {r.distance}")

page_2_ids = [r.id for r in list(result) for result in page_2_search_results]

assert page_1_ids[10:] == page_2_ids[:10]


1. Id: 441649888455100366, distance: 0.3015505075454712
2. Id: 441649888460965006, distance: 0.30356737971305847
3. Id: 441649888456058672, distance: 0.3056580424308777
4. Id: 441649888455112596, distance: 0.30653345584869385
5. Id: 441649888454621164, distance: 0.3073357343673706
6. Id: 441649888457254935, distance: 0.30832263827323914
7. Id: 441649888464497169, distance: 0.3085707426071167
8. Id: 441649888469177362, distance: 0.30902576446533203
9. Id: 441649888457558552, distance: 0.3094905614852905
10. Id: 441649888463768405, distance: 0.3095042407512665
11. Id: 441649888467530376, distance: 0.3095369338989258
12. Id: 441649888462537005, distance: 0.3102073073387146
13. Id: 441649888463917559, distance: 0.31034770607948303
14. Id: 441649888457242246, distance: 0.31035780906677246
15. Id: 441649888460168757, distance: 0.3109000623226166
16. Id: 441649888467328270, distance: 0.3109663128852844
17. Id: 441649888461803448, distance: 0.31106722354888916
18. Id: 441649888463204582, dista

### 3.2. Use the `output_fields` parameter to specify the fields to be returned alongside the Id and distance from the input vector

In [9]:
search_params = {
    "data": reduced_search_vectors,
    "anns_field": "embedding",
    "param": {"metric_type": "L2", "params": {"nprobe": 16}, "offset": 10},
    "limit": 10,
    "output_fields": ["site_name", "subsite_name", "file_timestamp"]
}

search_results = collection.search(**search_params)
for result in search_results:
    for i, r in enumerate(result): 
        print(f"{i+1}. Id: {r.id}, distance: {r.distance}, entity<{r.entity}>")



1. Id: 441649888455100366, distance: 0.3015505075454712, entity<id: 441649888455100366, distance: 0.3015505075454712, entity: {'site_name': 'Bon-Bon-Station', 'subsite_name': 'Wet-A', 'file_timestamp': 1587141000}>
2. Id: 441649888460965006, distance: 0.30356737971305847, entity<id: 441649888460965006, distance: 0.30356737971305847, entity: {'site_name': 'Bon-Bon-Station', 'subsite_name': 'Dry-A', 'file_timestamp': 1589725800}>
3. Id: 441649888456058672, distance: 0.3056580424308777, entity<id: 441649888456058672, distance: 0.3056580424308777, entity: {'site_name': 'Bon-Bon-Station', 'subsite_name': 'Wet-A', 'file_timestamp': 1587141000}>
4. Id: 441649888455112596, distance: 0.30653345584869385, entity<id: 441649888455112596, distance: 0.30653345584869385, entity: {'site_name': 'Bon-Bon-Station', 'subsite_name': 'Wet-A', 'file_timestamp': 1595449800}>
5. Id: 441649888454621164, distance: 0.3073357343673706, entity<id: 441649888454621164, distance: 0.3073357343673706, entity: {'site_nam

### 3.3. Use the `expr` field to specify a metadata filter to apply before executing the vector search
Ref: https://milvus.io/docs/hybridsearch.md

In [10]:
search_params = {
    "data": reduced_search_vectors,
    "anns_field": "embedding",
    "param": {"metric_type": "L2", "params": {"nprobe": 16}, "offset": 10},
    "limit": 10,
    "output_fields": ["site_name", "subsite_name", "file_timestamp"],
    "expr": "subsite_name == \"Wet-A\"", 
}

search_results = collection.search(**search_params)
for result in search_results:
    for i, r in enumerate(result): 
        print(f"{i+1}. Id: {r.id}, distance: {r.distance}, entity<{r.entity}>")


1. Id: 441649888463204582, distance: 0.31172290444374084, entity<id: 441649888463204582, distance: 0.31172290444374084, entity: {'site_name': 'Bon-Bon-Station', 'subsite_name': 'Wet-A', 'file_timestamp': 1583519400}>
2. Id: 441649888468543461, distance: 0.31275108456611633, entity<id: 441649888468543461, distance: 0.31275108456611633, entity: {'site_name': 'Booroopki', 'subsite_name': 'Wet-A', 'file_timestamp': 1653573600}>
3. Id: 441649888459760710, distance: 0.31351348757743835, entity<id: 441649888459760710, distance: 0.31351348757743835, entity: {'site_name': 'Matuwa-Indigenous-Protected-Area', 'subsite_name': 'Wet-A', 'file_timestamp': 1625169600}>
4. Id: 441649888456475908, distance: 0.31355029344558716, entity<id: 441649888456475908, distance: 0.31355029344558716, entity: {'site_name': 'Bon-Bon-Station', 'subsite_name': 'Wet-A', 'file_timestamp': 1594737000}>
5. Id: 441649888466145366, distance: 0.3174912929534912, entity<id: 441649888466145366, distance: 0.3174912929534912, ent

### 3.4.Search for multiple input vectors at the same time
Due to internal implementation, Milvus is actually optimized for searching against multiple input vectors rather than a single vector at a time.

In [11]:
search_vectors = embeddings[np.random.choice(range(len(embeddings)), size=5)]
reduced_search_vectors = pca_matrix.apply(search_vectors)

search_params = {
    "data": reduced_search_vectors,
    "anns_field": "embedding",
    "param": {"metric_type": "L2", "params": {"nprobe": 16}, "offset": 10},
    "limit": 10
}

search_results = collection.search(**search_params)
for i, result in enumerate(search_results):
    print(f"Results for input vector {i}")
    for j, r in enumerate(result): 
        print(f"{j+1}. Id: {r.id}, distance: {r.distance}")

Results for input vector 0
1. Id: 441649888469013681, distance: 5.000259876251221
2. Id: 441649888455944961, distance: 5.015172958374023
3. Id: 441649888464328962, distance: 5.031186580657959
4. Id: 441649888455068979, distance: 5.0317792892456055
5. Id: 441649888457707625, distance: 5.050155162811279
6. Id: 441649888460127498, distance: 5.051435470581055
7. Id: 441649888460290776, distance: 5.078517913818359
8. Id: 441649888461726581, distance: 5.128861904144287
9. Id: 441649888454569811, distance: 5.1348419189453125
10. Id: 441649888462262241, distance: 5.137036323547363
Results for input vector 1
1. Id: 441649888462521406, distance: 4.354253768920898
2. Id: 441649888460762331, distance: 4.355628967285156
3. Id: 441649888466915997, distance: 4.36440372467041
4. Id: 441649888458244186, distance: 4.3656206130981445
5. Id: 441649888461532915, distance: 4.365994453430176
6. Id: 441649888468863040, distance: 4.36760139465332
7. Id: 441649888465266265, distance: 4.368870735168457
8. Id: 44