Steps: 
- Download bestbuy sample dataset from github: https://github.com/BestBuyAPIs/open-data-set/blob/master/products.json

In [3]:
import urllib.request
import os
import json

os.remove("./assets/products.json")
! wget -P ./assets/ 'https://raw.githubusercontent.com/BestBuyAPIs/open-data-set/master/products.json'


--2023-03-21 10:38:43--  https://raw.githubusercontent.com/BestBuyAPIs/open-data-set/master/products.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39685207 (38M) [text/plain]
Saving to: ‘./assets/products.json’


2023-03-21 10:38:45 (16,0 MB/s) - ‘./assets/products.json’ saved [39685207/39685207]



instanciate your MongoDB Atlas Client

In [11]:

import pymongo
connection = pymongo.MongoClient("mongodb+srv://<username>:<password>@realmcluster.jklrj.mongodb.net/?retryWrites=true&w=majority")
collection = connection["vectorSearchDemo"]["productCatalog"]

We will load roBerta using SentenceTransformer

In [22]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

Now we will get the first 5000 items from the list, encode the name field and load it in your MongoDB Atlas deployment. This should take ~ 2.5 min for 5000 documents. 

In [23]:

import itertools

collection.delete_many({})

with open('./assets/products.json', 'r') as file:
    json_object = json.load(file)

documents_to_process = itertools.islice(json_object, 5000)
documents_to_insert = []
for doc in documents_to_process:
    encoded = model.encode(doc["name"]).tolist()
    doc["encoded"] = encoded
    documents_to_insert.append(doc)

collection.insert_many(documents_to_insert)

<pymongo.results.InsertManyResult at 0x2c3ff7820>

It is interesting to check the size of the embedded array. We will have to report it in the Atlas Search Vector Definition. using Roberta, this should be 768

In [28]:
embeddedSize = collection.aggregate([
    {
        '$project': {
            'size': {
                '$size': '$encoded'
            }, 
            '_id': 0
        }
    }, {
        '$limit': 1
    }
])

for size in embeddedSize:
    print(size)

{'size': 768}


Now that documents are inserted in MongoDB Atlas, we need to create an Atlas Search vector index in the vectorSearchDemo.productCatalog collection
note than the dimensions value is equal to the size of the array encoded. this will depends on the model you use. 

{
  "mappings": {
    "fields": {
      "encoded": [
        {
          "dimensions": 768,
          "similarity": "euclidean",
          "type": "knnVector"
        }
      ]
    }
  }
}

We're going now to query data..
First we need to encode a sentence


In [31]:
vector_text="I need a device to listen to music"
vector_query = model.encode(vector_text).tolist()

Then we can query data the knnBeta operator

In [32]:
pipeline = [
        {
            "$search": {
                "index": "default",
                "knnBeta": {
                    "vector": vector_query,
                    "path": "encoded",
                    "k": 10
                }
            }
        },
        {
            "$project": {
                "_id": 0,
                'score': {
                    '$meta': 'searchScore'
                },
                "name":1
            }
        }
    ]

docs = list(collection.aggregate(pipeline))

print(docs)

[{'name': 'SONOS - PLAY:1 Wireless Speaker for Streaming Music - White', 'score': 0.4775533676147461}, {'name': 'SONOS - PLAY:1 Wireless Speaker for Streaming Music - Black', 'score': 0.46975255012512207}, {'name': 'Onkyo - On-Ear Headphones', 'score': 0.46310728788375854}, {'name': 'Urbanears - Plattan On-Ear Headphones - Mulberry', 'score': 0.45934388041496277}, {'name': 'iLive - Wireless Over-the-Ear Headphones - Red', 'score': 0.4592013657093048}, {'name': 'Onkyo - On-Ear Headphones - Violet', 'score': 0.45326054096221924}, {'name': 'Urbanears - Plattan On-Ear Headphones - Indigo', 'score': 0.4529149830341339}, {'name': 'Urbanears - Plattan On-Ear Headphones - Moss', 'score': 0.45284903049468994}, {'name': 'iLive - On-Ear Wireless Headphones - Red', 'score': 0.45234042406082153}, {'name': 'iLive - Wireless Earbud Headphones - Red', 'score': 0.4523336887359619}]
