# Populating Embedding Vectors in Mondodb Atlas

We are going to create embedding attributes for movies collection.

We will be using locally generated embeddings (no API calls)

## References

- https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface.html#huggingfaceembedding
- Embedding models leaderboard : https://huggingface.co/spaces/mteb/leaderboard
- Explaining leaderboard: https://huggingface.co/blog/mteb

## Basic Setup

In [1]:
## Check if GPU is enabled
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070


In [None]:
## Setup logging.  To see more loging set the level to DEBUG

import sys
import logging

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Step-1: Load Settings

In [1]:
## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")

In [2]:
# Our variables

DB_NAME = 'sample_mflix'
COLLECTION_NAME = 'embedded_movies'

## Step-2: Initialize Mongo Atlas Client

In [3]:
from AtlasClient import AtlasClient

atlas_client = AtlasClient (ATLAS_URI, DB_NAME)
print("Connected to the Mongo Atlas database!")

Connected to the Mongo Atlas database!


In [4]:
collection = atlas_client.get_collection(COLLECTION_NAME)
document_count = collection.count_documents({})

print (f"document count = {document_count:,}")

document count = 3,483


## Step-3: Calculate Embeddings

We are going to generate all embeddings locally on our computer, using open source models.  No API calls or API KEYS needed ! 😄

**Let's try a few embedding models**

Here are a select models for comparison.  Taken from leaderboard : https://huggingface.co/spaces/mteb/leaderboard

| model name                              | overall score | model size | model params | embedding length | License  | url                                                            |
|-----------------------------------------|---------------|------------|--------------|------------------|----------|----------------------------------------------------------------|
| intfloat/e5-mistral-7b-instruct         | 66.x          | 15 GB      | 7.11 B       | 4096             | MIT      | https://huggingface.co/intfloat/e5-mistral-7b-instruct         |
| BAAI/bge-large-en-v1.5                  | 64.x          | 1.34 GB    | 335 M        | 1024             | MIT      | https://huggingface.co/BAAI/bge-large-en-v1.5                  |
| BAAI/bge-small-en-v1.5                  | 62.x          | 133 MB     | 33.5 M       | 384              | MIT      | https://huggingface.co/BAAI/bge-small-en-v1.5                  |
| sentence-transformers/all-mpnet-base-v2 | 57.8          | 438 MB     |              | 768              | Apache 2 | https://huggingface.co/sentence-transformers/all-mpnet-base-v2 |
| sentence-transformers/all-MiniLM-L12-v2 | 56.x          | 134 MB     |              | 384              | Apache 2 | https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 |
| sentence-transformers/all-MiniLM-L6-v2  | 56.x          | 91 MB      |              | 384              | Apache 2 | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2  |

In [5]:
import os
## LlamaIndex will download embeddings models as needed.
## Set llamaindex cache dir to ./cache dir here (Default is system tmp)
## This way, we can easily see downloaded artifacts
os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath(''), 'cache')

In [6]:
from llama_index.embeddings import HuggingFaceEmbedding
import time

## handy function to calculate embeddings, given a model
def create_embeddings (movies, embedding_model, embedding_attr):
    embed_model = HuggingFaceEmbedding(model_name=embedding_model)

    t2a = time.perf_counter()
    for movie in movies:
        movie[embedding_attr] = embed_model.get_text_embedding(movie['plot'])

    t2b = time.perf_counter()
    # print (f'Embeddings generated for {len(movies):,} movies  in {(t2b-t2a)*1000:,.0f} ms')

In [7]:
# fetch all movies
t1a = time.perf_counter()
movies = [m for m in atlas_client.find (collection_name=COLLECTION_NAME, filter={'plot':{"$exists": True}}, limit=0)]
t1b = time.perf_counter()

print (f'Fetched {len(movies):,} from Atlas in {(t1b-t1a)*1000:,.0f} ms')

Fetched 3,403 from Atlas in 14,260 ms


In [8]:
## Embedding models we want to use.

model_mappings = {
    'BAAI/bge-small-en-v1.5' : {'embedding_attr' : 'plot_embedding_bge_small', 'index_name' : 'idx_plot_embedding_bge_small'},

    'sentence-transformers/all-mpnet-base-v2' : {'embedding_attr' : 'plot_embedding_mpnet_base_v2', 'index_name' : 'idx_plot_embedding_mpnet_base_v2'},

    # 'sentence-transformers/all-MiniLM-L12-v2' : {'embedding_attr' : 'plot_embedding_minilm_l12_v2', 'index_name' : 'idx_plot_embedding_minilm_l12_v2'},

    'sentence-transformers/all-MiniLM-L6-v2' : {'embedding_attr' : 'plot_embedding_minilm_l6_v2', 'index_name' : 'idx_plot_embedding_minilm_l6_v2'},

    ## bge-large takes too long and consumes too much memory!
    # 'BAAI/bge-large-en-v1.5' : {'embedding_attr' : 'plot_embedding_bge_large', 'index_name' : 'idx_plot_embedding_bge_large', 'embedding_length' : 1024},
}

In [9]:
## For selected embedding models above, we are giong to create vectors
## in movie collection.
## Remember, each embedding model has its own 'plot_embedding' attribute (we don't want to mix them up)

for key in model_mappings.keys():
    embedding_model = key
    embedding_attr = model_mappings[key]['embedding_attr']

    print (f'\n------- embedding model = {embedding_model} ---------')
    t1a = time.perf_counter()
    create_embeddings(movies=movies, embedding_model=embedding_model, embedding_attr=embedding_attr)
    t1b = time.perf_counter()
    avg_time_per_movie = (t1b-t1a)*1000 / len(movies)
    print (f'model={embedding_model}, created embeddings for {len(movies):,} movies in {(t1b-t1a)*1000:,.0f} ms, avg_time_per_movie={avg_time_per_movie:,.0f} ms')




------- embedding model = BAAI/bge-small-en-v1.5 ---------


  from .autonotebook import tqdm as notebook_tqdm


model=BAAI/bge-small-en-v1.5, created embeddings for 3,403 movies in 33,073 ms, avg_time_per_movie=10 ms

------- embedding model = sentence-transformers/all-mpnet-base-v2 ---------
model=sentence-transformers/all-mpnet-base-v2, created embeddings for 3,403 movies in 25,875 ms, avg_time_per_movie=8 ms

------- embedding model = sentence-transformers/all-MiniLM-L6-v2 ---------
model=sentence-transformers/all-MiniLM-L6-v2, created embeddings for 3,403 movies in 13,702 ms, avg_time_per_movie=4 ms


## Step-4: Inspect Generated Embeddings

Run the cell below a few times to see a different movie each time

In [10]:
import random

movie = random.choice(movies)
# print (movie)
print ('_id :', movie['_id'])
print ('title :', movie['title'])
print ('plot :', movie['plot'])
print (f'plot_embeddings (existing openAI generated), len={len(movie["plot_embedding"])} , {movie["plot_embedding"][:5]}...')
print (f'plot_embedding_bge_small , len={len(movie["plot_embedding_bge_small"])} , {movie["plot_embedding_bge_small"][:5]}...')
print (f'plot_embedding_mpnet_base_v2 , len={len(movie["plot_embedding_mpnet_base_v2"])} , {movie["plot_embedding_mpnet_base_v2"][:5]}...')
print (f'plot_embedding_minilm_l6_v2 , len={len(movie["plot_embedding_minilm_l6_v2"])} , {movie["plot_embedding_minilm_l6_v2"][:5]}...')

_id : 573a1399f29313caabced644
title : Unforgiven
plot : Retired Old West gunslinger William Munny reluctantly takes on one last job, with the help of his old partner and a young man.
plot_embeddings (existing openAI generated), len=1536 , [-0.015232798, -0.024966672, -0.0036355436, -0.008456874, -0.021839323]...
plot_embedding_bge_small , len=384 , [-0.04438990354537964, 0.036243874579668045, 0.037500377744436264, -0.0010036976309493184, 0.0009061899036169052]...
plot_embedding_mpnet_base_v2 , len=768 , [-0.0023269052617251873, 0.1270730048418045, 0.033072661608457565, -0.005018125753849745, -0.037151169031858444]...
plot_embedding_minilm_l6_v2 , len=384 , [-0.03493762016296387, -0.008418717421591282, -0.004047343973070383, 0.01668776012957096, 0.015404606238007545]...


## Step-5: Now Update Movie Collection in Atlas

We have calculated all embeddings locally.

Let's update the Atlas database

In [11]:
## If we update documents ONE-BY-ONE, it takes about 5 minutes to complete
## So this code is not recommended


# collection = atlas_client.get_collection(COLLECTION_NAME)

# t1a = time.perf_counter()
# for movie in movies:
# 	collection.replace_one({'_id': movie['_id']}, movie)
# t1b = time.perf_counter()

# print (f'Updated {len(movies):,} in Atlas in {(t1b-t1a)*1000:,.0f} ms')


In [12]:
## Let's do a bulk update
from pymongo import  ReplaceOne


collection = atlas_client.get_collection(COLLECTION_NAME)

replacements = [ReplaceOne ({"_id" : movie["_id"]}, movie) for movie in movies]

# print (replacements[:3])

# Perform bulk replacement
print (f'About to update {len(replacements)} movies in Atlas...')
t1a = time.perf_counter()
result = collection.bulk_write(replacements)
t1b = time.perf_counter()

## Print result
print(f"Update matched count: {result.matched_count}")
print(f"Update modified count: {result.modified_count}")
print (f'Updated {len(movies):,} in Atlas in {(t1b-t1a)*1000:,.0f} ms')


About to update 3403 movies in Atlas...
Update matched count: 3403
Update modified count: 0
Updated 3,403 in Atlas in 70,850 ms


## Step-6: Verify Data in Atlas UI

Let's see if the embeddings are populuated in Atlas.

Go to Atlas UI --> Browse Collections --> sample_mflix --> embedded_movies

You should see something like this:

![](images/custom-embeddings-1.png)

## Step-7: Create Indexes

We need to create indexes on embedding attributes before we query.

Refer to this document for detailed steps : [setup-atlas-index.md](setup-atlas-index.md)

Remember, we have a few embeddings, each needs its own index.

We have have 3 indices in Atlas in free tier.  So we can create additional 2 indexes. That is perfectly ok for this lab.  You can choose which ones to experiment with.

**In Atlas UI, enter the index commands below correctly.  Make sure `path` and `numDimensions` match!**

![](images/atlas-index-5.png)


### Embedding-1: `BAAI/bge-small-en-v1.5`

Index type: **Atlas Vector Search**

Index name: **`idx_plot_embedding_bge_small`**

**Index definition**

```json
{
  "fields": [
    {
      "type": "vector",
      "path": "plot_embedding_bge_small",
      "numDimensions": 384,
      "similarity": "euclidean"
    }
  ]
}
```

### Embedding-2: `sentence-transformers/all-mpnet-base-v2`

Index type: **Atlas Vector Search**

Index name: **`idx_plot_embedding_mpnet_base_v2`**

**Index definition**

```json
{
  "fields": [
    {
      "type": "vector",
      "path": "plot_embedding_mpnet_base_v2",
      "numDimensions": 768,
      "similarity": "euclidean"
    }
  ]
}
```

### (Optional) Embedding-3: `sentence-transformers/all-MiniLM-L6-v2`

Index type: **Atlas Vector Search**

Index name: **`idx_plot_embedding_minilm_l6_v2`**

**Index definition**

```json
{
  "fields": [
    {
      "type": "vector",
      "path": "plot_embedding_minilm_l6_v2",
      "numDimensions": 384,
      "similarity": "euclidean"
    }
  ]
}
```




## Step-8: Verifying Indexes

Make sure indexes are ready and active before proceeding to the next step.

![](images/index-verify.png)