# Learning Objectives

- Implement key ideas of building vector databases


# Setup

In [1]:
!pip install -q chromadb==0.4.22 \
                langchain==0.1.9 \
                langchain-community==0.0.32 \
                sentence-transformers==2.3.1 \
                datasets==2.19.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.0/817.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [2]:
import pandas as pd

from google.colab import userdata
from google.colab import drive

from datasets import load_dataset

from langchain_core.documents import Document
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from scipy.spatial.distance import cosine

# Business Use Case

In the world of online shopping, it's crucial for e-commerce platforms to help customers easily find and buy what they need. But with so many products available, it's a big challenge to make sure the right products are seen by the right people.

Given a large product assortment, designing an effective product search tool becomes paramount.


In this context, consider an ecommerce player like Amazon that has a huge range of products, from electronics to clothes to home items. Enabling customers to easily search for what they want improves their shopping experience. The mandate here is tomake searching for products easier and more accurate, so customers can find what they want quickly and easily.

Apart from product search, product recommendation can also be done using similarity search. We are going to do both search and product recommendation in this session.

The first step for both search and recommendation is converting the text (product description) into embeddings and storing it into a vector database. This process runs faster with a GPU. After this step, we can implement the search and recommendation on a CPU. In this notebook, we setup the vectorDB using GPU and persist the vectorDB into Google Drive. In the second notebook, we implement a search and recommendation system on a CPU.

# Creating a Vector DB for Products

## Data

In [34]:
products_data = load_dataset("pgurazada1/amazon_india_products")['train'].to_pandas()

In [35]:
products_data.shape

(30000, 15)

We have 30000 products and 15 columns in this dataset. Let's have a glance at the data.

In [36]:
products_data.sample(5)

Unnamed: 0,Uniq Id,Crawl Timestamp,Category,Product Title,Product Description,Brand,Pack Size Or Quantity,Mrp,Price,Site Name,Offers,Combo Offers,Stock Availibility,Product Asin,Image Urls
972,6cd4340a8e73d25368ac60214548adaf,2019-10-29 15:59:36 +0000,Hair Care,Glan Hair Bonding tape/Hair Patch System/Hair...,Red Liner - small Roll - 10 mm x 5 meters for ...,Glan,,799.0,645.0,Amazon In,19.27%,,YES,B07QZQY7XX,https://images-na.ssl-images-amazon.com/images...
23689,543fa03ee2a33b82057f8ff6a5e84b5a,2019-10-29 22:46:45 +0000,Hair Care,"Nutricost Biotin (10,000mcg) with Virgin Orga...","Size: 10,000mcg nutricost biotin features one ...",Nutricost,,,,Amazon In,,,NO,B07VPD46DY,https://images-na.ssl-images-amazon.com/images...
20432,7e34bf7da6dd69ee603f514b30aed0a5,2019-10-31 11:41:39 +0000,Hair Care,Khadi Herbal Amla & Bhringraj Shampoo SLS & P...,"It protects your hair from sun damage, which c...",Khadi Herbal,,349.0,349.0,Amazon In,0%,,YES,B07X28P3RP,https://images-na.ssl-images-amazon.com/images...
15500,90dadc27028bf7d331d3db809e7427d6,2019-10-31 19:10:12 +0000,Grocery & Gourmet Foods,CHOCOCRAFT - Rakhi special gift - 18 Chocolat...,This Chocolate Gift Box has been specially des...,CHOCOCRAFT,649 Grams,995.0,995.0,Amazon In,0%,,YES,B01JRLNJHS,https://images-na.ssl-images-amazon.com/images...
7090,9d7bdac6c7801395de6e2a7377a43cff,2019-10-31 16:10:14 +0000,Skin Care,Dr. Woods Skin Lightening English Rose Bar So...,,Dr. Woods,,5083.0,5083.0,Amazon In,0%,,YES,B00UU2WERU,https://images-na.ssl-images-amazon.com/images...


We can see that product_description is the most relevant detail for us for this use-case. Additionally, we could add details from other columns like weight and price into the product description. This will create embeddings with more details about the product.

In [37]:
products_data['Product Description'][52]

"Parag Fragrances Ambery Chandan Eau De Perfume is Long Lasting Perfume By Parag Fragrance Which Are Really Stay Long Last on Clothes. All Notes Are This Perfume is Superb. The head notes form a person's first impression of a perfume. They are fresh and light and represent the story of the fragrance. Their function is to attract, but also to smoothly transit into the heart notes. The foundation of any fragrance lies in its heart notes, they make an appearance once the head notes evaporate. They last longer than the head notes and have a strong influence on the base notes to come. The base notes are the strongest and most robust part of the fragrance, adding to the depth, complexity and long-lasting effect of the fragrance. They mingle with the heart notes to create the full body of the fragrance."

Let's drop any duplicates products that may be present in the dataset.

In [38]:
products_data.drop_duplicates(subset="Uniq Id", inplace=True)

In [39]:
products_data.shape

(30000, 15)

In [40]:
products_data['Price'].isna().sum()

600

In [41]:
products_data.dropna(subset=['Product Description', 'Price'], inplace=True)

There were no duplicate products in the dataset. And we removed rows with missing values.


In [42]:
products_data.shape

(27474, 15)


Let us now embellish the description of the product with other information, for example, price and discount.

In [43]:
products_data["product_description"] = (
    products_data["Product Description"] +
    " This product is available from the company " +
    products_data["Brand"] +
    " at a price of " +
    products_data["Price"] +
    " at a discount of " +
    products_data["Offers"] +
    "."
)

In [44]:
products_data.product_description[52]

"Parag Fragrances Ambery Chandan Eau De Perfume is Long Lasting Perfume By Parag Fragrance Which Are Really Stay Long Last on Clothes. All Notes Are This Perfume is Superb. The head notes form a person's first impression of a perfume. They are fresh and light and represent the story of the fragrance. Their function is to attract, but also to smoothly transit into the heart notes. The foundation of any fragrance lies in its heart notes, they make an appearance once the head notes evaporate. They last longer than the head notes and have a strong influence on the base notes to come. The base notes are the strongest and most robust part of the fragrance, adding to the depth, complexity and long-lasting effect of the fragrance. They mingle with the heart notes to create the full body of the fragrance. This product is available from the company Parag fragrances at a price of 749.00 at a discount of 25.1%."

Let's create documents from the product description. Let's also add a little meta-deta so that if we quickly wish to show the meta-data in search resutls or if we want to do a pre-filter, it will come in handy

In [45]:
docs = [
    Document(
        page_content=doc,
        metadata = {"id": id, "price": price}
    ) for doc, id, price in zip(products_data['product_description'], products_data['Uniq Id'], products_data['Price'])
]

## Embedding Model

In [46]:
embedding_model_name = 'thenlper/gte-large'

In [47]:
embedding_model = SentenceTransformerEmbeddings(model_name=embedding_model_name)



## Indexing to Chroma

In [48]:
vectorstore = Chroma.from_documents(
    docs,
    embedding_model,
    collection_name="product_embeddings",
    persist_directory='./products_db'
)

(The above indexing operation will take ~ 15 minutes to run on the GPU).

Let's do a test query and see what product descriptions are close to the query.

In [49]:
query = "rose scented perfume" # If there are no rose scented perfumes, it should suggest other flowery or fruity flavours before moving to men perfumes.

In [50]:
docs = vectorstore.similarity_search(query, k=5)

In [51]:
for i, doc in enumerate(docs):
    print(f"Retrieved chunk {i+1}: \n")
    print(doc.page_content.replace('\t', ' '))
    print('\n')

Retrieved chunk 1: 

Enlighten your mood with the instant sense gratification that the fragrance of Rose endows. Embellis the sentiment in the care of Glycerin. Enjoy your own floral paradise everyday This product is available from the company Generic at a price of 56.00 at a discount of 13.85%.


Retrieved chunk 2: 

Secret Scent Musk Rose Perfume Roll on is a long lasting fragrance perfume for men and women, its contain 0% alcohol for giving long lasting fragrance and this perfume is undiluted and natural. All perfume oils are 99.9% same with real addition and giving same fragrance on clothes. This product is available from the company Secret Scent at a price of 4999.00 at a discount of 0%.


Retrieved chunk 3: 

Enlighten your mood with the instant sense gratification that the fragrance of Rose endows. Embellish the sentiment in the care of Glycerin. Enjoy your own floral paradise everyday This product is available from the company Khadi at a price of 120.00 at a discount of 0%.


R

# Save Database State to Google Drive

Using a GPU to create an index is the most compute intensive portion of operating a vector database. While we have persisted the database to the local Colab instance in the previous section, this folder is lost once the notebook is disconnected. To avoid losing data, we can copy the database state from the notebook to Google Drive for later reuse.

Provide Google Drive access to this Colab instance.

In [52]:
drive.mount('/content/drive')

Mounted at /content/drive


In [53]:
!cp -r products_db /content/drive/MyDrive

The above code saves the database state to Google Drive (within My Drive).

Let's check the persisted DB once before we move on to the next task.

In [54]:
persisted_vectordb_location = '/content/drive/MyDrive/products_db'

In [55]:
vectorstore_persisted = Chroma(
    collection_name="product_embeddings",
    persist_directory=persisted_vectordb_location,
    embedding_function=embedding_model
)

In [56]:
docs = vectorstore_persisted.similarity_search(query, k=5)

In [57]:
for i, doc in enumerate(docs):
    print(f"Retrieved chunk {i+1}: \n")
    print(doc.page_content.replace('\t', ' '))
    print('\n')

Retrieved chunk 1: 

Enlighten your mood with the instant sense gratification that the fragrance of Rose endows. Embellis the sentiment in the care of Glycerin. Enjoy your own floral paradise everyday This product is available from the company Generic at a price of 56.00 at a discount of 13.85%.


Retrieved chunk 2: 

Secret Scent Musk Rose Perfume Roll on is a long lasting fragrance perfume for men and women, its contain 0% alcohol for giving long lasting fragrance and this perfume is undiluted and natural. All perfume oils are 99.9% same with real addition and giving same fragrance on clothes. This product is available from the company Secret Scent at a price of 4999.00 at a discount of 0%.


Retrieved chunk 3: 

Enlighten your mood with the instant sense gratification that the fragrance of Rose endows. Embellish the sentiment in the care of Glycerin. Enjoy your own floral paradise everyday This product is available from the company Khadi at a price of 120.00 at a discount of 0%.


R

We can now run the similarity search as before, but the database is hosted and streamed from Google Drive.

Once a vector database is created, dependence on a GPU is far lesser. During inference CPU instances could be used to serve similar documents from the database.