# Learning Objectives

- Implement search and recommendation systems rooted on vector databases


# Setup

In [None]:
!pip install -q chromadb==0.4.22 \
                langchain==0.1.9 \
                langchain-community==0.0.32 \
                sentence-transformers==2.3.1 \
                datasets==2.19.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.0/817.0 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m704.4 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [None]:
import gdown
import pandas as pd

from google.colab import userdata
from google.colab import drive

from datasets import load_dataset

from langchain_core.documents import Document
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma
from scipy.spatial.distance import cosine

In [None]:
embedding_model_name = 'thenlper/gte-large'

In [None]:
embedding_model = SentenceTransformerEmbeddings(model_name=embedding_model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

# Business Use Case

In the world of online shopping, it's crucial for e-commerce platforms to help customers easily find and buy what they need. But with so many products available, it's a big challenge to make sure the right products are seen by the right people.

Given a large product assortment, designing an effective product search tool becomes paramount.


In this context, consider an ecommerce player like Amazon that has a huge range of products, from electronics to clothes to home items. Enabling customers to easily search for what they want improves their shopping experience. The mandate here is tomake searching for products easier and more accurate, so customers can find what they want quickly and easily.

Apart from product search, product recommendation can also be done using similarity search. We are going to do both search and product recommendation in this session.

In this notebook, we implement a search and recommendation system on a CPU using the vector database created and persisted in the previous notebook.

# Loading the saved Vector Database

In [None]:
#get vector database url or path
#vector_db_url = 'https://drive.google.com/drive/folders/1vZFtgFr4CAvWDqSGNGUZGpJlyG_I-0lU?usp=drive_link'

In [None]:
#gdown.download_folder(vector_db_url)

In [None]:
persisted_vectordb_location = 'products_db'

In [None]:
vectorstore_persisted = Chroma(
    collection_name="product_embeddings",
    persist_directory=persisted_vectordb_location,
    embedding_function=embedding_model
)

Let's run a quick test

In [None]:
query = "rose scented perfume"

In [None]:
docs = vectorstore_persisted.similarity_search(query, k=5)

In [None]:
for i, doc in enumerate(docs):
    print(f"Retrieved chunk {i+1}: \n")
    print(doc.page_content.replace('\t', ' '))
    print('\n')

Retrieved chunk 1: 

Enlighten your mood with the instant sense gratification that the fragrance of Rose endows. Embellis the sentiment in the care of Glycerin. Enjoy your own floral paradise everyday This product is available from the company Generic at a price of 56.00 at a discount of 13.85%.


Retrieved chunk 2: 

Secret Scent Musk Rose Perfume Roll on is a long lasting fragrance perfume for men and women, its contain 0% alcohol for giving long lasting fragrance and this perfume is undiluted and natural. All perfume oils are 99.9% same with real addition and giving same fragrance on clothes. This product is available from the company Secret Scent at a price of 4999.00 at a discount of 0%.


Retrieved chunk 3: 

Enlighten your mood with the instant sense gratification that the fragrance of Rose endows. Embellish the sentiment in the care of Glycerin. Enjoy your own floral paradise everyday This product is available from the company Khadi at a price of 120.00 at a discount of 0%.


R

Now let's test our vector search with some customer search behaviour.

# Search

One important and frequent behaviour of consumer behaviour is searching with long natural language sentences instead of consice keywords like in product names.

Let's try few example searches for our product range.

In [None]:
query = "I have bad odour, what should i do"

Instead of asking for a perfume, we are typing the problem. Even though bad odour is not in the description, relevant products are retrieved.

In [None]:
docs = vectorstore_persisted.similarity_search(query, k=5)

In [None]:
for i, doc in enumerate(docs):
    print(f"Retrieved chunk {i+1}: \n")
    print(doc.page_content.replace('\t', ' '))
    print(doc.metadata)
    print('\n')

Retrieved chunk 1: 

Men Antibacterial Odour Protection Antiperspirant Deodorant helps fight body odour at the source, where it counts UK’s No.1 deodorant brand  Antiperspirant deodorant helps reduce 90%* of odour-causing bacteria  As you move, MotionSenseTM technology helps keep you fresh  Provides up to 48 hours of protection against sweat and body odour  Enjoy all day freshness with a clean fragrance  Alcohol**-free Don’t let sweat and body odour dictate your day: help fight odour at the source with Sure Men Antibacterial Odour Protection Antiperspirant Deodorant. Experience round-the-clock confidence from this men’s antiperspirant deodorant, rich with antibacterial protection that helps to reduce odour causing bacteria by up to 90%*. Lime Oil, Eucalyptus and Orange Turpine and MotionSenseTM technology helps keep you fresh when you need it most. Microcapsules sitting on the skin break when you move, delivering a burst of with every step. With up to 48 hours of protection, an antibac

In [None]:
query = "My skin has gone dry provide shampoo"

Once again, we are searching a problem instead of a product. We would expect the ideal system to retrieve moisturising products like lotions and oils.

In [None]:
docs = vectorstore_persisted.similarity_search(query, k=5)

In [None]:
for i, doc in enumerate(docs):
    print(f"Retrieved chunk {i+1}: \n")
    print(doc.page_content.replace('\t', ' '))
    print(doc.metadata)
    print('\n')

Retrieved chunk 1: 

Body can lose upto half a liter of water everyday - leaving skin dehydrated. Dehydrated skin not only looks dull & unhealthy, but also becomes prone to skin infections due to external irritants. Vaseline® Intensive Care Aloe Soothe – with 100% pure aloe vera extracts & microdroplets of Vaseline jelly, restores skin’s moisture & helps maintain 24 hour hydration in skin. Aloe vera has long been used as a home ready to deal with dry flaky skin. It’s believed to have soothing properties to restore smooth supple skin and is effective at improving skin hydration. Aloe vera is an effective skin-conditioning which calms the skin and leaves it feeling deeply moisturized. ven Vaseline® Jelly creates an extra layer of protection, preventing moisture from escaping and helping aid your skin’s natural recovery process. Vaseline Intensive Care Aloe Soothe lotion helps rejuvenates dry skin even in harsh summers. It is a light lotion, that absorbs fast for a non-greasy & non-sticky

In [None]:
query = "diabetes control product at cheap price with high discount"

In [None]:
docs = vectorstore_persisted.similarity_search(query, k=5)

In [None]:
for i, doc in enumerate(docs):
    print(f"Retrieved chunk {i+1}: \n")
    print(doc.page_content.replace('\t', ' '))
    print(doc.metadata)
    print('\n')

Retrieved chunk 1: 

GOKUL’s Diabetes Care Juice is a 100% Natural and Organic remedy for Diabetes. The combined effect of this formula helps to control blood sugar levels effectively without any side effects. This product is available from the company Gokul Herbals at a price of 188.00 at a discount of 20.0%.
{'id': '1a55246823cdbd07ef9454f5ed2cee5a', 'price': '188.00'}


Retrieved chunk 2: 

Aashirvaad sugar release control atta, with low glycaemic index, releases sugar in your body, slowly, thus helps in sustained and steady blood sugar level. This product is available from the company Aashirvaad at a price of 57.00 at a discount of 5.0%.
{'id': '1b724e265c5b3d2c35d5f0004335e7c6', 'price': '57.00'}


Retrieved chunk 3: 

Studies show the blend of green tea with arjun, gurnar, methi and a few more herbs helps control the problems like Blood Pressure and Diabetes This product is available from the company ANDEES at a price of 340.00 at a discount of 0%.
{'id': '69f9e6baa7770436246c7b8

Most of our product descriptions do not explicitly mention the word diabetes yet relevant products are searched effectively.

# Recommendations

For every product page a customer visits, we can recommend products that are similar to the current product. This can be one of the many inputs for a recommendation engine.

In [None]:
query = "Flaxseeds are considered one of the most powerful plant foods on the planet. Rich in heart loving omega 3 essential fatty acids, lignans and both soluble and insoluble fibre, flaxseed bestows health benefits like no other."

In [None]:
docs = vectorstore_persisted.similarity_search(query, k=5)

The most similar one to the text above will be the same product and hence it will come up at the top.

In [None]:
for i, doc in enumerate(docs):
    print(f"Retrieved chunk {i+1}: \n")
    print(doc.page_content.replace('\t', ' '))
    print(doc.metadata)
    print('\n')

Retrieved chunk 1: 

Flaxseeds are considered one of the most powerful plant foods on the planet. Rich in heart loving omega 3 essential fatty acids, lignans and both soluble and insoluble fibre, flaxseed bestows health benefits like no other. This product is available from the company Organo Nutri at a price of 221.00 at a discount of 35.0%.
{'id': 'ff9c643890aaaf1b720587610ddfdd6b', 'price': '221.00'}


Retrieved chunk 2: 

Flaxseeds are considered one of the most powerful plant foods on the planet. Rich in heart loving omega 3 essential fatty acids, lignans and both soluble and insoluble fibre, flaxseed bestows health benefits like no other. This product is available from the company Organo Nutri at a price of 130.00 at a discount of 35.0%.
{'id': '25af44b3a0becc2ef0ddf530baea78b0', 'price': '130.00'}


Retrieved chunk 3: 

Kitchen & Health brings to you the world’s first cultivated superfood. Flax seeds help improve digestion, improve skin health, reduce sugar cravings and promote 

To improve quality of the results, we want to retrieve products that are similar to the description. We can do this by putting a bad around the similarity score (say 0.78 - 0.8). This way, product descriptions that are exactly the same will not be retrieved (since they will have a similarity score of 1).

Let's first retrieve a handful of documents (say, 100) which we can then filter to create a good selection.

In [None]:
doc_r = vectorstore_persisted.similarity_search_with_relevance_scores(query, k=100)

`similarity_search_with_relevance_scores` returns a list of tuple of (docs, r). So, let's rewrite out printing logic to accomodate this.

In [None]:
recommendations = []

In [None]:
for (doc, r) in doc_r:
    if r > 0.78 and r < 0.80:
        recommendations.append((doc.page_content.replace('\t', ' '), doc.metadata))

In [None]:
len(recommendations)

11

In [None]:
print(recommendations[0])

('Healthy Planet Combo of Flaxseed Oil 200ml x 4 This product is available from the company Jiwesh Special Tasty Spices at a price of 896.00 at a discount of 10.04%.', {'id': 'd91e5c7876e42f69ad2a01b1a696a71c', 'price': '896.00'})


We got what we wanted. We were able to fetch products that are related to the product but is not the exact same product.