# Recommender Sytem with LLMs and Weaviate

## 1. Install the requirements

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install --upgrade pip

Note: you may need to restart the kernel to use updated packages.


## 2. Import the required dependencies

In [3]:
import weaviate
import openai
import pandas as pd
from tqdm import tqdm
print(f"Weaviate client library version: {weaviate.__version__}.")
print(f"Openai version: {openai.__version__}.")



Weaviate client library version: 4.7.1.
Openai version: 1.12.0.


## 3. OpenAI setup

In [4]:
# Load the env variable(s)
import os

from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

api_key = os.environ.get("OPENAI_API_KEY")

#print("OpenAI API key: ", api_key)

In [5]:
# setting the openai client:
from openai import OpenAI
chat_client = OpenAI()



## 4. Weaviate Setup

1. Go to `console.weaviate.com` login or sign-up
2. Go to the Dashboard in the WCD Console and click on the `Create cluster` button.
3. Enter the name of the `Sandbox` and click `Create`.
4. The instance creation process will take a few minutes. After success, it will be displayed. `Click on the dropdown` to get all the information.


In [6]:
# Get the .env Variables

WEAVIATE_URL = os.environ.get("WCS_URL")
WEAVIATE_API_KEY = os.environ.get("WCS_API_KEY")

# print("weaviate url: ", WEAVIATE_URL)
# print("weaviate api key: ", WEAVIATE_API_KEY)

In [7]:
# Weavite config

headers = {
 "Content-Type": "application/json",
 "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
}

# Connect to a WCS instance
client = weaviate.connect_to_wcs(cluster_url=WEAVIATE_URL, auth_credentials=weaviate.auth.AuthApiKey(WEAVIATE_API_KEY), headers=headers)

In [8]:
# Verify that the Weaviate instance is running using the is_live function.
assert client.is_live()


## 5. Dataset Processing

### 5.1 Introduction to Dataset

The Amazon Product Sales Dataset is a collection of product data scraped from the Amazon website and organized into 142 categories. The dataset was taken from Kaggle. This dataset includes key features such as product names, categories, images, links, ratings, and prices, making it a valuable resource, as shown below.

Since there are 142 collections, the entire dataset is consolidated into Amazon-Product.csv. We will use this dataset for our recommendation system.

In [9]:
# Load the dataset
file_path = 'dataset\\Amazon-Products.csv'
dataset = pd.read_csv(file_path)
# Display the first few rows of the dataset
dataset.head()

Unnamed: 0.1,Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,0,Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/31UISB90sY...,https://www.amazon.in/Lloyd-Inverter-Convertib...,4.2,2255,"₹32,999","₹58,990"
1,1,LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.2,2948,"₹46,490","₹75,990"
2,2,LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Cop...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Inverter-Convertible-...,4.2,1206,"₹34,490","₹61,990"
3,3,LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.0,69,"₹37,990","₹68,990"
4,4,Carrier 1.5 Ton 3 Star Inverter Split AC (Copp...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/41lrtqXPiW...,https://www.amazon.in/Carrier-Inverter-Split-C...,4.1,630,"₹34,490","₹67,790"


### 5.2 Dataset cleaning

In [10]:
# Check for the missing and duplicates
print(f"\nUncleaned DataFrame Rows: {dataset.shape[0]}")
print(f"\nAll of the Duplicated Rows: {dataset.duplicated().sum()}" ) 
print(f"\nAll of the N/A Rows: {dataset.isna().sum()}" )


Uncleaned DataFrame Rows: 551585

All of the Duplicated Rows: 0

All of the N/A Rows: Unnamed: 0             0
name                   0
main_category          0
sub_category           0
image                  0
link                   0
ratings           175794
no_of_ratings     175794
discount_price     61163
actual_price       17813
dtype: int64


In [11]:
# 1. Handling missing values
df_cleaned = dataset.dropna() # Drop rows with any NaN values
# 2. Removing duplicates
#df_cleaned = df_cleaned.drop_duplicates()
# Display the cleaned DataFrame
print(f"\nCleaned DataFrame Rows: {df_cleaned.shape[0]}")


Cleaned DataFrame Rows: 340680


In [12]:
def conversion(amount, exchange_rate=84):
 """
 Convert INR to USD.
 Returns:
 float: Amount in US Dollars ($).
 """
 amount_in_inr = float(amount.replace('₹', '').replace(',', ''))
 amount_in_usd = (amount_in_inr / exchange_rate)
 return f"${amount_in_usd:.2f}"
df_cleaned['actual_price'] = df_cleaned['actual_price'].apply(conversion)
df_cleaned['discount_price'] = df_cleaned['discount_price'].apply(conversion)
df_cleaned.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['actual_price'] = df_cleaned['actual_price'].apply(conversion)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['discount_price'] = df_cleaned['discount_price'].apply(conversion)


Unnamed: 0.1,Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,0,Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/31UISB90sY...,https://www.amazon.in/Lloyd-Inverter-Convertib...,4.2,2255,$392.85,$702.26
1,1,LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.2,2948,$553.45,$904.64
2,2,LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Cop...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Inverter-Convertible-...,4.2,1206,$410.60,$737.98
3,3,LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.0,69,$452.26,$821.31
4,4,Carrier 1.5 Ton 3 Star Inverter Split AC (Copp...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/41lrtqXPiW...,https://www.amazon.in/Carrier-Inverter-Split-C...,4.1,630,$410.60,$807.02


In [13]:
print(f"datatypes before\n:{df_cleaned.dtypes}")
def clean_price(price):
 # Remove currency symbol and commas, then convert to float
 return float(price.replace('$', '').replace(',', ''))
df_cleaned['discount_price'] = df_cleaned['discount_price'].apply(clean_price)
df_cleaned['actual_price'] = df_cleaned['actual_price'].apply(clean_price)
df_cleaned['ratings'] = pd.to_numeric(df_cleaned['ratings'], errors='coerce')
df_cleaned['no_of_ratings'] = pd.to_numeric(df_cleaned['no_of_ratings'], errors='coerce')
print(f"datatypes after\n:{df_cleaned.dtypes}")
print(f"\nCleaned and Transformed DataFrame Rows: {df_cleaned.shape[0]}")


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['discount_price'] = df_cleaned['discount_price'].apply(clean_price)


datatypes before
:Unnamed: 0         int64
name              object
main_category     object
sub_category      object
image             object
link              object
ratings           object
no_of_ratings     object
discount_price    object
actual_price      object
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['actual_price'] = df_cleaned['actual_price'].apply(clean_price)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['ratings'] = pd.to_numeric(df_cleaned['ratings'], errors='coerce')


datatypes after
:Unnamed: 0          int64
name               object
main_category      object
sub_category       object
image              object
link               object
ratings           float64
no_of_ratings     float64
discount_price    float64
actual_price      float64
dtype: object

Cleaned and Transformed DataFrame Rows: 340680


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['no_of_ratings'] = pd.to_numeric(df_cleaned['no_of_ratings'], errors='coerce')


## 6. Cluster Vector Database

### 6.1 Create Products Collection in the cluster

In [14]:
import weaviate.classes.config as wc
properties = [
 wc.Property(name="name", data_type=wc.DataType.TEXT),
 wc.Property(name="main_category", data_type=wc.DataType.TEXT),
 wc.Property(name="sub_category", data_type=wc.DataType.TEXT),
 wc.Property(name="image", data_type=wc.DataType.TEXT, skip_vectorization=True),
 wc.Property(name="link", data_type=wc.DataType.TEXT, skip_vectorization=True),
 wc.Property(name="ratings", data_type=wc.DataType.NUMBER, skip_vectorization=True),
 wc.Property(name="no_of_ratings", data_type=wc.DataType.NUMBER,skip_vectorization=True),
 wc.Property(name="discount_price", data_type=wc.DataType.NUMBER,skip_vectorization=True),
 wc.Property(name="actual_price", data_type=wc.DataType.NUMBER,skip_vectorization=True),
]
# Create the Product collection in Weaviate
try:
 client.collections.create(
 name="Products",
 properties=properties,
 vectorizer_config=wc.Configure.Vectorizer.text2vec_openai()
 )
 print("Product collection created successfully.")
 
except Exception as e:
 print(f"Failed to create Product collection: {e}")

Failed to create Product collection: Collection may not have been created properly.! Unexpected status code: 422, with response body: {'error': [{'message': 'class name Products already exists'}]}.


### 6.2 Insert Products to Vector Database

In [15]:
# Check if the collection is created in our cluster.
# Since the dataset is large, with approximately ~340,000 entries, we will filter it s
# pecifically to include only grocery products and then sample 1000 products from this filtered subset.

df_filtered = df_cleaned[df_cleaned['main_category'] == 'grocery & gourmet foods']
df_sampled = df_filtered.sample(n = 1000)


In [16]:
from weaviate.util import generate_uuid5
products = client.collections.get("Products")
# Enter context manager
with products.batch.dynamic() as batch:
    # Loop through the data
    for i, product in tqdm(df_sampled.iterrows(), total=df_sampled.shape[0]):
        # Build the object payload
        product_obj = {
            "name": product["name"],
            "main_category": product["main_category"],
            "sub_category": product["sub_category"],
            "image": product["image"],
            "link": product["link"],
            "ratings": product["ratings"],
            "no_of_ratings": product["no_of_ratings"],
            "discount_price": product["discount_price"],
            "actual_price": product["actual_price"],
        }
        # Add object to batch queue
        batch.add_object(
            properties=product_obj,
            uuid=generate_uuid5(product["link"])
        )
        # Check for failed objects
        if len(products.batch.failed_objects) > 0:
            print(f"Failed to import {len(products.batch.failed_objects)} objects")

100%|██████████| 1000/1000 [00:05<00:00, 168.08it/s]


{'message': 'Failed to send 7 objects in a batch of 48. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
{'message': 'Failed to send 3 objects in a batch of 43. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
{'message': 'Failed to send 9 objects in a batch of 48. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
{'message': 'Failed to send 5 objects in a batch of 48. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
{'message': 'Failed to send 8 objects in a batch of 48. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
{'message': 'Failed to send 6 objects in a batch of 48. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
{'message': 'Failed to send 9 objects in

In [None]:
products = client.collections.get("Products")
for item in products.iterator(include_vector=True):
 print(item.uuid, item.properties, item.vector)

## 7. Hybrid Search in Weaviate

Hybrid search typically refers to a search strategy that integrates multiple search methodologies or technologies to deliver more thorough and precise results. Weaviate employs sparse and dense vectors to capture the semantic meaning and context of search queries and documents.

Currently, Weaviate’s implementation of hybrid search combines BM25/BM25F with vector search. Since we want the best of both worlds from the recommender system, we will use hybrid search. In Weaviate, implementing hybrid search is just one line of code away.



In [18]:
import weaviate.classes.query as wq
products = client.collections.get("Products")
# Perform the query
query = "chicken noodles"
response = products.query.hybrid( query=query, limit=5, return_metadata=wq.MetadataQuery(score=True))

In [41]:
# View the results:
for item in response.objects:
    print(item.properties["name"], item.properties["actual_price"], item.properties["discount_price"], item.properties["image"])
    # Print the hybrid search score of the object from the query
    print(f"Hybrid score: {item.metadata.score:.3f}\n") 

Nongshim Shin Kimchi Instant Noodle Cup, 2.65 oz / 75 g 1.3 1.23 https://m.media-amazon.com/images/I/81LSfIHpKdL._AC_UL320_.jpg
Hybrid score: 0.869

Samyang Carbo Hot Chicken Flavour Raman Cup Noodles, 70mg*4 Pack (Pack of 4) (Imported) 6.43 5.94 https://m.media-amazon.com/images/I/51EN1zzpl8S._AC_UL320_.jpg
Hybrid score: 0.815

Sam Yang Carbo & Cheese Hot Chicken Flavour Raman Cup Noodles, 70 grams*2 Pack (Pack of 2) (Imported) 7.13 3.81 https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/618GOoLaVnL._AC_UL320_.jpg
Hybrid score: 0.656

Samyang Hot Chicken Buldak Carbonara Noodles, 130 grams 3.56 2.02 https://m.media-amazon.com/images/I/817eoiBIqiL._AC_UL320_.jpg
Hybrid score: 0.508

Typhoo Green Tea Lemon & Honey - 25 Heat Sealed enveloped Tea Bags (Pack of 2) 5.0 3.44 https://m.media-amazon.com/images/I/71GX6NkAbXL._AC_UL320_.jpg
Hybrid score: 0.416



In [42]:
import pandas as pd
from IPython.display import HTML

# Example data
data = []

# Collecting the data from your response
for item in response.objects:
    name = item.properties["name"]
    actual_price = item.properties["actual_price"]
    discount_price = item.properties["discount_price"]
    image_url = item.properties["image"]
    score = item.metadata.score
    
    # Append a dictionary to the data list
    data.append({
        "Name": name,
        "Image": f'<img src="{image_url}" width="100"/>',  # HTML image tag
        "Score": score,
        "Actual Price": actual_price,
        "Discount Price": discount_price,
    })

# Create a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame as HTML
HTML(df.to_html(escape=False, index=False))


Name,Image,Score,Actual Price,Discount Price
"Nongshim Shin Kimchi Instant Noodle Cup, 2.65 oz / 75 g",,0.869456,1.3,1.23
"Samyang Carbo Hot Chicken Flavour Raman Cup Noodles, 70mg*4 Pack (Pack of 4) (Imported)",,0.814904,6.43,5.94
"Sam Yang Carbo & Cheese Hot Chicken Flavour Raman Cup Noodles, 70 grams*2 Pack (Pack of 2) (Imported)",,0.65578,7.13,3.81
"Samyang Hot Chicken Buldak Carbonara Noodles, 130 grams",,0.508032,3.56,2.02
Typhoo Green Tea Lemon & Honey - 25 Heat Sealed enveloped Tea Bags (Pack of 2),,0.415906,5.0,3.44


## 8. Suggestions from OpenAI


In this section, we will use LLM as a knowledge base to suggest items to the users. In the code snippet below, we will provide the query to the gpt-3.5-turbo to generate suggestions from it. The prompt is given as follows:



### 8.1 Prompt setup

In [43]:
prompt = "\
    You are an expert in \
    recommending grocery and gourmet food products. Based on the following query, \
    provide a list of high-quality grocery and gourmet food items that would be ideal for the customer.\
    Just provide the names of products in JSON. Key of it is rec_prod while values are in the list.\
"

The prompt and query are provided to the chat model, and the type is restricted to JSON objects for easier access in Python.



In [44]:
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
 model="gpt-3.5-turbo",
 messages=[
 {"role": "system", "content": prompt},
 {"role": "user", "content": f"Query: {query} \n" }
 ],
 response_format={ "type": "json_object" }
)
# Extract and print the recommendations
recommendations = completion.choices[0].message

In [45]:
#The following code converts a string dictionary into the dictionary
import ast
rec_obj = ast.literal_eval(recommendations.content)
rec_obj

{'rec_prod': ['Organic Chicken Noodle Soup',
  'Homemade Chicken Noodle Soup Mix',
  'Artisanal Egg Noodles',
  'Soba Noodles',
  'Spicy Chicken Ramen Noodles']}

In [47]:
#Rec_obj holds the suggestions but not the real products; let’s ask our database to fetch a similar one
query = "chicken noodles"
response = products.query.hybrid(
 query=rec_obj['rec_prod'][-1], limit=5, return_metadata=wq.MetadataQuery(score=True)
)
for o in response.objects:
    print(o.properties["name"], o.properties["actual_price"], o.properties["discount_price"], o.properties["image"])
    print(f"Hybrid score: {o.metadata.score:.3f}\n")

Samyang Carbo Hot Chicken Flavour Raman Cup Noodles, 70mg*4 Pack (Pack of 4) (Imported) 6.43 5.94 https://m.media-amazon.com/images/I/51EN1zzpl8S._AC_UL320_.jpg
Hybrid score: 0.934

Samyang Hot Chicken Buldak Carbonara Noodles, 130 grams 3.56 2.02 https://m.media-amazon.com/images/I/817eoiBIqiL._AC_UL320_.jpg
Hybrid score: 0.905

Sam Yang Carbo & Cheese Hot Chicken Flavour Raman Cup Noodles, 70 grams*2 Pack (Pack of 2) (Imported) 7.13 3.81 https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T2/images/I/618GOoLaVnL._AC_UL320_.jpg
Hybrid score: 0.810

Top Ramen Cup Noodles Italiano, 70g 0.6 0.49 https://m.media-amazon.com/images/I/91NtaMEIFoL._AC_UL320_.jpg
Hybrid score: 0.722

Indomie Instant Noodles Chicken Flavour-Pack of 20 6.9 6.61 https://m.media-amazon.com/images/I/61ZlxKR3C6L._AC_UL320_.jpg
Hybrid score: 0.721



In [48]:
import pandas as pd
from IPython.display import HTML

# Example data
data = []

# Collecting the data from your response
for item in response.objects:
    name = item.properties["name"]
    actual_price = item.properties["actual_price"]
    discount_price = item.properties["discount_price"]
    image_url = item.properties["image"]
    score = item.metadata.score
    
    # Append a dictionary to the data list
    data.append({
        "Name": name,
        "Image": f'<img src="{image_url}" width="100"/>',  # HTML image tag
        "Score": score,
        "Actual Price": actual_price,
        "Discount Price": discount_price,
    })

# Create a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame as HTML
HTML(df.to_html(escape=False, index=False))


Name,Image,Score,Actual Price,Discount Price
"Samyang Carbo Hot Chicken Flavour Raman Cup Noodles, 70mg*4 Pack (Pack of 4) (Imported)",,0.933647,6.43,5.94
"Samyang Hot Chicken Buldak Carbonara Noodles, 130 grams",,0.90469,3.56,2.02
"Sam Yang Carbo & Cheese Hot Chicken Flavour Raman Cup Noodles, 70 grams*2 Pack (Pack of 2) (Imported)",,0.809984,7.13,3.81
"Top Ramen Cup Noodles Italiano, 70g",,0.722482,0.6,0.49
Indomie Instant Noodles Chicken Flavour-Pack of 20,,0.720561,6.9,6.61


This can be problematic if the product catalog is limited and the recommendations from the recommendation system are too narrow, potentially not aligning with the user’s preferences. Here, we can use Weaviate’s nearObject feature. The nearObject operator helps find objects in Weaviate that are most similar to an existing object.

In [50]:
# Perform the query
query = "chicken noodles"
response_near = products.query.near_object(near_object=response.objects[0].uuid, limit=5)

## 9. Refferences


https://medium.com/@haziqa5122/recommender-system-using-llms-and-vector-databases-03fa90e850d1