In [None]:
!pip install pandas sentence_transformers
!pip install -U transformers
!pip install hnswlib
# Install below if using GPU
!pip install accelerate

# Content

In the digital space, personalizing the shopping experience is key for businesses aiming to keep customers happy and engaged. One innovative method to achieve this involves using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) structures. This blog dives into a hands-on example showing how these technologies can be applied to suggest products that fit each user's unique taste.

# 1. Data Preprocessing
In this section, we’ll start by importing libraries and loading the product purchase dataset. Afterwards, preprocessing dataset will be done before generating embeddings of product names.
## 1.1 Libraries
At first, we will begin with importing the required libraries. These libraries are essential for loading our dataset, generating embeddings and using LLM & RAG for personalized recommendations.

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer

import pandas as pd
import numpy as np

import hnswlib
import torch

## 1.2 Dataset
The dataset contains data of purchased products. It can be downloaded from [here](https://www.kaggle.com/datasets/carrie1/ecommerce-data/data). Also, for the ones who want analyze the dataset may checkout this [notebook](https://www.kaggle.com/code/fabiendaniel/customer-segmentation). Attributes of the dataset are given below:
1. **InvoiceNo: Transaction Code** - A unique 6-digit numeric identifier for each transaction. Transactions beginning with 'C' denote cancellations.
2. **StockCode: Item Code** - A unique 5-digit numeric code assigned to each distinct product.
3. **Description: Product Name** - A text descriptor for each product.
4. **Quantity: Product Amount** - The number of units of each product included in a transaction.
5. **InvoiceDate: Transaction Date and Time** - The date and time at which each transaction occurred.
6. **UnitPrice: Price per Unit** - The cost of one unit of the products.
7. **CustomerID: Buyer ID** - A unique 5-digit numeric identifier assigned to each customer.
8. **Country: Customer's Country** - The country where the customer resides.

In [5]:
df_purchases = pd.read_csv('data.csv', encoding='unicode_escape')
print("Row Count:",df_purchases.shape[0])
df_purchases.head()

Row Count: 541909


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


## 1.3 Preprocessing
In order to eliminate duplications and null values we have to preprocess our dataset. Thus, embeddings and recommendations will be generated without any issue.

In [6]:
# elimination of NaN values
df_purchases.dropna(inplace=True)
# elimination of duplicate rows
df_purchases.drop_duplicates(inplace=True)
# elimination of cancelled orders
df_purchases = df_purchases[~df_purchases['InvoiceNo'].str.startswith('C')]

## 1.2 User History & Product to Description Dictionary Generation
In this part of our blog, we'll focus on summarizing the purchase history of each customer and creating a clear mapping of product codes to their descriptions. This is necessary to get similar products from queries and rerank them with respect to their history.


In [7]:
# User purchase history
customer_history_dict = df_purchases.groupby("CustomerID")['StockCode'].apply(lambda x: sorted(list(set(x)))).to_dict()

# product to description dictionary
df_product_descriptions = df_purchases[["StockCode", "Description"]]
# Multiple transaction of same products are removed.
df_product_descriptions.drop_duplicates(inplace=True)
# dictionary generation
product_to_description_dict = dict(zip(df_product_descriptions['StockCode'], df_product_descriptions['Description']))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_product_descriptions.drop_duplicates(inplace=True)


In [8]:
def get_previous_purchases(user_id, k=3):
  """Gets previous purchases of the user"""
  product_list = customer_history_dict.get(user_id, [])
  purchase_descriptions = ""
  for i, product in enumerate(product_list[:k]):
    product_description = product_to_description_dict.get(product, "")
    purchase_descriptions += f"{i+1}. {product_description}\n"

  return purchase_descriptions

# 2. Generating Product Embeddings

Now, our embeddings are ready to be generated. This code segment transforms product descriptions into numerical embeddings, crucial for machine learning models to process text. By mapping descriptions to a vector space, we can evaluate item similarities. The pretrained model, trained on a diverse corpus, is well-suited for tasks like information retrieval and semantic textual similarity. For a detailed explanation of the model, you can check [here](https://huggingface.co/thenlper/gte-small).

In [9]:
# Sequence Transformer
embedding_model = SentenceTransformer("thenlper/gte-small")

def get_embedding(text: str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = embedding_model.encode(text)

    return embedding.tolist()

df_product_descriptions["embedding"] = df_product_descriptions["Description"].apply(get_embedding)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_product_descriptions["embedding"] = df_product_descriptions["Description"].apply(get_embedding)


## 2.1  Setting Up the Approximate Nearest Neighbor (ANN) Index
In this section, first, we define the dimensions of our embeddings. We then initialize the ANN index with the specified dimensionality and number of elements, leveraging cosine similarity for comparison. After configuring the index parameters for construction and querying efficiency, we load the product embeddings into the index. This setup will enable quick retrieval of similar products based on their embeddings.

In [10]:
# Embedding model dimension
dim = embedding_model.get_sentence_embedding_dimension()

num_elements = df_product_descriptions.shape[0]
# hnswlib initialization with cosine similarity
p = hnswlib.Index(space='cosine', dim=dim)

p.init_index(max_elements=num_elements, ef_construction=100, M=16)

p.set_ef(10)

embeddings = np.vstack(df_product_descriptions["embedding"].values)
p.add_items(embeddings)

## 2.2 Query-based Retrieval of Similar Products
After setting ANN indexes, we implement functions of a query-based product search to retrieve similar products from user query. Our first function is vector_search which takes a user's query, converts it into an embedding, and then uses the ANN index to find the top k similar items.

In [11]:
def vector_search(user_query, k):
    """Gets user input query and return top k similar items"""

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."


    labels, distances = p.knn_query(query_embedding, k=k)
    results = df_product_descriptions.iloc[list(labels[0])].to_dict('records')
    return results

`get_search_result` function aggregates the search results from `vector_search` and formats them into a readable string. Each product description is numbered and listed to be fed to LLM.

In [12]:
def get_search_result(query, k):
    """Aggregate similar product descriptions into one string"""
    get_knowledge = vector_search(query, k)

    search_result = ""
    for i, result in enumerate(get_knowledge):
        search_result += f"{i+1}. {result.get('Description', 'N/A')}\n"

    return search_result

A mini showcase of retrieval operation:

In [13]:
# Gets top k similar products w.r.t provided query
k = 3
query = "lantern"
source_information = get_search_result(query, k)
combined_information = f"Similar Results:\n{source_information}"

print(combined_information)

Similar Results:
1. WHITE METAL LANTERN
2. WHITE MOROCCAN METAL LANTERN
3. FRENCH CARRIAGE LANTERN



# 3. Personalized Reranking
Here comes the fancy part, a personalized ranking of recommendations with the LLM. In this section, we'll demonstrate how to utilize a large language model for personalization in recommendation systems. We will be using the Gemma-2B model, as our goal is to illustrate the integration of LLM & RAG for personalized recommendations rather than to showcase groundbreaking innovation.

In order to use the Gemma-2B model, it is required to have a user access token. The token will be used for accessing [huggingface](https://huggingface.co/) and downloading the [model](https://huggingface.co/google/gemma-2b). If you don't have it, check this [link](https://huggingface.co/docs/hub/en/security-tokens).

In [14]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [15]:
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if device == 'cpu':
  model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
else:
  model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Showcasing Personalized Recommendations
To demonstrate our personalized recommendation approach, we utilize structured prompts for few-shot learning to guide the model in reranking product suggestions based on user preferences. After retrieving related items for a query like "BAG CHARM," we define prompts that instruct the model on how to tailor recommendations to match the user's historical interests.

In [35]:
# User id to check result of personalized recommendation
user_id = 15781

query = "BAG CHARM"
k = 3
# get similar items
source_information = get_search_result(query, k)
previous_purchases = get_previous_purchases(user_id)

# Providing example prompts (few-shot learning) to get the desired output
example_prompt = f"""Given a customer's 'Previous Purchases', rerank a list of 'Recommended Products' from most to least relevant to the customer's preferences. Only recommend products from latest'Recommended Products' section The relevance should be determined by considering the types and themes of products the customer has bought before.

Example 1:
- User Input:
Previous Purchases:
1. BLUE CALCULATOR RULER
2. DOORMAT TOPIARY
3. PARTY BUNTING
Recommended Products:
1. CRYSTAL FROG PHONE CHARM
2. PINK CRYSTAL SKULL PHONE CHARM
3. BLUE LEAVES AND BEADS PHONE CHARM

- Model Output:
Reranked Recommendations:
1. BLUE LEAVES AND BEADS PHONE CHARM
2. CRYSTAL FROG PHONE CHARM
3. PINK CRYSTAL SKULL PHONE CHARM

Example 2:
- User Input:
Previous Purchases:
1. PANTRY HOOK SPATULA
2. BIRDCAGE DECORATION TEALIGHT HOLDER
3. REGENCY TEA PLATE PINK
Recommended Products:
1. SWEETHEART CAKESTAND 3 TIER
2. CAKESTAND, 3 TIER, LOVEHEART
3. REGENCY CAKESTAND 3 TIER

- Model Output:
Reranked Recommendations:
1. REGENCY CAKESTAND 3 TIER
2. SWEETHEART CAKESTAND 3 TIER
3. CAKESTAND, 3 TIER, LOVEHEART

"""

combined_information = f"""{example_prompt}

Your Turn:
- User Input:
Previous Purchases:
{previous_purchases}
Recommended Products:
{source_information}
- Model Output:
"""

# will be used to extract last prompt
key_text = 'Your Turn:'

# input ids
input_ids = tokenizer(combined_information, return_tensors="pt").to(device)
response = model.generate(**input_ids, max_new_tokens=500)
output_text = tokenizer.decode(response[0])
output_text = output_text[output_text.index(key_text) + len(key_text):]

print(f"Query: {query}")
print(output_text)

Query: BAG CHARM

- User Input:
Previous Purchases:
1. EDWARDIAN PARASOL NATURAL
2. BLUE STRIPE CERAMIC DRAWER KNOB
3. WHITE LOVEBIRD LANTERN

Recommended Products:
1. COPPER AND BRASS BAG CHARM
2. IVORY GOLD METAL BAG CHARM
3. WHITE WITH METAL BAG CHARM

- Model Output:
Reranked Recommendations:
1. WHITE WITH METAL BAG CHARM
2. COPPER AND BRASS BAG CHARM
3. IVORY GOLD METAL BAG CHARM<eos>


## Why Utilize LLM for Reranking?
While basic product ranking methods can be enhanced to build robust real-time systems for reranking, there should be a reason to use large language models in this process. One reason is the ability of LLMs to provide explanations for the reranking decisions. This helps users gain a clearer understanding of the rationale behind the reranking.

In [16]:
# User id to check result of personalized recommendation
user_id = 15781

query = "BAG CHARM"
k = 3
# get similar items
source_information = get_search_result(query, k)
previous_purchases = get_previous_purchases(user_id)

# Providing example prompts (few-shot learning) to get the desired output
example_prompt = f"""Given a customer's 'Previous Purchases', rerank a list of 'Recommended Products' from most to least relevant to the customer's preferences. Only recommend products from latest 'Recommended Products' section The relevance should be determined by considering the types and themes of products the customer has bought before. Also give brief explanation about reranking reason.

Example 1:
- User Input:
Previous Purchases:
1. BLUE CALCULATOR RULER
2. DOORMAT TOPIARY
3. PARTY BUNTING
Recommended Products:
1. CRYSTAL FROG PHONE CHARM
2. PINK CRYSTAL SKULL PHONE CHARM
3. BLUE LEAVES AND BEADS PHONE CHARM

- Model Output:
Reranked Recommendations:
1. BLUE LEAVES AND BEADS PHONE CHARM - Matches blue theme; visually appealing.
2. CRYSTAL FROG PHONE CHARM - Playful, aligns with fun items.
3. PINK CRYSTAL SKULL PHONE CHARM - Decorative, less color relevance noted.

Example 2:
- User Input:
Previous Purchases:
1. PANTRY HOOK SPATULA
2. BIRDCAGE DECORATION TEALIGHT HOLDER
3. REGENCY TEA PLATE PINK
Recommended Products:
1. SWEETHEART CAKESTAND 3 TIER
2. CAKESTAND, 3 TIER, LOVEHEART
3. REGENCY CAKESTAND 3 TIER

- Model Output:
Reranked Recommendations:
1. REGENCY CAKESTAND 3 TIER - Matches Regency style; highly relevant.
2. SWEETHEART CAKESTAND 3 TIER - Elegant, complements table setting decor.
3. CAKESTAND, 3 TIER, LOVEHEART - Decorative, thematic but less specific.
"""

combined_information = f"""{example_prompt}

Your Turn:
- User Input:
Previous Purchases:
{previous_purchases}
Recommended Products:
{source_information}
- Model Output:
"""

# will be used to extract last prompt
key_text = 'Your Turn:'

# input ids
input_ids = tokenizer(combined_information, return_tensors="pt").to(device)
response = model.generate(**input_ids, max_new_tokens=500)
output_text = tokenizer.decode(response[0])
output_text = output_text[output_text.index(key_text) + len(key_text):]

print(f"Query: {query}")
print(output_text)

Query: BAG CHARM

- User Input:
Previous Purchases:
1. EDWARDIAN PARASOL NATURAL
2. BLUE STRIPE CERAMIC DRAWER KNOB
3. WHITE LOVEBIRD LANTERN

Recommended Products:
1. COPPER AND BRASS BAG CHARM
2. IVORY GOLD METAL BAG CHARM
3. WHITE WITH METAL BAG CHARM

- Model Output:
Reranked Recommendations:
1. WHITE WITH METAL BAG CHARM - Matches white theme; complements previous item.
2. COPPER AND BRASS BAG CHARM - Matches Edwardian style; complements previous item.
3. IVORY GOLD METAL BAG CHARM - Less relevant to the customer's previous purchases.<eos>


# Conclusion
This exploration has shown the implementation of LLM and RAG in personalized recommendations. The goal of this post was to clarify how these technologies can be integrated into existing systems, moving through each step from initial data preprocessing to the reranking of products. The insights here are designed to help developers and businesses utilize the potential of LLM and RAG to advance personalized e-commerce experiences. By adopting these technologies to larger datasets one can ensure more engaging and satisfying user journey, since the system has potential to enhance relevance and personal appeal of their product recommendations.


# Future Works
The purpose of this post was to provide an insight about using LLM & RAG in personalization and results can be still enhanced. Here is the list of what can be done as a future work:
- **Using a larger dataset**: Expanding the dataset can increase potential reranking by exposing it to a more varied dataset.
- **Using more complex models**: Using advanced language models and sophisticated embedding techniques can enhance the quality and relevance of the recommendations.
- **Generating embeddings from item to item relations**: Generating embeddings from the relationships between items can enhance the potential for personalized reranking.

# References

[1] https://qdrant.tech/articles/what-is-rag-in-ai

[2] https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb

[3] https://www.kaggle.com/datasets/carrie1/ecommerce-data/data

[4] https://www.kaggle.com/code/fabiendaniel/customer-segmentation

[5] https://huggingface.co/thenlper/gte-small