# SciKit-Learn x LightGBM - Hybrid Recommender System (Content-based Filtering)

Content-Based Recommendations systems are the systems that look for similarity before recommending something. To understand how similarity between different products is computed, there are different techniques or similarity measures that are used to compute the similarity, such as Euclidean distance and cosine similarity. This system uses the metadata item name to make these recommendations, this decision is made as items sold on e-commerce platforms are usually named in a way that includes the product name, brand name, features, and simple descriptions in order to stand out to users and be in favor of the search engines.

In this assignment, I went through the steps of building a hybrid model of content-based recommender systems using the Lazada Indonesia Data set that is publicly available on Kaggle. To achieve this, I utilized computed values of pairwise cosine similarity scores from scikit-learn library, and regression algorithms by LightGBM for all products sold on Lazada Indonesia based on their product name and predicted ratings, and recommend relevant products based on the combination of the cosine similarity score and the predicted ratings value.

(full dataset can be downloaded here) https://www.kaggle.com/datasets/grikomsn/lazada-indonesian-reviews?select=20191002-items.csv

In [1]:
pip install pyspellchecker

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install spacy

In [None]:
pip install streamlit

## Section 1 Data Preparation

"20191002-items.csv" contains information on 10,942 products being sold in the Lazada Indonesia dataset. Features include itemId, category, name, brandName, url, price, averageRating, totalReviews, retrievedDate.

Let's load the products metadata dataset into a pandas DataFrame:

In [3]:
# Import Pandas
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np
import random
import tkinter as tk
from tkinter import Scrollbar, Text
from tkinter import messagebox

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Load Movies Metadata
metadata = pd.read_csv('20191002-items.csv', low_memory=False)

# Print the first three rows
metadata.head()

Unnamed: 0,itemId,category,name,brandName,url,price,averageRating,totalReviews,retrievedDate
0,100002528,beli-harddisk-eksternal,"TOSHIBA Smart HD LED TV 32"" - 32L5650VJ Free B...",Toshiba,https://www.lazada.co.id/products/toshiba-smar...,2499000,4,8,2019-10-02
1,100003785,beli-harddisk-eksternal,"TOSHIBA Full HD Smart LED TV 40"" - 40L5650VJ -...",Toshiba,https://www.lazada.co.id/products/toshiba-full...,3788000,3,3,2019-10-02
2,100004132,beli-harddisk-eksternal,Samsung 40 Inch Full HD Flat LED Digital TV 4...,LG,https://www.lazada.co.id/products/samsung-40-i...,3850000,3,2,2019-10-02
3,100004505,beli-harddisk-eksternal,"Sharp HD LED TV 24"" - LC-24LE175I - Hitam",Sharp,https://www.lazada.co.id/products/sharp-hd-led...,1275000,3,11,2019-10-02
4,100005037,beli-harddisk-eksternal,Lenovo Ideapad 130-15AST LAPTOP MULTIMEDIA I A...,Lenovo,https://www.lazada.co.id/products/lenovo-ideap...,3984100,5,1,2019-10-02


In [5]:
metadata.shape

(10942, 9)

There are 10942 rows and 9 columns

Let's inspect the names of a few products:

In [6]:
#The plot description is available to you as the overview feature in your metadata dataset. 
metadata['name'].head()

0    TOSHIBA Smart HD LED TV 32" - 32L5650VJ Free B...
1    TOSHIBA Full HD Smart LED TV 40" - 40L5650VJ -...
2     Samsung 40 Inch Full HD Flat LED Digital TV 4...
3            Sharp HD LED TV 24" - LC-24LE175I - Hitam
4    Lenovo Ideapad 130-15AST LAPTOP MULTIMEDIA I A...
Name: name, dtype: object

## Section 2 Features Generation


Next, we used scikit-learn's built-in TfIdfVectorizer class to produce the TF-IDF matrix, by:

1. Importing the Tfidf module using scikit-learn;

2. Removing stop words like 'the', 'an', etc. since they do not give any useful information about the topic;

3. Replacing not-a-number values with a blank string; 

4. Constructing the TF-IDF matrix on the data.

In [7]:
# Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object. Remove all English stop words
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN values with an empty string for the 'name' column (or any other column you want to use for TF-IDF)
metadata['name'] = metadata['name'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['name'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape


(10942, 3784)

There are 3,784 different vocabularies or words (features) in the dataset that contains 10,942 products.


In [8]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names_out()[1000:1010]

array(['army', 'array', 'art', 'asli', 'asm', 'aspire', 'assus', 'astro',
       'asuransi', 'asus'], dtype=object)

With the matrix, we can now use the cosine similarity to calculate a numeric quantity that denotes the similarity between two products. Cosine similarity is a popular choice because it works well for text-based data and is robust to the magnitude of vectors. It takes into account the importance of words in a text-based document collection.

Since TF-IDF vectorizer is used, calculating the dot product between each vector will directly give you the cosine similarity score. Therefore, we will use sklearn's <i>linear_kernel()</i> instead of <i>cosine_similarities()</i> since it is faster.

In [57]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [58]:
cosine_sim.shape

(10942, 10942)

The above returns a matrix of shape 10942x10942, which means each product overview cosine similarity score with every other product overview. Hence, each product will be a 1x10942 column vector where each column will be a similarity score with each product.

In [11]:
#observing the first 6 rows and 6 columns
for i in range(6):
    print(cosine_sim[i][:6])

[1.         0.48307326 0.13742051 0.20957068 0.05027774 0.        ]
[0.48307326 1.         0.23674775 0.18661981 0.05422743 0.        ]
[0.13742051 0.23674775 1.         0.12880672 0.05339002 0.        ]
[0.20957068 0.18661981 0.12880672 1.         0.06132393 0.        ]
[0.05027774 0.05422743 0.05339002 0.06132393 1.         0.        ]
[0. 0. 0. 0. 0. 1.]


Next, we defined a function that takes in a product name as an input and outputs a list of the 10 most similar products. Firstly, for this, we need a reverse mapping of product names and DataFrame indices. In other words, we are generating the ID for each product name using index.

In [12]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['name']).drop_duplicates()

#check the first 10 indices
indices[:]

name
TOSHIBA Smart HD LED TV 32" - 32L5650VJ Free Bracket TV - Hitam - Khusus Jabodetabek                                             0
TOSHIBA Full HD Smart LED TV 40" - 40L5650VJ - Hitam - Khusus Jabodetabek                                                        1
 Samsung 40 Inch Full HD Flat LED Digital TV 40J5000                                                                             2
Sharp HD LED TV 24" - LC-24LE175I - Hitam                                                                                        3
Lenovo Ideapad 130-15AST LAPTOP MULTIMEDIA I AMD A4-9125 I 8GB DDR4 I 1TB HDD I AMD Radeon R3 I 15,6 HD LED I DVDRW I DOS        4
                                                                                                                             ...  
Toshiba 32L3750VJ Digital Tv DVB-T2 LED TV 32" + Free Bracket TV - Khusus JABODETABEK                                        10937
Samsung 43K5002AK Televisi LED - Khusus JABODETABEK                           

## Section 3 Content-Based Filtering Recommender

Next we will build a content filtering recommender by doing the following steps:

1. Getting the index of the product given its name.

2. Getting the list of cosine similarity scores for that particular product with all products. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

3. Sorting the aforementioned list of tuples based on the similarity scores; that is, the second element.

4. Getting the top 10 elements of this list. Ignore the first element as it refers to self (the product most similar to a particular product is the product itself).

5. Returning the titles corresponding to the indices of the top elements.

In [13]:
def get_recommendations(name, cosine_sim=cosine_sim):
    # Get the index of the product that matches the title
    idx = indices[name]
    if isinstance(idx, pd.Series):
        # Handle multiple products with the same name by selecting the first one
        idx = idx.iloc[0]

    # Get the pairwise similarity scores of all products with that product
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the products based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar products
    sim_scores = sim_scores[1:11]

    # Get the product indices
    product_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['name'].iloc[product_indices]

In [79]:
get_recommendations('Refurbished Lenovo Thinkpad L430 - 4GB - 320GB - 14 inch - Intel Core i5 2nd Gen')

5075     Refurbished Lenovo Thinkpad L430 - 4GB - 320GB...
6333     Refurbished Lenovo Thinkpad L430 - 4GB - 320GB...
9388     Refurbished Lenovo Thinkpad L430 - 4GB - 320GB...
10864    Refurbished Lenovo Thinkpad L430 - 4GB - 320GB...
4008     Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
5073     Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
6331     Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
9386     Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
10862    Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
4059     Promo Lenovo Thinkpad T420S - Laptop - Noteboo...
Name: name, dtype: object

Since people using e-commerce systems search for their desired by keywords that make up a part of a product's name, and never by a product's full name, a function taking in the full name of a product is unrealistic and far from practical. We need a function that is able to take in a string input that can be a part of the full product name to get recommendations.

In [15]:
def get_recommendations_partial(name, cosine_sim=cosine_sim):
    # Filter products that contain the input name as a substring
    matching_products = metadata[metadata['name'].str.contains(name, case=False)]

    if matching_products.empty:
        return "No matching products found."

    recommendations = {}

    for _, matching_product in matching_products.iterrows():
        idx = indices[matching_product['name']]
        sim_scores = cosine_sim[idx]
        avg_sim_score = sim_scores.mean()

        recommendations[matching_product['name']] = avg_sim_score

    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
    top_recommendations = [rec[0] for rec in sorted_recommendations][:10]

    return top_recommendations

In [86]:
get_recommendations_partial('shiba')

['Flashdisk USB - Flash disk USB - Flash Drive Toshiba 4GB FREE KABEL OTG',
 'Flashdisk 32GB Toshiba USB Flashdisk [32 GB]',
 'Flashdisk Toshiba 32 gb Flash Drive Flash Disk',
 'TOSHIBA Flashdisk 32GB USB Flash Drive - Putih',
 'Flashdisk USB Flash Drive Toshiba 16 GB Putih',
 'Toshiba Flashdisk 16GB USB Flash Memory + Free Flashdisk 8GB',
 'Flashdisk Toshiba USB [8 GB]',
 'Flashdisk Toshiba 4GB Flash Drive Flash Disk',
 'Toshiba Flashdisk Flash Drive Toshiba 64 GB USB 0.3 Murah',
 'External Toshiba Canvio Ready  1TB - HDD / HD / Hardisk Eksternal Usb 3.0 - Hitam  + Gratis  Pouch Hdd + Flash Drive Toshiba 16Gb Usb 3.0']

Now that our Scikit-Learn's cosine similarity score based content-based recommender system is complete, we may proceed to go into creating our LightGBM regression model.

# LightGBM Regression Model
First we import the LightGBM library and create a new dataframe of our products csv file.

In [4]:
import lightgbm as lgb

In [17]:
productdf = pd.read_csv('20191002-items.csv', low_memory=False)
productdf.head()

Unnamed: 0,itemId,category,name,brandName,url,price,averageRating,totalReviews,retrievedDate
0,100002528,beli-harddisk-eksternal,"TOSHIBA Smart HD LED TV 32"" - 32L5650VJ Free B...",Toshiba,https://www.lazada.co.id/products/toshiba-smar...,2499000,4,8,2019-10-02
1,100003785,beli-harddisk-eksternal,"TOSHIBA Full HD Smart LED TV 40"" - 40L5650VJ -...",Toshiba,https://www.lazada.co.id/products/toshiba-full...,3788000,3,3,2019-10-02
2,100004132,beli-harddisk-eksternal,Samsung 40 Inch Full HD Flat LED Digital TV 4...,LG,https://www.lazada.co.id/products/samsung-40-i...,3850000,3,2,2019-10-02
3,100004505,beli-harddisk-eksternal,"Sharp HD LED TV 24"" - LC-24LE175I - Hitam",Sharp,https://www.lazada.co.id/products/sharp-hd-led...,1275000,3,11,2019-10-02
4,100005037,beli-harddisk-eksternal,Lenovo Ideapad 130-15AST LAPTOP MULTIMEDIA I A...,Lenovo,https://www.lazada.co.id/products/lenovo-ideap...,3984100,5,1,2019-10-02


In [18]:
productdf.shape

(10942, 9)

In [19]:
productdf.columns

Index(['itemId', 'category', 'name', 'brandName', 'url', 'price',
       'averageRating', 'totalReviews', 'retrievedDate'],
      dtype='object')

# Data Cleansing
We check for null values in our dataframe and proceeds to remove it first.

In [20]:
productdf.isnull().any()

itemId           False
category         False
name             False
brandName         True
url              False
price            False
averageRating    False
totalReviews     False
retrievedDate    False
dtype: bool

In [21]:
productdf.dropna(inplace=True)

# Feature Engineering
Feature engineering involves creating features from our dataset that might be relevant for making recommendations. For the recommender system using LightGBM, we can perform feature engineering using the columns available in our dataset.

In [22]:
# Define your target variable
target = 'averageRating'  # You can choose 'averageRating' as a target

# Define your features
# You can select relevant columns as features for your recommender
features = ['category', 'brandName', 'price', 'totalReviews']

# If you have a 'retrievedDate' column, you can convert it to a datetime object
productdf['retrievedDate'] = pd.to_datetime(productdf['retrievedDate'])

# Extract additional datetime features
productdf['year'] = productdf['retrievedDate'].dt.year
productdf['month'] = productdf['retrievedDate'].dt.month
productdf['day'] = productdf['retrievedDate'].dt.day
productdf['dayofweek'] = productdf['retrievedDate'].dt.dayofweek

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(productdf[features], productdf[target], test_size=0.2, random_state=42)

# Check the first few rows of the training data to verify the feature engineering
print(X_train.head())

      category  brandName    price  totalReviews
7534         3        168   290000             3
838          0          3  3990000             5
7134         3        192    67500             1
4901         1          3  4115000             2
4796         1         16  8215600             5


# Label Encoding 
Since our features contain categorical variables (e.g., "category," "brandName"), we'll need to encode them into numerical values using techniques like label encoding or one-hot encoding.

In [None]:
# Label encode categorical features (e.g., 'category' and 'brandName')
from sklearn.preprocessing import LabelEncoder

label_encoders = {}
for feature in ['category', 'brandName']:
    label_encoders[feature] = LabelEncoder()
    productdf[feature] = label_encoders[feature].fit_transform(productdf[feature])


# Train-Test Split
We split our dataset into training and testing sets. We'll use the training set to train the model and the testing set to evaluate its performance.

In [23]:
train_data = lgb.Dataset(X_train, label=y_train)

# Train the Model
We defined our LightGBM model's parameters, and trained our LightGBM model using the training dataset and parameters.

In [24]:
params = {
    'objective': 'regression',  # Since it's a regression problem
    'metric': 'rmse',  # Root Mean Squared Error as the evaluation metric
    'boosting_type': 'gbdt',  # Gradient Boosting Decision Tree
    'num_leaves': 31,  # Maximum number of leaves in one tree
    'learning_rate': 0.05,  # Learning rate
    'feature_fraction': 0.9,  # Feature fraction for each tree
    'bagging_fraction': 0.8,  # Bagging fraction for each tree
    'bagging_freq': 5,  # Frequency for bagging
    'verbose': 0  # Verbosity
}

In [25]:
num_round = 100  # Number of boosting rounds (you can adjust this)
bst = lgb.train(params, train_data, num_round)

# Evaluate the Model
We used the trained model to make predictions on our test data, and evaluated the model's performance using appropriate metrics (e.g., RMSE) and analyzed the results.

In [26]:
y_pred = bst.predict(X_test, num_iteration=bst.best_iteration)

In [27]:
rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Root Mean Squared Error (RMSE): {rmse}")

Root Mean Squared Error (RMSE): 0.9460617466426621


The Root Mean Squared Error (RMSE) value of 0.9460617466426621 is a metric used to evaluate the performance of a regression model. In our recommender system, it indicates how well our LightGBM model predicts the ratings or preferences of users for products.

In [28]:
print(y_pred)

[3.96734536 3.90749091 4.51539997 ... 4.81724706 4.42770376 4.35088368]


# Algorithm Merging - Hybridization
Now that we have both a working model of LightGBM model and the content-based model, we will proceed into creatng a hybrid recommender system that combines the predictions from our LightGBM model and the content-based model. We start this section by using our trained LightGBM model to predict ratings for all products in the catalog.

In [29]:
# Predict ratings for all products in the catalog
predicted_ratings = bst.predict(productdf[features])

# Create a DataFrame to store product IDs and their predicted ratings
catalog_predictions = pd.DataFrame({'itemId': productdf['itemId'], 'predicted_rating': predicted_ratings})

# Sort the catalog_predictions DataFrame by predicted ratings in descending order
catalog_predictions = catalog_predictions.sort_values(by='predicted_rating', ascending=False)

In [30]:
catalog_predictions

Unnamed: 0,itemId,predicted_rating
6269,615988593,4.990463
10809,615988593,4.990463
6764,119244473,4.990463
3778,615988593,4.990463
5300,119244473,4.990463
...,...,...
387,108468276,2.331816
1763,340685481,2.331816
1372,173653315,2.331816
2898,452758788,2.331816


Next, we merge the catalog_predictions table (which includes predicted ratings) with the metadata table on the productId into the same dataframe.

In [31]:
# Merge catalog_predictions with metadata on 'productId'
merged_data = pd.merge(catalog_predictions, metadata, on='itemId', how='inner')
merged_data

Unnamed: 0,itemId,predicted_rating,category,name,brandName,url,price,averageRating,totalReviews,retrievedDate
0,615988593,4.990463,beli-harddisk-eksternal,ASUS ROG ZEPHYRUS G GA502DU-R76601T - RYZEN 7 ...,Asus,https://www.lazada.co.id/products/asus-rog-zep...,19495000,5,1,2019-10-02
1,615988593,4.990463,beli-laptop,ASUS ROG ZEPHYRUS G GA502DU-R76601T - RYZEN 7 ...,Asus,https://www.lazada.co.id/products/asus-rog-zep...,19495000,5,1,2019-10-02
2,615988593,4.990463,beli-smart-tv,ASUS ROG ZEPHYRUS G GA502DU-R76601T - RYZEN 7 ...,Asus,https://www.lazada.co.id/products/asus-rog-zep...,19495000,5,1,2019-10-02
3,615988593,4.990463,jual-flash-drives,ASUS ROG ZEPHYRUS G GA502DU-R76601T - RYZEN 7 ...,Asus,https://www.lazada.co.id/products/asus-rog-zep...,19495000,5,1,2019-10-02
4,615988593,4.990463,shop-televisi-digital,ASUS ROG ZEPHYRUS G GA502DU-R76601T - RYZEN 7 ...,Asus,https://www.lazada.co.id/products/asus-rog-zep...,19495000,5,1,2019-10-02
...,...,...,...,...,...,...,...,...,...,...
35605,108468276,2.331816,jual-flash-drives,Flashdisk Toshiba Hayabusa Kapasitas 16GB + Ga...,Toshiba,https://www.lazada.co.id/products/flashdisk-to...,79000,1,1,2019-10-02
35606,627854700,2.391741,beli-harddisk-eksternal,FlashDisk HP 32GB V250W Free Kabel OTG + Lampu...,Flashdisk,https://www.lazada.co.id/products/flashdisk-hp...,34900,1,1,2019-10-02
35607,627854700,2.391741,jual-flash-drives,FlashDisk HP 32GB V250W Free Kabel OTG + Lampu...,Flashdisk,https://www.lazada.co.id/products/flashdisk-hp...,34900,1,1,2019-10-02
35608,627854700,2.351146,beli-harddisk-eksternal,FlashDisk HP 32GB V250W Free Kabel OTG + Lampu...,Flashdisk,https://www.lazada.co.id/products/flashdisk-hp...,34900,1,1,2019-10-02


Then we create a modified get_recommendations function called get_recommendations_hybrid that calculates an index score through a weighted sum approach (0.7 for content-based, 0.3 for LightGBM) that combines both cosine similarity score and predicted ratings of the product of a specified name, and finally, sort the results based on this index score to return as a value.

In [32]:
 def get_recommendations_hybrid(name, cosine_sim=cosine_sim, catalog_predictions=catalog_predictions):
    lgb_weight = 0.1
    cosine_weight = 0.9
    # Get the index of the product that matches the title
    idx = indices[name]
    if isinstance(idx, pd.Series):
        # Handle multiple products with the same name by selecting the first one
        idx = idx.iloc[0]

    # Get the pairwise similarity scores of all products with that product
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the products based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar products
    sim_scores = sim_scores[1:11]

    # Get the product indices
    product_indices = [i[0] for i in sim_scores]
    
    # Initialize empty list to store combined weightage values
    combined_weightage = []

    # Calculate the combined weightage for each product
    for i in range(len(product_indices)):
        combined_score = (lgb_weight * predicted_ratings[i]) + (cosine_weight * sim_scores[i][1])
        combined_weightage.append(combined_score)
        
    # Zip the combined weightage values with their corresponding product indices
    combined_zip = list(zip(combined_weightage, product_indices))

    # Sort the zip list based on combined weightage in descending order
    combined_zip.sort(reverse=True)

    # Unzip the sorted list to separate combined weightage and product indices
    sorted_combined_weightage, sorted_product_indices = zip(*combined_zip)

    # Return the top 10 most similar movies
    return metadata['name'].iloc[product_indices]

In [92]:
get_recommendations_hybrid('Refurbished Lenovo Thinkpad L430 - 4GB - 320GB - 14 inch - Intel Core i5 2nd Gen')

5075     Refurbished Lenovo Thinkpad L430 - 4GB - 320GB...
6333     Refurbished Lenovo Thinkpad L430 - 4GB - 320GB...
9388     Refurbished Lenovo Thinkpad L430 - 4GB - 320GB...
10864    Refurbished Lenovo Thinkpad L430 - 4GB - 320GB...
4008     Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
5073     Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
6331     Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
9386     Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
10862    Refurbished Lenovo Thinkpad Yoga 11e - 4GB - 1...
4059     Promo Lenovo Thinkpad T420S - Laptop - Noteboo...
Name: name, dtype: object

Like I mentioned earlier in this notebook, since people using e-commerce systems search for their desired by keywords that make up a part of a product's name only, we need a function that is able to take in a string input that can be a part of the full product name to get recommendations. We improved upon our get_recommendations_partial to obtain non-duplicating results that disregards the order of the keywords being passed into the function as parameters.

One approach to achieve this is to use Natural Language Processing (NLP) techniques to extract meaningful keywords from the input name and then match these keywords with the product names. We can use libraries like spaCy for this purpose.

In [36]:
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

def extract_keywords(text):
    # Tokenize the input text using spaCy
    doc = nlp(text)
    
    # Extract keywords (lemmatized tokens)
    keywords = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    
    return keywords

To remove duplicates from the recommended_products list, you can convert it into a set, and then back into a list.

In [37]:
import re

def get_recommendations_hybrid_partial(name, cosine_sim=cosine_sim, catalog_predictions=catalog_predictions):
    # Extract keywords from the input name
    input_keywords = extract_keywords(name)
    
    if not input_keywords:
        return "No matching products found."

    combined_results = set()

    for _, product_row in metadata.iterrows():
        product_name = product_row['name']
        
        # Extract keywords from the product name
        product_keywords = extract_keywords(product_name)
        
        if any(word in product_keywords for word in input_keywords):
            # Use the hybrid recommendation approach for the matched product
            recommended = get_recommendations_hybrid(product_name, catalog_predictions=catalog_predictions, cosine_sim=cosine_sim)[:10]
            
            # Check if the product name contains all input keywords using regular expressions
            if all(re.search(r'\b{}\b'.format(word), product_name, re.IGNORECASE) for word in input_keywords):
                combined_results.update(recommended)

    return list(combined_results)[:10]


In [94]:
get_recommendations_hybrid_partial('sandisk flash')

['Sandisk USB 3.0 FD Flash Disk Drive FlashDisk Cruzer Glide 32GB 32 GB',
 'SANDISK IXPAND MINI FLASH DRIVE LIGHTNING USB 3.0 OTG - 16GB (FLASHDISK USB , IPHONE & IPAD)',
 'SanDisk Flashdisk Cruzer Blade CZ50 - 8GB ',
 'SANDISK IXPAND MINI FLASH DRIVE LIGHTNING USB 3.0 OTG - 64GB (FLASHDISK USB , IPHONE & IPAD)',
 'Sandisk USB 3.0 Flash Disk FlashDisk Cruzer Glide Flash Drive 16GB 3.0',
 'Sandisk Flashdisk Ultra Flair USB 3.0 CZ73 - 32GB',
 'SanDisk Cruzer Blade 8GB CZ50 Flashdisk',
 'FLASHDISK SANDISK 8GB CRUZER BLADE CZ50',
 'FLASHDISK SANDISK 16 GB 32 GB 64 GB PLUS OTG FD SANDISK 16GB 32GB 64GB MULTI SLOT DUAL USB 3.0 HIGH SPEED FLASH DISK SUPPORT ANDROID OTG FLASH DRIVE',
 'Flashdisk SanDisk Cruzer Spark CZ61 USB [32 GB] #Garansi Resmi#']

And with that, our hybrid algorithm for is complete.

# Spelling Check and Correction
In an attempt to maximize the usability and flexibility of our recommender system to fit realistic user behavior, people using e-commerce systems frequently search for their desired products incorrectly with spelling errors either by mistake or by ignorance, hence we need an input function that is able to account for these user mistakes, and attempt correction on their input before processing it through our recommender system.

To do this, we can use libraries like pyspellchecker in Python to perform spelling correction. This library provides functions to check and correct spelling errors in text.

In [39]:
from spellchecker import SpellChecker

There is no point correcting search inputs into words that do not reflect our product catalogue, ones that will not return any result, so instead of using SpellChecker's internal dictionary, we will create a custom library of words consisting of our product features for our SpellChecker instance in order to minimize false correction errors.

In [None]:
# Create a list of product-related words
product_words = tfidf.get_feature_names_out()

# Initialize a SpellChecker instance
spell = SpellChecker(language=None)

# Set the word bank of the SpellChecker to the custom dictionary
spell.word_frequency.load_words(product_words)

Next we tested the accuracy and effectiveness of our SpellChecker instance.

In [40]:
# Use the SpellChecker to correct text
corrected_text = spell.correction("sandick")

print(corrected_text)  # This will print "sandisk"

corrected_text = spell.correction("lovovo")

print(corrected_text)  # This will print "lenovo"

corrected_text = spell.correction("pilips")

print(corrected_text)  # This will print "philip"

corrected_text = spell.correction("samsang")

print(corrected_text)  # This will print "samsung"

corrected_text = spell.correction("cip")

print(corrected_text)  # This will print "samsung"

sandisk
lenovo
philips
samsung
chip


Users may input apostrophes into their search which may distort our search results, to account for this possibility, we first have to remove all possible text patterns with apostrophe (-'s, -s', -'d, -n't, -'re, -'ll) and revert them back to their basic form before being processed by our recommender system.

In [41]:
def remove_special_phrases(input_string):
    phrases_to_remove = ["'s", "n't", "s'", "'d", "'ll", "'re"]
    for phrase in phrases_to_remove:
        input_string = input_string.replace(phrase, "")
    return input_string

apostrophe_text1 = remove_special_phrases("word's")
print(apostrophe_text1)

apostrophe_text2 = remove_special_phrases("word'd")
print(apostrophe_text2)

apostrophe_text3 = remove_special_phrases("words'")
print(apostrophe_text3)

apostrophe_text4 = remove_special_phrases("word'll")
print(apostrophe_text4)

apostrophe_text5 = remove_special_phrases("wordn't")
print(apostrophe_text5)

apostrophe_text5 = remove_special_phrases("word're")
print(apostrophe_text5)

word
word
word
word
word
word


Next we remove other unwanted characters from the string such as hyphens and remaining apostrophes from the words. One of SpellChecker()'s limitation is that it can only correct one word at a time, so we are unable to pass in an entire phrase or sentence into SpellChecker() to be corrected. We created a function to split our phrases and strings into individual words, to pass into SpellChecker for correction, then join back the returned words into a complete string.

In [64]:
def correct_spelling_in_string(input_string, spell):
    # Split the input string into words
    words = input_string.split()

    # Initialize an empty list to store corrected words
    corrected_words = []

    for word in words:
        # Remove unwanted characters like apostrophes and hyphens
        word = remove_special_phrases(word)
        
        word = ''.join(filter(str.isalnum, word))
        # Correct the spelling of the word
        corrected_word = spell.correction(word)
        # Append the corrected word to the list
        corrected_words.append(corrected_word)

    # Join the corrected words into a string
    corrected_string = ' '.join(corrected_words)

    return corrected_string


In [73]:
# Example usage:
input_phrase = "samsang's leptop lavovo 64gb"
corrected_text = correct_spelling_in_string(input_phrase, spell)
print(corrected_text)

input_phrase = "asuz geming laptip"
corrected_text = correct_spelling_in_string(input_phrase, spell)
print(corrected_text)

input_phrase = "sandiak 64gb"
corrected_text = correct_spelling_in_string(input_phrase, spell)
print(corrected_text)

input_phrase = "seagat extrnal had disskk"
corrected_text = correct_spelling_in_string(input_phrase, spell)
print(corrected_text)

samsung laptop lenovo 64gb
asus gaming laptop
sandisk 64gb
seagate external hard disk


An overall product search input function is created that takes in a string input with possibly incorrectly spelled keywords arranged in a random order, calls the spelling correction function, and pass on the corrected string to our get recommendations function to return the recommended results to the user.

In [43]:
def product_search_input(input_string, spell):
    query = correct_spelling_in_string(input_string, spell)
    results = get_recommendations_hybrid_partial(query)
    return results

In [45]:
product_search_input("lavovo geming", spell)

lavovo
lenovo
geming
gaming
lenovo gaming


['Lenovo IdeaPad 130-15ST (AMD A4-9125 - 4GB DDR4 - 1TB - VGA AMD RADEON R3 - Laptop Gaming - DVDRW - 15.6")',
 'Lenovo Ideapad 330-14AST-38ID AMD A9-9425 (RAM 4GB DDR4 - 1TB HDD - Integrated - Windows 10 Home - 14" HD - 2 year warranty - BLACK)',
 'LAPTOP GAMING LENOVO IP 130-15AST - AMD A4-9125 - VGA AMD RADEON R3 - RAM 4GB - HDD 1TB - DVDRW - 15.6"',
 'Lenovo IP 130 15AST AMD A4 9125 2.3GHZ  RAM 4GB HDD 1TB VGA Radeon R3  15.6 Inch  DVDRW',
 'Lenovo Legion Y530-15ICH-72ID Intel Core i7-8750H (8GB DDR4 + 16GB Intel Optane - 1TB HDD - GTX 1050 4GB - Windows 10 Home - 15,6" FHD IPS - 2 Years Warranty) Gaming Laptop',
 'Lenovo Legion Y530-15ICH-10ID Intel Core i5-8300H (8GB DDR4 + 16GB Intel Optane - 1TB HDD - GTX 1050Ti 4GB - Windows 10 Home - 15,6" FHD IPS - 2 Years Warranty) Gaming Laptop',
 'Lenovo Ideapad 330-14AST-39ID AMD A9-9425 (RAM 4GB DDR4 - 1TB HDD - Integrated - Windows 10 Home - 14" HD - 2 year warranty - GRAY)',
 'Lenovo Ideapad 130-15AST With 8GB DDR4 I AMD A4-9125 I 500

# Graphical User Interface
A simple GUI built using the built-in Tkinter library with a text entry field, a search button, and a text widget to display the results is created. 

When the "Search" button is clicked, it calls our product_search_input function with the user's input and displays the results in the text widget.

In [95]:
# Create the main window
root = tk.Tk()
root.title("Product Search")

# Create a label and an entry widget
label = tk.Label(root, text="Enter your product search:")
label.pack()

entry = tk.Entry(root, width=40)
entry.pack()

# Create a text widget to display results with a larger height
results_text = Text(root, wrap=tk.WORD, height=25, width=120)
results_text.pack()

# Create a scrollbar for the text widget
scrollbar = Scrollbar(root, command=results_text.yview)
scrollbar.pack(side=tk.RIGHT, fill=tk.Y)
results_text.config(yscrollcommand=scrollbar.set)

# Function to handle button click
def search_product():
    input_string = entry.get()
    if input_string:
        results = product_search_input(input_string, spell)  # Call your function here
        results_text.delete(1.0, tk.END)  # Clear previous results
        if results:
            for result in results:
                results_text.insert(tk.END, result + "\n")
        else:
            messagebox.showinfo("No Results", "No matching products found.")
    else:
        messagebox.showinfo("Input Error", "Please enter a product search.")

# Create a search button
search_button = tk.Button(root, text="Search", command=search_product)
search_button.pack()

# Start the Tkinter main loop
root.mainloop()

lenovo gaming
asus gaming laptop
samsung tv


Unfortunately, as there is an absence of actually relevant product data, evaluation methods such as Precision at K and Recall at K could not be performed. Without a comparable truth value in the dataset, Holdout Testing could not be performed as well, except solely for the LightGBM model. 

Despite that, by visual evaluation and informed judgment of the algorithm output in comparison with input string data, the relevance of the recommended items to the input string remains high. That concludes the creation, testing, and demonstration of our SciKit-Learn x LightGBM - Hybrid Recommender System