### Building a search system to find relevant items in a recommender system

The goal of this notebook is to show users how to build a search system to find products related to a particular search query. This is useful in building recommender systems for ecommerce websites or social media

It involves two stages:
1. Finding items that contain the search terms
2. Assigning a score to these items and picking the best matches

In [97]:
import numpy as np
import pandas as pd
import sklearn

from collections import defaultdict

Data source1: https://www.kaggle.com/datasets/knightbearr/sales-product-data

Data source2: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

In [137]:
product_df = pd.read_csv('./item_data.csv')
product_df.head()

Unnamed: 0,id,item
0,1,iPhone
1,2,Lightning Charging Cable
2,3,Wired Headphones
3,4,27in FHD Monitor
4,5,AAA Batteries (4-pack)


In [144]:
# initialize the product catalog to a dictionary of lists
product_catalog = defaultdict(list)


# iterate through the product_df and populate the product_catalog
def add_to_product_catalog(data):
    item_id = data['id']
    item_name = data['item'].lower().split(" ")
    
    for word in item_name:
        product_catalog[word].append(item_id)
    
_ = product_df.apply(add_to_product_catalog, axis=1)

In [145]:
product_catalog

defaultdict(list,
            {'iphone': [1],
             'lightning': [2],
             'charging': [2, 7],
             'cable': [2, 7],
             'wired': [3],
             'headphones': [3, 8, 9],
             '27in': [4, 6],
             'fhd': [4],
             'monitor': [4, 6, 15, 16],
             'aaa': [5],
             'batteries': [5, 13],
             '(4-pack)': [5, 13],
             '4k': [6],
             'gaming': [6],
             'usb-c': [7],
             'bose': [8],
             'soundsport': [8],
             'apple': [9],
             'airpods': [9],
             'macbook': [10],
             'pro': [10],
             'laptop': [10, 17],
             'flatscreen': [11],
             'tv': [11],
             'vareebadd': [12],
             'phone': [12, 14],
             'aa': [13],
             'google': [14],
             '20in': [15],
             '34in': [16],
             'ultrawide': [16],
             'thinkpad': [17],
             'lg': [18, 19],
   

In [107]:
for idx, row in product_df.iterrows():
    print(row['item'])

iPhone
Lightning Charging Cable
Wired Headphones
27in FHD Monitor
AAA Batteries (4-pack)
27in 4K Gaming Monitor
USB-C Charging Cable
Bose SoundSport Headphones
Apple Airpods Headphones
Macbook Pro Laptop
Flatscreen TV
Vareebadd Phone
AA Batteries (4-pack)
Google Phone
20in Monitor
34in Ultrawide Monitor
ThinkPad Laptop
LG Dryer
LG Washing Machine
nan
Product
tropical fruit
whole milk
pip fruit
other vegetables
rolls
pot plants
citrus fruit
beef
frankfurter
chicken
butter
fruit juice
packaged fruit
chocolate
specialty bar
butter milk
bottled water
yogurt
sausage
brown bread
hamburger meat
root vegetables
pork
pastry
canned beer
berries
coffee
misc. beverages
ham
turkey
curd cheese
red wine
frozen potato products
flour
sugar
frozen meals
herbs
soda
detergent
grapes
processed cheese
fish
sparkling wine
newspapers
curd
pasta
popcorn
finished products
beverages
bottled beer
dessert
dog food
specialty chocolate
condensed milk
cleaner
white wine
meat
ice cream
hard cheese
cream cheese 
liquor

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]
vectorizer = TfidfVectorizer()
score_base = vectorizer.fit_transform(corpus)

In [28]:
df_tfidf_sklearn = pd.DataFrame(score_base.toarray(),columns=vectorizer.get_feature_names_out())
df_tfidf_sklearn

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


In [None]:
df_tfidf_sklearn = df_tfidf_sklearn.to_dict()

In [17]:
result = []
for idx in range(len(corpus)):
    result.append([0, idx])

[[0, 0], [0, 1], [0, 2], [0, 3]]

In [30]:
def find(search_term):
    search_term = search_term.split(" ")
    for term in search_term:
        if term in df_tfidf_sklearn:
            for idx in range(len(result)):
                result[idx][0] += df_tfidf_sklearn[term][idx]
    return result

In [31]:
res = find("first is second")
res

[[2.5090279528829376, 3],
 [2.2197584135232944, 0],
 [0.534207575284336, 2],
 [1.9287421291985014, 1]]

In [32]:
res.sort(reverse=True)
res

[[2.5090279528829376, 3],
 [2.2197584135232944, 0],
 [1.9287421291985014, 1],
 [0.534207575284336, 2]]

Improvements

1. Added a related item category
2. remove stopwords
3. use a better score matcher such as cosine distance
4. preprocess numbers into words
5. some people search for general words such as "phone"
6. Lemmatization and stemming to handle singular and plural words