### Building a search system to find relevant items in a recommender system

The goal of this notebook is to show users how to build a search system to find products related to a particular search query. This is useful in building recommender systems for ecommerce websites or social media

It involves two stages:
1. Finding items that contain the search terms
2. Assigning a score to these items and picking the best matches

In [97]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

from collections import defaultdict

Data source1: https://www.kaggle.com/datasets/knightbearr/sales-product-data

Data source2: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

In [149]:
product_df = pd.read_csv('./item_data.csv')
product_df.head(10)

Unnamed: 0,id,item
0,1,iPhone
1,2,Lightning Charging Cable
2,3,Wired Headphones
3,4,27in FHD Monitor
4,5,AAA Batteries (4-pack)
5,6,27in 4K Gaming Monitor
6,7,USB-C Charging Cable
7,8,Bose SoundSport Headphones
8,9,Apple Airpods Headphones
9,10,Macbook Pro Laptop


In [147]:
product_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      193 non-null    int64 
 1   item    193 non-null    object
dtypes: int64(1), object(1)
memory usage: 3.1+ KB


`We are going to build two indices, one to map id to items, the other to map words to ids`

Build a dictionary mapping ids to items (for faster lookup O(1))

In [173]:
product_item_store = pd.Series(product_df.item.values,index=product_df.id).to_dict()
product_item_store

{1: 'iPhone',

 2: 'Lightning Charging Cable',
 
 3: 'Wired Headphones',
 
 4: '27in FHD Monitor',
 
 5: 'AAA Batteries (4-pack)',
 
 ...
 
 193: 'cling bags'}

Building the product catalog, which is a reverse index of items; the opposite of what we did above

In [150]:
# initialize the product catalog to a dictionary of lists
product_catalog = defaultdict(list)


# iterate through the product_df and populate the product_catalog
def add_to_product_catalog(data):
    """
    1. iterate through the product_df and populate the product_catalog
    
    Expectation -> product_catalog = {word: [item_ids_containing_word]}
    """
    
    item_id = data['id']
    item_name = data['item'].lower().split(" ")
    
    for word in item_name:
        product_catalog[word].append(item_id)
    
_ = product_df.apply(add_to_product_catalog, axis=1)

product_catalog

defaultdict(list,
            {'iphone': [1],
             'lightning': [2],
             'charging': [2, 7],
             'cable': [2, 7],
             'wired': [3],
             'headphones': [3, 8, 9],
             '27in': [4, 6],
             'fhd': [4],
             'monitor': [4, 6, 15, 16],
             'aaa': [5],
             'batteries': [5, 13],
             '(4-pack)': [5, 13],
             '4k': [6],
             'gaming': [6],
             'usb-c': [7],
             'bose': [8],
             'soundsport': [8],
             'apple': [9],
             'airpods': [9],
             'macbook': [10],
             'pro': [10],
             'laptop': [10, 17],
             'flatscreen': [11],
             'tv': [11],
             'vareebadd': [12],
             'phone': [12, 14],
             'aa': [13],
             'google': [14],
             '20in': [15],
             '34in': [16],
             'ultrawide': [16],
             'thinkpad': [17],
             'lg': [18, 19],
   

Finding a list of similar items

In [158]:
# define the find_similar_items method

def find_similar_items(search_string, product_catalog):
    """
    1. Split the search string by spaces
    2. For each word in the search string, find the item items that contain that word
    3. Add all item ids to a list called `all_occurrences`
    4. Create list of `similar_items`, which is a union of `all_occurrences`
    5. Return a list of the resulting items using `similar_items`
    """

    all_occurrences = []
    search_string = search_string.split(" ")
    for string in search_string:
        if string in product_catalog:
            all_occurrences.append(product_catalog[string])
    print(f'all_occurrences is {all_occurrences}')
    
    similar_items_id = set().union(*all_occurrences)
    print(f'similar_items_id is {similar_items_id}')
    
    similar_items = [product_item_store[id] for id in similar_items_id]
    print(f'similar_items is {similar_items}')
    
    return similar_items or []
        

similar_items = find_similar_items("iphone monitor thinkpad", product_catalog)

all_occurrences is [[1], [4, 6, 15, 16], [17]]
similar_items_id is {1, 4, 6, 15, 16, 17}
similar_items is ['iPhone', '27in FHD Monitor', '27in 4K Gaming Monitor', '20in Monitor', '34in Ultrawide Monitor', 'ThinkPad Laptop']


Now we move to ranking them in order of importance using Term-Freq-Index-Doc-Freq

Each item in the similar_items is called a document, and a corpus is a list of documents

In [165]:
# corpus = [
#      'This is the first document.',
#      'This document is the second document.',
#      'And this is the third one.',
#      'Is this the first document?',
# ]
corpus = similar_items
vectorizer = TfidfVectorizer()
tfidf_scores = vectorizer.fit_transform(corpus)

In [166]:
tfidf_array = pd.DataFrame(tfidf_scores.toarray(),columns=vectorizer.get_feature_names_out())
tfidf_array

Unnamed: 0,20in,27in,34in,4k,fhd,gaming,iphone,laptop,monitor,thinkpad,ultrawide
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.576336,0.0,0.0,0.702836,0.0,0.0,0.0,0.416964,0.0,0.0
2,0.0,0.471523,0.0,0.575019,0.0,0.575019,0.0,0.0,0.341135,0.0,0.0
3,0.86004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.510227,0.0,0.0
4,0.0,0.0,0.652057,0.0,0.0,0.0,0.0,0.0,0.386839,0.0,0.652057
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.707107,0.0


In [171]:
# we simply convert to a dictionary to help with faster lookup
tfidf_dict = tfidf_array.to_dict()

In [174]:
# the rank function works by adding the scores of each word in the search term, for each document. Then sorting them.

def rank_items(search_term, corpus):
    search_term = search_term.split(" ")
    result = []
    for idx in range(len(corpus)):
        result.append([0, corpus[idx]])
        
    for term in search_term:
        if term in tfidf_dict:
            for idx in range(len(result)):
                result[idx][0] += tfidf_dict[term][idx]
    return result

In [177]:
res = rank_items("iphone monitor thinkpad", corpus)
res

[[1.0, 'iPhone'],
 [0.4169638817420929, '27in FHD Monitor'],
 [0.3411349947282554, '27in 4K Gaming Monitor'],
 [0.5102266013323625, '20in Monitor'],
 [0.3868386046677293, '34in Ultrawide Monitor'],
 [0.7071067811865476, 'ThinkPad Laptop']]

In [178]:
res.sort(reverse=True)
res

[[1.0, 'iPhone'],
 [0.7071067811865476, 'ThinkPad Laptop'],
 [0.5102266013323625, '20in Monitor'],
 [0.4169638817420929, '27in FHD Monitor'],
 [0.3868386046677293, '34in Ultrawide Monitor'],
 [0.3411349947282554, '27in 4K Gaming Monitor']]

In [179]:
print("order to return items:")
print([item[1] for item in res])

order to return items:
['iPhone', 'ThinkPad Laptop', '20in Monitor', '27in FHD Monitor', '34in Ultrawide Monitor', '27in 4K Gaming Monitor']


Improvements

1. Added a related item category
2. remove stopwords
3. use a better score matcher such as cosine distance
4. preprocess numbers into words
5. some people search for general words such as "phone"
6. Lemmatization and stemming to handle singular and plural words