# Grocery Cart Recommender System

### I found this dataset on kaggle and wanted to try my hand at a recommender system. The data set contain information on the types of groceries that are bought together. Each row is a cart, and each column is an item from that cart. Based off that, I can build a grocery cart recommendation system.

### I will build my recommendation system using the surprise library in python.

### Then I will test it using the standard method, and a few custom methods

In [1]:
import pandas as pd
import numpy as np
from random import sample, seed, randint
from surprise import Reader, Dataset, SVD

In [2]:
groceries = pd.read_csv('groceries-groceries.csv')

In [3]:
groceries.head()

Unnamed: 0,Item(s),Item 1,Item 2,Item 3,Item 4,Item 5,Item 6,Item 7,Item 8,Item 9,...,Item 23,Item 24,Item 25,Item 26,Item 27,Item 28,Item 29,Item 30,Item 31,Item 32
0,4,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,...,,,,,,,,,,
1,3,tropical fruit,yogurt,coffee,,,,,,,...,,,,,,,,,,
2,1,whole milk,,,,,,,,,...,,,,,,,,,,
3,4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,...,,,,,,,,,,
4,4,other vegetables,whole milk,condensed milk,long life bakery product,,,,,,...,,,,,,,,,,


### Briefly exporing the dataset, I wanna know the average number of items per cart, and the number of carts with more that 10 items

In [4]:
print(np.mean(groceries[groceries.columns[0]]))
print(len(groceries[groceries[groceries.columns[0]]>10]))

4.409456024402644
650


## Extracting the set of all items, and creating an item dictionary

In [5]:
# %%timeit -n 1 -r 2 a = 2
cols = list(groceries.columns)
cols.pop(0)

items = []

for col in cols:
    items += groceries[col].tolist()
    

items = list(set(items))
items = [item for item in items if str(item) != 'nan']
item_dict = {k: v for v, k in enumerate(items)}

In [6]:
item_dict['citrus fruit']

13

## Now I want the dataframe in a format with one column for the cart number, and one column for the item number

### Since the recommender system is often used for items given a rating or a score, we need a column for that. But the customers didn't give any of these items a score, they either bought them or they didn't. So for every item that was bought, it gets a score of 1.

In [7]:
#%%timeit -n 1 -r 2 a = 2
df_long = pd.DataFrame(columns=['cart', 'item', 'score'])

for index, row in groceries.iterrows():
    #finished_row = False
    for col in cols:
        if row[col] in item_dict:
            df_long = df_long.append({'cart': index, 'item': item_dict[row[col]], 'score': 1}, ignore_index=True)
        else:
            break
        
df_long.head()

Unnamed: 0,cart,item,score
0,0,13,1
1,0,80,1
2,0,37,1
3,0,59,1
4,1,72,1


## Now we set up our reader, giving it the (0,1) scale mentioned above

## Then we do some cross-validation 

In [8]:
from surprise.model_selection.validation import cross_validate
reader = Reader(rating_scale=(0,1))

In [9]:
data = Dataset.load_from_df(df_long[['cart', 'item', 'score']], reader)
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=10, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 10 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Mean    Std     
RMSE (testset)    0.0293  0.0293  0.0274  0.0277  0.0267  0.0282  0.0271  0.0282  0.0283  0.0276  0.0280  0.0008  
MAE (testset)     0.0129  0.0124  0.0119  0.0119  0.0116  0.0121  0.0117  0.0121  0.0123  0.0118  0.0121  0.0004  
Fit time          6.21    7.00    6.90    7.73    4.78    9.37    10.78   8.79    9.22    9.07    7.98    1.69    
Test time         0.32    0.08    0.14    0.07    0.05    0.11    0.14    0.12    0.11    0.13    0.13    0.07    


{'fit_time': (6.211331844329834,
  7.00226902961731,
  6.896723031997681,
  7.72976016998291,
  4.782274961471558,
  9.366635799407959,
  10.777338027954102,
  8.78551197052002,
  9.21639108657837,
  9.06538200378418),
 'test_mae': array([0.01287845, 0.01241733, 0.01185041, 0.01190586, 0.01157082,
        0.01213777, 0.01166065, 0.01211242, 0.01229705, 0.01179108]),
 'test_rmse': array([0.02925785, 0.02928431, 0.02740502, 0.02770427, 0.02672813,
        0.02816916, 0.02712351, 0.02822738, 0.02830823, 0.02763876]),
 'test_time': (0.3245689868927002,
  0.08448100090026855,
  0.1384110450744629,
  0.07311606407165527,
  0.0470890998840332,
  0.1108698844909668,
  0.13843607902526855,
  0.11655998229980469,
  0.11436581611633301,
  0.13244390487670898)}

## The cross-validation gave us good scores. Let's double check this with a train-test split

In [10]:
from surprise import accuracy
from surprise.model_selection import train_test_split

# reader = Reader(rating_scale=(0,1))

# data = Dataset.load_from_df(df_long[['cart', 'item', 'score']], reader)

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.0304


0.03039695708407548

### The RMSE is slightly larger than the CV results, but thats because the train set was smaller (0.75 vs. 0.90)

# Custom tests:

## The standard test that just give us back RMSE and MAE can tell us if it works better or worse than some other method, but in my opinion, are not very informative.

### Here is my suggestion for testing: take a subset of carts that have at least 10 items and remove them from the training set. Then, for that subset, remove 1 item, and take suggestions for those carts and see how often the missing item is suggested.

### There are 650 carts with more than 10 items, so we will take a random subset of 100 from them

In [11]:
big_carts = groceries[groceries[groceries.columns[0]]>10]

seed(90210)
big_carts_indexes = list(big_carts.index)

## This randomly selects 100 of the 650 carts that have more than 10 items
test_cart_indexes = sample(big_carts_indexes, 100)

In [12]:
cols = list(groceries.columns)
cols.pop(0)

items = []

for col in cols:
    items += groceries[col].tolist()
    

items = list(set(items))
items = [item for item in items if str(item) != 'nan']
item_dict = {k: v for v, k in enumerate(items)}
item_dict_rev = {v: k for k, v in item_dict.items()}

df_long = pd.DataFrame(columns=['cart', 'item', 'score'])


items_removed = {}

## Here we are creating the same dataset to feed to the SVD as before, but for all the "test carts" selected,
## we are removing one item and storing it in "items_removed"

for index, row in groceries.iterrows():
    if index in test_cart_indexes:
        test_items = []
        for col in cols:
            if row[col] in item_dict:
                test_items.append(row[col])
        item_to_remove = randint(0,len(test_items)-1)
        items_removed[index] = item_dict[test_items.pop(item_to_remove)]
        for item in test_items:
            df_long = df_long.append({'cart': index, 'item': item_dict[item], 'score': 1}, ignore_index=True)
    else:
        for col in cols:
            if row[col] in item_dict:
                df_long = df_long.append({'cart': index, 'item': item_dict[row[col]], 'score': 1}, ignore_index=True)
            else:
                break

In [13]:
## Setting up and training our SVD as before

reader = Reader(rating_scale=(0,1))

data = Dataset.load_from_df(df_long[['cart', 'item', 'score']], reader)
svd = SVD()

trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x110e1f1d0>

In [14]:
## Just looking at some data, making sure things are working

for item in df_long[df_long['cart']==test_cart_indexes[0]]['item']:
    print(item_dict_rev[item])
list(df_long[df_long['cart']==test_cart_indexes[0]]['item'])

pip fruit
whole milk
dessert
butter milk
yogurt
cream cheese
tidbits
frozen potato products
rolls/buns
white bread
mustard
fruit/vegetable juice
salty snack
long life bakery product
chocolate
specialty bar
napkins


[49, 128, 25, 145, 82, 154, 133, 165, 92, 111, 73, 52, 63, 62, 156, 33, 30]

In [15]:
svd.predict(test_cart_indexes[0], items_removed[test_cart_indexes[0]]).est

0.9826785613427292

In [16]:
# suggs = pd.DataFrame(columns=['index', 'item', 'score'])

# cart_items = list(df_long[df_long['cart']==test_cart_indexes[0]]['item'])

# for ind in item_dict_rev.keys():
#     if ind not in cart_items:
#         est = svd.predict(test_cart_indexes[0], ind).est
#         item = item_dict_rev[ind]
#         suggs = suggs.append({'index': ind, 'item': item, 'score': est}, ignore_index=True)
        
# suggs = suggs.sort_values('score', ascending=False)
# suggs[:5]['index']

## Now we go through all the test carts, calculate the error for the missing item (1 - its suggestion estimate), and whether or not the missing item showed up in the top 5, 10, 20, and 50 suggested items

In [17]:
errors = []
top5 = []
top10 = []
top20 = []
top50 = []

for cart in test_cart_indexes:
    
    suggs = pd.DataFrame(columns=['index', 'item', 'score'])
    
    cart_items = list(df_long[df_long['cart']==cart]['item'])
    
    est = svd.predict(test_cart_indexes[0], items_removed[cart]).est
    error = 1.0-est
    errors.append(error)
    
    for ind in item_dict_rev.keys():
        if ind not in cart_items:
            est = svd.predict(test_cart_indexes[0], ind).est
            item = item_dict_rev[ind]
            suggs = suggs.append({'index': ind, 'item': item, 'score': est}, ignore_index=True)
        
    suggs = suggs.sort_values('score', ascending=False)
    
    if items_removed[cart] in suggs[:5]['index']:
        top5.append(1)
    else:
        top5.append(0)
    
    if items_removed[cart] in suggs[:10]['index']:
        top10.append(1)
    else:
        top10.append(0)
        
    if items_removed[cart] in suggs[:20]['index']:
        top20.append(1)
    else:
        top20.append(0)
        
    if items_removed[cart] in suggs[:50]['index']:
        top50.append(1)
    else:
        top50.append(0)

rmse = np.sqrt(np.mean([error*error for error in errors]))
print(rmse)
print(np.mean(top5), np.mean(top10), np.mean(top20), np.mean(top50))

0.03978365413967616
0.04 0.1 0.15 0.3


## We get a similar RMSE for our way of calculating it, but slightly larger

## But the other measures are much more informative

## The missing item from the test carts showed up in the top 5 recommended items only 4% of the time, and in the top 50 less than 1/3 of the time. This isn't quite as good as the low RMSE errors led me to believe, but maybe this is par for the course in recommender systems