In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
color = sns.color_palette()

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics.pairwise import cosine_similarity
# from sklearn.linear_model import LogisticRegression
# from sklearn.linear_model import Ridge, Lasso
# from sklearn.model_selection import train_test_split

from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score
from sklearn.metrics import accuracy_score

import random
import statistics
import time
#from ast import literal_eval

import pickle

import warnings
warnings.filterwarnings("ignore")

np.random.seed(12345)

In [2]:
aisles = pd.read_csv("aisles.csv")
departments = pd.read_csv("departments.csv")
orders = pd.read_csv("orders.csv")
prior = pd.read_csv("order_products__prior.csv")
train = pd.read_csv("order_products__train.csv")
products = pd.read_csv("products.csv")

# Data Preparation and Modeling

Instacart have provided data on orders that are broken down to 'prior', 'train' and 'test' sets. The goal of the competition was to build a model that will be trained on a 'train' set, using 'prior' set as a purchase history, in order to predict 'test' orders. Therefore, no information on 'train' set was provided on order to keep the competition fair and avoid any 'data peeking' for better results.

As a result, we do not have a 'test' set to track any progress. That is why we will remove 'test' order_ids and re-assign the last recorded transaction for each user as a 'test' and all remaining as 'train'(purchase histories). This way we will be able to test the accuracy of the model (approach) and make adjustments as we go.

In [3]:
orders['eval_set'].value_counts()

prior    3214874
train     131209
test       75000
Name: eval_set, dtype: int64

In [4]:
# drop all orders from 'test' set
orders = orders.drop(orders[orders.eval_set == 'test'].index)

# re-assign all last orders per customer to be the 'test'
orders.loc[orders.groupby('user_id')['eval_set'].tail(1).index, 'eval_set'] = 'test'

# all remaining ones assign to be 'train' set
orders.loc[orders['eval_set'] == 'prior', 'eval_set'] = 'train'

In [5]:
# sanity check
orders['eval_set'].value_counts()

train    3139874
test      206209
Name: eval_set, dtype: int64

As you can see from above, orders data frame will now contain only train and test set orders. And as a sanity check, since we have used last recorded order for each customer as a 'train', we now have exactly the same number of test orders as we have users in the data set.

Instacart also has almost 50 thousand products to offer, which is amazing, but it brings a huge complexity to the model we are attempting to build. Essentially, for prediction of the next basket for each customer we require to treat each product as its own class and then use machine learning methods to decide which classes will be included in the next basket. In other words, we will have up to 50 thousands separate classifiers per customer which is a complexity nightmare. So in order to reduce the complexity and simplify the task we would want to assign a category for each product. Instead of building a separate Machine Learning model that will help us solve the issue with categories, we will simply use the aisle names. This approach not only reduces the complexity of classifying 50 thousand items down to 134, it also has another benefit: it accounts for change in preferences. For instance, a customer buys chocolate chip and mint cookies all the time and the model knows that, however, when a person transitions to, say, caramel or any other cookies, the model will (temporarily) provide inaccurate predictions. By having categories, neither model nor a customer cares about those specific details. As long as a category is included to a shopping list, cookies will always be cookies and which kind will always be up to the customer at any given moment.


In [6]:
# in order to assign a correct aisle name we need to merge 2 tables together on aisle_id
product_df = pd.merge(products, aisles, how = 'left', on = 'aisle_id').drop(['aisle_id', 'department_id'], axis = 1)

product_df.columns = ['product_id', 'product_name', 'category']

In [7]:
product_df.head(3)

Unnamed: 0,product_id,product_name,category
0,1,Chocolate Sandwich Cookies,cookies cakes
1,2,All-Seasons Salt,spices seasonings
2,3,Robust Golden Unsweetened Oolong Tea,tea


As you can see, table above illustrates the statement about potential change of preferences and how assignment of categories accounts for that (tea category has an assortment of dozens of teas and consumers are free to choose any kind without decreasing the prediction accuracy).


'Prior' and 'train' sets only contain information about specific orders and which products were included, so in order to reflect the category change for all orders we need to concatenate both data frames and merge them with product_df.


In [8]:
order_products = pd.concat([prior, train], ignore_index = True)
order_products.shape

(33819106, 4)

In [9]:
order_products.head(3)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0


In [10]:
order_products = order_products.merge(product_df, how = 'left', on = 'product_id')
order_products.head(3)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,category
0,2,33120,1,1,Organic Egg Whites,eggs
1,2,28985,2,1,Michigan Organic Kale,fresh vegetables
2,2,9327,3,0,Garlic Powder,spices seasonings


In [11]:
order_products.drop(['add_to_cart_order', 'reordered', 'product_name', 'product_id'], axis = 1, inplace = True)

print(f"Number of rows: {order_products.shape[0]} and number of duplicates: {order_products.duplicated().sum()}")
order_products.head()

Number of rows: 33819106 and number of duplicates: 9489884


Unnamed: 0,order_id,category
0,2,eggs
1,2,fresh vegetables
2,2,spices seasonings
3,2,oils vinegars
4,2,baking ingredients


By assigning categories to products we now run into an issue of people buying multiple products withing same category (salt, pepper from spices) which will imbalance (inflate) the coefficients and negatively affect the prediction. Luckily, during the step of binarization using `MultiLabelBinarizer()`, that provides a binary vector for each order indicating whether a given category was included in the order or not, all the duplicated values will be accounted for and removed.

But before we proceed, first, we need to split the data into train and test sets in order to avoid data leakage (including test data in a training set).

_Remember: train and prior only contain details about particular orders, for information about whom that order belongs to we have to refer to orders data frame._

In [37]:
# And finally split the orders into 'train' and 'test'
train_orders = orders.loc[orders['eval_set'] == 'train', 'order_id'].values
test_orders = orders.loc[orders['eval_set'] == 'test', 'order_id'].values

order_products_train = order_products.loc[order_products['order_id'].isin(train_orders),:]
order_products_test = order_products.loc[order_products['order_id'].isin(test_orders),:]

In [38]:
order_products_train.shape

(31653689, 2)

Now that we have properly separated data into 'train'/'test' data frames we proceed with vectorizing the categories for each order using `MultiLabelBinarizer()`. What this will do is create a table with order_ids as rows and binary (1 for purchased and 0 otherwise) representation of product categories purchased as columns. But first, we need to format the data so that we have a single row for each order with a list of all purchased categories.

In [39]:
order_products_train_vec = order_products_train.groupby('order_id').apply(lambda order: order['category'].tolist()).to_frame().reset_index()
order_products_train_vec.columns = ['order_id', 'basket']
order_products_train_vec.head(3)

Unnamed: 0,order_id,basket
0,2,"[eggs, fresh vegetables, spices seasonings, oi..."
1,3,"[yogurt, soy lactosefree, packaged vegetables ..."
2,4,"[breakfast bakery, cold flu allergy, energy gr..."


In [15]:
# lets do the same for test set for easier query of the data
order_products_test_vec = order_products_test.groupby('order_id').apply(lambda order: order['category'].tolist()).to_frame().reset_index()
order_products_test_vec.columns = ['order_id', 'basket']
order_products_test_vec.head(3)

Unnamed: 0,order_id,basket
0,1,"[yogurt, other creams cheeses, fresh vegetable..."
1,36,"[specialty cheeses, water seltzer sparkling wa..."
2,38,"[nuts seeds dried fruit, packaged vegetables f..."


In [16]:
# instantiate the model
mlb_train = MultiLabelBinarizer()

#create a data frame where index = order_id, columns = categories (mlb treats them as classes)
binary_train = pd.DataFrame(mlb_train.fit_transform(order_products_train_vec['basket']),
                      columns = mlb_train.classes_,
                      index = order_products_train_vec['order_id'].values)
# display the outcome
binary_train.head(3)

Unnamed: 0,air fresheners candles,asian foods,baby accessories,baby bath body care,baby food formula,bakery desserts,baking ingredients,baking supplies decor,beauty,beers coolers,...,spreads,tea,tofu meat alternatives,tortillas flat bread,trail mix snack mix,trash bags liners,vitamins supplements,water seltzer sparkling water,white wines,yogurt
2,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [17]:
# and lets do the same for test set for faster and easier check-ups:

mlb_test = MultiLabelBinarizer()
binary_test = pd.DataFrame(mlb_test.fit_transform(order_products_test_vec['basket']),
                      columns = mlb_test.classes_,
                      index = order_products_test_vec['order_id'].values)

binary_test.head(3)

Unnamed: 0,air fresheners candles,asian foods,baby accessories,baby bath body care,baby food formula,bakery desserts,baking ingredients,baking supplies decor,beauty,beers coolers,...,spreads,tea,tofu meat alternatives,tortillas flat bread,trail mix snack mix,trash bags liners,vitamins supplements,water seltzer sparkling water,white wines,yogurt
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
36,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we have a binary representation for each category in the basket for each order in both train and test sets.

# Simple Methodology

In order to attempt predicting upcoming grocery list (basket) we will begin by making a couple of simple prediction methods to use as benchmarks:

- **same as last order**: use customer's last recorded order as a prediction for the next purchase
- **random guess**: random selection of categories from an individual (user specific) set of categories
- **most frequently bought categories (top _N_)**: use N of customer's most frequently purchased categories as a prediction, where N stands for the user's average basket size.


To keep the approach as simple as possible and have a natural understanding how it works, we will define and build all functions as if a person walks into a store, scans the loyalty card, algorithm retrieves user's previous order and outputs predicted basket.

In [18]:
def same_as_last(user_id):
    # get last order_id
    last_order_id = orders.loc[(orders['user_id'] == user_id) & (orders['eval_set'] == 'train'),'order_id'].tail(1).values[0]
    
    # retrieve the basket
    last_order = order_products_train_vec.loc[order_products_train_vec['order_id'] == last_order_id,'basket'].values[0]
    
    return last_order, last_order_id # second return for jaccard_score and f1_score
    
    

In [19]:
same_as_last(1)

(['soft drinks',
  'soft drinks',
  'candy chocolate',
  'yogurt',
  'packaged cheese',
  'nuts seeds dried fruit',
  'soy lactosefree',
  'cereal',
  'popcorn jerky'],
 2550362)

For the second baseline approach, **random_guess** we will define a global helper function that potentially will be of use outside of the random guess model.

In [21]:

def get_user_categories(user_id):
    # get a list of all orders (except for the test set)
    user_orders = orders.loc[(orders['user_id'] == user_id) & (orders['eval_set'] != 'test'), ['order_id']].values.tolist()
    
    # instantiate an empty list for categories from each order
    user_categories = [] 
    
    # loop through the user_orders and append all categories to the list
    for order in user_orders:
        user_categories.extend(order_products[order_products['order_id'] == order[0]]['category'].values.tolist())
    
    # leave only unique categories
    user_categories = list(set(user_categories))
    
    return user_categories

In [22]:
# Now, if we would like to see all categories purchased by a given user, 
# in a human readable format, we can use this function.

get_user_categories(1)

['soy lactosefree',
 'spreads',
 'candy chocolate',
 'paper goods',
 'fresh fruits',
 'nuts seeds dried fruit',
 'cream',
 'yogurt',
 'packaged cheese',
 'cereal',
 'popcorn jerky',
 'soft drinks']

In [23]:
def random_guess(user_id):
    # get customer categories
    categories = get_user_categories(user_id)
    
    # generate random binary vector of the same length as the categories
    random = np.random.randint(2, size = len(categories))
    
    # for each '1' in a binary vector append category (same relative position) to a final output
    random_basket = []
    for i in range(len(categories)):
        if random[i] == 1:
            random_basket.append(categories[i])
        else: 
            continue
            
    return random_basket

In [24]:
random_guess(1)

['spreads',
 'candy chocolate',
 'paper goods',
 'nuts seeds dried fruit',
 'packaged cheese',
 'popcorn jerky',
 'soft drinks']

In [25]:
# its better to define a function that will extract average basket size and categories of a customer separately 

def average_basket_size(user_id):
    # get all order ids for a customer (excluding train)
    all_orders = orders.loc[(orders['user_id'] == user_id) & (orders['eval_set'] == 'train'), 'order_id'].values
    number_of_orders = all_orders.shape[0]
    baskets = order_products_train_vec.loc[order_products_train_vec['order_id'].isin(all_orders),'basket'].values
    # in order to incorporate the 'value' of each category for the user, we will use the lists that include
    # duplicated values (i.e. multiple products withing same category). If we use the vectorized data frame
    # it will only incorporate the cyclical nature of a purchase of specific category.
    
    # loop through each basket, add their size to as running sum
    total_size = 0
    for i in range(number_of_orders):
        total_size += len(baskets[i])
        
    # calculate average size (rounding the answer to nearest number)
    average_basket_size = np.round(total_size/number_of_orders, 0)
    
    # return the value as int for the use in slicing
    return int(average_basket_size)

In [26]:
def top_N(user_id):
    
    # get a list of all orders (leave out the orders for test set)
    user_orders = orders.loc[(orders['user_id'] == user_id) & (orders['eval_set'] != 'test'), ['order_id']].values.tolist()
    
    # instantiate an empty list for categories from each order
    user_categories = [] 
    
    # loop through the user_orders and append all categories to the list
    for order in user_orders:
        user_categories.extend(order_products[order_products['order_id'] == order[0]]['category'].values.tolist())
    
    # create a dictionary to get counts for each category   
    counts = {}
    
    # loop through categories and update their counts
    for category in user_categories:
        if category in counts.keys():
            counts[category] += 1
        else:
            counts[category] = 1
    
    # sort the dictionary by values (counts)
    top = {key: value for key, value in sorted(counts.items(), key = lambda item: item[1], reverse = True)}
    
    # get average size of customer baskets (N)
    avg = average_basket_size(user_id)
    
    # return first N keys from dict
    return list(top.keys())[:avg]
        

In [27]:
top_N(1)

['soft drinks',
 'popcorn jerky',
 'nuts seeds dried fruit',
 'packaged cheese',
 'fresh fruits',
 'cereal']

Now that we have some baseline models we need to have a way to test how accurate the predictions are. Since the task is to predict a basket with categories (a vector) and check how different the prediction is from the actual order, we introduce an intuitive metric to test the accuracy:

$$ \text{base accuracy} = \frac{\text{number of categories predicted correctly}}{\text{length of actual basket}} * 100. $$


For that, we define another function:

In [28]:
def get_base_accuracy(user_id, predicted_basket):
    
    # get users test order_id
    test_order_id = orders.loc[(orders['user_id'] == user_id)&(orders['eval_set'] == 'test'), 'order_id'].values[0]
    
    # get actual (test) basket
    actual_basket = order_products_test_vec.loc[order_products_test_vec['order_id'] == test_order_id,'basket'].values[0]

    # compare two baskets and store the result
    number_correct = 0
    
    for item in predicted_basket:
        if item in actual_basket:
            number_correct += 1
        else:
            continue
    
    accuracy = (number_correct/len(actual_basket))*100

    # return accuracy
    return accuracy, len(actual_basket), len(predicted_basket)

Conducting additional research for the ways to measure the accuracy of vector predictions, we have discovered a paper "Personalized Purchase Prediction of Market Baskets with Wasserstein-Based Sequence Matching" by Mathias Kraus and Stefan Feuerriegel. This paper was, in fact, using very similar baseline approaches as well as highly technical approach for the final model. Their approach was to use product embeddings and cosine similarity to measure the distance between products; Wassertein distance to measure the distance between baskets; and finally, K-Nearest Neighbors with Dynamic Time Warping to find closest purchase histories. (Kraus & Feuerriegel, 2019)

As for the metrics, they have used F1-score, that is defined as a harmonic mean of precision and recall:

<img src = "https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQKbbc4tA5lxCGgIul_G7QztvVLT2VDYDpJvg&usqp=CAU" width = "500">


as well as Jaccard coefficient - the ratio of basket intersection to baskets union: 


<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Intersection_of_sets_A_and_B.svg/1200px-Intersection_of_sets_A_and_B.svg.png" width = "300">


where A is a predicted basket and B is the actual basket.


Therefore, we need to incorporate calculations of these metrics for our case:


In [29]:
def get_jaccard_acc(user_id, predicted_basket_id):
    # get users test order_id
    test_order_id = orders.loc[(orders['user_id'] == user_id)&(orders['eval_set'] == 'test'), 'order_id'].values[0]
    
    # get actual (test) basket (binary)
    # here we need to use binary representation of each basket to make the test work.
    actual_basket = binary_test.loc[binary_test.index == test_order_id].values
    predicted_basket = binary_train.loc[binary_train.index == predicted_basket_id].values
    
    j_score = jaccard_score(actual_basket, predicted_basket, average = 'samples')
    
    return j_score

In [30]:
def get_f1_acc(user_id, predicted_basket_id):
    # get users test order_id
    test_order_id = orders.loc[(orders['user_id'] == user_id)&(orders['eval_set'] == 'test'), 'order_id'].values[0]
    
    # get actual (test) basket (binary)
    # here we need to use binary representation of each basket to make the test work.
    actual_basket = binary_test.loc[binary_test.index == test_order_id].values
    predicted_basket = binary_train.loc[binary_train.index == predicted_basket_id].values
    
    f1score = f1_score(actual_basket, predicted_basket, average = 'samples')
    
    return f1score


Now we have written some function that will give us the results on an individual level, lets use them in order to have a complete information for every approach.

In [31]:
# for same as last its easier to provide jaccard and f1 scores since we already have 
# binarized vectors for calculations
def get_same_as_last_accuracy(user_id):
    
    # unpack the output of same_as_last method:
    predicted_basket, predicted_basket_id = same_as_last(user_id)
    
    # unpack the output of a base_accuracy
    accuracy, len_test, len_pred =  get_base_accuracy(user_id, predicted_basket)
    
    # since same_as_last approach uses train order (that is binarized) we simply take
    # the respective order_ids from same_as_last output
    
    jacc_score = get_jaccard_acc(user_id, predicted_basket_id)
    f1score = get_f1_acc(user_id, predicted_basket_id)
    
    return accuracy, len_test, len_pred, jacc_score, f1score

In [32]:
def get_random_guess_accuracy(user_id):
    
    # unpack the output of random_guess method
    predicted_basket = random_guess(user_id)
    
    # unpack the output of a base_accuracy
    accuracy, len_test, len_pred =  get_base_accuracy(user_id, predicted_basket)
    
    # get users test order_id
    test_order_id = orders.loc[(orders['user_id'] == user_id)&(orders['eval_set'] == 'test'), 'order_id'].values[0]
    
    # get actual (test) basket (binary)
    actual_basket = binary_test.loc[binary_test.index == test_order_id].values
    

    # use fitted mlb to transform predicted basket since the output format for this approach is not binarized
    
    jacc_score = jaccard_score(actual_basket, mlb_train.transform([predicted_basket]), average = 'samples')
    f1score = f1_score(actual_basket, mlb_train.transform([predicted_basket]), average = 'samples')
    
    return accuracy, len_test, len_pred, jacc_score, f1score

In [33]:
def get_top_N_accuracy(user_id):
    
    # unpack the output of top_N method
    predicted_basket = top_N(user_id)
    
    # unpack the output of a base_accuracy
    accuracy, len_test, len_pred =  get_base_accuracy(user_id, predicted_basket)
    

    # get users test order_id
    test_order_id = orders.loc[(orders['user_id'] == user_id)&(orders['eval_set'] == 'test'), 'order_id'].values[0]
    
    # get actual (test) basket (binary)
    actual_basket = binary_test.loc[binary_test.index == test_order_id].values
    

    # use fitted mlb to transform predicted basket since the output format for this approach is not binarized
    
    jacc_score = jaccard_score(actual_basket, mlb_train.transform([predicted_basket]), average = 'samples')
    f1score = f1_score(actual_basket, mlb_train.transform([predicted_basket]), average = 'samples')
    
    
    
    return accuracy, len_test, len_pred, jacc_score, f1score

Now we are ready to test the baseline methods. Due to a runtime issue, we will conduct the test on a random sample of 3000 users from the dataset.

In [34]:
sample_users = random.sample(set(orders['user_id'].unique()), 3000)

In [35]:
# initiate the run-time variable
start = time.time()

# instantiate an empty list to store the accuracies
same_as_last_accuracies = []

# save the length of sample users for a print of progress
len_sample = len(sample_users)
# instantiate a counter for the progress status
i = 1

# loop through the sample users and append the accuracies as tuples
for user in sample_users:
    same_as_last_accuracies.append(get_same_as_last_accuracy(user))
    
    # print the progress status
    print(f"Done with {i} of {len_sample}")
    i += 1
    
end = time.time()
print(f"Run time: {end - start} seconds")
# save the results into a dataframe
same_as_last_accuracies_df = pd.DataFrame(same_as_last_accuracies, columns = ['accuracy', 'len_test', 'len_pred', 'jacc_score', 'f1_score'])


Done with 1 of 3000
Done with 2 of 3000
Done with 3 of 3000
Done with 4 of 3000
Done with 5 of 3000
Done with 6 of 3000
Done with 7 of 3000
Done with 8 of 3000
Done with 9 of 3000
Done with 10 of 3000
Done with 11 of 3000
Done with 12 of 3000
Done with 13 of 3000
Done with 14 of 3000
Done with 15 of 3000
Done with 16 of 3000
Done with 17 of 3000
Done with 18 of 3000
Done with 19 of 3000
Done with 20 of 3000
Done with 21 of 3000
Done with 22 of 3000
Done with 23 of 3000
Done with 24 of 3000
Done with 25 of 3000
Done with 26 of 3000
Done with 27 of 3000
Done with 28 of 3000
Done with 29 of 3000
Done with 30 of 3000
Done with 31 of 3000
Done with 32 of 3000
Done with 33 of 3000
Done with 34 of 3000
Done with 35 of 3000
Done with 36 of 3000
Done with 37 of 3000
Done with 38 of 3000
Done with 39 of 3000
Done with 40 of 3000
Done with 41 of 3000
Done with 42 of 3000
Done with 43 of 3000
Done with 44 of 3000
Done with 45 of 3000
Done with 46 of 3000
Done with 47 of 3000
Done with 48 of 3000
D

Done with 379 of 3000
Done with 380 of 3000
Done with 381 of 3000
Done with 382 of 3000
Done with 383 of 3000
Done with 384 of 3000
Done with 385 of 3000
Done with 386 of 3000
Done with 387 of 3000
Done with 388 of 3000
Done with 389 of 3000
Done with 390 of 3000
Done with 391 of 3000
Done with 392 of 3000
Done with 393 of 3000
Done with 394 of 3000
Done with 395 of 3000
Done with 396 of 3000
Done with 397 of 3000
Done with 398 of 3000
Done with 399 of 3000
Done with 400 of 3000
Done with 401 of 3000
Done with 402 of 3000
Done with 403 of 3000
Done with 404 of 3000
Done with 405 of 3000
Done with 406 of 3000
Done with 407 of 3000
Done with 408 of 3000
Done with 409 of 3000
Done with 410 of 3000
Done with 411 of 3000
Done with 412 of 3000
Done with 413 of 3000
Done with 414 of 3000
Done with 415 of 3000
Done with 416 of 3000
Done with 417 of 3000
Done with 418 of 3000
Done with 419 of 3000
Done with 420 of 3000
Done with 421 of 3000
Done with 422 of 3000
Done with 423 of 3000
Done with 

Done with 752 of 3000
Done with 753 of 3000
Done with 754 of 3000
Done with 755 of 3000
Done with 756 of 3000
Done with 757 of 3000
Done with 758 of 3000
Done with 759 of 3000
Done with 760 of 3000
Done with 761 of 3000
Done with 762 of 3000
Done with 763 of 3000
Done with 764 of 3000
Done with 765 of 3000
Done with 766 of 3000
Done with 767 of 3000
Done with 768 of 3000
Done with 769 of 3000
Done with 770 of 3000
Done with 771 of 3000
Done with 772 of 3000
Done with 773 of 3000
Done with 774 of 3000
Done with 775 of 3000
Done with 776 of 3000
Done with 777 of 3000
Done with 778 of 3000
Done with 779 of 3000
Done with 780 of 3000
Done with 781 of 3000
Done with 782 of 3000
Done with 783 of 3000
Done with 784 of 3000
Done with 785 of 3000
Done with 786 of 3000
Done with 787 of 3000
Done with 788 of 3000
Done with 789 of 3000
Done with 790 of 3000
Done with 791 of 3000
Done with 792 of 3000
Done with 793 of 3000
Done with 794 of 3000
Done with 795 of 3000
Done with 796 of 3000
Done with 

Done with 1119 of 3000
Done with 1120 of 3000
Done with 1121 of 3000
Done with 1122 of 3000
Done with 1123 of 3000
Done with 1124 of 3000
Done with 1125 of 3000
Done with 1126 of 3000
Done with 1127 of 3000
Done with 1128 of 3000
Done with 1129 of 3000
Done with 1130 of 3000
Done with 1131 of 3000
Done with 1132 of 3000
Done with 1133 of 3000
Done with 1134 of 3000
Done with 1135 of 3000
Done with 1136 of 3000
Done with 1137 of 3000
Done with 1138 of 3000
Done with 1139 of 3000
Done with 1140 of 3000
Done with 1141 of 3000
Done with 1142 of 3000
Done with 1143 of 3000
Done with 1144 of 3000
Done with 1145 of 3000
Done with 1146 of 3000
Done with 1147 of 3000
Done with 1148 of 3000
Done with 1149 of 3000
Done with 1150 of 3000
Done with 1151 of 3000
Done with 1152 of 3000
Done with 1153 of 3000
Done with 1154 of 3000
Done with 1155 of 3000
Done with 1156 of 3000
Done with 1157 of 3000
Done with 1158 of 3000
Done with 1159 of 3000
Done with 1160 of 3000
Done with 1161 of 3000
Done with 1

Done with 1476 of 3000
Done with 1477 of 3000
Done with 1478 of 3000
Done with 1479 of 3000
Done with 1480 of 3000
Done with 1481 of 3000
Done with 1482 of 3000
Done with 1483 of 3000
Done with 1484 of 3000
Done with 1485 of 3000
Done with 1486 of 3000
Done with 1487 of 3000
Done with 1488 of 3000
Done with 1489 of 3000
Done with 1490 of 3000
Done with 1491 of 3000
Done with 1492 of 3000
Done with 1493 of 3000
Done with 1494 of 3000
Done with 1495 of 3000
Done with 1496 of 3000
Done with 1497 of 3000
Done with 1498 of 3000
Done with 1499 of 3000
Done with 1500 of 3000
Done with 1501 of 3000
Done with 1502 of 3000
Done with 1503 of 3000
Done with 1504 of 3000
Done with 1505 of 3000
Done with 1506 of 3000
Done with 1507 of 3000
Done with 1508 of 3000
Done with 1509 of 3000
Done with 1510 of 3000
Done with 1511 of 3000
Done with 1512 of 3000
Done with 1513 of 3000
Done with 1514 of 3000
Done with 1515 of 3000
Done with 1516 of 3000
Done with 1517 of 3000
Done with 1518 of 3000
Done with 1

Done with 1833 of 3000
Done with 1834 of 3000
Done with 1835 of 3000
Done with 1836 of 3000
Done with 1837 of 3000
Done with 1838 of 3000
Done with 1839 of 3000
Done with 1840 of 3000
Done with 1841 of 3000
Done with 1842 of 3000
Done with 1843 of 3000
Done with 1844 of 3000
Done with 1845 of 3000
Done with 1846 of 3000
Done with 1847 of 3000
Done with 1848 of 3000
Done with 1849 of 3000
Done with 1850 of 3000
Done with 1851 of 3000
Done with 1852 of 3000
Done with 1853 of 3000
Done with 1854 of 3000
Done with 1855 of 3000
Done with 1856 of 3000
Done with 1857 of 3000
Done with 1858 of 3000
Done with 1859 of 3000
Done with 1860 of 3000
Done with 1861 of 3000
Done with 1862 of 3000
Done with 1863 of 3000
Done with 1864 of 3000
Done with 1865 of 3000
Done with 1866 of 3000
Done with 1867 of 3000
Done with 1868 of 3000
Done with 1869 of 3000
Done with 1870 of 3000
Done with 1871 of 3000
Done with 1872 of 3000
Done with 1873 of 3000
Done with 1874 of 3000
Done with 1875 of 3000
Done with 1

Done with 2190 of 3000
Done with 2191 of 3000
Done with 2192 of 3000
Done with 2193 of 3000
Done with 2194 of 3000
Done with 2195 of 3000
Done with 2196 of 3000
Done with 2197 of 3000
Done with 2198 of 3000
Done with 2199 of 3000
Done with 2200 of 3000
Done with 2201 of 3000
Done with 2202 of 3000
Done with 2203 of 3000
Done with 2204 of 3000
Done with 2205 of 3000
Done with 2206 of 3000
Done with 2207 of 3000
Done with 2208 of 3000
Done with 2209 of 3000
Done with 2210 of 3000
Done with 2211 of 3000
Done with 2212 of 3000
Done with 2213 of 3000
Done with 2214 of 3000
Done with 2215 of 3000
Done with 2216 of 3000
Done with 2217 of 3000
Done with 2218 of 3000
Done with 2219 of 3000
Done with 2220 of 3000
Done with 2221 of 3000
Done with 2222 of 3000
Done with 2223 of 3000
Done with 2224 of 3000
Done with 2225 of 3000
Done with 2226 of 3000
Done with 2227 of 3000
Done with 2228 of 3000
Done with 2229 of 3000
Done with 2230 of 3000
Done with 2231 of 3000
Done with 2232 of 3000
Done with 2

Done with 2547 of 3000
Done with 2548 of 3000
Done with 2549 of 3000
Done with 2550 of 3000
Done with 2551 of 3000
Done with 2552 of 3000
Done with 2553 of 3000
Done with 2554 of 3000
Done with 2555 of 3000
Done with 2556 of 3000
Done with 2557 of 3000
Done with 2558 of 3000
Done with 2559 of 3000
Done with 2560 of 3000
Done with 2561 of 3000
Done with 2562 of 3000
Done with 2563 of 3000
Done with 2564 of 3000
Done with 2565 of 3000
Done with 2566 of 3000
Done with 2567 of 3000
Done with 2568 of 3000
Done with 2569 of 3000
Done with 2570 of 3000
Done with 2571 of 3000
Done with 2572 of 3000
Done with 2573 of 3000
Done with 2574 of 3000
Done with 2575 of 3000
Done with 2576 of 3000
Done with 2577 of 3000
Done with 2578 of 3000
Done with 2579 of 3000
Done with 2580 of 3000
Done with 2581 of 3000
Done with 2582 of 3000
Done with 2583 of 3000
Done with 2584 of 3000
Done with 2585 of 3000
Done with 2586 of 3000
Done with 2587 of 3000
Done with 2588 of 3000
Done with 2589 of 3000
Done with 2

Done with 2904 of 3000
Done with 2905 of 3000
Done with 2906 of 3000
Done with 2907 of 3000
Done with 2908 of 3000
Done with 2909 of 3000
Done with 2910 of 3000
Done with 2911 of 3000
Done with 2912 of 3000
Done with 2913 of 3000
Done with 2914 of 3000
Done with 2915 of 3000
Done with 2916 of 3000
Done with 2917 of 3000
Done with 2918 of 3000
Done with 2919 of 3000
Done with 2920 of 3000
Done with 2921 of 3000
Done with 2922 of 3000
Done with 2923 of 3000
Done with 2924 of 3000
Done with 2925 of 3000
Done with 2926 of 3000
Done with 2927 of 3000
Done with 2928 of 3000
Done with 2929 of 3000
Done with 2930 of 3000
Done with 2931 of 3000
Done with 2932 of 3000
Done with 2933 of 3000
Done with 2934 of 3000
Done with 2935 of 3000
Done with 2936 of 3000
Done with 2937 of 3000
Done with 2938 of 3000
Done with 2939 of 3000
Done with 2940 of 3000
Done with 2941 of 3000
Done with 2942 of 3000
Done with 2943 of 3000
Done with 2944 of 3000
Done with 2945 of 3000
Done with 2946 of 3000
Done with 2

In [36]:
# this cell follows same steps as previous one, but for the random_guess method

start = time.time()

random_guess_accuracies = []

len_sample = len(sample_users)
i = 1
for user in sample_users:
    random_guess_accuracies.append(get_random_guess_accuracy(user))
    print(f"Done with {i} of {len_sample}")
    i += 1
    
end = time.time()
print(f"Run time: {end - start} seconds")
random_guess_accuracies_df = pd.DataFrame(random_guess_accuracies, columns = ['accuracy', 'len_test', 'len_pred', 'jacc_score', 'f1_score'])



Done with 1 of 3000
Done with 2 of 3000
Done with 3 of 3000
Done with 4 of 3000
Done with 5 of 3000
Done with 6 of 3000
Done with 7 of 3000
Done with 8 of 3000
Done with 9 of 3000
Done with 10 of 3000
Done with 11 of 3000
Done with 12 of 3000
Done with 13 of 3000
Done with 14 of 3000
Done with 15 of 3000
Done with 16 of 3000
Done with 17 of 3000
Done with 18 of 3000
Done with 19 of 3000
Done with 20 of 3000
Done with 21 of 3000
Done with 22 of 3000
Done with 23 of 3000
Done with 24 of 3000
Done with 25 of 3000
Done with 26 of 3000
Done with 27 of 3000
Done with 28 of 3000
Done with 29 of 3000
Done with 30 of 3000
Done with 31 of 3000
Done with 32 of 3000
Done with 33 of 3000
Done with 34 of 3000
Done with 35 of 3000
Done with 36 of 3000
Done with 37 of 3000
Done with 38 of 3000
Done with 39 of 3000
Done with 40 of 3000
Done with 41 of 3000
Done with 42 of 3000
Done with 43 of 3000
Done with 44 of 3000
Done with 45 of 3000
Done with 46 of 3000
Done with 47 of 3000
Done with 48 of 3000
D

Done with 379 of 3000
Done with 380 of 3000
Done with 381 of 3000
Done with 382 of 3000
Done with 383 of 3000
Done with 384 of 3000
Done with 385 of 3000
Done with 386 of 3000
Done with 387 of 3000
Done with 388 of 3000
Done with 389 of 3000
Done with 390 of 3000
Done with 391 of 3000
Done with 392 of 3000
Done with 393 of 3000
Done with 394 of 3000
Done with 395 of 3000
Done with 396 of 3000
Done with 397 of 3000
Done with 398 of 3000
Done with 399 of 3000
Done with 400 of 3000
Done with 401 of 3000
Done with 402 of 3000
Done with 403 of 3000
Done with 404 of 3000
Done with 405 of 3000
Done with 406 of 3000
Done with 407 of 3000
Done with 408 of 3000
Done with 409 of 3000
Done with 410 of 3000
Done with 411 of 3000
Done with 412 of 3000
Done with 413 of 3000
Done with 414 of 3000
Done with 415 of 3000
Done with 416 of 3000
Done with 417 of 3000
Done with 418 of 3000
Done with 419 of 3000
Done with 420 of 3000
Done with 421 of 3000
Done with 422 of 3000
Done with 423 of 3000
Done with 

Done with 752 of 3000
Done with 753 of 3000
Done with 754 of 3000
Done with 755 of 3000
Done with 756 of 3000
Done with 757 of 3000
Done with 758 of 3000
Done with 759 of 3000
Done with 760 of 3000
Done with 761 of 3000
Done with 762 of 3000
Done with 763 of 3000
Done with 764 of 3000
Done with 765 of 3000
Done with 766 of 3000
Done with 767 of 3000
Done with 768 of 3000
Done with 769 of 3000
Done with 770 of 3000
Done with 771 of 3000
Done with 772 of 3000
Done with 773 of 3000
Done with 774 of 3000
Done with 775 of 3000
Done with 776 of 3000
Done with 777 of 3000
Done with 778 of 3000
Done with 779 of 3000
Done with 780 of 3000
Done with 781 of 3000
Done with 782 of 3000
Done with 783 of 3000
Done with 784 of 3000
Done with 785 of 3000
Done with 786 of 3000
Done with 787 of 3000
Done with 788 of 3000
Done with 789 of 3000
Done with 790 of 3000
Done with 791 of 3000
Done with 792 of 3000
Done with 793 of 3000
Done with 794 of 3000
Done with 795 of 3000
Done with 796 of 3000
Done with 

Done with 1119 of 3000
Done with 1120 of 3000
Done with 1121 of 3000
Done with 1122 of 3000
Done with 1123 of 3000
Done with 1124 of 3000
Done with 1125 of 3000
Done with 1126 of 3000
Done with 1127 of 3000
Done with 1128 of 3000
Done with 1129 of 3000
Done with 1130 of 3000
Done with 1131 of 3000
Done with 1132 of 3000
Done with 1133 of 3000
Done with 1134 of 3000
Done with 1135 of 3000
Done with 1136 of 3000
Done with 1137 of 3000
Done with 1138 of 3000
Done with 1139 of 3000
Done with 1140 of 3000
Done with 1141 of 3000
Done with 1142 of 3000
Done with 1143 of 3000
Done with 1144 of 3000
Done with 1145 of 3000
Done with 1146 of 3000
Done with 1147 of 3000
Done with 1148 of 3000
Done with 1149 of 3000
Done with 1150 of 3000
Done with 1151 of 3000
Done with 1152 of 3000
Done with 1153 of 3000
Done with 1154 of 3000
Done with 1155 of 3000
Done with 1156 of 3000
Done with 1157 of 3000
Done with 1158 of 3000
Done with 1159 of 3000
Done with 1160 of 3000
Done with 1161 of 3000
Done with 1

Done with 1476 of 3000
Done with 1477 of 3000
Done with 1478 of 3000
Done with 1479 of 3000
Done with 1480 of 3000
Done with 1481 of 3000
Done with 1482 of 3000
Done with 1483 of 3000
Done with 1484 of 3000
Done with 1485 of 3000
Done with 1486 of 3000
Done with 1487 of 3000
Done with 1488 of 3000
Done with 1489 of 3000
Done with 1490 of 3000
Done with 1491 of 3000
Done with 1492 of 3000
Done with 1493 of 3000
Done with 1494 of 3000
Done with 1495 of 3000
Done with 1496 of 3000
Done with 1497 of 3000
Done with 1498 of 3000
Done with 1499 of 3000
Done with 1500 of 3000
Done with 1501 of 3000
Done with 1502 of 3000
Done with 1503 of 3000
Done with 1504 of 3000
Done with 1505 of 3000
Done with 1506 of 3000
Done with 1507 of 3000
Done with 1508 of 3000
Done with 1509 of 3000
Done with 1510 of 3000
Done with 1511 of 3000
Done with 1512 of 3000
Done with 1513 of 3000
Done with 1514 of 3000
Done with 1515 of 3000
Done with 1516 of 3000
Done with 1517 of 3000
Done with 1518 of 3000
Done with 1

Done with 1833 of 3000
Done with 1834 of 3000
Done with 1835 of 3000
Done with 1836 of 3000
Done with 1837 of 3000
Done with 1838 of 3000
Done with 1839 of 3000
Done with 1840 of 3000
Done with 1841 of 3000
Done with 1842 of 3000
Done with 1843 of 3000
Done with 1844 of 3000
Done with 1845 of 3000
Done with 1846 of 3000
Done with 1847 of 3000
Done with 1848 of 3000
Done with 1849 of 3000
Done with 1850 of 3000
Done with 1851 of 3000
Done with 1852 of 3000
Done with 1853 of 3000
Done with 1854 of 3000
Done with 1855 of 3000
Done with 1856 of 3000
Done with 1857 of 3000
Done with 1858 of 3000
Done with 1859 of 3000
Done with 1860 of 3000
Done with 1861 of 3000
Done with 1862 of 3000
Done with 1863 of 3000
Done with 1864 of 3000
Done with 1865 of 3000
Done with 1866 of 3000
Done with 1867 of 3000
Done with 1868 of 3000
Done with 1869 of 3000
Done with 1870 of 3000
Done with 1871 of 3000
Done with 1872 of 3000
Done with 1873 of 3000
Done with 1874 of 3000
Done with 1875 of 3000
Done with 1

Done with 2190 of 3000
Done with 2191 of 3000
Done with 2192 of 3000
Done with 2193 of 3000
Done with 2194 of 3000
Done with 2195 of 3000
Done with 2196 of 3000
Done with 2197 of 3000
Done with 2198 of 3000
Done with 2199 of 3000
Done with 2200 of 3000
Done with 2201 of 3000
Done with 2202 of 3000
Done with 2203 of 3000
Done with 2204 of 3000
Done with 2205 of 3000
Done with 2206 of 3000
Done with 2207 of 3000
Done with 2208 of 3000
Done with 2209 of 3000
Done with 2210 of 3000
Done with 2211 of 3000
Done with 2212 of 3000
Done with 2213 of 3000
Done with 2214 of 3000
Done with 2215 of 3000
Done with 2216 of 3000
Done with 2217 of 3000
Done with 2218 of 3000
Done with 2219 of 3000
Done with 2220 of 3000
Done with 2221 of 3000
Done with 2222 of 3000
Done with 2223 of 3000
Done with 2224 of 3000
Done with 2225 of 3000
Done with 2226 of 3000
Done with 2227 of 3000
Done with 2228 of 3000
Done with 2229 of 3000
Done with 2230 of 3000
Done with 2231 of 3000
Done with 2232 of 3000
Done with 2

Done with 2547 of 3000
Done with 2548 of 3000
Done with 2549 of 3000
Done with 2550 of 3000
Done with 2551 of 3000
Done with 2552 of 3000
Done with 2553 of 3000
Done with 2554 of 3000
Done with 2555 of 3000
Done with 2556 of 3000
Done with 2557 of 3000
Done with 2558 of 3000
Done with 2559 of 3000
Done with 2560 of 3000
Done with 2561 of 3000
Done with 2562 of 3000
Done with 2563 of 3000
Done with 2564 of 3000
Done with 2565 of 3000
Done with 2566 of 3000
Done with 2567 of 3000
Done with 2568 of 3000
Done with 2569 of 3000
Done with 2570 of 3000
Done with 2571 of 3000
Done with 2572 of 3000
Done with 2573 of 3000
Done with 2574 of 3000
Done with 2575 of 3000
Done with 2576 of 3000
Done with 2577 of 3000
Done with 2578 of 3000
Done with 2579 of 3000
Done with 2580 of 3000
Done with 2581 of 3000
Done with 2582 of 3000
Done with 2583 of 3000
Done with 2584 of 3000
Done with 2585 of 3000
Done with 2586 of 3000
Done with 2587 of 3000
Done with 2588 of 3000
Done with 2589 of 3000
Done with 2

Done with 2904 of 3000
Done with 2905 of 3000
Done with 2906 of 3000
Done with 2907 of 3000
Done with 2908 of 3000
Done with 2909 of 3000
Done with 2910 of 3000
Done with 2911 of 3000
Done with 2912 of 3000
Done with 2913 of 3000
Done with 2914 of 3000
Done with 2915 of 3000
Done with 2916 of 3000
Done with 2917 of 3000
Done with 2918 of 3000
Done with 2919 of 3000
Done with 2920 of 3000
Done with 2921 of 3000
Done with 2922 of 3000
Done with 2923 of 3000
Done with 2924 of 3000
Done with 2925 of 3000
Done with 2926 of 3000
Done with 2927 of 3000
Done with 2928 of 3000
Done with 2929 of 3000
Done with 2930 of 3000
Done with 2931 of 3000
Done with 2932 of 3000
Done with 2933 of 3000
Done with 2934 of 3000
Done with 2935 of 3000
Done with 2936 of 3000
Done with 2937 of 3000
Done with 2938 of 3000
Done with 2939 of 3000
Done with 2940 of 3000
Done with 2941 of 3000
Done with 2942 of 3000
Done with 2943 of 3000
Done with 2944 of 3000
Done with 2945 of 3000
Done with 2946 of 3000
Done with 2

In [37]:
# this cell follows same steps as previous one, but for the top_N method

start = time.time()

top_N_accuracies = []

len_sample = len(sample_users)
i = 1
for user in sample_users:
    top_N_accuracies.append(get_top_N_accuracy(user))
    print(f"Done with {i} of {len_sample}")
    i += 1
    
end = time.time()
print(f"Run time: {end - start} seconds")
top_N_accuracies_df = pd.DataFrame(top_N_accuracies, columns = ['accuracy', 'len_test', 'len_pred', 'jacc_score', 'f1_score'])



Done with 1 of 3000
Done with 2 of 3000
Done with 3 of 3000
Done with 4 of 3000
Done with 5 of 3000
Done with 6 of 3000
Done with 7 of 3000
Done with 8 of 3000
Done with 9 of 3000
Done with 10 of 3000
Done with 11 of 3000
Done with 12 of 3000
Done with 13 of 3000
Done with 14 of 3000
Done with 15 of 3000
Done with 16 of 3000
Done with 17 of 3000
Done with 18 of 3000
Done with 19 of 3000
Done with 20 of 3000
Done with 21 of 3000
Done with 22 of 3000
Done with 23 of 3000
Done with 24 of 3000
Done with 25 of 3000
Done with 26 of 3000
Done with 27 of 3000
Done with 28 of 3000
Done with 29 of 3000
Done with 30 of 3000
Done with 31 of 3000
Done with 32 of 3000
Done with 33 of 3000
Done with 34 of 3000
Done with 35 of 3000
Done with 36 of 3000
Done with 37 of 3000
Done with 38 of 3000
Done with 39 of 3000
Done with 40 of 3000
Done with 41 of 3000
Done with 42 of 3000
Done with 43 of 3000
Done with 44 of 3000
Done with 45 of 3000
Done with 46 of 3000
Done with 47 of 3000
Done with 48 of 3000
D

Done with 379 of 3000
Done with 380 of 3000
Done with 381 of 3000
Done with 382 of 3000
Done with 383 of 3000
Done with 384 of 3000
Done with 385 of 3000
Done with 386 of 3000
Done with 387 of 3000
Done with 388 of 3000
Done with 389 of 3000
Done with 390 of 3000
Done with 391 of 3000
Done with 392 of 3000
Done with 393 of 3000
Done with 394 of 3000
Done with 395 of 3000
Done with 396 of 3000
Done with 397 of 3000
Done with 398 of 3000
Done with 399 of 3000
Done with 400 of 3000
Done with 401 of 3000
Done with 402 of 3000
Done with 403 of 3000
Done with 404 of 3000
Done with 405 of 3000
Done with 406 of 3000
Done with 407 of 3000
Done with 408 of 3000
Done with 409 of 3000
Done with 410 of 3000
Done with 411 of 3000
Done with 412 of 3000
Done with 413 of 3000
Done with 414 of 3000
Done with 415 of 3000
Done with 416 of 3000
Done with 417 of 3000
Done with 418 of 3000
Done with 419 of 3000
Done with 420 of 3000
Done with 421 of 3000
Done with 422 of 3000
Done with 423 of 3000
Done with 

Done with 752 of 3000
Done with 753 of 3000
Done with 754 of 3000
Done with 755 of 3000
Done with 756 of 3000
Done with 757 of 3000
Done with 758 of 3000
Done with 759 of 3000
Done with 760 of 3000
Done with 761 of 3000
Done with 762 of 3000
Done with 763 of 3000
Done with 764 of 3000
Done with 765 of 3000
Done with 766 of 3000
Done with 767 of 3000
Done with 768 of 3000
Done with 769 of 3000
Done with 770 of 3000
Done with 771 of 3000
Done with 772 of 3000
Done with 773 of 3000
Done with 774 of 3000
Done with 775 of 3000
Done with 776 of 3000
Done with 777 of 3000
Done with 778 of 3000
Done with 779 of 3000
Done with 780 of 3000
Done with 781 of 3000
Done with 782 of 3000
Done with 783 of 3000
Done with 784 of 3000
Done with 785 of 3000
Done with 786 of 3000
Done with 787 of 3000
Done with 788 of 3000
Done with 789 of 3000
Done with 790 of 3000
Done with 791 of 3000
Done with 792 of 3000
Done with 793 of 3000
Done with 794 of 3000
Done with 795 of 3000
Done with 796 of 3000
Done with 

Done with 1119 of 3000
Done with 1120 of 3000
Done with 1121 of 3000
Done with 1122 of 3000
Done with 1123 of 3000
Done with 1124 of 3000
Done with 1125 of 3000
Done with 1126 of 3000
Done with 1127 of 3000
Done with 1128 of 3000
Done with 1129 of 3000
Done with 1130 of 3000
Done with 1131 of 3000
Done with 1132 of 3000
Done with 1133 of 3000
Done with 1134 of 3000
Done with 1135 of 3000
Done with 1136 of 3000
Done with 1137 of 3000
Done with 1138 of 3000
Done with 1139 of 3000
Done with 1140 of 3000
Done with 1141 of 3000
Done with 1142 of 3000
Done with 1143 of 3000
Done with 1144 of 3000
Done with 1145 of 3000
Done with 1146 of 3000
Done with 1147 of 3000
Done with 1148 of 3000
Done with 1149 of 3000
Done with 1150 of 3000
Done with 1151 of 3000
Done with 1152 of 3000
Done with 1153 of 3000
Done with 1154 of 3000
Done with 1155 of 3000
Done with 1156 of 3000
Done with 1157 of 3000
Done with 1158 of 3000
Done with 1159 of 3000
Done with 1160 of 3000
Done with 1161 of 3000
Done with 1

Done with 1476 of 3000
Done with 1477 of 3000
Done with 1478 of 3000
Done with 1479 of 3000
Done with 1480 of 3000
Done with 1481 of 3000
Done with 1482 of 3000
Done with 1483 of 3000
Done with 1484 of 3000
Done with 1485 of 3000
Done with 1486 of 3000
Done with 1487 of 3000
Done with 1488 of 3000
Done with 1489 of 3000
Done with 1490 of 3000
Done with 1491 of 3000
Done with 1492 of 3000
Done with 1493 of 3000
Done with 1494 of 3000
Done with 1495 of 3000
Done with 1496 of 3000
Done with 1497 of 3000
Done with 1498 of 3000
Done with 1499 of 3000
Done with 1500 of 3000
Done with 1501 of 3000
Done with 1502 of 3000
Done with 1503 of 3000
Done with 1504 of 3000
Done with 1505 of 3000
Done with 1506 of 3000
Done with 1507 of 3000
Done with 1508 of 3000
Done with 1509 of 3000
Done with 1510 of 3000
Done with 1511 of 3000
Done with 1512 of 3000
Done with 1513 of 3000
Done with 1514 of 3000
Done with 1515 of 3000
Done with 1516 of 3000
Done with 1517 of 3000
Done with 1518 of 3000
Done with 1

Done with 1833 of 3000
Done with 1834 of 3000
Done with 1835 of 3000
Done with 1836 of 3000
Done with 1837 of 3000
Done with 1838 of 3000
Done with 1839 of 3000
Done with 1840 of 3000
Done with 1841 of 3000
Done with 1842 of 3000
Done with 1843 of 3000
Done with 1844 of 3000
Done with 1845 of 3000
Done with 1846 of 3000
Done with 1847 of 3000
Done with 1848 of 3000
Done with 1849 of 3000
Done with 1850 of 3000
Done with 1851 of 3000
Done with 1852 of 3000
Done with 1853 of 3000
Done with 1854 of 3000
Done with 1855 of 3000
Done with 1856 of 3000
Done with 1857 of 3000
Done with 1858 of 3000
Done with 1859 of 3000
Done with 1860 of 3000
Done with 1861 of 3000
Done with 1862 of 3000
Done with 1863 of 3000
Done with 1864 of 3000
Done with 1865 of 3000
Done with 1866 of 3000
Done with 1867 of 3000
Done with 1868 of 3000
Done with 1869 of 3000
Done with 1870 of 3000
Done with 1871 of 3000
Done with 1872 of 3000
Done with 1873 of 3000
Done with 1874 of 3000
Done with 1875 of 3000
Done with 1

Done with 2190 of 3000
Done with 2191 of 3000
Done with 2192 of 3000
Done with 2193 of 3000
Done with 2194 of 3000
Done with 2195 of 3000
Done with 2196 of 3000
Done with 2197 of 3000
Done with 2198 of 3000
Done with 2199 of 3000
Done with 2200 of 3000
Done with 2201 of 3000
Done with 2202 of 3000
Done with 2203 of 3000
Done with 2204 of 3000
Done with 2205 of 3000
Done with 2206 of 3000
Done with 2207 of 3000
Done with 2208 of 3000
Done with 2209 of 3000
Done with 2210 of 3000
Done with 2211 of 3000
Done with 2212 of 3000
Done with 2213 of 3000
Done with 2214 of 3000
Done with 2215 of 3000
Done with 2216 of 3000
Done with 2217 of 3000
Done with 2218 of 3000
Done with 2219 of 3000
Done with 2220 of 3000
Done with 2221 of 3000
Done with 2222 of 3000
Done with 2223 of 3000
Done with 2224 of 3000
Done with 2225 of 3000
Done with 2226 of 3000
Done with 2227 of 3000
Done with 2228 of 3000
Done with 2229 of 3000
Done with 2230 of 3000
Done with 2231 of 3000
Done with 2232 of 3000
Done with 2

Done with 2547 of 3000
Done with 2548 of 3000
Done with 2549 of 3000
Done with 2550 of 3000
Done with 2551 of 3000
Done with 2552 of 3000
Done with 2553 of 3000
Done with 2554 of 3000
Done with 2555 of 3000
Done with 2556 of 3000
Done with 2557 of 3000
Done with 2558 of 3000
Done with 2559 of 3000
Done with 2560 of 3000
Done with 2561 of 3000
Done with 2562 of 3000
Done with 2563 of 3000
Done with 2564 of 3000
Done with 2565 of 3000
Done with 2566 of 3000
Done with 2567 of 3000
Done with 2568 of 3000
Done with 2569 of 3000
Done with 2570 of 3000
Done with 2571 of 3000
Done with 2572 of 3000
Done with 2573 of 3000
Done with 2574 of 3000
Done with 2575 of 3000
Done with 2576 of 3000
Done with 2577 of 3000
Done with 2578 of 3000
Done with 2579 of 3000
Done with 2580 of 3000
Done with 2581 of 3000
Done with 2582 of 3000
Done with 2583 of 3000
Done with 2584 of 3000
Done with 2585 of 3000
Done with 2586 of 3000
Done with 2587 of 3000
Done with 2588 of 3000
Done with 2589 of 3000
Done with 2

Done with 2904 of 3000
Done with 2905 of 3000
Done with 2906 of 3000
Done with 2907 of 3000
Done with 2908 of 3000
Done with 2909 of 3000
Done with 2910 of 3000
Done with 2911 of 3000
Done with 2912 of 3000
Done with 2913 of 3000
Done with 2914 of 3000
Done with 2915 of 3000
Done with 2916 of 3000
Done with 2917 of 3000
Done with 2918 of 3000
Done with 2919 of 3000
Done with 2920 of 3000
Done with 2921 of 3000
Done with 2922 of 3000
Done with 2923 of 3000
Done with 2924 of 3000
Done with 2925 of 3000
Done with 2926 of 3000
Done with 2927 of 3000
Done with 2928 of 3000
Done with 2929 of 3000
Done with 2930 of 3000
Done with 2931 of 3000
Done with 2932 of 3000
Done with 2933 of 3000
Done with 2934 of 3000
Done with 2935 of 3000
Done with 2936 of 3000
Done with 2937 of 3000
Done with 2938 of 3000
Done with 2939 of 3000
Done with 2940 of 3000
Done with 2941 of 3000
Done with 2942 of 3000
Done with 2943 of 3000
Done with 2944 of 3000
Done with 2945 of 3000
Done with 2946 of 3000
Done with 2

In [147]:
baseline_results = pd.concat([same_as_last_accuracies_df.mean().to_frame(),
                              random_guess_accuracies_df.mean().to_frame(),
                              top_N_accuracies_df.mean().to_frame()],
                            axis = 1,
                            sort = False)
baseline_results.columns = ['same_as_last', 'random_guess', 'top_N']
baseline_results.index = ['base_accuracy', 'len_test', 'len_pred', 'jacc_score', 'f1_score']
baseline_results = baseline_results.T
baseline_results.sort_values(['jacc_score', 'f1_score'], ascending = [0,0])

Unnamed: 0,base_accuracy,len_test,len_pred,jacc_score,f1_score
top_N,40.982577,10.631667,9.847,0.30494,0.438114
same_as_last,49.466675,10.631667,10.309,0.286835,0.405653
random_guess,30.839796,10.631667,13.587333,0.170351,0.27311


Based on the results above we can see that using previous basket as a prediction reports highest base accuracy and top_N being second-best. However, in terms of commonly used metrics (f1-score and jaccard coefficient), these two methods switch places. Since F1-score and Jaccard coefficient are the 'industry standard' we will focus on improving these metrics from now on.

Also note, that there is a clear connection between difference in basket lengths and the accuracy. Random guess method has the lowest accuracy across all metrics with highest difference between predicted baskets and actual baskets. As was pointed out during the EDA process for this data, we have a right-skewed distribution of basket sizes that ranges from 1 item to 145, which was assumed to be a potential issue for predictions. 


Main hypothesis is that we can use people with similar purchase histories to extract upcoming shopping lists. For example, a person with a history of 5 orders had a very similar purchasing patterns as another customer (say, with 10 orders). If the match occurred with middle orders (matched orders 3 through 7 out of 10), we can use order #8 as a prediction for the customer in question. Order #8 is the basket that follows a series of orders with similar pattern, therefore, if historical purchases were similar, the following purchases will be similar as well. 


Let's start testing the hypothesis by employing the simplest version of the above: using cosine similarity in order to find similar order across the dataset, and use that user's next order (if exists) as a prediction. 

In [41]:
def get_cosine_ids(user_id):
    
    # get the last order of the user
    last_order_id = orders.loc[(orders['user_id'] == user_id) & (orders['eval_set'] == 'train'),'order_id'].tail(1).values[0]
    
    # find cosine similarity for that basket and all other orders
    cosine = cosine_similarity(binary_train.loc[last_order_id,:].values.reshape(1,-1), binary_train)
    
    # save and output the results as datagrame
    cosine_df = pd.DataFrame({'order_id': binary_train.index, 'similarity': cosine[0]})
    
    # sort by cosine coefficient (decsending)
    cosine_df = cosine_df.sort_values(by = 'similarity', ascending=False)
    
    return cosine_df

In [42]:
def get_cosine_basket(user_id):

    # get cosine similarity df for that users last order
    cosine = get_cosine_ids(user_id)
    
    
    # use the flag and a while loop to account for the cases when user of the closest basket (whos 
    # next order will be used as a prediction) does not have a next order.
    # this will unsure using second best match in case the above occurs.
    flag = True
    
    # index of the most similar basket
    n_best = 1
    
    while flag:
        
        # assign order_id of the closest basket 
        # note:index zero in cosine similarity output will be similarity of an order to itself we need 
        # to start indexing from 1
        
        similar_order_id = cosine['order_id'].values[n_best] 

        # retrieves user_id of the similar order (whose next order we will use as a prediction)
        similar_order_user = orders.loc[orders['order_id'] == similar_order_id, ['user_id']].values[0][0]

        # retrieves that order's order_number 
        similar_order_number = orders.loc[(orders['user_id'] == similar_order_user) & (orders['order_id'] == similar_order_id), 'order_number'].values[0]
        
        # all order numbers indicate the sequence of purchases by a user
        # i.e how many times that user was previously shopping at the store
        # make sure not to capture train set orders
        order_numbers = orders.loc[(orders['user_id'] == similar_order_user)&(orders['eval_set'] != 'test'), 'order_number'].values

        # check if that user has a follow-up order:
        if (similar_order_number + 1) in order_numbers:
            predicted_order_id = orders.loc[(orders['user_id'] == similar_order_user)& (orders['order_number'] == (similar_order_number+1)), ['order_id']].values[0][0]
            flag = False
        # if that user does not have a next purchase take second best user 
        else:
            n_best += 1
    
    # now get the actual order from order_products_train_vec:
    predicted_basket = order_products_train_vec.loc[order_products_train_vec['order_id'] == predicted_order_id, 'basket'].values[0]
    
    # return predicted basket and its id for jaccard and f1 tests
    return predicted_basket, predicted_order_id


In [43]:
def get_cosine_accuracy(user_id):
    
    # unpack output of cosine basket fnc
    predicted_basket, predicted_basket_id = get_cosine_basket(user_id)
    
    # unpack outpur of the base_accuracy
    accuracy, len_test, len_pred =  get_base_accuracy(user_id, predicted_basket)
    
    # retreive jaccard and f1 scores
    jacc_score = get_jaccard_acc(user_id, predicted_basket_id)
    f1score = get_f1_acc(user_id, predicted_basket_id)
    
    return accuracy, len_test, len_pred, jacc_score, f1score

In [44]:
# this cell follows same structure as baseline methods checks
start = time.time()

cosine_accuracies = []
len_users = len(sample_users)
i = 1
for user in sample_users:
    cosine_accuracies.append(get_cosine_accuracy(user))
    print(f"Done with {i} of {len_users}")
    i += 1
    
end = time.time()
print(f"Run time: {end - start} seconds")
cosine_accuracies_df = pd.DataFrame(cosine_accuracies, 
                                    columns = ['accuracy', 'len_test', 'len_pred', 'jacc_score', 'f1_score'])



Done with 1 of 3000
Done with 2 of 3000
Done with 3 of 3000
Done with 4 of 3000
Done with 5 of 3000
Done with 6 of 3000
Done with 7 of 3000
Done with 8 of 3000
Done with 9 of 3000
Done with 10 of 3000
Done with 11 of 3000
Done with 12 of 3000
Done with 13 of 3000
Done with 14 of 3000
Done with 15 of 3000
Done with 16 of 3000
Done with 17 of 3000
Done with 18 of 3000
Done with 19 of 3000
Done with 20 of 3000
Done with 21 of 3000
Done with 22 of 3000
Done with 23 of 3000
Done with 24 of 3000
Done with 25 of 3000
Done with 26 of 3000
Done with 27 of 3000
Done with 28 of 3000
Done with 29 of 3000
Done with 30 of 3000
Done with 31 of 3000
Done with 32 of 3000
Done with 33 of 3000
Done with 34 of 3000
Done with 35 of 3000
Done with 36 of 3000
Done with 37 of 3000
Done with 38 of 3000
Done with 39 of 3000
Done with 40 of 3000
Done with 41 of 3000
Done with 42 of 3000
Done with 43 of 3000
Done with 44 of 3000
Done with 45 of 3000
Done with 46 of 3000
Done with 47 of 3000
Done with 48 of 3000
D

Done with 379 of 3000
Done with 380 of 3000
Done with 381 of 3000
Done with 382 of 3000
Done with 383 of 3000
Done with 384 of 3000
Done with 385 of 3000
Done with 386 of 3000
Done with 387 of 3000
Done with 388 of 3000
Done with 389 of 3000
Done with 390 of 3000
Done with 391 of 3000
Done with 392 of 3000
Done with 393 of 3000
Done with 394 of 3000
Done with 395 of 3000
Done with 396 of 3000
Done with 397 of 3000
Done with 398 of 3000
Done with 399 of 3000
Done with 400 of 3000
Done with 401 of 3000
Done with 402 of 3000
Done with 403 of 3000
Done with 404 of 3000
Done with 405 of 3000
Done with 406 of 3000
Done with 407 of 3000
Done with 408 of 3000
Done with 409 of 3000
Done with 410 of 3000
Done with 411 of 3000
Done with 412 of 3000
Done with 413 of 3000
Done with 414 of 3000
Done with 415 of 3000
Done with 416 of 3000
Done with 417 of 3000
Done with 418 of 3000
Done with 419 of 3000
Done with 420 of 3000
Done with 421 of 3000
Done with 422 of 3000
Done with 423 of 3000
Done with 

Done with 752 of 3000
Done with 753 of 3000
Done with 754 of 3000
Done with 755 of 3000
Done with 756 of 3000
Done with 757 of 3000
Done with 758 of 3000
Done with 759 of 3000
Done with 760 of 3000
Done with 761 of 3000
Done with 762 of 3000
Done with 763 of 3000
Done with 764 of 3000
Done with 765 of 3000
Done with 766 of 3000
Done with 767 of 3000
Done with 768 of 3000
Done with 769 of 3000
Done with 770 of 3000
Done with 771 of 3000
Done with 772 of 3000
Done with 773 of 3000
Done with 774 of 3000
Done with 775 of 3000
Done with 776 of 3000
Done with 777 of 3000
Done with 778 of 3000
Done with 779 of 3000
Done with 780 of 3000
Done with 781 of 3000
Done with 782 of 3000
Done with 783 of 3000
Done with 784 of 3000
Done with 785 of 3000
Done with 786 of 3000
Done with 787 of 3000
Done with 788 of 3000
Done with 789 of 3000
Done with 790 of 3000
Done with 791 of 3000
Done with 792 of 3000
Done with 793 of 3000
Done with 794 of 3000
Done with 795 of 3000
Done with 796 of 3000
Done with 

Done with 1119 of 3000
Done with 1120 of 3000
Done with 1121 of 3000
Done with 1122 of 3000
Done with 1123 of 3000
Done with 1124 of 3000
Done with 1125 of 3000
Done with 1126 of 3000
Done with 1127 of 3000
Done with 1128 of 3000
Done with 1129 of 3000
Done with 1130 of 3000
Done with 1131 of 3000
Done with 1132 of 3000
Done with 1133 of 3000
Done with 1134 of 3000
Done with 1135 of 3000
Done with 1136 of 3000
Done with 1137 of 3000
Done with 1138 of 3000
Done with 1139 of 3000
Done with 1140 of 3000
Done with 1141 of 3000
Done with 1142 of 3000
Done with 1143 of 3000
Done with 1144 of 3000
Done with 1145 of 3000
Done with 1146 of 3000
Done with 1147 of 3000
Done with 1148 of 3000
Done with 1149 of 3000
Done with 1150 of 3000
Done with 1151 of 3000
Done with 1152 of 3000
Done with 1153 of 3000
Done with 1154 of 3000
Done with 1155 of 3000
Done with 1156 of 3000
Done with 1157 of 3000
Done with 1158 of 3000
Done with 1159 of 3000
Done with 1160 of 3000
Done with 1161 of 3000
Done with 1

Done with 1476 of 3000
Done with 1477 of 3000
Done with 1478 of 3000
Done with 1479 of 3000
Done with 1480 of 3000
Done with 1481 of 3000
Done with 1482 of 3000
Done with 1483 of 3000
Done with 1484 of 3000
Done with 1485 of 3000
Done with 1486 of 3000
Done with 1487 of 3000
Done with 1488 of 3000
Done with 1489 of 3000
Done with 1490 of 3000
Done with 1491 of 3000
Done with 1492 of 3000
Done with 1493 of 3000
Done with 1494 of 3000
Done with 1495 of 3000
Done with 1496 of 3000
Done with 1497 of 3000
Done with 1498 of 3000
Done with 1499 of 3000
Done with 1500 of 3000
Done with 1501 of 3000
Done with 1502 of 3000
Done with 1503 of 3000
Done with 1504 of 3000
Done with 1505 of 3000
Done with 1506 of 3000
Done with 1507 of 3000
Done with 1508 of 3000
Done with 1509 of 3000
Done with 1510 of 3000
Done with 1511 of 3000
Done with 1512 of 3000
Done with 1513 of 3000
Done with 1514 of 3000
Done with 1515 of 3000
Done with 1516 of 3000
Done with 1517 of 3000
Done with 1518 of 3000
Done with 1

Done with 1833 of 3000
Done with 1834 of 3000
Done with 1835 of 3000
Done with 1836 of 3000
Done with 1837 of 3000
Done with 1838 of 3000
Done with 1839 of 3000
Done with 1840 of 3000
Done with 1841 of 3000
Done with 1842 of 3000
Done with 1843 of 3000
Done with 1844 of 3000
Done with 1845 of 3000
Done with 1846 of 3000
Done with 1847 of 3000
Done with 1848 of 3000
Done with 1849 of 3000
Done with 1850 of 3000
Done with 1851 of 3000
Done with 1852 of 3000
Done with 1853 of 3000
Done with 1854 of 3000
Done with 1855 of 3000
Done with 1856 of 3000
Done with 1857 of 3000
Done with 1858 of 3000
Done with 1859 of 3000
Done with 1860 of 3000
Done with 1861 of 3000
Done with 1862 of 3000
Done with 1863 of 3000
Done with 1864 of 3000
Done with 1865 of 3000
Done with 1866 of 3000
Done with 1867 of 3000
Done with 1868 of 3000
Done with 1869 of 3000
Done with 1870 of 3000
Done with 1871 of 3000
Done with 1872 of 3000
Done with 1873 of 3000
Done with 1874 of 3000
Done with 1875 of 3000
Done with 1

Done with 2190 of 3000
Done with 2191 of 3000
Done with 2192 of 3000
Done with 2193 of 3000
Done with 2194 of 3000
Done with 2195 of 3000
Done with 2196 of 3000
Done with 2197 of 3000
Done with 2198 of 3000
Done with 2199 of 3000
Done with 2200 of 3000
Done with 2201 of 3000
Done with 2202 of 3000
Done with 2203 of 3000
Done with 2204 of 3000
Done with 2205 of 3000
Done with 2206 of 3000
Done with 2207 of 3000
Done with 2208 of 3000
Done with 2209 of 3000
Done with 2210 of 3000
Done with 2211 of 3000
Done with 2212 of 3000
Done with 2213 of 3000
Done with 2214 of 3000
Done with 2215 of 3000
Done with 2216 of 3000
Done with 2217 of 3000
Done with 2218 of 3000
Done with 2219 of 3000
Done with 2220 of 3000
Done with 2221 of 3000
Done with 2222 of 3000
Done with 2223 of 3000
Done with 2224 of 3000
Done with 2225 of 3000
Done with 2226 of 3000
Done with 2227 of 3000
Done with 2228 of 3000
Done with 2229 of 3000
Done with 2230 of 3000
Done with 2231 of 3000
Done with 2232 of 3000
Done with 2

Done with 2547 of 3000
Done with 2548 of 3000
Done with 2549 of 3000
Done with 2550 of 3000
Done with 2551 of 3000
Done with 2552 of 3000
Done with 2553 of 3000
Done with 2554 of 3000
Done with 2555 of 3000
Done with 2556 of 3000
Done with 2557 of 3000
Done with 2558 of 3000
Done with 2559 of 3000
Done with 2560 of 3000
Done with 2561 of 3000
Done with 2562 of 3000
Done with 2563 of 3000
Done with 2564 of 3000
Done with 2565 of 3000
Done with 2566 of 3000
Done with 2567 of 3000
Done with 2568 of 3000
Done with 2569 of 3000
Done with 2570 of 3000
Done with 2571 of 3000
Done with 2572 of 3000
Done with 2573 of 3000
Done with 2574 of 3000
Done with 2575 of 3000
Done with 2576 of 3000
Done with 2577 of 3000
Done with 2578 of 3000
Done with 2579 of 3000
Done with 2580 of 3000
Done with 2581 of 3000
Done with 2582 of 3000
Done with 2583 of 3000
Done with 2584 of 3000
Done with 2585 of 3000
Done with 2586 of 3000
Done with 2587 of 3000
Done with 2588 of 3000
Done with 2589 of 3000
Done with 2

Done with 2904 of 3000
Done with 2905 of 3000
Done with 2906 of 3000
Done with 2907 of 3000
Done with 2908 of 3000
Done with 2909 of 3000
Done with 2910 of 3000
Done with 2911 of 3000
Done with 2912 of 3000
Done with 2913 of 3000
Done with 2914 of 3000
Done with 2915 of 3000
Done with 2916 of 3000
Done with 2917 of 3000
Done with 2918 of 3000
Done with 2919 of 3000
Done with 2920 of 3000
Done with 2921 of 3000
Done with 2922 of 3000
Done with 2923 of 3000
Done with 2924 of 3000
Done with 2925 of 3000
Done with 2926 of 3000
Done with 2927 of 3000
Done with 2928 of 3000
Done with 2929 of 3000
Done with 2930 of 3000
Done with 2931 of 3000
Done with 2932 of 3000
Done with 2933 of 3000
Done with 2934 of 3000
Done with 2935 of 3000
Done with 2936 of 3000
Done with 2937 of 3000
Done with 2938 of 3000
Done with 2939 of 3000
Done with 2940 of 3000
Done with 2941 of 3000
Done with 2942 of 3000
Done with 2943 of 3000
Done with 2944 of 3000
Done with 2945 of 3000
Done with 2946 of 3000
Done with 2

In [72]:
cos_df = cosine_accuracies_df.mean().to_frame().T

cos_df.columns = ['base_accuracy', 'len_test', 'len_pred', 'jacc_score', 'f1_score']

Unnamed: 0,base_accuracy,len_test,len_pred,jacc_score,f1_score
0,32.009458,10.631667,9.550333,0.155608,0.243585


In [78]:
updated_results = pd.concat([baseline_results, cos_df],
                           axis = 0,
                           sort = False)

updated_results.index = ['same_as_last', 'random_guess', 'top_N', 'cosine_1d']
updated_results.sort_values(['jacc_score', 'f1_score'], ascending = [0,0])

Unnamed: 0,base_accuracy,len_test,len_pred,jacc_score,f1_score
top_N,40.982577,10.631667,9.847,0.30494,0.438114
same_as_last,49.466675,10.631667,10.309,0.286835,0.405653
random_guess,30.839796,10.631667,13.587333,0.170351,0.27311
cosine_1d,32.009458,10.631667,9.550333,0.155608,0.243585


Based on the updated results, the simple approach with cosine similarity performs the worst, according to F1-score and Jaccard coefficient leaving the Top_N approach as the best predictor so far.


To sum up this notebook, top_N approach performs the best in terms of F1-score and Jaccard coefficient. And according to the aforementioned paper, their sophisticated approach yielded 0.497 and 0.567 for F1 and Jaccard, respectively. This indicates that the task of predicting order baskets is a highly complicated task that is currently undertaken by academic researchers. The difficulty could arise from attempts to predict human behavior which, on average, does not follow any patterns, especially in shopping. There is a large variation that arises from compulsive purchasing as well as from stopping by the store to pick up a couple of items, or shopping for special occasions. All these factors increase the complexity of normalizing human behavior (in this context) and extracting predictive patterns.


As for future work, we will attempt to use same logic and approach in order to compare purchase histories (N dimensions). Potentially, binary basket representation table would have to be sorted in a way where all orders (baskets) are ordered as purchase histories for each user so we could compare the baskets sequentially. For now, we will explore the application of series of Logistic Regressions for predicting upcoming shopping list in the next notebook.

-----

But first i will save `binary_train` and `binary_test` for future use. (we will retain information on users and orders for these data frames)

In [46]:
train_binary = orders.loc[orders['eval_set']== 'train',['order_id', 'user_id']].merge(binary_train, left_on = 'order_id', right_index = True)
test_binary = orders.loc[orders['eval_set']== 'test',['order_id', 'user_id']].merge(binary_test, left_on = 'order_id', right_index = True)


In [47]:
train_binary.head()

Unnamed: 0,order_id,user_id,air fresheners candles,asian foods,baby accessories,baby bath body care,baby food formula,bakery desserts,baking ingredients,baking supplies decor,...,spreads,tea,tofu meat alternatives,tortillas flat bread,trail mix snack mix,trash bags liners,vitamins supplements,water seltzer sparkling water,white wines,yogurt
0,2539329,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2398795,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,473747,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,2254736,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,431534,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


References:

1. Kraus, M., &amp; Feuerriegel, S. (2019, June 14). Personalized Purchase Prediction of Market Baskets with Wasserstein-Based Sequence Matching [Scholarly project]. In Cornell University. Retrieved from https://arxiv.org/abs/1905.13131

In [144]:
%load_ext watermark

%watermark -v -m -p numpy,pandas,sklearn -g

CPython 3.7.6
IPython 7.12.0

numpy 1.18.1
pandas 1.0.1
sklearn 0.22.1

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 19.5.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   :
