## Building a Recommendation System for 
## Reoccurring Products.



### Dataset: Instacart Online Grocery Shopping Dataset

#### Link: https://www.instacart.com/datasets/grocery-shopping-2017
#### The dataset has 6 subsets of data that are interrelated. There are 3 major datasets that we will be using They are as follows:
#### ..........................................................................................................................................................
#### 1)	Orders Data
#### This csv file includes data for all the unique orders for every user. The columns are:
#### •	order_id: order identifier
#### •	user_id: customer identifier
#### •	eval_set: which evaluation set this order belongs in (see SET described below)
#### •	order_number: the order sequence number for this user (1 = first, n = nth)
#### •	order_dow: the day of the week the order was placed on
#### •	order_hour_of_day: the hour of the day the order was placed on
#### •	days_since_prior: days since the last order capped at 30 (with NAs for order_number = 1)
#### ..........................................................................................................................................................
#### 2)	Products Data:
#### This file contains the list of all the products available for purchase. The columns are:
#### •	product_id: product identifier
#### •	product_name: name of the product
#### •	aisle_id: foreign key
#### •	department_id: foreign key
#### ..........................................................................................................................................................
#### 3)	Order_product_set Data
#### This csv file indicates products were purchased in individual orders. Thus, if ‘n’ products are purchased in order ‘x’, the order_id will be the same for the ‘n’ products. The columns are:
#### •	order_id: foreign key
#### •	product_id: foreign key
#### •	add_to_cart_order: order in which each product was added to cart
#### •	reordered: 1 if this product has been ordered by this user in the past, 0 otherwise


In [1]:
# Importing library
import numpy as np
import pandas as pd
from pandas import isnull

In [2]:
# Importing Orders Data
df_orders = pd.read_csv("orders.csv")
df_orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [3]:
# Filtering out rows belonging to 'prior' and 'train' set.
orders = df_orders.loc[df_orders['eval_set'] != 'test']

## Train - Test Split

The train test split is done on the basis of the order number. The last order of each user needs to be predicted thus it constitutes the test data while the remaining orders are used for training.

In [4]:
# Giving New ID to efficiently make a train test split
orders['New_ID'] = orders.index + 1
orders = orders[['order_id', 'user_id', 'order_number', 'days_since_prior_order', 'New_ID']]
orders.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,order_id,user_id,order_number,days_since_prior_order,New_ID
0,2539329,1,1,,1
1,2398795,1,2,15.0,2
2,473747,1,3,21.0,3
3,2254736,1,4,29.0,4
4,431534,1,5,28.0,5


In [5]:
# Making Test Data --> The orders with highest order number for each user are filtered out
test = orders.groupby(['user_id']).apply(lambda x: (x.sort_values(['user_id', 'order_number'], 
                                                  ascending=[True, False]).head(1)))

test.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,order_id,user_id,order_number,days_since_prior_order,New_ID
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,10,1187899,1,11,14.0,11
2,25,1492625,2,15,30.0,26
3,37,1402502,3,12,15.0,38
4,43,2557754,4,5,0.0,44
5,49,2196797,5,5,6.0,50


In [6]:
# Making Train Data --> The test data formed above is removed from the orders data, if the ID matches.
# After the test data is removed, the remaining data forms the train data.
train = orders
cond = train['New_ID'].isin(test['New_ID']) == True
train.drop(train[cond].index, inplace = True)
train = train.reset_index(drop =True)
train.head()

Unnamed: 0,order_id,user_id,order_number,days_since_prior_order,New_ID
0,2539329,1,1,,1
1,2398795,1,2,15.0,2
2,473747,1,3,21.0,3
3,2254736,1,4,29.0,4
4,431534,1,5,28.0,5


In [7]:
# Loading Orders_product data for 'prior' set
df_set1 = pd.read_csv("order_products_prior.csv")

# Loading Orders_product data for 'train' set
df_set2 = pd.read_csv("order_products_train.csv")

#Concating the Orders_product datasets
prod_order = pd.concat([df_set1, df_set2], axis = 0)

# Sorting based on order_id and add_to_cart_order
prod_order = prod_order.sort_values(["order_id", "add_to_cart_order"],ascending = [True, True])

prod_order.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [8]:
# Joining Train Data with the order_product data. 
# Inner Join was used to make sure that only those orders are kept who's product data is available.
op_train = pd.merge(train, prod_order, on='order_id', how='inner')
op_train.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,New_ID,product_id,add_to_cart_order,reordered
0,2539329,1,prior,1,2,8,,1,196,1,0
1,2539329,1,prior,1,2,8,,1,14084,2,0
2,2539329,1,prior,1,2,8,,1,12427,3,0
3,2539329,1,prior,1,2,8,,1,26088,4,0
4,2539329,1,prior,1,2,8,,1,26405,5,0


In [10]:
# Joining Test Data with the order_product data. 
# Inner Join was used to make sure that only those orders are kept who's product data is available.
op_test = pd.merge(test, prod_order, on='order_id', how='inner')
op_test.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,New_ID,product_id,add_to_cart_order,reordered
0,1187899,1,train,11,4,8,14.0,11,196,1,1
1,1187899,1,train,11,4,8,14.0,11,25133,2,1
2,1187899,1,train,11,4,8,14.0,11,38928,3,1
3,1187899,1,train,11,4,8,14.0,11,26405,4,1
4,1187899,1,train,11,4,8,14.0,11,39657,5,1


# Algorithm: Weighted Basket Method

In [11]:
# Extracting useful columns. No need of order_id as it is represented by the order_number for each user.
data = op_train[['user_id', 'product_id', 'order_number']]
data.head()

Unnamed: 0,user_id,product_id,order_number
0,1,196,1
1,1,14084,1
2,1,12427,1
3,1,26088,1
4,1,26405,1


In [12]:
# Calculating the weightage of each product for the respective user.
# The oldest order has the least weightage while the most recent order has the maximum weightage.
# This is similar to the order_number system.
# The weightage of each product corresponds to order_number it appears in.
# Thus, aggregating the order_number to calculate weightage of of each user purchased product:
sum_df = data.groupby(['user_id','product_id']).agg({'order_number': 'sum'})
sum_df = sum_df.reset_index()
sum_df.head()

Unnamed: 0,user_id,product_id,order_number
0,1,196,55
1,1,10258,54
2,1,10326,5
3,1,12427,55
4,1,13032,19


In [13]:
# Computing the number of orders of the user to calculate the total weightage of each user
max_basket = data.groupby(['user_id']).agg({'order_number': 'max'})
max_basket.head()

Unnamed: 0_level_0,order_number
user_id,Unnamed: 1_level_1
1,10
2,14
3,11
4,4
5,4


In [14]:
# Calculating the total weightage of each user using formula: (n * (n +1)) / 2 
# Where n = total number of orders for that user
max_basket['weight'] = (max_basket.order_number * (max_basket.order_number + 1)) / 2
max_basket = max_basket.reset_index()
max_basket.head()

Unnamed: 0,user_id,order_number,weight
0,1,10,55.0
1,2,14,105.0
2,3,11,66.0
3,4,4,10.0
4,5,4,10.0


In [15]:
# Merging the product weightage data with total weightage data
weight_data = pd.merge(sum_df, max_basket, on = 'user_id', how='outer')
weight_data = weight_data.drop(['order_number_y'], axis = 1)
weight_data.head()

Unnamed: 0,user_id,product_id,order_number_x,weight
0,1,196,55,55.0
1,1,10258,54,55.0
2,1,10326,5,55.0
3,1,12427,55,55.0
4,1,13032,19,55.0


In [16]:
# Calculating the weighted score of the user purchased products
# Formula: Product weighatge / Total weightage of user
weight_data['imp_score'] = weight_data.order_number_x / weight_data.weight
weight_data.head()

Unnamed: 0,user_id,product_id,order_number_x,weight,imp_score
0,1,196,55,55.0,1.0
1,1,10258,54,55.0,0.981818
2,1,10326,5,55.0,0.090909
3,1,12427,55,55.0,1.0
4,1,13032,19,55.0,0.345455


# Algorithm: Product Transition Probability

In [17]:
ttrain = train

In [18]:
# prev_order_id column has the preceding order_id of the current order_id for the given user
# Done by shifting the order_id column by 1
ttrain['prev_order_id'] = ttrain.sort_values(['user_id', 'order_number'])\
.groupby('user_id')['order_id'].shift().fillna(0).astype(np.uint32)

ttrain.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,New_ID,prev_order_id
0,2539329,1,prior,1,2,8,,1,0
1,2398795,1,prior,2,3,7,15.0,2,2539329
2,473747,1,prior,3,3,12,21.0,3,2398795
3,2254736,1,prior,4,4,7,29.0,4,473747
4,431534,1,prior,5,4,15,28.0,5,2254736


In [19]:
# This is done to make the dataset ready for lookup on the basis of order_id
ttrain = ttrain.set_index('order_id')

In [20]:
# Get product list for alL the orders and make a new column prod_list
ttrain['prod_list'] = prod_order.groupby('order_id').aggregate(
    {'product_id':lambda x: set(x)})

ttrain.head()

Unnamed: 0_level_0,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,New_ID,prev_order_id,prod_list
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2539329,1,prior,1,2,8,,1,0,"{196, 26405, 14084, 26088, 12427}"
2398795,1,prior,2,3,7,15.0,2,2539329,"{196, 26088, 13032, 12427, 10258, 13176}"
473747,1,prior,3,3,12,21.0,3,2398795,"{196, 12427, 25133, 30450, 10258}"
2254736,1,prior,4,4,7,29.0,4,473747,"{196, 26405, 12427, 25133, 10258}"
431534,1,prior,5,4,15,28.0,5,2254736,"{17122, 196, 12427, 25133, 10258, 10326, 13176..."


In [21]:
# Make a new dataset that has all rows except where order_number is 1
# This is done because order_number 1 for every user cannot be compared with any order_id.
ords = ttrain[(ttrain.order_number > 1)]
ords = ords.reset_index()

In [22]:
# Mapping the order_id column with the index of ttrain dataset which is set as the order_id preciously
# Done to get the list of products in the current order_id
ords['prod_list'] = ords.order_id.map(ttrain.prod_list)
ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,New_ID,prev_order_id,prod_list
0,2398795,1,prior,2,3,7,15.0,2,2539329,"{196, 26088, 13032, 12427, 10258, 13176}"
1,473747,1,prior,3,3,12,21.0,3,2398795,"{196, 12427, 25133, 30450, 10258}"
2,2254736,1,prior,4,4,7,29.0,4,473747,"{196, 26405, 12427, 25133, 10258}"
3,431534,1,prior,5,4,15,28.0,5,2254736,"{17122, 196, 12427, 25133, 10258, 10326, 13176..."
4,3367565,1,prior,6,2,7,19.0,6,431534,"{10258, 12427, 196, 25133}"


In [23]:
# Mapping the prev_order_id column with the index of ttrain dataset 
# Done to get the list of products in the previous order_id
ords['prev_prod_list'] = ords.prev_order_id.map(ttrain.prod_list)
ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,New_ID,prev_order_id,prod_list,prev_prod_list
0,2398795,1,prior,2,3,7,15.0,2,2539329,"{196, 26088, 13032, 12427, 10258, 13176}","{196, 26405, 14084, 26088, 12427}"
1,473747,1,prior,3,3,12,21.0,3,2398795,"{196, 12427, 25133, 30450, 10258}","{196, 26088, 13032, 12427, 10258, 13176}"
2,2254736,1,prior,4,4,7,29.0,4,473747,"{196, 26405, 12427, 25133, 10258}","{196, 12427, 25133, 30450, 10258}"
3,431534,1,prior,5,4,15,28.0,5,2254736,"{17122, 196, 12427, 25133, 10258, 10326, 13176...","{196, 26405, 12427, 25133, 10258}"
4,3367565,1,prior,6,2,7,19.0,6,431534,"{10258, 12427, 196, 25133}","{17122, 196, 12427, 25133, 10258, 10326, 13176..."


In [24]:
# fill N/A values: na -> empty set
ords.loc[:, ['prod_list', 
               'prev_prod_list']] \
= ords.loc[:, ['prod_list', 
               'prev_prod_list']].applymap(lambda x: set() if isnull(x) else x)

In [25]:
# Making a set T11: Common products in the the current and previous orders.  
ords['T11'] = ords.apply(lambda r: r['prod_list'] & r['prev_prod_list'], axis=1)
ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,New_ID,prev_order_id,prod_list,prev_prod_list,T11
0,2398795,1,prior,2,3,7,15.0,2,2539329,"{196, 26088, 13032, 12427, 10258, 13176}","{196, 26405, 14084, 26088, 12427}","{26088, 12427, 196}"
1,473747,1,prior,3,3,12,21.0,3,2398795,"{196, 12427, 25133, 30450, 10258}","{196, 26088, 13032, 12427, 10258, 13176}","{10258, 12427, 196}"
2,2254736,1,prior,4,4,7,29.0,4,473747,"{196, 26405, 12427, 25133, 10258}","{196, 12427, 25133, 30450, 10258}","{10258, 12427, 196, 25133}"
3,431534,1,prior,5,4,15,28.0,5,2254736,"{17122, 196, 12427, 25133, 10258, 10326, 13176...","{196, 26405, 12427, 25133, 10258}","{10258, 12427, 196, 25133}"
4,3367565,1,prior,6,2,7,19.0,6,431534,"{10258, 12427, 196, 25133}","{17122, 196, 12427, 25133, 10258, 10326, 13176...","{10258, 12427, 196, 25133}"


In [27]:
# product count -> No. of bins needed for N1 and N11
df_prod = pd.read_csv("products.csv")
n_products = len(df_prod)

In [28]:
# N1 ----------------------------
# flatten list of sets of the prev_prod_list column  --> f1 
f1 = [val for sublist in [list(i) for i in ords.prev_prod_list.values] for val in sublist]

# N1: number of times a product occurs in the prev_prod_list column; count its recurrence in f1
N1 = np.bincount(f1, minlength=n_products+1)

In [29]:
# N11 ----------------------------
# flatten list of sets of the T11 column --> f11
f11 = [val for sublist in [list(i) for i in ords.T11.values] for val in sublist]

# N1: number of times a product occurs in the T11 column; count its recurrence in f11
N11 = np.bincount(f11, minlength=n_products+1)

In [30]:
# Calculate P11
# Probability that the product will be purchased in the next order given that it was purchased in past order
"""
P11 = No of times product was present in both current and past order --> (N11)
       _______________________________________________________________________ 
             No of times product was present in past order --> (N1)
"""

product_probs = pd.DataFrame(
    data={
        'product_id': np.array(range(0, n_products+1)),
        'P11': (N11) / (N1)
    }
)
product_probs = product_probs[1:] # No Product No 0 exists.
product_probs.head()

  if sys.path[0] == '':


Unnamed: 0,product_id,P11
1,1,0.254545
2,2,0.049383
3,3,0.389513
4,4,0.235714
5,5,0.266667


# Combining both algorithms 

In [31]:
# Merging the weighted score data with product probability data using left join
# Merging in products previously purchased by the user
# Thus, only those products will have both scores that the user has purchased previously.
score_data = pd.merge(weight_data, product_probs, on = 'product_id', how='left')
score_data.head()

Unnamed: 0,user_id,product_id,order_number_x,weight,imp_score,P11
0,1,196,55,55.0,1.0,0.452491
1,1,10258,54,55.0,0.981818,0.359473
2,1,10326,5,55.0,0.090909,0.36446
3,1,12427,55,55.0,1.0,0.423948
4,1,13032,19,55.0,0.345455,0.254822


In [32]:
# Multiplying Weighted score(imp_score) with P11 to get final score
score_data['score'] = score_data.imp_score * score_data.P11
score_data.head()

Unnamed: 0,user_id,product_id,order_number_x,weight,imp_score,P11,score
0,1,196,55,55.0,1.0,0.452491,0.452491
1,1,10258,54,55.0,0.981818,0.359473,0.352938
2,1,10326,5,55.0,0.090909,0.36446,0.033133
3,1,12427,55,55.0,1.0,0.423948,0.423948
4,1,13032,19,55.0,0.345455,0.254822,0.088029


In [33]:
# Filter the top 10 products for each user based on the final score
rec = score_data.groupby(['user_id']).apply(lambda x: (x.sort_values(['user_id', 'score'], 
                                                  ascending=[True, False]).head(10)))
rec = rec.reset_index(drop = True)
rec.head()

Unnamed: 0,user_id,product_id,order_number_x,weight,imp_score,P11,score
0,1,196,55,55.0,1.0,0.452491,0.452491
1,1,12427,55,55.0,1.0,0.423948,0.423948
2,1,10258,54,55.0,0.981818,0.359473,0.352938
3,1,25133,52,55.0,0.945455,0.242766,0.229524
4,1,46149,27,55.0,0.490909,0.423815,0.208055


# Validation: Calculate Precision & Recall

In [34]:
# Filter out only the repurchased products in the test order for every user
reordered_test =op_test[(op_test['reordered'] == 1)]
reordered_test.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,New_ID,product_id,add_to_cart_order,reordered
0,1187899,1,train,11,4,8,14.0,11,196,1,1
1,1187899,1,train,11,4,8,14.0,11,25133,2,1
2,1187899,1,train,11,4,8,14.0,11,38928,3,1
3,1187899,1,train,11,4,8,14.0,11,26405,4,1
4,1187899,1,train,11,4,8,14.0,11,39657,5,1


In [35]:
# data1 --> recommended product list
data1 = rec[['user_id', 'product_id']]

# data2 --> Actually purchased products in the test order
data2 = reordered_test[['user_id', 'product_id']]

In [36]:
# Function to Filter out the common products recommended as well as actually purchased.
def dataframe_difference(df1, df2, which= 'both' ):
    """Find rows which are different between two DataFrames."""
    comparison_df = df1.merge(df2,
                              indicator=True,
                              how='outer')
    if which is None:
        diff_df = comparison_df[comparison_df['_merge'] != 'both']
    else:
        diff_df = comparison_df[comparison_df['_merge'] == which]

    return diff_df

In [37]:
diff_df = dataframe_difference(data2, data1) # calling the function
diff_df = diff_df.reset_index(drop = True)
diff_df.head()

Unnamed: 0,user_id,product_id,_merge
0,1,196,both
1,1,25133,both
2,1,38928,both
3,1,39657,both
4,1,10258,both


In [38]:
# Number of Recommended Products --> Mostly 10 for all users 
# But can change if less than 10 products were purchased by the user in all baskets combined
rec = data1.groupby('user_id')['product_id'].count().to_frame('rec_count')

# Number of relevant products --> actually repurchased products in the test order
rel = data2.groupby('user_id')['product_id'].count().to_frame('rel_count')

# Count of the common products in both datases
common = diff_df.groupby('user_id')['product_id'].count().to_frame('true_count')

# Dataset to calculate the accuracy ---> Merge above 3 datasets
accuracy = pd.concat([rec, rel, common], axis = 1)
accuracy = accuracy.fillna(0)
accuracy = accuracy.reset_index()
accuracy.head()

Unnamed: 0,user_id,rec_count,rel_count,true_count
0,1,10,10.0,8.0
1,2,10,12.0,3.0
2,3,10,6.0,3.0
3,4,10,0.0,0.0
4,5,10,4.0,2.0


In [39]:
# Calculate Precision and Recall
precision = []
recall = []
for i in range(len(accuracy)) :
    if accuracy["rec_count"][i] != 0 and accuracy["rel_count"][i] != 0: # Should not be zero as a denominator
        pre = accuracy["true_count"][i] / accuracy["rec_count"][i]
        re = accuracy["true_count"][i] / accuracy["rel_count"][i]
        precision.append(pre)
        recall.append(re)
    else:
        precision.append(0)
        recall.append(0)
        
accuracy["Precision"] = precision
accuracy["Recall"] = recall
accuracy.head()

Unnamed: 0,user_id,rec_count,rel_count,true_count,Precision,Recall
0,1,10,10.0,8.0,0.8,0.8
1,2,10,12.0,3.0,0.3,0.25
2,3,10,6.0,3.0,0.3,0.5
3,4,10,0.0,0.0,0.0,0.0
4,5,10,4.0,2.0,0.2,0.5


In [41]:
pre = accuracy["Precision"].mean()
print(f"Precision = {pre*100} %")
re = accuracy["Recall"].mean()
print(f"Recall = {re*100} %")

Precision = 29.662976005144063 %
Recall = 53.11609499215047 %
