There are 2 commonly used product recommendation system: used-based collaborative filtering and content-based filtering method, each with its own unique advantages and disadvantages. Often time, industry best practices include both methods to produce an effective recommendation system. In this project, we will focus on building a product recommendation system based on collaborative filtering approach because it's more suitable to our dataset.

Collaborative filtering algorithm:

Collaborative filtering method is defined as a way of recommending products or services to customers based on the behaviors of previous similar customers (ie. products and services viewed or purchased)

Our assumption: customers who view and buy the similar products or services are more likely to view and buy similar products/services in the future. Based on market research and real case studies, this assumption tends to hold true.

Step 1: Load the dataset

In [1]:
import pandas as pd
df = pd.read_csv('onlinepurchase.csv')

Step 2: Examine the data

In [2]:
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,Invoice_time,Price,CustomerID,Country,Purchase_dt
0,493410,TEST001,This is a test product.,5,2010-01-04 9:24,4.5,12346.0,United Kingdom,2010-01-04
1,C493411,21539,RETRO SPOTS BUTTER DISH,-1,2010-01-04 9:43,4.25,14590.0,United Kingdom,2010-01-04
2,493412,TEST001,This is a test product.,5,2010-01-04 9:53,4.5,12346.0,United Kingdom,2010-01-04
3,493413,21724,PANDA AND BUNNIES STICKER SHEET,1,2010-01-04 9:54,0.85,,United Kingdom,2010-01-04
4,493413,84578,ELEPHANT TOY WITH BLUE T-SHIRT,1,2010-01-04 9:54,3.75,,United Kingdom,2010-01-04


In [3]:
df.shape

(1022664, 9)

In [4]:
df.columns

Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'Invoice_time',
       'Price', 'CustomerID', 'Country', 'Purchase_dt'],
      dtype='object')

In [5]:
df['CustomerID'].nunique() # View the number of unique customers

5887

In [6]:
df.duplicated().sum() # View the number of duplicated records

11585

Step 3: Clean the data

In [7]:
df = df.drop_duplicates() # Remove duplicates from data

In [8]:
df = df.loc[df['Quantity'] > 0] # Remove purchase records with Quantiy <= 0

In [9]:
# Remove records with missing values for columns CustomerID and Description

df = df.dropna(subset=['CustomerID', 'Description'])

In [10]:
# Export cleaned data to csv file

df.to_csv('onlinepurchase_clean.csv', sep = '|', index = False)

Step 4: Build a customer-item matrix

In this step, we transform data into a customer_item matrix, where each row represents customer and the columns correspond to different products

In [11]:
customer_item_matrix = df.pivot_table(
    index = 'CustomerID',
    columns = 'StockCode',
    values = 'Quantity',
    aggfunc = 'sum'
)
customer_item_matrix.head()

StockCode,10002,10080,10120,10123C,10123G,10124A,10124G,10125,10133,10134,...,BANK CHARGES,C2,D,DOT,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,,,,,,,,,,,...,,,,,,,,,45.0,1.0
12347.0,,,,,,,,,,,...,,,,,,,,,,
12348.0,,,,,,,,,,,...,,,,,,,10.0,,,
12349.0,,,,,,,,,,,...,,,,,,,3.0,,,
12350.0,,,,,,,,,,,...,,,,,,,1.0,,,


Next, we will 1-0 encode the customer-item matrix to indicate whether a customer has bought a certain product or not (1 means purchased, 0 means has not purchased)

In [12]:
# Apply a blanket lambda function for 1-0 encoding on the customer_item matrix
customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
customer_item_matrix.head()

## Disadvantage: this medthod of 1-0 encoding does not take into account the volume
# (quantity) of a particular purchase

StockCode,10002,10080,10120,10123C,10123G,10124A,10124G,10125,10133,10134,...,BANK CHARGES,C2,D,DOT,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12350.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


Now, we're going to use cosine similarity tool to build a user-based recommendation. This function computes the pairwise cosine similarities between samples and outputs an array of the results. Therefore, we will also need to convert the result array back to a data frame)

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
customer_similarity_matrix = pd.DataFrame(cosine_similarity(customer_item_matrix))
customer_similarity_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5822,5823,5824,5825,5826,5827,5828,5829,5830,5831
0,1.000000,0.000000,0.000000,0.131060,0.000000,0.000000,0.023002,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.070535,0.000000
1,0.000000,1.000000,0.053452,0.045502,0.043214,0.038881,0.031944,0.055728,0.023395,0.090351,...,0.037987,0.000000,0.067344,0.064820,0.102869,0.113961,0.067344,0.000000,0.076186,0.024398
2,0.000000,0.053452,1.000000,0.017025,0.048507,0.000000,0.023905,0.000000,0.026261,0.067612,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.138580,0.000000,0.000000,0.000000,0.054772
3,0.131060,0.045502,0.017025,1.000000,0.041292,0.037152,0.152617,0.071000,0.044710,0.057555,...,0.054447,0.000000,0.032174,0.020646,0.049147,0.140654,0.016087,0.000000,0.062399,0.038854
4,0.000000,0.043214,0.048507,0.041292,1.000000,0.000000,0.028989,0.050572,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.058824,0.000000,0.038782,0.000000,0.000000,0.029630,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5827,0.000000,0.113961,0.138580,0.140654,0.038782,0.034893,0.140153,0.066683,0.125976,0.018019,...,0.034091,0.044348,0.030218,0.077563,0.015386,1.000000,0.090655,0.015386,0.110698,0.087581
5828,0.000000,0.067344,0.000000,0.016087,0.000000,0.041239,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.090655,1.000000,0.054554,0.069264,0.120761
5829,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.034503,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.015386,0.054554,1.000000,0.105802,0.026352
5830,0.070535,0.076186,0.000000,0.062399,0.029630,0.026660,0.087612,0.000000,0.016042,0.082602,...,0.000000,0.000000,0.069264,0.000000,0.070535,0.110698,0.069264,0.105802,1.000000,0.066915


After the cosine similarity computation, each column and each row of the data respresent individual customer. As a result, we're going to rename the indexes and columns by using the CustomerID.

In [18]:
customer_item_matrix.index

Float64Index([12346.0, 12347.0, 12348.0, 12349.0, 12350.0, 12351.0, 12352.0,
              12353.0, 12354.0, 12355.0,
              ...
              18278.0, 18279.0, 18280.0, 18281.0, 18282.0, 18283.0, 18284.0,
              18285.0, 18286.0, 18287.0],
             dtype='float64', name='CustomerID', length=5832)

In [24]:
customer_similarity_matrix.index = customer_item_matrix.index
customer_similarity_matrix.columns = customer_item_matrix.index
customer_similarity_matrix

CustomerID,12346.0,12347.0,12348.0,12349.0,12350.0,12351.0,12352.0,12353.0,12354.0,12355.0,...,18278.0,18279.0,18280.0,18281.0,18282.0,18283.0,18284.0,18285.0,18286.0,18287.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,1.000000,0.000000,0.000000,0.131060,0.000000,0.000000,0.023002,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.070535,0.000000
12347.0,0.000000,1.000000,0.053452,0.045502,0.043214,0.038881,0.031944,0.055728,0.023395,0.090351,...,0.037987,0.000000,0.067344,0.064820,0.102869,0.113961,0.067344,0.000000,0.076186,0.024398
12348.0,0.000000,0.053452,1.000000,0.017025,0.048507,0.000000,0.023905,0.000000,0.026261,0.067612,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.138580,0.000000,0.000000,0.000000,0.054772
12349.0,0.131060,0.045502,0.017025,1.000000,0.041292,0.037152,0.152617,0.071000,0.044710,0.057555,...,0.054447,0.000000,0.032174,0.020646,0.049147,0.140654,0.016087,0.000000,0.062399,0.038854
12350.0,0.000000,0.043214,0.048507,0.041292,1.000000,0.000000,0.028989,0.050572,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.058824,0.000000,0.038782,0.000000,0.000000,0.029630,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18283.0,0.000000,0.113961,0.138580,0.140654,0.038782,0.034893,0.140153,0.066683,0.125976,0.018019,...,0.034091,0.044348,0.030218,0.077563,0.015386,1.000000,0.090655,0.015386,0.110698,0.087581
18284.0,0.000000,0.067344,0.000000,0.016087,0.000000,0.041239,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.090655,1.000000,0.054554,0.069264,0.120761
18285.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.034503,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.015386,0.054554,1.000000,0.105802,0.026352
18286.0,0.070535,0.076186,0.000000,0.062399,0.029630,0.026660,0.087612,0.000000,0.016042,0.082602,...,0.000000,0.000000,0.069264,0.000000,0.070535,0.110698,0.069264,0.105802,1.000000,0.066915


We interpret the similarities between customers based on the cosine similairty values computed and range from 0 to 1 with 0 means no similarity and 1 indicates the same cosine similarty of the same customer. Based on this scale, we can intepret the relative similarity between different customers. For examples, the cosine similarity between customers 12347 and 12348 is 0.053452 whereas the cosine similarity value between customers 12347 and 12349 is 0.045502. As a result, we can say that customer 12347's purchase behavior is more similar to that of customer 12348 than customer 12349.

Step 5: Generate recommendation list of items

These pairwise cosine similarities are what we will use as measures for recommending a list of items. We'll start by looking at an array of cosine similairites for a particular customer and picking out another customer that is most similar to the chosen one.

In [25]:
# Choose a particular customer to be compared with other customers.
# In this case, we're using customer 12350

customer_similarity_matrix.loc[12350.0]

CustomerID
12346.0    0.000000
12347.0    0.043214
12348.0    0.048507
12349.0    0.041292
12350.0    1.000000
             ...   
18283.0    0.038782
18284.0    0.000000
18285.0    0.000000
18286.0    0.029630
18287.0    0.000000
Name: 12350.0, Length: 5832, dtype: float64

In [26]:
# Sort the cosine similarity values from highest to lowest
# Then pick out the customer that is most similar to customer 12350
# In this case, it's customer 12568 with the cosine similarity of 0.21693

customer_similarity_matrix.loc[12350.0].sort_values(ascending=False)

CustomerID
12350.0    1.000000
12568.0    0.216930
16886.0    0.171499
12503.0    0.171499
12814.0    0.171499
             ...   
15835.0    0.000000
15829.0    0.000000
15828.0    0.000000
15827.0    0.000000
12346.0    0.000000
Name: 12350.0, Length: 5832, dtype: float64

With these two customers, we're going to use the following strategy to make a recommendation list of items for the target customer 12568:
First, we will find out the historical items that were purchased by customer 12350 as well as those of customer 12368.
Then, we will recommend to the target customer 12568 items that he or she has not bought but were bought by customer 12350.

In [29]:
customer_item_matrix.head()

StockCode,10002,10080,10120,10123C,10123G,10124A,10124G,10125,10133,10134,...,BANK CHARGES,C2,D,DOT,M,PADS,POST,SP1002,TEST001,TEST002
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
12350.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [53]:
# Get the indexes of items purchased by customer 12350

item_index_12350 = customer_item_matrix.loc[12350].to_numpy().nonzero()
item_index_12350

(array([ 172,  179,  556, 1104, 1120, 1122, 1161, 1167, 1528, 1582, 1713,
        1718, 1780, 3257, 3277, 3381, 4585]),)

In [83]:
# Get the StockCodes of items bought by customer 12350

items_bought_12350 = set(customer_item_matrix.loc[12350].iloc[item_index_12350].index)
items_bought_12350

{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21866',
 '21908',
 '21915',
 '22348',
 '22412',
 '22551',
 '22557',
 '22620',
 '79066K',
 '79191C',
 '84086C',
 'POST'}

We're going to apply the same code for target customer 12368

In [87]:
item_index_12568 = customer_item_matrix.loc[12568.0].to_numpy().nonzero()
item_index_12568

(array([  58,  201, 1528, 3037, 4585]),)

In [88]:
items_bought_12568 = set(customer_item_matrix.loc[12568.0].iloc[item_index_12368].index)
items_bought_12568

{'16161P', '20676', '22348', '47570', 'POST'}

To find items customer 12350 has bought but the target customer 12568 has not, we can do a simple subtraction operation:

In [90]:
items_recommended_to_12568 = items_bought_12350 - items_bought_12568
items_recommended_to_12568

{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21866',
 '21908',
 '21915',
 '22412',
 '22551',
 '22557',
 '22620',
 '79066K',
 '79191C',
 '84086C'}

Next, we're gonna get the item desciptions for these recommended StockCodes

In [105]:
# Use the items recommended to 12568 to compare with the original dataset

df['StockCode'].isin(items_recommended_to_12568)

0          False
2          False
6          False
7          False
8          False
           ...  
1022655    False
1022656    False
1022657    False
1022658    False
1022663    False
Name: StockCode, Length: 763716, dtype: bool

In [107]:
# Extract out 2 columns StockCode and Description, with any records matching
# the recommended items, from the original dataset

df.loc[df['StockCode'].isin(items_recommended_to_12568), ['StockCode', 'Description']]

Unnamed: 0,StockCode,Description
449,21171,BATHROOM METAL SIGN
915,21864,UNION JACK FLAG PASSPORT COVER
993,21908,CHOCOLATE THIS WAY METAL SIGN
1231,21908,CHOCOLATE THIS WAY METAL SIGN
1764,21908,CHOCOLATE THIS WAY METAL SIGN
...,...,...
1019881,20615,BLUE SPOTTY PASSPORT COVER
1019903,21832,CHOCOLATE CALCULATOR
1021303,21908,CHOCOLATE THIS WAY METAL SIGN
1022275,21171,BATHROOM METAL SIGN


In [1]:
# Since these are records of historical purchases, we need to drop duplicates to get
# the unique descriptions of recommended items

desc_item_recomm_12568 = \
df.loc[df['StockCode'].isin(items_recommended_to_12568), ['StockCode', 'Description']]\
.drop_duplicates().set_index('StockCode')

desc_item_recomm_12568

NameError: name 'df' is not defined

Disadvantage: the customer-based recommendation method depends on historical data of previous purchases. This poses a problem for newly acquired customers who we may have little knowledge about their purchasing behaviors. However, we can remedy this problem by using content-based (or item-based) recommendation in which we may ask new customers a set of questions to get a baseline of their preferences and then conduct recomendations based on this information. In fact, in real pratice, both systems of customer-based and item-based recommendations are often used together.