# Introduction

Why is this important?
- Studies have shown that personalized product recommendations improve conversion rates and customer retention rates.

**Collaborative filtering and product recommendation**
According to the study conducted by Salesforce: The customers who are prompted with personalized product recommendations drive 24% of the orders and 26% of the revenue.

Also, the product recommendations lead to repeat visits, purchases with recommendations yield higher average-order value, and customers do bu recommended items.

**Product recommender system**
> A system with goal of predicting and compiling a list of items that a customer is likely to purchase.

Two ways to produce a list of recommendations:
1. Collaborative filtering
> Based on previous user behaviors. (e.g: Pages that they viewed, products that they purchased, or ratings that they have given to different items).

Assumption:
The customers who have viewed or purchased similar contents or products in the past are likely to view or purchase similar kinds of contents or products in the future.

2. Content-based filtering
> Based on the characteristics of an item or a user. It typically looks at the keywords that describe the characteristics of an item

Assumption:
The users are likely to view or purchase items that are similar in characteristics of those items that they have bought or viewed in the past.

![collaborative-vs-content.png](attachment:collaborative-vs-content.png)

# Deep in Collaborative Filtering

The algorithms:
1. Building user-to-item matrix. It comprises individual users in the rows and invidual items in the columns.

![user-to-item-matrix.png](attachment:user-to-item-matrix.png)

2. Determine the approach, there are two approaches:
- User-based approach ==> Similarity between user
- Item-based approach ==> Similarity between items. (By simply transposing the user-to-item matrix, then calculate its similarity).

3. Measure the similarities (according your approach). One of the common formula is Cosine Similarity:


The cosine similarity between two vectors $\mathbf{a}$ and $\mathbf{b}$ is calculated using the formula:

$$
\text{similarity}(\mathbf{a}, \mathbf{b}) = \frac{\sum_{i=1}^{n} a_i \cdot b_i}{\sqrt{\sum_{i=1}^{n} a_i^2} \cdot \sqrt{\sum_{i=1}^{n} b_i^2}}
$$

Where:

- $a_i$ and $b_i$ are the components of vectors $\mathbf{a}$ and $\mathbf{b}$ respectively.
- $n$ is the dimensionality of the vectors.



# Exercise

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('data/Online retail.csv')
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB
None


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [4]:
# Data preparation
df = df.loc[df['Quantity'] > 0]
df = df.dropna(subset=['CustomerID'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397924 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    397924 non-null  object 
 1   StockCode    397924 non-null  object 
 2   Description  397924 non-null  object 
 3   Quantity     397924 non-null  int64  
 4   InvoiceDate  397924 non-null  object 
 5   UnitPrice    397924 non-null  float64
 6   CustomerID   397924 non-null  float64
 7   Country      397924 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 27.3+ MB


**Building Customer-Item Matrix**

In [8]:
# Building a customer-item matrix
customer_item_matrix = df.pivot_table(index='CustomerID', 
                                      columns='StockCode',
                                      values='Quantity',
                                      aggfunc='sum')

customer_item_matrix.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,,,,,,,,,,,...,,,,,,,,,,
12347.0,,,,,,,,,,,...,,,,,,,,,,
12348.0,,,,,,,,,,,...,,,,,,,,,,9.0
12349.0,,,,,,,,,,,...,,,,,,,,,,1.0
12350.0,,,,,,,,,,,...,,,,,,,,,,1.0


In [9]:
customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)

customer_item_matrix.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12349.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350.0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


**User-based approximation**

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

# User based approximation
user_user_sim_matrix = pd.DataFrame(
    cosine_similarity(customer_item_matrix)
)

user_user_sim_matrix.columns = customer_item_matrix.index

user_user_sim_matrix['CustomerID'] = customer_item_matrix.index
user_user_sim_matrix = user_user_sim_matrix.set_index('CustomerID')
user_user_sim_matrix.head()

CustomerID,12346.0,12347.0,12348.0,12349.0,12350.0,12352.0,12353.0,12354.0,12355.0,12356.0,...,18273.0,18274.0,18276.0,18277.0,18278.0,18280.0,18281.0,18282.0,18283.0,18287.0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347.0,0.0,1.0,0.063022,0.04613,0.047795,0.038484,0.0,0.025876,0.136641,0.094742,...,0.0,0.029709,0.052668,0.0,0.032844,0.062318,0.0,0.113776,0.109364,0.012828
12348.0,0.0,0.063022,1.0,0.024953,0.051709,0.027756,0.0,0.027995,0.118262,0.146427,...,0.0,0.064282,0.113961,0.0,0.0,0.0,0.0,0.0,0.170905,0.083269
12349.0,0.0,0.04613,0.024953,1.0,0.056773,0.137137,0.0,0.030737,0.032461,0.144692,...,0.0,0.105868,0.0,0.0,0.039014,0.0,0.0,0.067574,0.137124,0.030475
12350.0,0.0,0.047795,0.051709,0.056773,1.0,0.031575,0.0,0.0,0.0,0.033315,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044866,0.0


In [15]:
# Example the customers those are similar to customer 12350.

user_user_sim_matrix.loc[12350.0].sort_values(ascending=False).head()

CustomerID
12350.0    1.000000
17935.0    0.183340
12414.0    0.181902
12652.0    0.175035
16692.0    0.171499
Name: 12350.0, dtype: float64

In [23]:
# Build recommendation for CustomerID 17935.0 from 12350.0. 
# The A = 12350.0
# The B = 17935.0

items_bought_by_A = set(customer_item_matrix.loc[12350.0][
    customer_item_matrix.loc[12350.0] != 0].index)
items_bought_by_A

{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21866',
 '21908',
 '21915',
 '22348',
 '22412',
 '22551',
 '22557',
 '22620',
 '79066K',
 '79191C',
 '84086C',
 'POST'}

In [24]:
items_bought_by_B = set(customer_item_matrix.loc[17935.0][
    customer_item_matrix.loc[17935.0] != 0].index)
items_bought_by_B

{'20657',
 '20659',
 '20828',
 '20856',
 '21051',
 '21866',
 '21867',
 '22208',
 '22209',
 '22210',
 '22211',
 '22449',
 '22450',
 '22551',
 '22553',
 '22557',
 '22640',
 '22659',
 '22749',
 '22752',
 '22753',
 '22754',
 '22755',
 '23290',
 '23292',
 '23309',
 '85099B',
 'POST'}

In [25]:
items_to_recommend_to_B = items_bought_by_A - items_bought_by_B
items_to_recommend_to_B

{'20615',
 '20652',
 '21171',
 '21832',
 '21864',
 '21908',
 '21915',
 '22348',
 '22412',
 '22620',
 '79066K',
 '79191C',
 '84086C'}

In [31]:
df.loc[df['StockCode'].isin(items_to_recommend_to_B)]\
    .set_index('StockCode')['Description']\
    .drop_duplicates()

StockCode
21832                CHOCOLATE CALCULATOR
21915              RED  HARMONICA IN BOX 
22620         4 TRADITIONAL SPINNING TOPS
79066K                     RETRO MOD TRAY
21864     UNION JACK FLAG PASSPORT COVER 
79191C        RETRO PLASTIC ELEPHANT TRAY
21908       CHOCOLATE THIS WAY METAL SIGN
20615        BLUE POLKADOT PASSPORT COVER
20652          BLUE POLKADOT LUGGAGE TAG 
22348         TEA BAG PLATE RED RETROSPOT
22412     METAL SIGN NEIGHBOURHOOD WITCH 
21171                BATHROOM METAL SIGN 
84086C            PINK/PURPLE RETRO RADIO
Name: Description, dtype: object

NOTE:
- For new customers, we are not going to have enough data to compare these new customers against the others. 
- In order to handle this problem, we can use item-based collaborative filtering.

**Item-based approximation**

In [32]:
item_item_sim_matrix = pd.DataFrame(cosine_similarity(customer_item_matrix.T))

item_item_sim_matrix.columns = customer_item_matrix.T.index

item_item_sim_matrix['StockCode'] = customer_item_matrix.T.index
item_item_sim_matrix = item_item_sim_matrix.set_index('StockCode')

item_item_sim_matrix.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214V,90214W,90214Y,90214Z,BANK CHARGES,C2,DOT,M,PADS,POST
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.0,0.0,0.094868,0.091287,0.0,0.0,0.090351,0.062932,0.098907,0.095346,...,0.0,0.0,0.0,0.0,0.0,0.029361,0.0,0.066915,0.0,0.078217
10080,0.0,1.0,0.0,0.0,0.0,0.0,0.032774,0.045655,0.047836,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016182,0.0,0.0
10120,0.094868,0.0,1.0,0.11547,0.0,0.0,0.057143,0.059702,0.041703,0.060302,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.070535,0.0,0.010993
10123C,0.091287,0.0,0.11547,1.0,0.0,0.0,0.164957,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10124A,0.0,0.0,0.0,0.0,1.0,0.447214,0.063888,0.044499,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


NOTES:
The strategy for doing product recommendation using this item-to-item similarity matrix:
1. For the given product that the target customer bought, we are going to find the most similar items.
2. We are going to recommend these similar items to the customer, since those similar items were bought by other customers who have bought the product that the target customer initially bought.


In [35]:
# Assume a new customer just bought a product with StockCode 23166

top_10_similar_items = list(item_item_sim_matrix.loc['23166']\
                           .sort_values(ascending=False)\
                           .iloc[:10].index)

top_10_similar_items

['23166',
 '23165',
 '23167',
 '22993',
 '23307',
 '22722',
 '22720',
 '22666',
 '23243',
 '22961']

In [37]:
df.loc[df['StockCode'].isin(top_10_similar_items)]\
    .set_index('StockCode')['Description']\
    .drop_duplicates().loc[top_10_similar_items]

StockCode
23166         MEDIUM CERAMIC TOP STORAGE JAR
23165          LARGE CERAMIC TOP STORAGE JAR
23167         SMALL CERAMIC TOP STORAGE JAR 
22993           SET OF 4 PANTRY JELLY MOULDS
23307    SET OF 60 PANTRY DESIGN CAKE CASES 
22722      SET OF 6 SPICE TINS PANTRY DESIGN
22720      SET OF 3 CAKE TINS PANTRY DESIGN 
22666        RECIPE BOX PANTRY YELLOW DESIGN
23243    SET OF TEA COFFEE SUGAR TINS PANTRY
22961                 JAM MAKING SET PRINTED
Name: Description, dtype: object

NOTES:
- The first item here is the item that the target customer just bought and the other nine items are the items that are frequently bought by others who have bought the first item.
- With this data, you can include these items in your marketing messages for this target customer as further product recommendations.
- Personalizing the marketing messages with targeted product recommendations typically yields higher conversion rates from customers.
- Using an item-based collaborative filtering algorithm, you can now easily do product recommendations for both new and existing customers.