# Collaborative filtering with surprise
The first part is using cosine and KNN as the other CF notebook
## The second part (user-to-user CF) is using:
- Matrix factorization: convert a item-user matrix to item matrix and user matrix
- Coclustering: cluster rows and columns
- Singular value decomposition (SVD): similarit to PCA

# What is Collaborative Filtering?

Collaborative filtering is the predictive process behind recommendation engines. Recommendation engines analyze information about users with similar tastes to assess the probability that a target individual will enjoy something.

Collaborative filtering uses algorithms to filter data from user reviews to make personalized recommendations for users with similar preferences. Collaborative filtering is also used to select content and advertising for individuals on social media.

Collaborative filtering filters information by using the interactions and data collected by the system from other users. For example when we want to find a new movie to watch we'll often ask our friends for recommendations.

Naturally, we have greater trust in the recommendations from friends who share tastes similar to our own. Collaborative filtering does the same job. Collaborative filtering mostly focuses on finding similarity between users and recommend each other their likes. There are various ways to find the similarity measure : Cosine similarity, Pearson similarity, Jaccard similarity etc.

# Importing required libraries

In [1]:
# Installing surprise Library
!pip install surprise



In [2]:
# Importing basic libraries
import pandas as pd
import numpy as np
import random

# Importing scipy.sparse.csr_matrix for kNN data preparation
from scipy.sparse import csr_matrix

# Importing kNN algorithm
from sklearn.neighbors import NearestNeighbors

# Importing cosine_similarity to calculate cosine similarity in memory based collaborative filtering
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
# Importing surprise.Reader,Dataset for surprise data preparation
from surprise import Reader, Dataset

# Importing for surprise model customizations
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV

In [4]:
# Importing algorithms from Surprise package
from surprise.prediction_algorithms import NMF,CoClustering,SVD

# Importing accuracy to get metrics such as RMSE and MAE
from surprise import accuracy

# Importing the dataset as data

In [5]:
import os
!ls

Rec_sys_data.xlsx  sample_data


In [6]:
data = pd.read_excel('/content/Rec_sys_data.xlsx')

In [7]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


## About the Dataset

The 8 columns or features are as:
1. InvoiceNo : The invoice number of particular transaction
2. StockCode : The unique code for particular item
3. Descripion : The description of particular item
4. Quantity : The quantity of particular item bought by the customer
5. InvoiceDate : The date and time when the transaction was made
6. UnitPrice : The price of 1 unit of particular item
7. CustomerID : The unique id of customer who bought the item
8. Country : The country or region of the customer

In [8]:
data.shape

(272404, 9)

The dataset has 5,41,909 rows and 8 columns.
In other words, the dataset contains 5,41,909 unique transactions.

In [9]:
data.isnull().sum().sort_values(ascending=False)

InvoiceNo       0
StockCode       0
Quantity        0
InvoiceDate     0
DeliveryDate    0
Discount%       0
ShipMode        0
ShippingCost    0
CustomerID      0
dtype: int64

Out of 541909 records, 135080 records are missing CustomerID and 1454 have no Description.

Since we have to perform Collaborative Filtering, we need both these values to establish relations based on User behavior so we have to drop NAN records.

In [10]:
data1 = data.dropna()

In [11]:
data1.describe()

Unnamed: 0,InvoiceNo,Quantity,Discount%,ShippingCost,CustomerID
count,272404.0,272404.0,272404.0,272404.0,272404.0
mean,553740.733319,13.579536,0.300092,17.053491,15284.323523
std,9778.082879,149.136756,0.176023,10.01321,1714.478624
min,536365.0,1.0,0.0,5.81,12346.0
25%,545312.0,2.0,0.15,5.81,13893.0
50%,553902.0,6.0,0.3,15.22,15157.0
75%,562457.0,12.0,0.45,30.12,16788.0
max,569629.0,74215.0,0.6,30.12,18287.0


Here we can see quantity has some negative values which is a part of incorrect data so we will drop such entries

In [12]:
data1 = data1[data1.Quantity > 0]

In [13]:
data1.describe()

Unnamed: 0,InvoiceNo,Quantity,Discount%,ShippingCost,CustomerID
count,272404.0,272404.0,272404.0,272404.0,272404.0
mean,553740.733319,13.579536,0.300092,17.053491,15284.323523
std,9778.082879,149.136756,0.176023,10.01321,1714.478624
min,536365.0,1.0,0.0,5.81,12346.0
25%,545312.0,2.0,0.15,5.81,13893.0
50%,553902.0,6.0,0.3,15.22,15157.0
75%,562457.0,12.0,0.45,30.12,16788.0
max,569629.0,74215.0,0.6,30.12,18287.0


In [14]:
data1.shape

(272404, 9)

In [15]:
data1.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,DeliveryDate,Discount%,ShipMode,ShippingCost,CustomerID
0,536365,84029E,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.2,ExpressAir,30.12,17850
1,536365,71053,6,2010-12-01 08:26:00,2010-12-02 08:26:00,0.21,ExpressAir,30.12,17850
2,536365,21730,6,2010-12-01 08:26:00,2010-12-03 08:26:00,0.56,Regular Air,15.22,17850
3,536365,84406B,8,2010-12-01 08:26:00,2010-12-03 08:26:00,0.3,Regular Air,15.22,17850
4,536365,22752,2,2010-12-01 08:26:00,2010-12-04 08:26:00,0.57,Delivery Truck,5.81,17850


Our final dataset now has 397924 records which are clean and complete.

In [16]:
data1.StockCode = data1.StockCode.astype(str)

# Memory-Based Approach

In Memory-Based approach, the closest user or items are calculated only by using Cosine similarity or Pearson correlation coefficients, which are only based on arithmetic operations.

A common distance metric is cosine similarity. The metric can be thought of geometrically if one treats a given user’s (item’s) row (column) of the ratings matrix as a vector. For user-based collaborative filtering, two users’ similarity is measured as the cosine of the angle between the two users’ vectors. For users u and u′, the cosine similarity is:



As no training or optimization is involved, it is an easy to use approach. But its performance decreases when we have sparse data which hinders scalability of this approach for most of the real-world problems.

Memory-Based approach is further divided into :
1. User-to-User Collaborative Filtering
2. Item-to-Item Collaborative Filtering

## User-to-User Collaborative Filtering

User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user.

In User-Based Collaborative Filtering, we create a matrix that describes behaviour of all users corresponding to all the items.
Further, we build relation between mutiple users to identify the similar users.

### Implementation

We are creating a data(matrix) which contains CustomerID and whether they have ever purchased a product using groupby.

In [17]:
purchase_df = (data1.groupby(['CustomerID', 'StockCode'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('CustomerID'))

In [18]:
purchase_df.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0
12350,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
12352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,5.0


We are getting the quantity ordered (example : 48,24,126) while we just want to know if that particular item is purchased or not.

Thus we need to do encoding as 1(if purchased) or 0(not purchased)


In [19]:
def encode_units(x):
    if x < 1: # If the quantity is less than 1
        return 0 # Not purchased
    if x >= 1: # If the quantity is greater than 1
        return 1 # Purchased


purchase_df = purchase_df.applymap(encode_units)

In [20]:
purchase_df.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12347,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
12352,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1


The purchase matrix is now ready, which describes the behaviour of Customers corresponding to all the items.

We can now apply Collaborative filtering on it.

In [21]:
# Applying cosine_similarity on the purchase matrix
user_similarities = cosine_similarity(purchase_df)

In [22]:
# Storing the similarity scores in a dataframe, i.e., the similarity scores matrix
user_similarity_data = pd.DataFrame(user_similarities,index=purchase_df.index,columns=purchase_df.index)

In [23]:
user_similarity_data.head()

CustomerID,12346,12347,12348,12350,12352,12353,12354,12355,12356,12358,...,18269,18270,18272,18273,18278,18280,18281,18282,18283,18287
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12346,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.114708,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12347,0.0,1.0,0.070632,0.053567,0.048324,0.0,0.029001,0.091885,0.075845,0.0,...,0.041739,0.0,0.050669,0.0,0.036811,0.069843,0.0,0.0,0.087667,0.021253
12348,0.0,0.070632,1.0,0.051709,0.031099,0.0,0.027995,0.118262,0.146427,0.061546,...,0.0,0.0,0.024456,0.0,0.0,0.0,0.0,0.0,0.123091,0.082061
12350,0.0,0.053567,0.051709,1.0,0.035377,0.0,0.0,0.0,0.033315,0.070014,...,0.0,0.0,0.027821,0.0,0.0,0.0,0.0,0.0,0.052511,0.0
12352,0.0,0.048324,0.031099,0.035377,1.0,0.0,0.095765,0.040456,0.10018,0.084215,...,0.110264,0.065233,0.133855,0.0,0.0,0.0,0.0,0.0,0.094742,0.056143


This is how the user_similarity_data looks like. It contains the similarity score of users with 0 being the least similar while 1 being the most similar.

#### Making Recommendations

In [24]:
def fetch_similar_users(user_id,k=5):
    # separating data rows for the entered user id
    user_similarity = user_similarity_data[user_similarity_data.index == user_id]

    # a data of all other users
    other_users_similarities = user_similarity_data[user_similarity_data.index != user_id]

    # calc cosine similarity between user and each other user
    similarities = cosine_similarity(user_similarity,other_users_similarities)[0].tolist()

    # create list of indices of these users
    user_indices = other_users_similarities.index.tolist()

    # create key/values pairs of user index and their similarity
    index_similarity_pair = dict(zip(user_indices, similarities))

    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items(),reverse=True)

    # grab k users off the top
    top_k_users_similarities = sorted_index_similarity_pair[:k]
    similar_users = [u[0] for u in top_k_users_similarities]

    print('The users with behaviour similar to that of user {0} are:'.format(user_id))
    return similar_users

In [25]:
similar_users = fetch_similar_users(12347)
similar_users

The users with behaviour similar to that of user 12347 are:


[18287, 18283, 18282, 18281, 18280]

Further the similar users can be stored in a list and later we can display the items purchased by the similar users as done below

In [26]:
def simular_users_recommendation(userid):

    similar_users = fetch_similar_users(userid)

    #obtaining all the items bought by similar users
    simular_users_recommendation_list = []
    for j in similar_users:
        item_list = data1[data1["CustomerID"]==j]['StockCode'].to_list()
        simular_users_recommendation_list.append(item_list)

    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simular_users_recommendation_list:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))

    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)

    print('Items bought by Similar users based on Cosine Similarity')

    #returning 10 random recommendations
    return ten_random_recommendations

In [27]:
simular_users_recommendation(12347)

The users with behaviour similar to that of user 12347 are:
Items bought by Similar users based on Cosine Similarity


['22469',
 '22554',
 '82581',
 '22499',
 '22720',
 '23007',
 '22180',
 '23196',
 '22716',
 '47421']

## Item-to-Item Collaborative Filtering

An item-to-item filtering process uses a matrix to determine the likeness of pairs of items. Item-to-item processes then compare the current user’s preference to the items in the matrix for similarities upon which to base recommendations.

### Implementation

We are creating a data(matrix) which contains item names and whether they have been ever purchased by a customer using groupby.

In [28]:
items_purchase_df = (data1.groupby(['StockCode','CustomerID'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('StockCode'))

In [29]:
items_purchase_df.head()

CustomerID,12346,12347,12348,12350,12352,12353,12354,12355,12356,12358,...,18269,18270,18272,18273,18278,18280,18281,18282,18283,18287
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10080,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10123C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10124A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We are getting the quantity ordered (example : 48,24,126) while we just want to know if that particular item is purchased or not.

Thus we need to do encoding as 1(if purchased) or 0(not purchased)

In [30]:
items_purchase_df = items_purchase_df.applymap(encode_units)

The item_purchase matrix is now ready, which describes if the item was purchased by particular customer or not.

We can now apply Collaborative filtering on it.

In [31]:
# Applying Cosine similarity on the items
item_similarities = cosine_similarity(items_purchase_df)

In [32]:
# Storing the similarity scores in a dataframe
item_similarity_data = pd.DataFrame(item_similarities,index=items_purchase_df.index,columns=items_purchase_df.index)

In [33]:
item_similarity_data.head()

StockCode,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214R,90214S,90214V,90214Y,BANK CHARGES,C2,DOT,M,PADS,POST
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,1.0,0.0,0.108821,0.091287,0.0,0.0,0.094281,0.062932,0.091902,0.110096,...,0.0,0.0,0.0,0.0,0.0,0.032275,0.0,0.079333,0.0,0.066986
10080,0.0,1.0,0.0,0.0,0.0,0.0,0.043033,0.028724,0.067116,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10120,0.108821,0.0,1.0,0.132453,0.0,0.0,0.068399,0.068483,0.026669,0.079872,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076739,0.0,0.013885
10123C,0.091287,0.0,0.132453,1.0,0.0,0.0,0.172133,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10124A,0.0,0.0,0.0,0.0,1.0,0.288675,0.074536,0.049752,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Making Recommendations

In [34]:
def fetch_similar_items(item_id,k=10):
    # separating data rows of the selected item
    item_similarity = item_similarity_data[item_similarity_data.index == item_id]

    # a data of all other items
    other_items_similarities = item_similarity_data[item_similarity_data.index != item_id]

    # calculate cosine similarity between selected item with other items
    similarities = cosine_similarity(item_similarity,other_items_similarities)[0].tolist()

    # create list of indices of these items
    item_indices = other_items_similarities.index.tolist()

    # create key/values pairs of item index and their similarity
    index_similarity_pair = dict(zip(item_indices, similarities))

    # sort by similarity
    sorted_index_similarity_pair = sorted(index_similarity_pair.items())

    # grab k items from the top
    top_k_item_similarities = sorted_index_similarity_pair[:k]
    similar_items = [u[0] for u in top_k_item_similarities]

    print('Similar items based on purchase behaviour (item-to-item collaborative filtering)')
    return similar_items

In [35]:
similar_items = fetch_similar_items('10002')
similar_items

Similar items based on purchase behaviour (item-to-item collaborative filtering)


['10080',
 '10120',
 '10123C',
 '10124A',
 '10124G',
 '10125',
 '10133',
 '10135',
 '11001',
 '15030']

Further the similar items can be stored in a list and later we can display the similar items purchased by the our selected user as below

In [36]:
def simular_item_recommendation(userid):

    simular_items_recommendation_list = []

    #obtaining all the similar items to items bought by user
    item_list = data1[data1["CustomerID"]==userid]['StockCode'].to_list()
    for item in item_list:
        similar_items = fetch_similar_items(item)
        simular_items_recommendation_list.append(item_list)

    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in simular_items_recommendation_list:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))

    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)

    print('Similar Items bought by our users based on Cosine Similarity')

    #returning 10 random recommendations
    return ten_random_recommendations

In [37]:
simular_item_recommendation(12347)

Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purchase behaviour (item-to-item collaborative filtering)
Similar items based on purch

['22550',
 '22699',
 '22497',
 '21171',
 '84558A',
 '23170',
 '47559B',
 '21578',
 '23147',
 '84991']

In this approach, Collaborative Filtering models are created using machine learning algorithms to predict if the user is likely to purchase an item or not based on their past behaviour.

The possible approaches can be:
1. KNN : Collaborative Filtering Using k-Nearest Neighbors (kNN) kNN is a machine learning algorithm to find clusters of similar users based on past behaviour, and make predictions using the average of top-k nearest neighbors.
2. Matrix Factorization (MF): The idea behind such models is that attitudes or preferences of a user can be determined by a small number of hidden factors. We can call these factors as Embeddings.

## Collaborative Filtering using k-Nearest Neighbors

For passing our sparse matrix into KNN we need to convert it into CSR

CSR divides a sparse matrix into 3 arrays : values, extent of rows, index of columns

### Model building

In [38]:
purchase_matrix = csr_matrix(purchase_df.values)

# Creating KNN Model with metric parameter as euclidean distance
knn_model = NearestNeighbors(metric = 'euclidean', algorithm = 'brute')

# Fitting the model on purchase_matrix
knn_model.fit(purchase_matrix)

### Finding similar users

In [39]:
def fetch_similar_users_knn(purchase_df,query_index):

    # Creating empty list where we will store user id of similar users
    simular_users_knn = []

    # Storing the distance and index of nearest neighors
    distances, indices = knn_model.kneighbors(purchase_df.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 5)
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for {0}:\n'.format(purchase_df.index[query_index]))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i, purchase_df.index[indices.flatten()[i]], distances.flatten()[i]))

            simular_users_knn.append(purchase_df.index[indices.flatten()[i]])
    return simular_users_knn

In [40]:
simular_users_knn = fetch_similar_users_knn(purchase_df,1497)
simular_users_knn

Recommendations for 14729:

1: 16917, with distance of 8.12403840463596:
2: 16989, with distance of 8.12403840463596:
3: 15124, with distance of 8.12403840463596:
4: 12897, with distance of 8.246211251235321:


[16917, 16989, 15124, 12897]

### Making Recommendations

In [41]:
def knn_recommendation(simular_users_knn):

    #obtaining all the items bought by similar users
    knn_recommnedations = []
    for j in simular_users_knn:
        item_list = data1[data1["CustomerID"]==j]['StockCode'].to_list()
        knn_recommnedations.append(item_list)

    #this gives us multi-dimensional list
    # we need to flatten it
    flat_list = []
    for sublist in knn_recommnedations:
        for item in sublist:
            flat_list.append(item)
    final_recommendations_list = list(dict.fromkeys(flat_list))

    # storing 10 random recommendations in a list
    ten_random_recommendations = random.sample(final_recommendations_list, 10)

    print('Items bought by Similar users based on KNN')

    #returning 10 random recommendations
    return ten_random_recommendations

In [42]:
knn_recommendation(simular_users_knn)

Items bought by Similar users based on KNN


['22957',
 '22920',
 '22927',
 '22605',
 '23298',
 '22919',
 '22916',
 '22921',
 '84978',
 '22470']

## Collaborative Filtering using Matrix Factorization

For Matrix Factorization, we are using the Surprise Package.

Surprise package: This package has been specially developed to make recommendation based on collaborative filtering easy. It has default implementation for a variety of Collaborative Filtering algorithms such as NMF, kNN, Co-Clustering, SVD.

In [43]:
data3 = items_purchase_df.stack().to_frame()

In [44]:
#Renaming the column as Quantity
data3 = data3.reset_index().rename(columns={0:"Quantity"})

In [45]:
data3

Unnamed: 0,StockCode,CustomerID,Quantity
0,10002,12346,0
1,10002,12347,0
2,10002,12348,0
3,10002,12350,0
4,10002,12352,0
...,...,...,...
12903081,POST,18280,0
12903082,POST,18281,0
12903083,POST,18282,0
12903084,POST,18283,0


In [46]:
print(items_purchase_df.shape)
print(data3.shape)

(3538, 3647)
(12903086, 3)


3877 unique items x 4339 unique customer ids

Total records in data3 should be 3877x4339 = 1,68,22,303

And this size is too big to pass into an algorithm so we need to reduce the size of dataset by shortlisting.

### Shortlisting customers & items based on no. of orders

In [47]:
# Storing all customer ids in customers
customer_ids = data1['CustomerID']

# Storing all item descriptions in items
item_ids = data1['StockCode']

In [48]:
from collections import Counter

In [49]:
# counting no. of orders made by each customer
count_orders = Counter(customer_ids)

# storing the count and customer id in a dataframe
customer_count_df = pd.DataFrame.from_dict(count_orders, orient='index').reset_index().rename(columns={0:"Quantity"})

# dropping all customer ids with less than 120 orders
customer_count_df = customer_count_df[customer_count_df["Quantity"]>120]

# renaming the index column as CustomerID for inner join
customer_count_df.rename(columns={'index':'CustomerID'},inplace=True)

In [50]:
customer_count_df

Unnamed: 0,CustomerID,Quantity
0,17850,297
1,13047,140
2,12583,182
6,14688,265
8,15311,1892
...,...,...
3308,14096,1170
3367,16910,261
3392,16360,226
3413,17728,133


In [51]:
# counting no. of times an item was ordered
count_items = Counter(item_ids)

# storing the count and item description in a dataframe
item_count_df = pd.DataFrame.from_dict(count_items, orient='index').reset_index().rename(columns={0:"Quantity"})

# dropping all items which were ordered less than 120 times
item_count_df = item_count_df[item_count_df["Quantity"]>120]

# renaming the index column as Description for inner join
item_count_df.rename(columns={'index':'StockCode'},inplace=True)

In [52]:
item_count_df

Unnamed: 0,StockCode,Quantity
0,84029E,161
1,71053,220
3,84406B,213
4,22752,229
5,85123A,1606
...,...,...
3295,23294,181
3296,23295,213
3363,23328,129
3373,23356,148


Applying inner join

In [53]:
data4 = pd.merge(data3, item_count_df, on='StockCode', how='inner')
data4 = pd.merge(data4, customer_count_df, on='CustomerID', how='inner')

In [54]:
data4

Unnamed: 0,StockCode,CustomerID,Quantity_x,Quantity_y,Quantity
0,10133,12347,0,124,124
1,15036,12347,0,278,124
2,15056BL,12347,0,223,124
3,15056N,12347,0,325,124
4,16156S,12347,0,137,124
...,...,...,...,...,...
385667,85132C,18283,0,127,447
385668,85150,18283,1,264,447
385669,85152,18283,1,466,447
385670,M,18283,1,198,447


In [55]:
# dropping columns which are not necessary
data4.drop(['Quantity_y','Quantity_x'],axis=1,inplace=True)

In [56]:
data4

Unnamed: 0,StockCode,CustomerID,Quantity
0,10133,12347,124
1,15036,12347,124
2,15056BL,12347,124
3,15056N,12347,124
4,16156S,12347,124
...,...,...,...
385667,85132C,18283,447
385668,85150,18283,447
385669,85152,18283,447
385670,M,18283,447


In [57]:
data4.describe()

Unnamed: 0,CustomerID,Quantity
count,385672.0,385672.0
mean,15360.985915,279.089789
std,1719.468125,337.879413
min,12347.0,121.0
25%,13996.25,151.0
50%,15413.0,198.0
75%,16840.0,290.0
max,18283.0,5095.0


This is how the data4 looks like. We have reduced the size from 1,68,22,303 to 3,85,672.

This format is exactly what is suitable to be passed into surprise library.

In [58]:
# reading the data in a format supported by surprise library.
reader = Reader(rating_scale=(0,5095))
# the range has been set as 0,5095 as the maximum value of quantity is 5095.

# loading Dataset in a format supported by surprise library.
formated_data = Dataset.load_from_df(data4, reader)

In [59]:
# performing train test split on the dataset
train_set, test_set = train_test_split(formated_data, test_size= 0.2)


### Implementing NMF

In [60]:
# defining the model
algo1 = NMF()

# model fitting
algo1.fit(train_set)

# model prediction
pred1 = algo1.test(test_set)

In [61]:
accuracy.rmse(pred1)

accuracy.mae(pred1)

RMSE: 426.9061
MAE:  272.8512


272.8512099789018

In [62]:
cross_validate(algo1, formated_data, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    427.7283434.5767423.4518423.0231432.3452428.22504.6349  
MAE (testset)     271.8921274.0884271.7189272.0515273.8082272.71181.0189  
Fit time          8.09    8.70    8.79    8.91    8.82    8.66    0.29    
Test time         0.35    0.64    0.58    0.57    0.34    0.50    0.13    


{'test_rmse': array([427.72833993, 434.57670986, 423.45175685, 423.02314339,
        432.34520488]),
 'test_mae': array([271.8920799 , 274.08836808, 271.71891375, 272.05145829,
        273.80819943]),
 'fit_time': (8.094773530960083,
  8.700079917907715,
  8.788331508636475,
  8.905122995376587,
  8.81670069694519),
 'test_time': (0.3502819538116455,
  0.6423804759979248,
  0.5753815174102783,
  0.5734255313873291,
  0.3366849422454834)}

### Implementing Co-Clustering

In [63]:
# defining the model
algo2 = CoClustering()

# model fitting
algo2.fit(train_set)

# model prediction
pred2 = algo2.test(test_set)

In [64]:
accuracy.rmse(pred2)

accuracy.mae(pred2)

RMSE: 6.8761
MAE:  5.5924


5.592397875559905

In [65]:
cross_validate(algo2, formated_data, verbose=True)

Evaluating RMSE, MAE of algorithm CoClustering on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    7.1291  7.1412  6.7947  7.4263  7.1779  7.1338  0.2013  
MAE (testset)     5.6799  5.7101  5.3956  6.0318  5.7654  5.7166  0.2031  
Fit time          8.61    8.51    8.73    8.18    8.28    8.46    0.21    
Test time         0.63    0.62    0.65    0.61    0.65    0.64    0.02    


{'test_rmse': array([7.12914535, 7.14120894, 6.79472015, 7.42626997, 7.17786393]),
 'test_mae': array([5.67986754, 5.71007694, 5.39562709, 6.03181201, 5.76537598]),
 'fit_time': (8.61397933959961,
  8.507719278335571,
  8.730175971984863,
  8.17567753791809,
  8.28481674194336),
 'test_time': (0.6312966346740723,
  0.624396562576294,
  0.6525390148162842,
  0.6138842105865479,
  0.6543314456939697)}

### Implementing SVD

In [66]:
# defining the model
algo3 = SVD()

# model fitting
algo3.fit(train_set)

# model prediction
pred3 = algo3.test(test_set)

In [67]:
accuracy.rmse(pred3)
accuracy.mae(pred3)

RMSE: 4827.5643
MAE:  4815.8686


4815.8686199520325

In [68]:
cross_validate(algo3, formated_data, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    4827.61474828.59034827.70594827.52534827.30504827.74820.4416  
MAE (testset)     4816.24434816.78674816.01724815.56074814.94214815.91020.6246  
Fit time          5.21    5.49    5.73    5.40    6.14    5.60    0.32    
Test time         0.93    0.48    1.09    0.40    0.42    0.66    0.29    


{'test_rmse': array([4827.6147191 , 4828.59031679, 4827.70588707, 4827.52529067,
        4827.3050293 ]),
 'test_mae': array([4816.24433785, 4816.78667272, 4816.01721679, 4815.56067363,
        4814.94213965]),
 'fit_time': (5.214406728744507,
  5.488094091415405,
  5.732736587524414,
  5.40395712852478,
  6.137359619140625),
 'test_time': (0.9257731437683105,
  0.4804704189300537,
  1.0864524841308594,
  0.40100955963134766,
  0.4244561195373535)}

### Testing the models

In [69]:
#taking item 47590B and customer 15738 for testing
data1[(data1['StockCode']=='47590B')&(data1['CustomerID']==15738)].Quantity.sum()

78

In [70]:
algo1.test([['47590B',15738,78]])

# Predicted value given out by model is 3.08 while actual was 78

[Prediction(uid='47590B', iid=15738, r_ui=78, est=3.1065784409792787, details={'was_impossible': False})]

In [71]:
algo2.test([['47590B',15738,78]])

# Predicted value given out by model is 133.01 while actual was 78

[Prediction(uid='47590B', iid=15738, r_ui=78, est=133.38205283085563, details={'was_impossible': False})]

In [72]:
algo3.test([['47590B',15738,78]])

# Predicted value given out by model is 5095 while actual was 78

[Prediction(uid='47590B', iid=15738, r_ui=78, est=5095, details={'was_impossible': False})]

### Giving out predictions

In [73]:
pred2

# Predictions given out by Co-Clustering

[Prediction(uid='22969', iid=16928, r_ui=209.0, est=208.41734122374402, details={'was_impossible': False}),
 Prediction(uid='22962', iid=14307, r_ui=180.0, est=190.44411347110452, details={'was_impossible': False}),
 Prediction(uid='22383', iid=16283, r_ui=149.0, est=148.0551383455823, details={'was_impossible': False}),
 Prediction(uid='22630', iid=16474, r_ui=223.0, est=224.41626137151326, details={'was_impossible': False}),
 Prediction(uid='22027', iid=15867, r_ui=388.0, est=384.1215121321375, details={'was_impossible': False}),
 Prediction(uid='21955', iid=14936, r_ui=300.0, est=298.09878291697675, details={'was_impossible': False}),
 Prediction(uid='22699', iid=16327, r_ui=267.0, est=270.0319677722105, details={'was_impossible': False}),
 Prediction(uid='21479', iid=16942, r_ui=197.0, est=177.47962924214374, details={'was_impossible': False}),
 Prediction(uid='22437', iid=16764, r_ui=276.0, est=287.4295778074657, details={'was_impossible': False}),
 Prediction(uid='21390', iid=165

In [77]:
def get_item_orders(user_id):
    try:
        # for an item, return the no. of orders made
        return len(train_set.ur[train_set.to_inner_uid(user_id)])
    except ValueError:
        # user not present in training
        return 0

def get_customer_orders(item_id):
    try:
        # for an customer, return the no. of orders made
        return len(train_set.ir[train_set.to_inner_iid(item_id)])
    except ValueError:
        # item not present in training
        return 0

### Best and Worst Predictions made by Co-Clustering

In [78]:
predictions_data = pd.DataFrame(pred2, columns=['item_id', 'customer_id', 'quantity', 'prediction', 'details'])
predictions_data['item_orders'] = predictions_data.item_id.apply(get_item_orders)
predictions_data['customer_orders'] = predictions_data.customer_id.apply(get_customer_orders)
predictions_data['error'] = abs(predictions_data.prediction - predictions_data.quantity)
best_predictions = predictions_data.sort_values(by='error')[:10]
worst_predictions = predictions_data.sort_values(by='error')[-10:]

In [79]:
predictions_data

Unnamed: 0,item_id,customer_id,quantity,prediction,details,item_orders,customer_orders,error
0,22969,16928,209.0,208.417341,{'was_impossible': False},459,532,0.582659
1,22962,14307,180.0,190.444113,{'was_impossible': False},468,525,10.444113
2,22383,16283,149.0,148.055138,{'was_impossible': False},446,532,0.944862
3,22630,16474,223.0,224.416261,{'was_impossible': False},460,550,1.416261
4,22027,15867,388.0,384.121512,{'was_impossible': False},443,548,3.878488
...,...,...,...,...,...,...,...,...
77130,22900,15301,205.0,198.249498,{'was_impossible': False},450,545,6.750502
77131,22556,17472,181.0,174.250255,{'was_impossible': False},452,550,6.749745
77132,21892,16265,194.0,193.831720,{'was_impossible': False},450,559,0.168280
77133,21916,15290,139.0,131.666336,{'was_impossible': False},468,542,7.333664


In [80]:
best_predictions

Unnamed: 0,item_id,customer_id,quantity,prediction,details,item_orders,customer_orders,error
2954,20984,17841,5095.0,5095.0,{'was_impossible': False},435,552,0.0
10036,22996,17841,5095.0,5095.0,{'was_impossible': False},433,552,0.0
56107,21078,17841,5095.0,5095.0,{'was_impossible': False},435,552,0.0
18956,84050,17841,5095.0,5095.0,{'was_impossible': False},445,552,0.0
18689,23192,17841,5095.0,5095.0,{'was_impossible': False},460,552,0.0
57289,22712,17841,5095.0,5095.0,{'was_impossible': False},433,552,0.0
64339,22045,14505,524.0,523.984575,{'was_impossible': False},469,546,0.015425
57959,22045,17337,543.0,542.984575,{'was_impossible': False},469,542,0.015425
17838,22045,16904,608.0,607.984575,{'was_impossible': False},469,554,0.015425
27123,22045,17675,538.0,537.984575,{'was_impossible': False},469,554,0.015425


In [81]:
worst_predictions

Unnamed: 0,item_id,customer_id,quantity,prediction,details,item_orders,customer_orders,error
24896,22295,18118,972.0,950.996367,{'was_impossible': False},462,543,21.003633
50978,22295,15005,896.0,874.996367,{'was_impossible': False},462,521,21.003633
55023,22295,16033,904.0,882.996367,{'was_impossible': False},462,547,21.003633
75782,22295,13263,1314.0,1292.996367,{'was_impossible': False},462,537,21.003633
55444,22295,14456,582.0,560.996367,{'was_impossible': False},462,547,21.003633
62517,22295,14606,2185.0,2163.996367,{'was_impossible': False},462,554,21.003633
10939,22295,14796,904.0,882.996367,{'was_impossible': False},462,530,21.003633
68053,22295,13089,1511.0,1489.996367,{'was_impossible': False},462,534,21.003633
62875,22295,17337,543.0,521.996367,{'was_impossible': False},462,542,21.003633
65286,22295,17675,538.0,516.996367,{'was_impossible': False},462,554,21.003633


In [82]:
# Getting item list for user 12347
item_list = predictions_data[predictions_data['customer_id']==12347]['item_id'].values.tolist()
item_list

['23077',
 '22430',
 '85049A',
 '21770',
 '22771',
 '22916',
 '21718',
 '21507',
 '23243',
 '22077',
 '23144',
 '22772',
 '21519',
 '22551',
 '23356',
 '23287',
 '22965',
 '21876',
 '22662',
 '22855',
 '21703',
 '22672',
 '82583',
 '21041',
 '21877',
 '22997',
 '22147',
 '22273',
 '22752',
 '21212',
 '23192',
 '82484',
 '21967',
 '21485',
 '21242',
 '21922',
 '22665',
 '22895',
 '22499',
 '23198',
 '21989',
 '23148',
 '23201',
 '82578',
 '84879',
 '85099C',
 '21231',
 '20724',
 '22295',
 '22729',
 '21484',
 '22470',
 '22348',
 '21878',
 '21390',
 '22398',
 '22086',
 '21658',
 '22730',
 '22794',
 '22768',
 '21240',
 '23200',
 '23209',
 '23078',
 '84945',
 '84988',
 '20977',
 '22847',
 '21670',
 '21754',
 '21791',
 '22112',
 '22491',
 '23203',
 '21499',
 '21340',
 '23146',
 '22690',
 '23170',
 '22507',
 '22548',
 '21984',
 '22909',
 '37446',
 '22951',
 '22352',
 '21974',
 '23032',
 '21928',
 '22577',
 '23049',
 '22192',
 '21506',
 '21908',
 '21829',
 '22212',
 '85132C',
 '22417',
 '22141

In [83]:
# Getting list of uique customers who also bught same items (item_list)
customer_list = predictions_data[predictions_data['item_id'].isin(item_list)]['customer_id'].values
customer_list = np.unique(customer_list).tolist()
customer_list

[12347,
 12359,
 12362,
 12370,
 12378,
 12415,
 12417,
 12428,
 12431,
 12433,
 12444,
 12449,
 12451,
 12471,
 12472,
 12474,
 12476,
 12477,
 12481,
 12484,
 12490,
 12501,
 12502,
 12517,
 12520,
 12539,
 12540,
 12553,
 12567,
 12583,
 12621,
 12626,
 12627,
 12637,
 12662,
 12668,
 12681,
 12682,
 12683,
 12688,
 12700,
 12705,
 12708,
 12709,
 12714,
 12720,
 12721,
 12731,
 12743,
 12744,
 12748,
 12749,
 12753,
 12757,
 12766,
 12836,
 12839,
 12841,
 12867,
 12921,
 12949,
 12957,
 12971,
 13001,
 13004,
 13013,
 13018,
 13047,
 13050,
 13069,
 13078,
 13081,
 13089,
 13093,
 13097,
 13098,
 13102,
 13113,
 13124,
 13137,
 13139,
 13141,
 13148,
 13174,
 13178,
 13184,
 13198,
 13209,
 13230,
 13232,
 13263,
 13266,
 13267,
 13268,
 13269,
 13285,
 13317,
 13319,
 13334,
 13381,
 13408,
 13418,
 13448,
 13451,
 13458,
 13468,
 13488,
 13505,
 13507,
 13527,
 13534,
 13548,
 13555,
 13571,
 13593,
 13610,
 13634,
 13668,
 13694,
 13700,
 13709,
 13742,
 13764,
 13767,
 13777,


In [84]:
# filtering those customers from predictions data
filtered_data = predictions_data[predictions_data['customer_id'].isin(customer_list)]

# removing the items already bought
filtered_data = filtered_data[~filtered_data['item_id'].isin(item_list)]

# getting the top items (prediction)
recommended_items = filtered_data.sort_values('prediction',ascending=False).reset_index(drop=True).head(10)['item_id'].values.tolist()
recommended_items

['21078',
 '20984',
 '22996',
 '22712',
 '84050',
 '22759',
 '23084',
 '21907',
 '21623',
 '22178']