## Content-Based Filtering Recommender

As its name suggest, this type of recommender uses the similarity among the background information of the items or users to propose recommendations to users. For instance, if User A generally gives Sci-Fi movies a good rating, the recommender system would recommend more movies of the Sci-Fi genre to User A. In another instance, if User B is a male of the higher income group, a bank recommender system would likely label User B as a potential customer of the premium investment plan.

### Pros

Easy to overcome a cold start problem — when there is zero or few user-item interactions, the recommender is still able to provide good recommendations to the user.

### Cons

Requires the background information of items/users. Whenever there are new items/users, these background information has to be catalogued and added in.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_excel('Sample 5000.xlsx')
df

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2019-10-01 08:47:35 UTC,view,1001588,2053013555631879936,electronics.smartphone,samsung,460.50,244951053,91769fdf-461b-4e43-9c73-88a07481b75c
1,2019-10-01 08:48:28 UTC,view,1003535,2053013555631879936,electronics.smartphone,samsung,460.50,244951053,91769fdf-461b-4e43-9c73-88a07481b75c
2,2019-10-01 17:06:51 UTC,view,4100129,2053013561218690048,,sony,463.02,292071852,0051531b-c007-442f-88c8-2cbf9537bd02
3,2019-10-01 16:48:28 UTC,view,6400036,2053013554121929984,computers.components.cpu,intel,338.23,295655799,eb8f2cea-4c5b-4e00-880f-3bcfa28549ff
4,2019-10-01 17:07:37 UTC,view,1004870,2053013555631879936,electronics.smartphone,samsung,286.84,306087674,a15f469a-968f-4c8c-8317-6dffed3f5523
...,...,...,...,...,...,...,...,...,...
9996,2019-10-01 16:09:57 UTC,view,6701210,2053013554247759872,computers.components.videocards,msi,939.03,512393615,6b08f33f-53fa-431d-99f6-b06c9b10d6af
9997,2019-10-01 16:10:50 UTC,view,6701210,2053013554247759872,computers.components.videocards,msi,939.03,512393615,6b08f33f-53fa-431d-99f6-b06c9b10d6af
9998,2019-10-01 11:12:38 UTC,view,3900217,2053013552326769920,appliances.environment.water_heater,garanterm,90.32,512393698,bac58664-4b1f-414f-a914-afb0caceae36
9999,2019-10-01 11:13:30 UTC,view,1005067,2053013555631879936,electronics.smartphone,samsung,1209.53,512393698,39b68180-94ba-4615-85a8-f8b76f6a84af


In [2]:
df.isnull().sum()

event_time          0
event_type          0
product_id          0
category_id         0
category_code    3378
brand            1406
price               0
user_id             0
user_session        0
dtype: int64

In [3]:
df.dropna()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2019-10-01 08:47:35 UTC,view,1001588,2053013555631879936,electronics.smartphone,samsung,460.50,244951053,91769fdf-461b-4e43-9c73-88a07481b75c
1,2019-10-01 08:48:28 UTC,view,1003535,2053013555631879936,electronics.smartphone,samsung,460.50,244951053,91769fdf-461b-4e43-9c73-88a07481b75c
3,2019-10-01 16:48:28 UTC,view,6400036,2053013554121929984,computers.components.cpu,intel,338.23,295655799,eb8f2cea-4c5b-4e00-880f-3bcfa28549ff
4,2019-10-01 17:07:37 UTC,view,1004870,2053013555631879936,electronics.smartphone,samsung,286.84,306087674,a15f469a-968f-4c8c-8317-6dffed3f5523
8,2019-10-01 01:32:09 UTC,view,2501614,2053013564003709952,appliances.kitchen.oven,redmond,164.71,306441847,47641f8a-3aba-471a-8d07-014deccec567
...,...,...,...,...,...,...,...,...,...
9996,2019-10-01 16:09:57 UTC,view,6701210,2053013554247759872,computers.components.videocards,msi,939.03,512393615,6b08f33f-53fa-431d-99f6-b06c9b10d6af
9997,2019-10-01 16:10:50 UTC,view,6701210,2053013554247759872,computers.components.videocards,msi,939.03,512393615,6b08f33f-53fa-431d-99f6-b06c9b10d6af
9998,2019-10-01 11:12:38 UTC,view,3900217,2053013552326769920,appliances.environment.water_heater,garanterm,90.32,512393698,bac58664-4b1f-414f-a914-afb0caceae36
9999,2019-10-01 11:13:30 UTC,view,1005067,2053013555631879936,electronics.smartphone,samsung,1209.53,512393698,39b68180-94ba-4615-85a8-f8b76f6a84af


In [4]:
# provide a user score based on these user-item interactions
# view: 1, cart:10, purchase: 50
df['user_score'] = df['event_type'].map({'view':1,'cart':10,'purchase':50})
df['user_purchase'] = df['event_type'].apply(lambda x: 1 if x=='purchase' else 0)

In [5]:
df.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session,user_score,user_purchase
0,2019-10-01 08:47:35 UTC,view,1001588,2053013555631879936,electronics.smartphone,samsung,460.5,244951053,91769fdf-461b-4e43-9c73-88a07481b75c,1,0
1,2019-10-01 08:48:28 UTC,view,1003535,2053013555631879936,electronics.smartphone,samsung,460.5,244951053,91769fdf-461b-4e43-9c73-88a07481b75c,1,0
2,2019-10-01 17:06:51 UTC,view,4100129,2053013561218690048,,sony,463.02,292071852,0051531b-c007-442f-88c8-2cbf9537bd02,1,0
3,2019-10-01 16:48:28 UTC,view,6400036,2053013554121929984,computers.components.cpu,intel,338.23,295655799,eb8f2cea-4c5b-4e00-880f-3bcfa28549ff,1,0
4,2019-10-01 17:07:37 UTC,view,1004870,2053013555631879936,electronics.smartphone,samsung,286.84,306087674,a15f469a-968f-4c8c-8317-6dffed3f5523,1,0


In [5]:
# segregate items into 5 different price categories relative to their item categories
df['price_category'] = 0
for i in df['category_code'].unique():
    df.loc[df['category_code']==i,'price_category'] = pd.qcut(x=df['price'][df['category_code']==i],q=5,labels=[1,2,3,4,5])

AssertionError: 

In [6]:
# there could be multiple interactions per user per item
# perform a groupby operation to discover the sum of user scores for each unique user-item interaction
group = df.groupby(['user_id','product_id'])['user_score','user_purchase'].sum().reset_index()
group['user_purchase'] = group['user_purchase'].apply(lambda x: 1 if x>1 else x)
group['user_score'] = group['user_score'].apply(lambda x: 100 if x>100 else x)

# apply MinMaxScaler to the user scores to obtain an interaction score with a value between 0 and 1
# >=0.5: a very high probability that a purchase has occurred
# <0.5: no purchase occurs below the threshold of 0.5
from sklearn.preprocessing import MinMaxScaler

std = MinMaxScaler(feature_range=(0.025, 1))
std.fit(group['user_score'].values.reshape(-1,1))
group['interaction_score'] = std.transform(group['user_score'].values.reshape(-1,1))

group = group.merge(df[['product_id','category_code','brand','price','price_category']].drop_duplicates('product_id'),on=['product_id'])


  group = df.groupby(['user_id','product_id'])['user_score','user_purchase'].sum().reset_index()


In [7]:
group.head()

Unnamed: 0,user_id,product_id,user_score,user_purchase,interaction_score,category_code,brand,price,price_category
0,244951053,1001588,1,0,0.025,electronics.smartphone,samsung,460.5,0
1,495589687,1001588,1,0,0.025,electronics.smartphone,samsung,460.5,0
2,244951053,1003535,1,0,0.025,electronics.smartphone,samsung,460.5,0
3,292071852,4100129,1,0,0.025,,sony,463.02,0
4,503283510,4100129,77,1,0.773485,,sony,463.02,0


In [8]:
inputs = group.drop('interaction_score', axis =1)
X = inputs
y = group['interaction_score']

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                          random_state = 1)

In [11]:
X_train_matrix = pd.pivot_table(X_train,values='user_score',index='user_id',columns='product_id')
X_train_matrix = X_train_matrix.fillna(0)

In [12]:
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
# filtering by item category, price category and brand

product_cat = X_train[['product_id','price_category','category_code','brand']].drop_duplicates('product_id')
product_cat = product_cat.sort_values(by='product_id')

price_cat_matrix = np.reciprocal(euclidean_distances(np.array(product_cat['price_category']).reshape(-1,1))+1)
euclidean_matrix = pd.DataFrame(price_cat_matrix,columns=product_cat['product_id'],index=product_cat['product_id'])

tfidf_vectorizer = TfidfVectorizer()
doc_term = tfidf_vectorizer.fit_transform(list(product_cat['category_code']))
dt_matrix = pd.DataFrame(doc_term.toarray().round(3), index=[i for i in product_cat['product_id']], columns=tfidf_vectorizer.get_feature_names())
cos_similar_matrix = pd.DataFrame(cosine_similarity(dt_matrix.values),columns=product_cat['product_id'],index=product_cat['product_id'])

tfidf_vectorizer = TfidfVectorizer()
doc_term = tfidf_vectorizer.fit_transform(list(product_cat['brand']))
dt_matrix1 = pd.DataFrame(doc_term.toarray().round(3), index=[i for i in product_cat['product_id']], columns=tfidf_vectorizer.get_feature_names())
dt_matrix1 = dt_matrix1 + 0.01
cos_similar_matrix1 = pd.DataFrame(cosine_similarity(dt_matrix1.values),columns=product_cat['product_id'],index=product_cat['product_id'])

similarity_matrix = cos_similar_matrix.multiply(euclidean_matrix).multiply(cos_similar_matrix1)
content_matrix = X_train_matrix.dot(similar_matrix)

# apply MinMaxScaler again to obtain the trained User-Item Matrix of predicted interaction scores
std = MinMaxScaler(feature_range=(0, 1))
std.fit(content_matrix.values)
content_matrix = std.transform(content_matrix.values)

ValueError: np.nan is an invalid document, expected byte or unicode string.