### Content-Based Recommenders

This notebook is an exercise for CBF - content-based recommenders. In CBF, we need to model stable preferences for users which are measured by content attributes. Therefore the key concept is build **vector representations** of items and match it with users profile.

How to build profile? Users can build their own profile; we can also infer profile from user actions (explicit action like rating and implicit action like read, click, buy) etc. A simple method is just count the number of times the user choose items with each keywords.

In general, CBF depend on well-structured attributes that align with preferences; also it depend on having a reasonable distribution of attributes across items. CBF are unlikely to find suprising connections and find complements than substitutes.

For vector representations, we can use boolean, count and tf-idf etc. And what if we need to update profile? There are several ways:

* recompute each time;

* weight new / old similarity, keep track of total weight in profile and mix in new rating;

* decay old profile and mix in new profile.

In [1]:
import numpy as np
import pandas as pd

Import data, this is a table of content attributes for 20 documents across 10 attributes, and there are 2 users. In this exercise, content attributes are boolean values and encoded as `1-liked it`, `-1-disliked it`, `0-never saw it`.

In [2]:
content_tbl = pd.read_csv('HW2-data.csv', index_col=0)
content_tbl.shape

(20, 10)

In [3]:
content_tbl.index.name = 'doc'
content_tbl

Unnamed: 0_level_0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
doc1,1,0,1,0,1,1,0,0,0,1
doc2,0,1,1,1,0,0,0,1,0,0
doc3,0,0,0,1,1,1,0,0,0,0
doc4,0,0,1,1,0,0,1,1,0,0
doc5,0,1,0,0,0,0,0,0,1,1
doc6,1,0,0,1,0,0,0,0,0,0
doc7,0,0,0,0,0,0,0,1,0,1
doc8,0,0,1,1,0,0,1,0,0,1
doc9,0,0,0,0,0,1,0,0,1,0
doc10,0,1,0,0,1,0,1,0,0,0


In [4]:
user1 = np.array([1, -1, 0, 0, 0, 1, 0, 0, 0, 0,
                  0, 0, 0, 0, 0, 1, 0, 0, -1, 0]).reshape(20, -1)
user2 = np.array([-1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
                  0, -1, 0, 0, 0, 0, 1, 0, 0, 0]).reshape(20, -1)

#### 1. Build a basic user profile

In [5]:
# user profiles: simple dot product
user1_profile = content_tbl.values.T.dot(user1)
user2_profile = content_tbl.values.T.dot(user2)

In [6]:
print('user 1 profile: ', user1_profile.reshape(-1))
print('user 2 profile: ', user2_profile.reshape(-1))

user 1 profile:  [ 3 -2 -1  0  0  2 -1 -1  1  0]
user 2 profile:  [-2  2  2  3 -1 -2  0  3  0 -1]


Here we just apply dot product on user profile vector and item content vector to get the prediction of preference of users. This is same for the following two parts.

In [7]:
# predictions
for i in range(content_tbl.shape[0]):
    print('user 1 on doc {}: {}'.format(i + 1, user1_profile.reshape(-1).dot(content_tbl.iloc[i])))

print()

for i in range(content_tbl.shape[0]):
    print('user 2 on doc {}: {}'.format(i + 1, user2_profile.reshape(-1).dot(content_tbl.iloc[i])))

user 1 on doc 1: 4
user 1 on doc 2: -4
user 1 on doc 3: 2
user 1 on doc 4: -3
user 1 on doc 5: -1
user 1 on doc 6: 3
user 1 on doc 7: -1
user 1 on doc 8: -2
user 1 on doc 9: 3
user 1 on doc 10: -3
user 1 on doc 11: 0
user 1 on doc 12: 4
user 1 on doc 13: -2
user 1 on doc 14: -2
user 1 on doc 15: 0
user 1 on doc 16: 6
user 1 on doc 17: -4
user 1 on doc 18: 1
user 1 on doc 19: -4
user 1 on doc 20: -1

user 2 on doc 1: -4
user 2 on doc 2: 10
user 2 on doc 3: 0
user 2 on doc 4: 8
user 2 on doc 5: 1
user 2 on doc 6: 1
user 2 on doc 7: 2
user 2 on doc 8: 4
user 2 on doc 9: -2
user 2 on doc 10: 1
user 2 on doc 11: 1
user 2 on doc 12: -4
user 2 on doc 13: 7
user 2 on doc 14: 7
user 2 on doc 15: 4
user 2 on doc 16: -4
user 2 on doc 17: 10
user 2 on doc 18: 3
user 2 on doc 19: 2
user 2 on doc 20: 5


#### 2. Normalized attributes

We add a constraint that the length of content vector should be 1. Since some documents have many attributes checked could have more influence on the overall profile than one that had only a few. For instance doc 1 and doc 19 each have five attributes while doc 6, 7 only have 2 each. We may consider doc 6 says more about liking baseball (since it is one of only two attributes for the doc along with Europe) than doc 1 (since doc 1 has more attributes).

In [8]:
# treat all articles as having unit weight (normalization)
row_sums = content_tbl.sum(axis=1)
normalized_factor = np.sqrt(1 / row_sums).values

content_tbl_normalized = np.zeros(content_tbl.shape)
for i in range(len(normalized_factor)):
    content_tbl_normalized[i] = content_tbl.values[i] * normalized_factor[i]

In [9]:
# user profiles: simple dot product
user1_normalized_profile = content_tbl_normalized.T.dot(user1)
user2_normalized_profile = content_tbl_normalized.T.dot(user2)

In [10]:
# new user profile
print('user 1 normalized profile: \n', user1_normalized_profile.reshape(-1))
print('\n')
print('user 2 normalized profile: \n', user2_normalized_profile.reshape(-1))

user 1 normalized profile: 
 [ 1.73167065 -0.9472136  -0.5         0.20710678  0.          1.02456386
 -0.4472136  -0.5         0.57735027  0.        ]


user 2 normalized profile: 
 [-1.02456386  1.          1.0527864   1.5        -0.4472136  -1.02456386
 -0.07735027  1.5         0.         -0.4472136 ]


In [11]:
# predictions
pred_user1, pred_user2 = [], []
for i in range(content_tbl.shape[0]):
    pred_user1.append(user1_normalized_profile.reshape(-1).dot(content_tbl_normalized[i]))
    print('user 1 on doc {}: {}'\
          .format(i + 1, user1_normalized_profile.reshape(-1).dot(content_tbl_normalized[i])))

print()

for i in range(content_tbl.shape[0]):
    pred_user2.append(user2_normalized_profile.reshape(-1).dot(content_tbl_normalized[i]))
    print('user 2 on doc {}: {}'\
          .format(i + 1, user2_normalized_profile.reshape(-1).dot(content_tbl_normalized[i])))

user 1 on doc 1: 1.0090187477611812
user 1 on doc 2: -0.8700534071567052
user 1 on doc 3: 0.7111053789495445
user 1 on doc 4: -0.6200534071567052
user 1 on doc 5: -0.21354069100864065
user 1 on doc 6: 1.3709226658874274
user 1 on doc 7: -0.3535533905932738
user 1 on doc 8: -0.37005340715670515
user 1 on doc 9: 1.1327243469445638
user 1 on doc 10: -0.8050729140891351
user 1 on doc 11: 0.04465819873852045
user 1 on doc 12: 1.3331138468776906
user 1 on doc 13: -0.3964466094067262
user 1 on doc 14: -0.3313782725618923
user 1 on doc 15: 0.1422285251880866
user 1 on doc 16: 1.924646069958185
user 1 on doc 17: -0.8700534071567052
user 1 on doc 18: 0.5546948998705893
user 1 on doc 19: -0.8472135954999579
user 1 on doc 20: -0.08137827256189228

user 2 on doc 1: -0.8455773862443852
user 2 on doc 2: 2.526393202250021
user 2 on doc 3: 0.016294290956783142
user 2 on doc 4: 1.9877180676552082
user 2 on doc 5: 0.31915137944246463
user 2 on doc 6: 0.3361841152991205
user 2 on doc 7: 0.7444324057629834

#### 3. Apply TF-IDF 

IDF tells us how few documents contain this term, in other words: how rare it is? If all the documents contain the term, it should be 0 and it's totally not rare.

Typically inverse-document-frequency should be `log(#doc/#doc with item)`. Here we didn't use logarithm.

In [12]:
# apply idf
df = np.array([4, 6, 10, 11, 6, 6, 7, 6, 7, 5])
idf = 1 / df
idf

array([ 0.25      ,  0.16666667,  0.1       ,  0.09090909,  0.16666667,
        0.16666667,  0.14285714,  0.16666667,  0.14285714,  0.2       ])

In [13]:
# predictions
for i in range(content_tbl.shape[0]):
    print('user 1 on doc {}: {}'\
          .format(i + 1, np.sum(user1_normalized_profile.reshape(-1) * content_tbl_normalized[i] * idf)))

print()

for i in range(content_tbl.shape[0]):
    print('user 2 on doc {}: {}'\
          .format(i + 1, np.sum(user2_normalized_profile.reshape(-1) * content_tbl_normalized[i] * idf)))

user 1 on doc 1: 0.2476124657905287
user 1 on doc 2: -0.13618718835894128
user 1 on doc 3: 0.10945899074393543
user 1 on doc 4: -0.08919655031727514
user 1 on doc 5: -0.043526623104614706
user 1 on doc 6: 0.3194323422480595
user 1 on doc 7: -0.05892556509887896
user 1 on doc 8: -0.04752988365060847
user 1 on doc 9: 0.17906719376543057
user 1 on doc 10: -0.12803122640182818
user 1 on doc 11: 0.01875153415956633
user 1 on doc 12: 0.31164827655467253
user 1 on doc 13: -0.05725272206727814
user 1 on doc 14: -0.053281216750158504
user 1 on doc 15: 0.021183771740190156
user 1 on doc 16: 0.396152879851886
user 1 on doc 17: -0.13618718835894128
user 1 on doc 18: 0.07163451247986463
user 1 on doc 19: -0.12153324130475628
user 1 on doc 20: -0.006290578708492353

user 2 on doc 1: -0.21716749806965674
user 2 on doc 2: 0.3291544717401536
user 2 on doc 3: -0.06289226997572088
user 2 on doc 4: 0.24029611917898988
user 2 on doc 5: 0.04458526691550539
user 2 on doc 6: -0.08469536214019145
user 2 on doc