### Item-item Collaborative Filtering

Fundamentally user-user CF is great, good results, but (1) issue of sparsity, with large item sets, small number of ratings, too often there are no points where no recommendation can be made, user might have nothing in comment with others; (2) computational performance, with so many users, calculate pairwise similarities is slow, user profile change quickly, need to calculate real time.

Here comes the item-item CF. Item-item similarities are more stable (doesn't mean item attributes, that's content-based), since average item has many more ratings then average user. In this case what we mean by similarity is **do people treat these things the same in terms of like and dislike**. Do people like this tend to like the other? Do people purchase this tend to purchase the other?

Core assumptions:
1. item-item relationships need to be stable

Therefore leads to a main limitation: the recommendations are **lower serendipity** (less suprising). There will be an issue raised for item with short live cycle.

The algorithm is similar to user-user CF, but instead of averaging on users, we **average over item**. We can use pre-compute item similarities overall pairs of items, look for items similar to those the user likes / or in their basket / or has purchased.

$$s(u, i) = \frac{\sum_{j\in N(i: u)}w_{ij}(r_{uj}-\bar{r}_{j})}{\sum_{i\in N(i: u)}w_{ij}} + \bar{r}_{i},$$

here $\bar{r}_{i}$ is the average rating of item $i$, and $w_{ij}$ is the similarity between item $i$ and item $j$:

$$w_{ij} = \frac{r_i^\top r_j}{||r_i||_{2}||r_j||_{2}},$$

and item $j$ is in the neighbor of item $i$, which means user who rated $i$ should also rated $j$.

The whole process is same with user-user CF: calculate similarities and then predict scores.

In a word, item-item CF is much faster and more stable / conservative, and it is treated as a more efficient method than user-user CF and item-item CF is also treated as an aggregated producted associated recommender (weighted by similarities).

In [1]:
import numpy as np
import pandas as pd

**Note**: dataset for this assignment is not entirely correct, the **NormRatings** is not correctly calculated.

In [2]:
ratings = pd.read_excel('Assignment 5.xls', sheet_name='Ratings')

# normalized ratings should use item mean rather than user mean
norm_ratings = pd.read_excel('Assignment 5.xls', sheet_name='NormRatings')
matrix = pd.read_excel('Assignment 5.xls', sheet_name='Matrix')
filter_matrix = pd.read_excel('Assignment 5.xls', sheet_name='FilterMatrix')

In [3]:
ratings.head(3)

Unnamed: 0,User,1: Toy Story (1995),1210: Star Wars: Episode VI - Return of the Jedi (1983),356: Forrest Gump (1994),"318: Shawshank Redemption, The (1994)","593: Silence of the Lambs, The (1991)",3578: Gladiator (2000),260: Star Wars: Episode IV - A New Hope (1977),2028: Saving Private Ryan (1998),296: Pulp Fiction (1994),...,2916: Total Recall (1990),780: Independence Day (ID4) (1996),541: Blade Runner (1982),1265: Groundhog Day (1993),"2571: Matrix, The (1999)",527: Schindler's List (1993),"2762: Sixth Sense, The (1999)",1198: Raiders of the Lost Ark (1981),34: Babe (1995),Mean
0,755,2.0,5.0,2.0,,4.0,4.0,1.0,2.0,,...,,5.0,2.0,5.0,4.0,2.0,5.0,,,3.2
1,5277,1.0,,,2.0,4.0,2.0,5.0,,,...,2.0,2.0,,2.0,,5.0,1.0,3.0,,2.769231
2,1577,,,,5.0,2.0,,,,,...,1.0,4.0,4.0,1.0,1.0,2.0,3.0,1.0,3.0,2.333333


In [4]:
# L2-norm factor (root of sum square of ratings for each movie)
ratings.iloc[-1]

User                                                            L2
1: Toy Story (1995)                                        11.8322
1210: Star Wars: Episode VI - Return of the Jedi (1983)    11.6619
356: Forrest Gump (1994)                                   9.89949
318: Shawshank Redemption, The (1994)                      12.4097
593: Silence of the Lambs, The (1991)                      12.4499
3578: Gladiator (2000)                                     11.3578
260: Star Wars: Episode IV - A New Hope (1977)             14.2478
2028: Saving Private Ryan (1998)                           10.9087
296: Pulp Fiction (1994)                                   10.6301
1259: Stand by Me (1986)                                   9.94987
2396: Shakespeare in Love (1998)                           10.7703
2916: Total Recall (1990)                                  6.78233
780: Independence Day (ID4) (1996)                         11.7473
541: Blade Runner (1982)                                   10.

#### Top 5 movies most similar to Toy Story

(1) Raw

In [5]:
ratings.columns

Index(['User', '1: Toy Story (1995)',
       '1210: Star Wars: Episode VI - Return of the Jedi (1983)',
       '356: Forrest Gump (1994)', '318: Shawshank Redemption, The (1994)',
       '593: Silence of the Lambs, The (1991)', '3578: Gladiator (2000)',
       '260: Star Wars: Episode IV - A New Hope (1977)',
       '2028: Saving Private Ryan (1998)', '296: Pulp Fiction (1994)',
       '1259: Stand by Me (1986)', '2396: Shakespeare in Love (1998)',
       '2916: Total Recall (1990)', '780: Independence Day (ID4) (1996)',
       '541: Blade Runner (1982)', '1265: Groundhog Day (1993)',
       '2571: Matrix, The (1999)', '527: Schindler's List (1993)',
       '2762: Sixth Sense, The (1999)', '1198: Raiders of the Lost Ark (1981)',
       '34: Babe (1995)', 'Mean'],
      dtype='object')

Use cosine similarity.

In [6]:
# calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

sim_matrix = pd.DataFrame(cosine_similarity(ratings.iloc[:-1, 1:-1].fillna(0).T),
                          columns=ratings.columns[1:-1], index=ratings.columns[1:-1])
sim_matrix['1: Toy Story (1995)'].sort_values(ascending=False)[1:6]

260: Star Wars: Episode IV - A New Hope (1977)    0.747409
780: Independence Day (ID4) (1996)                0.690665
296: Pulp Fiction (1994)                          0.667846
318: Shawshank Redemption, The (1994)             0.667424
1265: Groundhog Day (1993)                        0.661016
Name: 1: Toy Story (1995), dtype: float64

(2) Normalized

Based on normalized rating data.

In [7]:
sim_matrix_norm = pd.DataFrame(cosine_similarity(norm_ratings.iloc[:-1, 1:].T),
                          columns=norm_ratings.columns[1:], index=norm_ratings.columns[1:])
sim_matrix_norm['1: Toy Story (1995)'].sort_values(ascending=False)[1:6]

34: Babe (1995)                          0.554448
356: Forrest Gump (1994)                 0.355780
296: Pulp Fiction (1994)                 0.295013
318: Shawshank Redemption, The (1994)    0.215975
2028: Saving Private Ryan (1998)         0.192799
Name: 1: Toy Story (1995), dtype: float64

#### Top 5 movies for user 5277 

(1) Raw

In [8]:
user = 5277

# user ratings
user_ratings = ratings.loc[ratings['User'] == user].iloc[:, 1:-1]
user_ratings_norm = norm_ratings.loc[norm_ratings['User'] == user].iloc[:, 1:-1]

# item neighbors: has to be rated by users
item_neighbors = user_ratings.T[user_ratings.notnull().T.values.reshape(-1)].T.columns
item_neighbors

Index(['1: Toy Story (1995)', '318: Shawshank Redemption, The (1994)',
       '593: Silence of the Lambs, The (1991)', '3578: Gladiator (2000)',
       '260: Star Wars: Episode IV - A New Hope (1977)',
       '1259: Stand by Me (1986)', '2396: Shakespeare in Love (1998)',
       '2916: Total Recall (1990)', '780: Independence Day (ID4) (1996)',
       '1265: Groundhog Day (1993)', '527: Schindler's List (1993)',
       '2762: Sixth Sense, The (1999)',
       '1198: Raiders of the Lost Ark (1981)'],
      dtype='object')

Use the formula (raw format).

In [9]:
def calc_score(user, item, sim_matrix, user_ratings):
    w = sim_matrix.loc[item_neighbors.tolist()][item].values.reshape(-1)
    r = user_ratings[item_neighbors].values.reshape(-1)
    return w.dot(r) / np.sum(w)

In [10]:
score = []
for item in matrix.columns.tolist():
    score.append(calc_score(user, item, sim_matrix, user_ratings))

preds = pd.DataFrame({'item': matrix.columns.tolist(), 'score': score})
preds.sort_values(by='score', ascending=False).head()

Unnamed: 0,item,score
16,527: Schindler's List (1993),2.973883
9,1259: Stand by Me (1986),2.928801
6,260: Star Wars: Episode IV - A New Hope (1977),2.92224
4,"593: Silence of the Lambs, The (1991)",2.883304
10,2396: Shakespeare in Love (1998),2.852131


(2) Normalized

In [11]:
def calc_score_norm(user, item, sim_matrix, user_ratings):
    w = sim_matrix.loc[item_neighbors.tolist()][item].values.reshape(-1)
    r = user_ratings[item_neighbors].values.reshape(-1) - ratings.mean()[item_neighbors]
    return (w.dot(r) / np.sum(np.abs(w))) + ratings.mean()[item]

In [12]:
score = []
for item in matrix.columns.tolist():
    score.append(calc_score_norm(user, item, sim_matrix_norm, user_ratings))

preds = pd.DataFrame({'item': matrix.columns.tolist(), 'score': score})
preds.sort_values(by='score', ascending=False).head()

Unnamed: 0,item,score
6,260: Star Wars: Episode IV - A New Hope (1977),5.041868
16,527: Schindler's List (1993),4.950092
9,1259: Stand by Me (1986),4.202127
10,2396: Shakespeare in Love (1998),4.073123
7,2028: Saving Private Ryan (1998),3.946249
