In [1]:
import pandas as pd
import numpy as np

---
### About this notebook:
In this chapter we explore collaborative filtering, in particular item-item based CF.

---
### Create toy dataset:
(see page 189 of the text)

In [2]:
toy_df = pd.DataFrame({'users':['Sara', 'Jesper', 'Therese', 'Helle', 'Pietro', 'Ekaterina'],
                      'MIB':[5, 4, 5, 3, 3, 2],
                      'ST':[3, 3, 2, 5, 3, 3],
                      'AV':[np.nan, 4, 5, 3, 3, 2],
                      'BH':[2, np.nan, 2, np.nan, 2, 3],
                      'SS':[2, 3, 1, 1, 4, 5],
                      'LM':[2, 3, 1, 1, 5, 5]})
toy_df = toy_df.set_index('users')
toy_df

Unnamed: 0_level_0,MIB,ST,AV,BH,SS,LM
users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sara,5,3,,2.0,2,2
Jesper,4,3,4.0,,3,3
Therese,5,2,5.0,2.0,1,1
Helle,3,5,3.0,,1,1
Pietro,3,3,3.0,2.0,4,5
Ekaterina,2,3,2.0,3.0,5,5


---
### Normalize each movie's rating:

In [3]:
# first, find the average rating for all users
toy_df.mean(axis=1)

users
Sara         2.800000
Jesper       3.400000
Therese      2.666667
Helle        2.600000
Pietro       3.333333
Ekaterina    3.333333
dtype: float64

In [4]:
# transpose the original df to make it easy to normalize the correct values:
normalized_t = toy_df.transpose() - np.round(toy_df.mean(axis=1), 2)
normalized_t

users,Sara,Jesper,Therese,Helle,Pietro,Ekaterina
MIB,2.2,0.6,2.33,0.4,-0.33,-1.33
ST,0.2,-0.4,-0.67,2.4,-0.33,-0.33
AV,,0.6,2.33,0.4,-0.33,-1.33
BH,-0.8,,-0.67,,-1.33,-0.33
SS,-0.8,-0.4,-1.67,-1.6,0.67,1.67
LM,-0.8,-0.4,-1.67,-1.6,1.67,1.67


In [5]:
# return the transposed dataframe to its original axes
normalized_df = normalized_t.transpose()
normalized_df

Unnamed: 0_level_0,MIB,ST,AV,BH,SS,LM
users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Sara,2.2,0.2,,-0.8,-0.8,-0.8
Jesper,0.6,-0.4,0.6,,-0.4,-0.4
Therese,2.33,-0.67,2.33,-0.67,-1.67,-1.67
Helle,0.4,2.4,0.4,,-1.6,-1.6
Pietro,-0.33,-0.33,-0.33,-1.33,0.67,1.67
Ekaterina,-1.33,-0.33,-1.33,-0.33,1.67,1.67


**Note:** the matrix above matches the one shown on table 8.3 on page 191

---
### Define the item-item function:

In [6]:
def item_item_similarity(ratings_df, item_1, item_2):
    # normalize the ratings data:
    normalized_t = ratings_df.transpose() - np.round(ratings_df.mean(axis=1), 2)
    normalized_df = normalized_t.transpose()
    # normalized_df = normalized_df.fillna(0)
    
    # calculate numerator and denominator:
    numerator = np.sum(normalized_df[item_1]*normalized_df[item_2])
    denominator = np.sqrt(np.sum(normalized_df[item_1]**2))*np.sqrt(np.sum(normalized_df[item_2]**2))
    
    # define the adjusted cosine similarity:
    adjusted_cosine_sim = np.round(numerator / denominator, 3)
    
    return adjusted_cosine_sim

In [7]:
item_item_similarity(toy_df, 'MIB', 'ST')

0.016

**Note:** while the results here agree with the main example in the text, some of the values shown on page 192 on table 8.4 do not. I have carried out the calculations as shown on the text on excel and verified that the results obtained using the function given above are in fact correct. 

---
### Serving Predictions:
The function below finds the highest rated neighbors (closest items) based on a similarity threshold value.

In [8]:
def find_neighborhood_thresh(ratings_df, target_item, thresh):
    # set list to hold item-rating tuples:
    sims = []
    
    # calculate similarities:
    for item in toy_df.columns:
        if item != target_item:
            sim = item_item_similarity(ratings_df, target_item, item)
            # apply threshold:
            if sim >= thresh:
                sims.append((item, sim))
            
    # convert tuples to df and sort by score:
    df = pd.DataFrame(sims, columns=['item', 'similarity_score']).set_index('item')
    neighborhood_df = df.sort_values(by='similarity_score', ascending=False) 
    
    return neighborhood_df
    

In [9]:
# closes items in terms of similarity:
find_neighborhood_thresh(ratings_df=toy_df, target_item='ST', thresh=0.01)

Unnamed: 0_level_0,similarity_score
item,Unnamed: 1_level_1
BH,0.189
MIB,0.016


---
### Define Prediction Fucntion:
The function below, uses the similarity of the neighbor items, plus the (noramlized) user's rating for those to predict the rating for a target item.

In [10]:
def predict_item_rating_for_user(ratings_df, target_user, target_item, thresh):
    # get user's mean rating:
    mean_rating_df = ratings_df.mean(axis=1).to_frame().reset_index()
    user_mean_rating = mean_rating_df[mean_rating_df['users']==target_user][0].item()
    
    # find the neighborhood of items with scores above threshold for target item:
    neighbors_df = find_neighborhood_thresh(ratings_df=toy_df, target_item='ST', thresh=0.01).fillna(0)
    item_neighbors_sims = neighbors_df.values.reshape(1,2)[0]
    
    # get target user ratings for neighbor items:
    neighbors_lst = neighbors_df.index.tolist()
    user_neighbors_ratings = ratings_df[neighbors_lst].loc[target_user].fillna(0).values
    
    # compute predictions:
    prediction = np.round(user_mean_rating + (np.sum(item_neighbors_sims*user_neighbors_ratings))/(np.sum(item_neighbors_sims)), 2)
    
    return prediction

In [11]:
predict_item_rating_for_user(ratings_df=toy_df, target_user='Helle', target_item='ST', thresh=0.01)

2.83

---


Unfortunately, we cannot test the similarity values from the functions above against the value obtained by the author in the table named "similarity" since the dataset is too large to pivot.