# Recommender Systems with Surprise
- **Created by Andrés Segura Tinoco**
- **Created on May 23, 2019**

## Experiment description
- Model built from a plain text file
- The algorithm used is: KNNBasic
- Model trained using the technique of cross validation (5 folds)
- The RMSE and MAE metrics were used to estimate the model error
- Type of filtering: collaborative

In [1]:
# Load the Pandas libraries
import os
import io
from collections import defaultdict

In [2]:
# Load Surprise libraries
from surprise import KNNBasic
from surprise import Reader
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import cross_validate

## 1. Loading data

In [3]:
# Path to dataset file
file_path = os.path.expanduser('../data/u.data')

In [4]:
# Read the data into a Surprise dataset
reader = Reader(line_format = 'user item rating timestamp', sep = '\t', rating_scale = (1, 5))
data = Dataset.load_from_file(file_path, reader = reader)

## 2. Train the model and measure its error

In [5]:
# Use k-NN inspired algorithms
kk = 50
algo = KNNBasic(k = kk, verbose = True)

In [6]:
# Run 5-fold cross-validation and print results
cv = cross_validate(algo, data, measures = ['RMSE', 'MAE'], cv = 5, verbose = True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9824  0.9819  0.9809  0.9833  0.9742  0.9806  0.0033  
MAE (testset)     0.7778  0.7750  0.7755  0.7769  0.7701  0.7751  0.0027  
Fit time          0.65    0.67    0.72    0.96    0.71    0.74    0.11    
Test time         4.99    5.61    11.32   5.97    6.00    6.78    2.30    


## 3. Make some predictions

In [7]:
# Without real rating
p1 = algo.predict(uid = '13', iid = '181', verbose = True)

user: 13         item: 181        r_ui = None   est = 4.20   {'actual_k': 50, 'was_impossible': False}


In [8]:
# With real rating
p2 = algo.predict(uid = '196', iid = '302', r_ui = 4, verbose = True)

user: 196        item: 302        r_ui = 4.00   est = 4.15   {'actual_k': 50, 'was_impossible': False}


## 4. Get the k nearest neighbors of a item

In [9]:
# Return two mappings to convert raw ids into movie names and movie names into raw ids
def read_item_names(file_path):
    rid_to_name = {}
    name_to_rid = {}
    
    with io.open(file_path, 'r', encoding = 'ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]
    
    return rid_to_name, name_to_rid

In [10]:
# Read the mappings raw id <-> movie name
item_filepath = '../data/u.item'
rid_to_name, name_to_rid = read_item_names(item_filepath)

In [11]:
# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
print('Toy Story (1995):', toy_story_inner_id)

Toy Story (1995): 120


In [12]:
# Retrieve inner ids of the nearest neighbors of Toy Story
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k = 10)
toy_story_neighbors

[227, 294, 342, 356, 482, 655, 754, 791, 724, 834]

In [13]:
# The 10 nearest neighbors of Toy Story are:
for inner_id in toy_story_neighbors:
    raw_id = algo.trainset.to_raw_iid(inner_id)
    movie = rid_to_name[raw_id]
    print(raw_id, '-', movie)

27 - Bad Boys (1995)
592 - True Crime (1995)
1074 - Reality Bites (1994)
255 - My Best Friend's Wedding (1997)
539 - Mouse Hunt (1997)
1118 - Up in Smoke (1978)
26 - Brothers McMullen, The (1995)
486 - Sabrina (1954)
1316 - Horse Whisperer, The (1998)
212 - Unbearable Lightness of Being, The (1988)


## 5. Get the top-N recommendations

In [14]:
# Return the top-N recommendation for each user from a set of predictions.
def get_top_n(predictions, n = 10):
    
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))
        
    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
    
    return top_n

In [15]:
# Create train_set and test_set
train_set = data.build_full_trainset()
test_set = train_set.build_anti_testset()

# First train a KNN algorithm on the whole dataset
algo.fit(train_set)
predictions = algo.test(test_set)

# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose = True)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9169


0.9168591886191365

In [16]:
# Than predict ratings for all pairs (u, i) that are NOT in the training set
top_n = 10
top_pred = get_top_n(predictions, n = top_n)

In [17]:
# Print the recommended items for a specific user
uid_list = ['196']

for uid, user_ratings in top_pred.items():
    if uid in uid_list:
        for (iid, rating) in user_ratings:
            movie = rid_to_name[iid]
            print('Movie:', iid, '-', movie, ', rating:', str(rating))

Movie: 1189 - Prefontaine (1997) , rating: 5
Movie: 1500 - Santa with Muscles (1996) , rating: 5
Movie: 814 - Great Day in Harlem, A (1994) , rating: 5
Movie: 1536 - Aiqing wansui (1994) , rating: 5
Movie: 1599 - Someone Else's America (1995) , rating: 5
Movie: 1653 - Entertaining Angels: The Dorothy Day Story (1996) , rating: 5
Movie: 1467 - Saint of Fort Washington, The (1993) , rating: 5
Movie: 1122 - They Made Me a Criminal (1939) , rating: 5
Movie: 1201 - Marlene Dietrich: Shadow and Light (1996)  , rating: 5
Movie: 1293 - Star Kid (1997) , rating: 4.999999999999999


---
<a href="https://ansegura7.github.io/RS_Surprise/">&laquo; Home</a>