# TP10 (Student version): a recommender system

We can use the following libraries.

In [None]:
import matplotlib.pyplot as plt
import math
import sys
import random
import time
import copy
print(sys.version)

The purpose of this practical work is to make a basic recommender system, and use it on a Movielens dataset.

## Exercise 1: data preparation

Download the rating data extracted from MovieLens http://lioneltabourier.fr/documents/rating_list.txt

This file is organised as follows:

<pre>
user_id   movie_id   rating
</pre>

It contains 100836 ratings of 9724 movies by 610 different users. Ratings on MovieLens goes from 0.5 to 5.

The corresponding movie index is available there http://lioneltabourier.fr/documents/movies.csv

### Question 1

Select **randomly** 1% of the ratings (so 1008 ratings). This will be your test set for the rest of this lab: these ratings are considered as unknown, and we aim at predicting them with the learning set which is the remaining 99% ratings.

Create two files, one containing the learning ratings, another containing the test ratings (please join them to the .ipynb file when sending your TP).

In [None]:
def data_preparation(rating_list):
    with open(rating_list, "r") as file:
        lines = file.readlines()
        test_set_length = len(lines) // 100
        test_set = []
        while test_set_length > 0:
            random_choice = random.randint(0, len(lines) - 1)
            test_set.append(lines.pop(random_choice))
            test_set_length -= 1
        with open("test_set.txt", "w") as test_set_file:
            for line in test_set:
                test_set_file.write(line)
        with open("learning_set.txt", "w") as learning_set_file:
            for line in lines:
                learning_set_file.write(line)

In [None]:
data_preparation("res/rating_list.txt")

## Exercise 2: benchmark recommender 

The benchmark recommender that you will create works as follows: for a user $u$ and an item $i$, the predicted score is

$$ r^*(u,i) = \overline{r} + ( \overline{r(u)} - \overline{r}) + ( \overline{r(i)} - \overline{r})$$

$\overline{r}$ is the average rating over the whole learning dataset.

$\overline{r(u)}$ is the average rating over the learning dataset of user $u$. In case $u$ is not present in the learning set, consider that $\overline{r(u)} = \overline{r}$.

$\overline{r(i)}$ is the average rating over the learning dataset of item $i$. In case $i$ is not present in the learning set, consider that $\overline{r(i)} = \overline{r}$.

### Question 2

Load the learning data in memory.

Clue: an adequate format for the rest of this TP is to create two dictionaries of lists (warning: a dictionary of sets won't work): 

1) keys = user ids , values = list of ratings 

2) keys = item ids , values = list of ratings 

In [None]:
def load_data(learning_data):
    user_set, movie_set = {}, {}
    with open(learning_data, "r") as file:
        for line in file:
            user, movie, rating = line.split()
            user, movie, rating = int(user), int(movie), float(rating)
            if user not in user_set:
                user_set[user] = []
            if movie not in movie_set:
                movie_set[movie] = []
            user_set[user].append(rating)
            movie_set[movie].append(rating)
    return user_set, movie_set
            

In [None]:
user_set, item_set = load_data("learning_set.txt")

### Question 3

Create a function which given a user $u$ and an item $i$ returns the value of $r^*(u,i)$ computed on the learning set.


In [None]:
def predict_score(my_user, my_item, user_set, movie_set):
    rating_sum, rating_count = 0, 0
    for user in user_set:
        rating_sum += sum(user_set[user])
        rating_count += len(user_set[user])
    average_rating = rating_sum / rating_count
    user_rating = sum(user_set[my_user]) / len(user_set[my_user]) if my_user in user_set else average_rating
    item_rating = sum(movie_set[my_item]) / len(movie_set[my_item]) if my_item in movie_set else average_rating
    return user_rating + item_rating - average_rating 

In [None]:
predict_score(610, 170875, user_set, item_set)

## Exercise 3: evaluation

Now that we have a prediction process, we evaluate its performances on the rating set.

### Question 4

For each rating in the test set, compute the rating predicted by the function defined above and compare it to the actual score. If an item has not been rated in the learning set or a user has made no rating in the learning set, don't do any prediction.

To present your results, you can print them in the form:

<pre>
user_id item_id real_rating predicted_rating
</pre>

At first sight, what is your opinion about the ratings that you obtained?


In [None]:
def evaluate(user_set, item_set, test_file):
    my_evaluations = []
    with open(test_file, "r") as file:
        for line in file:
            user, item, real_rating = line.split()
            user, item, real_rating = int(user), int(item), float(real_rating)
            if user in user_set and item in item_set:
                predicted_rating = predict_score(user, item, user_set, item_set)
                #print("{} {} {} {}".format(user, item, real_rating, round(predicted_rating,2)))
                my_evaluations.append((user, item, real_rating, predicted_rating))
    return my_evaluations

In [None]:
my_evaluations = evaluate(user_set, item_set, "test_set.txt")
for user, item, real_rating, predicted_rating in my_evaluations:
    print("{:>7} {:>7} {:>7} {:>7}".format(user, item, real_rating, round(predicted_rating,2)))

### Question 5

Using the previous question, compute the _Root Mean Square Error_, as defined in the course for the whole set of predictions:

$$RMSE = \sqrt{\frac{\sum _{k=1} ^K (r^*_k - r_k)^2 }{K}} $$

Here $K$ is the number of predictions, $ r^*_k $ the predicted rating,  $ r_k $ the real rating.

In [None]:
def compute_rmse(evaluation):
    return math.sqrt(sum([pow(predicted - real, 2) for _,_, real, predicted in evaluation]) / len(evaluation))

In [None]:
compute_rmse(my_evaluations)

# Part2: user-based collaborative filtering 

Using the same learning and testing files as in Part1, we aim at building a collaborative filtering method to improve the results. 

For this purpose, we define a distance between users: $ u_1 $ and $ u_2 $ will be close if they rate movies similarly and far away if they rate movies differently.

When predicting a score $ r^*_{CF}(u,i)$, we take into account this distance such that close users have more influence than distant users.

## Exercise 1: loading data

### Question 1

To make a collaborative filtering recommender system, we need more information than in Part1. 

So for Part2, create two dictionnaries of lists from the learning file:

1) keys = user ids , values = list of couples (item , rating) 

2) keys = item ids , values = list of couples (user , rating)

In [None]:
def load_data_improved(learning_data):
    user_set, item_set = {}, {}
    with open(learning_data, "r") as file:
        for line in file:
            user, item, rating = line.split()
            user, item, rating = int(user), int(item), float(rating)
            if user not in user_set:
                user_set[user] = {}
            if item not in item_set:
                item_set[item] = {}
            user_set[user][item] = rating
            item_set[item][user] = rating
    return user_set, item_set

In [None]:
user_set_improved, item_set_improved = load_data_improved("learning_set.txt")

## Exercise 2: computing distance

The distance between users is defined as follows:

$$ d(u_1,u_2) = \frac{1}{|I_1 \cap I_2|} \sum _{i \in I_1 \cap I_2} | r(u_1,i)  - r(u_2,i)| $$

where $ I_1 $ is the set of items rated by $u_1$ and $ I_2 $ is the set of items rated by $u_2$.

### Question 2

Compute in a 2D matrix (you can either use numpy or a list of lists to simulate a matrix) of size 610x610 (there are 610 users in the dataset) the distance between all pairs of users.

**Warning:** It is the difficult part of the lab work, as you need to make a relatively efficient code. I advise you to create two matrices, one computing $\sum _{i \in I_1 \cap I_2} | r(u_1,i)  - r(u_2,i)|$ and the other computing $|I_1 \cap I_2|$. Then go through each item using the second dictionary, and update the values in both matrices for each pair of users who have rated the same movie.


In [None]:
def compute_distance(user1, user2, user_set, item_set):
    I1, I2 = set(user_set[user1].keys()), set(user_set[user2].keys())
    my_intersection = I1.intersection(I2)
    if len(my_intersection) == 0:
        return -1
    my_distance = 0
    for item in my_intersection:
        my_distance += abs(item_set[item][user1] - item_set[item][user2])
    return my_distance / len(my_intersection)


def compute_all_distances(user_set, item_set):
    matrix = [[0 for j in range(610)] for i in range(610)]
    for i in range(len(matrix)):
        for j in range(len(matrix[i])):
            if i >= j:
                my_distance = compute_distance(i + 1, j + 1, user_set, item_set)
                matrix[i][j], matrix[j][i] = my_distance, my_distance
    return matrix

In [None]:
matrix = compute_all_distances(user_set_improved, item_set_improved)

### Question 3

Using the matrix of distances, compute a dictionary which contains for each user $ u $ its average distance to other users $ \overline{d(u)} $. 

Note that if a user $v$ has no common ating with user $u$, it is not taken into account in the average.

Formally:

$$ \overline{d(u)} = \frac{1}{|N(u)|} \sum _{v \in N(u)} d(u,v)$$

where $ N(u) $ is the set of users who share at least 1 rating with $u$.

In [None]:
def compute_average(user, matrix):
    user -= 1
    average_distance, count = 0, 0
    for index in range(len(matrix[user])):
        if matrix[user][index] != -1 and index != user:
            average_distance += matrix[user][index]
            count += 1
    return average_distance / count

def compute_all_average(matrix):
    my_dic = {}
    return dict((user, compute_average(user, matrix)) for user in range(610))
        
            

In [None]:
compute_all_average(matrix)

## Exercise 3: evaluation

The score predicted is computed in this way

$$ r^*(u,i) = \overline{r} + ( \overline{r(u)} - \overline{r}) + ( \overline{r_u(i)} - \overline{r})$$ 

You can observe that this score is similar to the benchmark except for the term $ \overline{r_u(i)} $ which is 

$$ \overline{r_u(i)} = \frac{\sum _{v \in U} w(u,v) r(v,i)}{\sum _{v \in U} w(u,v)} $$

It is a weighted average of the scores of other users who have rated item $i$, the weight is based on the distance  $ w(u,v) = \frac{\overline{d(u)}}{d(u,v)} $ and $ w(u,v) = 1 $ if $u$ and $v$ don't share any rating. In this way, if user $u$ had no common ratings with other users in the network, we fall back on the benchmark score.


### Question 4

For each rating in the test set, compute the rating predicted by the function defined above and compare it to the actual score. If an item has not been rated in the learning set or a user has made no rating in the learning set, don't do any prediction.

To present your results, you can print them in the form:

<pre>
user_id item_id real_rating predicted_rating
</pre>

In [None]:
def compute_weigh_average(my_user, my_item, item_set, matrix):
    weighted_average = 0
    sum_weight = 0
    for user in item_set[my_item].keys():
        if matrix[my_user - 1][user - 1] != -1:
            weight = 1
            if matrix[my_user -1][user -1] != 0:
                weigth = compute_average(my_user, matrix) / matrix[my_user - 1][user - 1]
            sum_weight += weight
            weighted_average += weight * item_set[my_item][user]
    return weighted_average / sum_weight
        

def predict_score_improved(my_user, my_item, user_set, item_set, matrix):
    rating_sum, rating_count = 0, 0
    for user in user_set:
        rating_sum += sum(user_set[user].values())
        rating_count += len(user_set[user])
    average_rating = rating_sum / rating_count
    user_rating = sum(user_set[my_user].values()) / len(user_set[my_user]) if my_user in user_set else average_rating
    weighted_average = compute_weigh_average(my_user, my_item, item_set, matrix) if my_item in item_set else average_rating
    return user_rating + weighted_average - average_rating

def evaluate_improved(user_set, item_set, test_file, matrix):
    my_evaluations = []
    with open(test_file, "r") as file:
        for line in file:
            user, item, real_rating = line.split()
            user, item, real_rating = int(user), int(item), float(real_rating)
            if user in user_set and item in item_set:
                predicted_rating = predict_score_improved(user, item, user_set, item_set, matrix)
                #print("{} {} {} {}".format(user, item, real_rating, round(predicted_rating,2)))
                my_evaluations.append((user, item, real_rating, predicted_rating))
    return my_evaluations

In [None]:
my_evaluations_improved = evaluate_improved(user_set_improved, item_set_improved, "test_set.txt", matrix)
for user, item, real_rating, predicted_rating in my_evaluations_improved:
    print("{:>7} {:>7} {:>7} {:>7}".format(user, item, real_rating, round(predicted_rating,2)))

### Question 5

Using the previous question, compute the _Root Mean Square Error_, as defined in the course for the whole set of predictions:

$$RMSE = \sqrt{\frac{\sum _{k=1} ^K (r^*_{CF}(k) - r(k))^2 }{K}} $$

Here $K$ is the number of predictions, $ r^*_{CF}(k) $ the predicted rating,  $ r(k) $ the real rating.

You should observe only a very slight improvement to the RMSE that you have computed in Part1 with the benchmark. Do you have an idea why it is the case?

In [None]:
compute_rmse(my_evaluations_improved)