# TP10 (Student version): a recommender system

We can use the following libraries.

In [None]:
import matplotlib.pyplot as plt
import math
import sys
import random
import time
import copy
print(sys.version)

The purpose of this practical work is to make a basic recommender system, and use it on a Movielens dataset.

## Exercise 1: data preparation

Download the rating data extracted from MovieLens http://lioneltabourier.fr/documents/rating_list.txt

This file is organised as follows:

<pre>
user_id   movie_id   rating
</pre>

It contains 100836 ratings of 9724 movies by 610 different users. Ratings on MovieLens goes from 0.5 to 5.

The corresponding movie index is available there http://lioneltabourier.fr/documents/movies.csv

### Question 1

Select **randomly** 1% of the ratings (so 1008 ratings). This will be your test set for the rest of this lab: these ratings are considered as unknown, and we aim at predicting them with the learning set which is the remaining 99% ratings.

Create two files, one containing the learning ratings, another containing the test ratings (please join them to the .ipynb file when sending your TP).

In [None]:
def data_preparation(rating_list):
    with open(rating_list, "r") as file:
        lines = file.readlines()
        test_set_length = len(lines) // 100
        test_set = []
        while test_set_length > 0:
            random_choice = random.randint(0, len(lines) - 1)
            test_set.append(lines.pop(random_choice))
            test_set_length -= 1
        with open("test_set.txt", "w") as test_set_file:
            for line in test_set:
                test_set_file.write(line)
        with open("learning_set.txt", "w") as learning_set_file:
            for line in lines:
                learning_set_file.write(line)

In [None]:
data_preparation("res/rating_list.txt")

## Exercise 2: benchmark recommender 

The benchmark recommender that you will create works as follows: for a user $u$ and an item $i$, the predicted score is

$$ r^*(u,i) = \overline{r} + ( \overline{r(u)} - \overline{r}) + ( \overline{r(i)} - \overline{r})$$

$\overline{r}$ is the average rating over the whole learning dataset.

$\overline{r(u)}$ is the average rating over the learning dataset of user $u$. In case $u$ is not present in the learning set, consider that $\overline{r(u)} = \overline{r}$.

$\overline{r(i)}$ is the average rating over the learning dataset of item $i$. In case $i$ is not present in the learning set, consider that $\overline{r(i)} = \overline{r}$.

### Question 2

Load the learning data in memory.

Clue: an adequate format for the rest of this TP is to create two dictionaries of lists (warning: a dictionary of sets won't work): 

1) keys = user ids , values = list of ratings 

2) keys = item ids , values = list of ratings 

In [None]:
def load_data(learning_data):
    user_set, movie_set = {}, {}
    with open(learning_data, "r") as file:
        for line in file:
            user, movie, rating = line.split()
            user, movie, rating = int(user), int(movie), float(rating)
            if user not in user_set:
                user_set[user] = []
            if movie not in movie_set:
                movie_set[movie] = []
            user_set[user].append(rating)
            movie_set[movie].append(rating)
    return user_set, movie_set
            

In [None]:
user_set, item_set = load_data("learning_set.txt")

### Question 3

Create a function which given a user $u$ and an item $i$ returns the value of $r^*(u,i)$ computed on the learning set.


In [None]:
def predict_score(my_user, my_item, user_set, movie_set):
    rating_sum, rating_count = 0, 0
    for user in user_set:
        rating_sum += sum(user_set[user])
        rating_count += len(user_set[user])
    average_rating = rating_sum / rating_count
    user_rating = sum(user_set[my_user]) / len(user_set[my_user]) if my_user in user_set else average_rating
    item_rating = sum(movie_set[my_item]) / len(movie_set[my_item]) if my_item in movie_set else average_rating
    return user_rating + item_rating - average_rating 

In [None]:
predict_score(610, 170875, user_set, item_set)

## Exercise 3: evaluation

Now that we have a prediction process, we evaluate its performances on the rating set.

### Question 4

For each rating in the test set, compute the rating predicted by the function defined above and compare it to the actual score. If an item has not been rated in the learning set or a user has made no rating in the learning set, don't do any prediction.

To present your results, you can print them in the form:

<pre>
user_id item_id real_rating predicted_rating
</pre>

At first sight, what is your opinion about the ratings that you obtained?


In [None]:
def evaluate(user_set, item_set, test_file):
    my_evaluations = []
    with open(test_file, "r") as file:
        for line in file:
            user, item, real_rating = line.split()
            user, item, real_rating = int(user), int(item), float(real_rating)
            if user in user_set and item in item_set:
                predicted_rating = predict_score(user, item, user_set, item_set)
                print("{} {} {} {}".format(user, item, real_rating, round(predicted_rating,2)))
                my_evaluations.append((user, item, real_rating, predicted_rating))
    return my_evaluations

In [None]:
my_evaluations = evaluate(user_set, item_set, "test_set.txt")
for user, item, real_rating, predicted_rating in my_evaluations:
    print("{} {} {} {}".format(user, item, real_rating, round(predicted_rating,2)))

### Question 5

Using the previous question, compute the _Root Mean Square Error_, as defined in the course for the whole set of predictions:

$$RMSE = \sqrt{\frac{\sum _{k=1} ^K (r^*_k - r_k)^2 }{K}} $$

Here $K$ is the number of predictions, $ r^*_k $ the predicted rating,  $ r_k $ the real rating.

In [None]:
def compute_rmse(evaluation):
    return math.sqrt(sum([pow(predicted - real, 2) for _,_, real, predicted in evaluation]) / len(evaluation))

In [None]:
compute_rmse(my_evaluations)