# Practice PS06: Recommendations engines (interactions-based)

For this assignment we will build and apply an item-based collaborative filtering recommenders for movies. 

Author: <font color="blue">Your name here</font>

E-mail: <font color="blue">Your e-mail here</font>

Date: <font color="blue">The current date here</font>

# 1. The Movies dataset

We will use the same dataset as in PS05, the the [32M version of Movielens](https://www.kaggle.com/datasets/justsahil/movielens-32m), which was released in 2024. We will use a sub-set containing only movies released in 2000 or later, and only 10% of the users and all of their ratings.

**MOVIES** are described in `ml32m-movies-2000s.csv.gz` in the following format: `movieId,title,genres`.

**RATINGS** are contained in `ml32m-ratings-2000s.csv.gz` in the following format: `userId,movieId,rating`

**TAGS** are contained in `ml32m-tags-2000s.csv.gz` in the following format: `userId,movieId,tag`

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1.1. Load the input files

In [None]:
# LEAVE THIS CODE AS-IS
# But feel free to add imports in an extra cell if needed

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from math import *
import random
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import linear_kernel

In [None]:
# LEAVE THIS CODE AS-IS

FILENAME_MOVIES = "ml32m-movies-2000s.csv.gz"
FILENAME_RATINGS = "ml32m-ratings-2000s.csv.gz"
FILENAME_TAGS = "ml32m-tags-2000s.csv.gz"

In [None]:
# LEAVE THIS CODE AS-IS

# Load movies
movies = pd.read_csv(FILENAME_MOVIES, 
                    compression='gzip',
                    sep=',', 
                    engine='python', 
                    encoding='utf-8',
                    names=['movie_id', 'title', 'genres'])

# Remove header row from this file
movies.drop(index=0, inplace=True)

# Make sure the movie id is numeric
movies["movie_id"] = pd.to_numeric(movies["movie_id"])
display(movies.head(5))

In [None]:
# LEAVE THIS CODE AS-IS

# Load ratings
ratings_raw = pd.read_csv(FILENAME_RATINGS, 
                    sep=',', 
                    compression='gzip',
                    encoding='utf-8',
                    engine='python',
                    names=['user_id', 'movie_id', 'rating'])
display(ratings_raw.head(5))

# 1.2. Merge the data into a single dataframe

Join the data into a single dataframe that should contain columns: user_id, movie_id, rating, timestamp, title, genders.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code from the previous practice that joined these three dataframes using "merge" into a single dataframe named "ratings". Print the first 5 rows of the resulting dataframe, which should contain columns "user_id", "movie_id", "rating", "title", and "genres".</font>

<font size="+1" color="red">Replace this cell with your code from the previous practice for "find_movies" that list movies containing a keyword</font>

In [None]:
# LEAVE AS-IS

# For testing, this should print 6 movies
find_movies("Final Destination", movies)

The following function, which you can leave as-is, prints the title of a movie given its movie_id.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE AS-IS

def get_title(movie_id, movies):
    return movies[movies['movie_id'] == movie_id].title.iloc[0]

In [None]:
# LEAVE AS-IS

# For testing, should print "Final Destination 5 (2011)"
print(get_title(88932, movies))

## 1.3. Count unique registers

Count the number of unique users and unique movies in the `ratings` variable. Use [unique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html). Print also the total number of movies in the `movies` variable. Your code should print:

```
Number of users who have rated a movie : 16348
Number of movies that have been rated  : 2878
Total number of movies                 : 51444
```

Note that ratings are heavily concentrated on a few popular movies.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your own code to indicate the number of unique users and unique movies in the "ratings" variable.</font>

# 2. Item-based Collaborative Filtering

The two main types of interactions-based recommender system, also known as *collaborative filtering* algorithms are:

1. **User-based Collaborative Filtering**: To recommend items for user A, we first look at other users B1, B2, ..., Bk with a similar behavior to A, and aggregate their preferences. For instance, if all Bi like a movie that A has not watched, it would be a good candidate to be recommended. 


2. **Item-based Collaborative Filtering**: To recommend items for user A, we first look at all the items I1, I2, ..., Ik that the user A has consumed, and find items that elicit similar ratings from other users. For instnce, an item that is rated positively by the same users that rate positively the Ii items, and negatively by the same users that rate negatively the Ii items, would be a good candidate to be recommended.

In both cases, a similarity matrix needs to be built. For user-based, the **user-similarity matrix** will consist of some **distance metrics** that measure the similarity between any two pairs of users. For item-based, the **matrix** will measure the similarity between any two pairs of items.

As we already know, there are several metrics strategy for measure the "similarity" of two items. Some of the most used metrics are Jaccard, Cosine and Pearson. Meanwhile, Jaccard similarity is based on the number of users which have rated item A and B divided by the number of users who have rated either A or B (very useful for those use cases where there is not a numeric rating but just a boolean value like a product being bought), in Pearson and Cosine similarities we measure the similarity between two vectors.

For the purpose of this assignment, we will use **Pearson Similarity** and we will implement a **Item-based Collaborative filtering**.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 2.1. Data pre-processing

Firstly, create a new dataframe called "rated_movies" that is simply the "ratings" dataset with column genres removed using the [Drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to generate "rated_movies" and print the first ten rows. This should have columns user_id, movie_id, rating, title</font>

Now, using the `rated_movies` dataframe, create a new dataframe named `ratings_summary` containing the following columns:

* movie_id
* title
* ratings_mean (average rating)
* ratings_count (number of people who have rated this movie)

You can use the following operations:

* Initialize `ratings_summary` to be only the movie_id and title of all movies in `rated_movies`
   * To group dataframe `df` by column `a` and keep only one unique row per value of `a`, use: `df.groupby('a').first()`
* Compute three series: `ratings_mean`, `ratings_median`, `ratings_count`:
   * To obtain a series with the XX of column `a` for each distinct value of column `b` in dataframe `df`, use: `df.groupby(b)['a'].XX()` (XX=mean, median, count)
* Add these series to the `ratings_summary`
   * To add a series `s` with column name `a` to dataframe `df`, use: `df['a'] = s`
    
<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to generate "ratings_summary" and print the first 10 rows.</font>

To select from dataframe A those having column C larger or equal to N, you can do `A[A.C >= N]`.

To sort dataframe A by decreasing values of column C, you can do `A.sort_values(by='C', ascending=False)`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to print the top 10 highest rated movies by average rating and the top 10 highest rated movies by median rating, considering only movies receiving at least 1000 ratings.</font>

<font size="+1" color="red">Repeat this, but this time consider movies receiving at least 3 ratings, and having a median of 4.5 or above.</font>

<font size="+1" color="red">Replace this cell with a brief commentary, in your own words, comparing the three lists above.</font>

## 2.2. Compute the user-movie matrix

Before calculating the **similarity matrix**, we create a table where columns are movies and rows are users, and each movie-user cell contains the score of that user for that movie.

We will use the [pivot_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html) function of Pandas, which receives a dataframe plus the variable that will make the rows, the variable that will make the columns, and the variable that will make the cells, and transform it into a matrix of the specified rows, columns, and cells.

For instance, if you have a dataframe D containing:

```
U V W
1 a 3.0
1 b 2.0
2 a 1.0
2 c 4.0
```

Calling `D.pivot_table(index='U', columns='V', values='W')` will create the following:

```
V  a   b   c
U
1 3.0 2.0 NaN
2 1.0 NaN 4.0
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to generate a "user_movie" matrix by calling "pivot_table" on "rated_movies". Print the first 5 rows. It might take about one minute to compute, depending on your computer.</font>

<font size="+1" color="red">Replace this a brief commentary indicating why do you think the "user_movie" matrix has so many "NaN" values. How do we call this characteristic of user ratings in recommender systems?</font>

# 2.3. Explore some correlations in the user-movie matrix

Now let us explore whether correlations in this matrix make sense.

1. Locate the movie_id for the following three movies:
  * *Finding Nemo (2003)* -- name this id_pivot
  * *Animatrix, The (2003)* -- name this id_m1
  * *Hey Arnold! The Movie (2002)* -- name this id_m2
2. Obtain the ratings for each of these movies: `user_movie[movie_id].dropna()`. You will obtain a column, containing a series of ratings for each movie.
3. Consolidate these four series into a single dataframe: `ratings3 = pd.concat([s1, s2, s3], axis=1)`
4. Drop from `ratings3` all rows containing a *NaN*. This will keep only the users that have rated all the 3 movies.
5. Display the first 10 rows from this table.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute and display the first 10 rows of the "ratings3" table as described above.</font>

<font size="+1" color="red">Replace this cell with an explanation, in your own words, of the contents of *ratings_3*.</font>

To compute Pearson correlation, we use the [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.corr.html) method.

To compute the correlation between two columns `a`, `b` in dataframe `df`, we use: `df[a].corr(df[b])`.

Compute the correlations between all pairs of columns of the `ratings3` table. You should display:

```
Similarity between 'Finding Nemo (2003)' and 'Animatrix, The (2003)': 0.74
Similarity between 'Finding Nemo (2003)' and 'Hey Arnold! The Movie (2002)': 0.96
Similarity between 'Animatrix, The (2003)' and 'Hey Arnold! The Movie (2002)': 0.90
```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to compute all correlations between these three movies, as described above.</font>

<font size="+1" color="red">Replace this cell with a brief commentary on the correlations you find.</font>

Now let us take the first movie selected above, the one with movie_id `id_pivot`.

Select the column corresponding to this movie in `user_movies` and compute its correlation with all other columns in `user_movies`. This can be done with  [corrwith](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corrwith.html).

To extract the ratings for a movie into a dataframe containing a single column "rating", you can use:

```
df = pd.DataFrame(user_movie[id_movie].dropna()).rename(columns={id_movie: "rating"})
```

To compute the correlation between two single-column dataframes `df1` and `df2`, you can use:

```
corr = df1.corrwith(df2)[0]
```

Store the result in a new dataframe named `similarity_to_pivot` containing two columns: `movie_id` and `corr_with_pivot`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create a "similarity_to_pivot" series that contains the computed correlations, droping the NaNs in the series.</font>

Next, create a dataframe `corr_with_pivot` by using `similarity_to_pivot` and `ratings_summary`. This dataframe should have the following columns:

* movie_id
* corr_with_pivot - the correlation between movies movie_id and id_pivot
* title
* ratings_mean
* ratings_count

Keep only rows in which *ratings_count* > 500, i.e., popular movies. To filter a dataframe `df` and keep only rows having column `c` larger than `x`, use `df[df[c] > x]`.

Display the top 10 rows with the largest correlation. To select the largest `n` rows from dataframe `df` according to column `c`, use `df.sort_values(c, ascending=False).head(n)`. 

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with code to create a "corr_with_pivot" dataframe as specified above, and to print the 20 movies (rated 1000 times or more, and with a median rating of 4.0 or better) with the highest correlation with the selected movie.</font>

<font size="+1" color="red">Replace this cell with a brief commentary about the movies you see on this list. What happens if you set the condition on *ratings_count* to a much larger value? What happens if you set it to a much smaller value?</font>

# 2.4. Implement the item-based recommendations

Now that we believe that this type of correlation sort of makes sense, let us implement the item-based recommender. We need all correlations between columns in `user_movie`.

To compute all correlations between columns in a dataframe, use [corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html). This function receives a matrix with *r* rows and *c* columns, and returns a square matrix of *c x c* containing all pair-wise correlations between columns.

**This process may take a few minutes.**

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie. Store this in "item_similarity", and print the first 10 rows.</font>

Similarities between movies that do not have many ratings in common are unreliable. Fortunately, the `corr` method includes a parameter `min_periods` that establishes a minimum number of elements in common that two columns must have to compute the correlation.

Re-generate item_similarity setting min_periods to 100.

This process will also take a few minutes.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to compute all correlations between columns (movies) in the matrix user_movie, but considering only movies having at least 100 ratings in common. Store this in "item_similarity_min_ratings", and print the first 10 rows</font>

Next will need some auxiliary functions that are provided below. These give us the list of movies that a user has rated. You can leave as-is.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# Leave this code as-is

# Gets the rating a user_id has given to a movie_id
def get_rating(user_movie, user_id, movie_id):
    return user_movie[movie_id][user_id]

# Gets a list of rated movies for a user_id
def get_rated_movies(user_movie, user_id):
    return list(user_movie.loc[user_id].dropna().sort_values(ascending=False).index)
    
# Print rated movies
def print_rated_movies(user_movie, movies, user_id):
    for movie_id in get_rated_movies(user_movie, user_id):
        print("%d %.1f %s " %
          (movie_id, get_rating(user_movie, user_id, movie_id), get_title(movie_id, movies)))


We will need to test our function so let us select a couple of interesting users.

Our first user, `user_id_super` will be someone who has given the following 3 films a rating higher than 4.5:

* super_movie_1=5349: *Spider-Man (2002)*
* super_movie_2=3793: *X-Men (2000)*
* super_movie_3=8961: *Incredibles, The (2004)* 	

Our second user, `user_id_drama` will be someone who has given the following 3 films a rating higher than 4.5:

* drama_movie_1=3408: *Erin Brockovich (2000)*
* drama_movie_2=5995: *Pianist, The (2002)*
* drama_movie_3=4995: *Beautiful Mind, A (2001)*
* and that has NOT rated the first superhero movies, i.e., having `user_movie[super_movie_1].isnull()`.

Print the number of users satisfying each condition and choose one at random.

*Tip:* To filter a dataframe by multiple conditions you can use, e.g., `df[(a > 1) & (b > 2)]`. 

**Important**: these particular users have watched lots of movies, so we cannot tell for sure they have only these interests.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to find userids of two example users: user_id_super (the one who liked the three superhero movies), and user_id_drama (the one who liked the three dramas and did not rate the first superhero movie). Print the number of users satisfying the conditions, and choose one at random using `random.choice()`. Print the user ids.</font>

The next code, that you should leave as-is, checks that the users you selected are correct.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
# LEAVE THIS CODE AS-IS
# We use this to check that the user ids you selected are correct

assert get_rating(user_movie, user_id_super, super_movie_1) > 4.5
assert get_rating(user_movie, user_id_super, super_movie_2) > 4.5
assert get_rating(user_movie, user_id_super, super_movie_3) > 4.5

assert get_rating(user_movie, user_id_drama, drama_movie_1) > 4.5
assert get_rating(user_movie, user_id_drama, drama_movie_2) > 4.5
assert get_rating(user_movie, user_id_drama, drama_movie_3) > 4.5


<font size="+1" color="red">Given that the rest of the practice requires you to fix the user ids, so that your results are consistent, replace this cell with code to assign them to fixed values (user_id_super=XXX, user_id_drama=YYY).</font>

In [None]:
# LEAVE AS-IS (TESTING CODE)

print_rated_movies(user_movie, movies, user_id_super)

In [None]:
# LEAVE AS-IS (TESTING CODE)

print_rated_movies(user_movie, movies, user_id_drama)

For every user, we will consider that the importance of a new movie (a movie s/he has not rated) will be equal to the sum of the similarities between that new movie and all the movies the user has already rated.

Indeed, to further improve this, we will compute a weighted sum, in which the weight will be the rating given to the movie.

For instance, suppose a user has rated movies as follows:

```
movie_id rating
1        2.0
2        3.0
3        NaN
4        NaN
```

And that movie similarities are as follows (values with a "." do not matter in this example):

```
movie_id   1   2   3   4
1         ...............
2         ...............
3         0.1 0.2 NaN ...
4         0.9 0.8 ... NaN
```

The importance of movie 3 to this user will be:

```
2.0 * 0.1 + 3.0 * 0.2 = 0.8
```

While the importance of movie 4 to this user will be:

```
2.0 * 0.9 + 3.0 + 0.8 = 5.6
```

As we can see, we are favoring movies that are highly similar to many movies that the user has rated high.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

Create a function `get_movies_relevance` that returns a dataframe with columns `movie_id` and `relevance`. You can use the following template:

```python
def get_movies_relevance(user_id, user_movie, item_similarity_matrix):
    
    # Create an empty series
    movies_relevance = ...
    
    # Iterate through the movies the user has rated
    for rated_movie in ...
        
        # Obtain the rating given
        rating_given = ...
        
        # Obtain the vector containing the similarities of watched_movie
        # with all other movies in item_similarity_matrix
        similarities = ...
        
        # Multiply this vector by the given rating
        weighted_similarities = ...
        
        # Append these terms to movies_relevance
        movies_relevance = pd.concat([movies_relevance, weighted_similarities])
    
    # Compute the sum for each movie
    movies_relevance = movies_relevance.groupby(movies_relevance.index).sum()
    
    # Convert to a dataframe
    movies_relevance_df = pd.DataFrame(movies_relevance, columns=['relevance'])
    movies_relevance_df['movie_id'] = movies_relevance_df.index
    
    return movies_relevance_df

```

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code for "get_movies_relevance"</font>

Apply `get_movies_relevance` to the two users we have selected, `user_id_super` and `user_id_drama`.

The result will contain only `movie_id` and `relevance`, you will have to merge with the `movies` dataframe on the `movie_id` attribute.

Sort the results by descending relevance and print the top 10 for each case.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code to obtain the 10 most relevant movies for the users user_id_super (who likes superhero movies) and user_id_drama (who likes dramas)</font>

<font size="+1" color="red">Replace this cell with a brief commentary on the movies you see on these lists. How many of them look relevant for the intended users? Feel free to use IMDB or Wikipedia to get info on these movies.</font>

<font size="-1" color="gray">All those trivial facts you learned about 2000s pop culture were supposed to be useful one day; that day has arrived :-)</font>

Finally, we need to remove the movies that the user has already watched. To do so:

* Obtain the dataframe of relevant movies with `get_movies_relevance`
* Set this dataframe index to 'movie_id'
* Obtain the list of movie_ids of watched movies with `get_watched_movies`
* Drop from the relevant movies dataframe the watched movies

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Replace this cell with your code implementing "get_recommended_movies"</font>

<font size="+1" color="red">Replace this cell with your code to obtain the 20 most recommended movies for the users user_id_super and user_id_drama</font>

<font size="+1" color="red">Replace this cell with a brief commentary on these recommendations. (1) What percentage of recommendations would you say are relevant for the user who likes superhero movies? And for the user who like drama movies? (2) After removing the movies the user has already watched, are the relevance scores of the remaining items comparable to the previous lists that contained all relevant movies?</font>

# DELIVER (individually)

Remember to read the section on "delivering your code" in the [course evaluation guidelines](https://github.com/chatox/data-mining-course/blob/master/upf/upf-evaluation.md).

Deliver a zip file containing:

* This notebook

## Extra points available

For more learning and extra points, use the [surprise](http://surpriselib.com/) library to generate recommendations for the same two users. Display the generated recommendations and comment on them.

**Note:** if you go for the extra points, add ``<font size="+2" color="blue">Additional results: surprise library</font>`` at the top of your notebook.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+2" color="#003300">I hereby declare that I completed this practice myself, that my answers were not written by an AI-enabled code assistant, and that except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>