In [7]:
# Move to the directory where the project is located
import os
os.chdir('/work/notebooks/enstit/Steflix')

<div align="center">
  <img src="./assets/steflix.png" width="225px">
</div>

# 🍿 STEFLIX: A Movie-based Recommender System

The aim of this repository is to build a **Recommender System** that uses Matrix Factorization to learn user and items embeddings from a (sparse) review matrix, and uses them to perform user-specific suggestions.

## Reading data

The data that we are going to use to populate our Recommender System is located in the [data](./data/) folder. In particular, there is the [movies.csv](./data/movies.csv) file that contains the list of all the movies in the collection, and the [ratings.csv](./data/ratings.csv) file that contains all the reviews left by users to the movies (in our system, a particular *user* can leave a score from 0 to 5 to any *movie*. It is possible to leave half points, e.g. 4.5 for an almost-perfect movie). 

In [8]:
import pandas as pd

# Read movies and ratings from the related CSV files
movies = pd.read_csv('./data/movies.csv', usecols=['movieId', 'title', 'genres'])
ratings = pd.read_csv('./data/ratings.csv', usecols=['userId', 'movieId', 'rating'])

users_list = ratings.userId.unique().astype(str).tolist()
movies_list = movies.title.unique().astype(str).tolist()

C = pd.pivot_table(ratings, values='rating', index=['userId'], columns=['movieId']).values

Now we have:
* the `users_list`, that contains the unique names of all the 610 users that reviewed a movie,
* the `movies_list`, with the name of all the 9737 movies in the reviewed collection, and
* the matrix `C`, that represents the feedback that any user gave to any movie in the collection. If no review for a specific movie has been left by a user, the related cell will contain a `nan` value.

## Building the Recommender System

The `utils` folder contains all the class definition for our system to work. In aprticular, the `recsys` package contains the definition of the `RecommenderSystem` class.

After initializing the `rs` object, we have to call the `build_embeddings()` funtion to compute *users* and *items* embeddings (using Weighted Matrix Factorization).

In [9]:
from utils.recsys import RecommenderSystem

try:
    # Load the recommender system from filesystem
    rs = RecommenderSystem(name="Steflix")
    rs = rs.load()
except FileNotFoundError:
    rs = RecommenderSystem(name="Steflix", reviews=C, users=users_list, items=movies_list) # Create a new recommender system
    rs.build_embeddings() # Build the users and items embeddings
    rs.save() # Save the recommender system to disk for persistence

2024-01-24 11:04:53,461 - DEBUG - Loading Steflix from filesystem...
2024-01-24 11:04:53,462 - DEBUG - No filename provided, using default Steflix.pkl...


Now our Recommender System contains information about the users name, the movies titles and the related ratings.

But what about movies that are not been watched by a user? The embeddings we constructbuilted are useful for exactly this reason: they allow us to infer about how much a user is attracted to certain *latent features*, and at the same time how each movie can be described by each of these *latent features*. By putting the two together, we can try to guess what movies a user might enjoy before that user has even seen them.

Let's start by seeing the top 20 movies watched by user *569*, for example: 

In [10]:
# Print the 20 most liked movies by user "569"
rs.get_user_chart(user="569", chart_len=20, print_chart=True)

  Position  Item Name                                 Rating
         1  Batman Forever (1995)                          5
         2  Die Hard: With a Vengeance (1995)              5
         3  Pulp Fiction (1994)                            5
         4  Star Trek: Generations (1994)                  5
         5  Speed (1994)                                   5
         6  GoldenEye (1995)                               4
         7  Net, The (1995)                                4
         8  Stargate (1994)                                4
         9  Ace Ventura: Pet Detective (1994)              4
        10  Clear and Present Danger (1994)                4
        11  Cliffhanger (1993)                             4
        12  Jurassic Park (1993)                           4
        13  Aladdin (1992)                                 4
        14  Dances with Wolves (1990)                      4
        15  Beauty and the Beast (1991)                    4
        16  Usual Suspec

Oh, they are fan of thriller and action movies! Such good taste...

Now, knowing which movies the user watched, and especially which one they enjoyed the most and the less, the Recommender System can suggest the user **new** movies that are similar (i.e., have similar *latent features*) to the movies they liked.\
Let's see in practice this list:

In [11]:
rs.contentbased_filtering(user="569", rec_len=10, print_chart=True)

Top 10 content-based recommendations for user 569:
  Position  Item Name                                                  Similarity
         1  Crimson Tide (1995)                                        49.26%
         2  Man of the Year (1995)                                     46.86%
         3  Outbreak (1995)                                            46.00%
         4  Disclosure (1994)                                          44.58%
         5  Firm, The (1993)                                           41.02%
         6  While You Were Sleeping (1995)                             40.86%
         7  Apollo 13 (1995)                                           37.75%
         8  Fugitive, The (1993)                                       35.96%
         9  Interview with the Vampire: The Vampire Chronicles (1994)  35.72%
        10  Quiz Show (1994)                                           33.30%


This is usually called a *Content-based filtering*, because of the nature on how new items are choosen to be suggested.\
There exist another form of filtering, called *Collaborative filtering*: the Recommender System collects movies from users that have tastes in common with the selected user, and rank them w.r.t. the user's *latent features*. Here's a list of movies that are suggested for the user *569* using a *Collaborative filtering*:

In [12]:
rs.collaborative_filtering(user="569", rec_len=10, print_chart=True)

Top 10 collaborative recommendations for user 569:
  Position  Item Name                                                  Similarity
         1  Disclosure (1994)                                          74.12%
         2  Crimson Tide (1995)                                        71.17%
         3  Firm, The (1993)                                           70.33%
         4  Outbreak (1995)                                            69.56%
         5  While You Were Sleeping (1995)                             64.33%
         6  Apollo 13 (1995)                                           62.03%
         7  Fugitive, The (1993)                                       61.11%
         8  Interview with the Vampire: The Vampire Chronicles (1994)  57.74%
         9  Legends of the Fall (1994)                                 57.03%
        10  Waterworld (1995)                                          56.85%


In the real world, these two approaches are used together: the list of suggested items consists of both items that are similar to those liked by the user, and items that have been seen and rated positively by users who have tastes in common with the subject.

## Possible future improvements

The Recommender System we described is a good starting point, but far from being a perfect system.\
Here are some ideas for expanding and improving the project:
* for the Weighted Matrix Factorization, *Alternating Least Squares* has been used in order to find the embeddings matrices. It would be useful to also try other methods, e.g. the *Stochastic Gradient Descent*;
* as for the *Alternating Least Squares*, it is known to have a very fast convergence. It would be nice to have a visual representation of the learning curve, also trying different hyperparameters.
* the collections of users, movies and ratings have been considered as static and immutable in time. In a real-world scenario, this is never the case. Some methods for embeddings updating after a new user, movie or rating has been recorded can be studied and implemented;
* our Recommender System does not include a **candidate selection** step: the corpus of items could potentially be huge, thus the retrieval must be fast. Having a different fucntion for this step, also allows for a more precise selection in the **ranking the candidates** step, taking into considerations further infomation (e.g., the date of the ratings, movies genres, ...);
* the code of the `__update_users_matrix` and `__update_items_matrix` methods of the `WeightedMatrixFactorization` class can be optimized with parallelization.