In [1]:
# Move to the directory where the project is located
import os
os.chdir('/work/notebooks/enstit/Steflix')

<div align="center">
  <img src="./assets/steflix.png" width="225px">
</div>

# 🍿 STEFLIX: A Movie-based Recommender System

The aim of this repository is to build a **Recommender System** that uses Matrix Factorization to learn users and items embeddings from a (sparse) review matrix, and uses them to perform user-specific suggestions (both adopting a *Content-based filtering* and a *Collaborative filtering*).

## Reading the data

The data that we are going to use to populate our Recommender System is located in the [data](./data/) folder. In particular, there is the [movies.csv](./data/movies.csv) file that contains the list of all the movies in the collection, and the [ratings.csv](./data/ratings.csv) file that contains all the reviews left by users to the movies (in our system, a particular *user* can leave a score from 0 to 5 to any *movie* (0 excluded). It is possible to leave half points, e.g. 4.5 for an almost-perfect movie). 

In [2]:
import pandas as pd

# Read movies and ratings from the related CSV files
movies = pd.read_csv('./data/movies.csv', usecols=['movie_id', 'title'])
ratings = pd.read_csv('./data/ratings.csv', usecols=['user_id', 'movie_id', 'rating'])

users_list = [f"User {user_idx}" for user_idx in sorted(ratings.user_id.unique())]
movies_list = movies.title.unique().astype(str).tolist()

feedbacks_matrix = pd.pivot_table(ratings, values='rating', index=['user_id'], columns=['movie_id']).values

Now we have:
* the `users_list`, that contains the unique names of all the 943 users that reviewed a movie,
* the `movies_list`, with the name of all the 1682 movies in the reviewed collection, and
* the matrix `C`, that represents the feedback that any user gave to any movie in the collection. If no review for a specific movie has been left by a user, the related cell will contain a `nan` value.

## Building the Recommender System

The `utils` folder contains all the class definition for our system to work. In aprticular, the `recsys` package contains the definition of the `RecommenderSystem` class.

After initializing the `rs` object, we have to call the `build_embeddings()` funtion to compute *users* and *items* embeddings (using Weighted Matrix Factorization).

In [3]:
from utils.recsys import RecommenderSystem

try:
    # Load the recommender system from filesystem
    rs = RecommenderSystem(name="Steflix")
    rs = rs.load()
except FileNotFoundError:
    rs = RecommenderSystem(name="Steflix", reviews=feedbacks_matrix, users=users_list, items=movies_list) # Create a new recommender system
    rs.build_embeddings() # Build the users and items embeddings with 50 latent factors
    rs.save() # Save the recommender system to disk for persistence

2024-01-25 09:54:27,038 - DEBUG - Loading Steflix_2 from filesystem...
2024-01-25 09:54:27,039 - DEBUG - No filename provided, using default Steflix_2.pkl...


Now our Recommender System contains information about the users names, the movies titles and the related ratings.

But what about movies that are not been watched by a user? The embeddings we constructbuilted are useful for exactly this reason: they allow us to infer about how much a user is attracted to certain *latent features*, and at the same time how each movie can be described by each of these *latent features*. By putting the two together, we can try to guess what movies a user might enjoy before that user has even seen them.

Let's start by seeing the top 20 movies watched by *User 475*, for example: 

In [4]:
# Print the 20 most liked movies by the user "User 475"
rs.get_user_chart(user="User 475", chart_len=20, print_chart=True)

  Position  Item Name                                    Rating
         1  Star Wars (1977)                                  5
         2  Fargo (1996)                                      5
         3  George of the Jungle (1997)                       5
         4  FairyTale: A True Story (1997)                    5
         5  Schindler's List (1993)                           5
         6  Nil By Mouth (1997)                               5
         7  Four Weddings and a Funeral (1994)                4
         8  Godfather, The (1972)                             4
         9  Gattaca (1997)                                    4
        10  In the Name of the Father (1993)                  4
        11  Desperate Measures (1998)                         4
        12  Fallen (1998)                                     4
        13  Naked Gun 33 1/3: The Final Insult (1994)         4
        14  Fly Away Home (1996)                              3
        15  Misérables, Les (1995)      

Oh, they are fan of thriller and action movies! Such good taste...

### Content-based filtering

Now, knowing which movies the user watched, and especially which one they enjoyed the most and the less, the Recommender System can suggest the user **new** movies that are similar (i.e., have similar *latent features*) to the movies they liked.\
Let's see in practice this list:

In [5]:
rs.contentbased_filtering(user="User 475", rec_len=20, print_chart=True)

Top 20 content-based recommendations for user User 475:
  Position  Item Name                                    Similarity
         1  Man Who Knew Too Little, The (1997)          40.48%
         2  Heat (1995)                                  38.93%
         3  Return of the Jedi (1983)                    38.60%
         4  L.A. Confidential (1997)                     37.03%
         5  Grosse Pointe Blank (1997)                   35.17%
         6  Shall We Dance? (1996)                       35.00%
         7  Jackal, The (1997)                           34.95%
         8  Treasure of the Sierra Madre, The (1948)     34.68%
         9  To Catch a Thief (1955)                      34.66%
        10  Men in Black (1997)                          34.44%
        11  Once Upon a Time in America (1984)           33.82%
        12  Chasing Amy (1997)                           33.80%
        13  3 Ninjas: High Noon At Mega Mountain (1998)  33.32%
        14  Jackie Brown (1997)             

This is usually called a *Content-based filtering*, because of the nature on how new items are choosen to be suggested.

### Collaborative filtering

There exist another form of filtering, called *Collaborative filtering*: the Recommender System collects movies from users that have tastes in common with the selected user, and rank them w.r.t. the user's *latent features*. Here's a list of movies that are suggested for the user *569* using a *Collaborative filtering*:

In [6]:
rs.collaborative_filtering(user="User 475", rec_len=20, print_chart=True)

Top 20 collaborative recommendations for user User 475:
  Position  Item Name                                                                    Similarity
         1  Return of the Jedi (1983)                                                    59.92%
         2  Heat (1995)                                                                  57.86%
         3  Toy Story (1995)                                                             56.89%
         4  Lone Star (1996)                                                             56.74%
         5  Men in Black (1997)                                                          55.90%
         6  Liar Liar (1997)                                                             55.88%
         7  Grosse Pointe Blank (1997)                                                   55.50%
         8  Treasure of the Sierra Madre, The (1948)                                     55.49%
         9  Leaving Las Vegas (1995)                                        

In the real world, these two approaches are used together: the list of suggested items consists of both items that are similar to those liked by the user, and items that have been seen and rated positively by users who have tastes in common with the subject.

## Possible future improvements

The Recommender System we described is a good starting point, but far from being a perfect system.\
Here are some ideas for expanding and improving the project:
* for the Weighted Matrix Factorization, *Alternating Least Squares* has been used in order to find the embeddings matrices. It would be useful to also try other methods, e.g. the *Stochastic Gradient Descent*;
* as for the *Alternating Least Squares*, it is known to have a very fast convergence. It would be nice to have a visual representation of the learning curve, also trying different hyperparameters.
* the collections of users, movies and ratings have been considered as static and immutable in time. In a real-world scenario, this is never the case. Some methods for embeddings updating after a new user, movie or rating has been recorded can be studied and implemented;
* our Recommender System does not include a **candidate selection** step: the corpus of items could potentially be huge, thus the retrieval must be fast. Having a different fucntion for this step, also allows for a more precise selection in the **ranking the candidates** step, taking into considerations further infomation (e.g., the date of the ratings, movies genres, ...);
* the code of the `__update_users_matrix` and `__update_items_matrix` methods of the `WeightedMatrixFactorization` class can be optimized with parallelization.