In [1]:
# Move to the directory where the project is located
import os
os.chdir('/work/notebooks/enstit/RecommenderSystem')

<div align="center">
  <img src="./assets/steflix.png" width="225px">
</div>

# 🍿 STEFLIX: A Movie-based Recommender System

The aim of this repository is to build a **Recommender System** that uses Matrix Factorization to learn user and items embeddings from a (sparse) review matrix, and uses them to perform user-specific suggestions.

## Reading data

The data that we are going to use to populate our Recommender System is located in the `data` folder. In particular, there is the `movies.csv` file that contains the list of all the movies in the collection, and the `ratings.csv` file that contains all the reviews left by users to the movies (in our system, a particular *user* can leave a score from 0 to 5 to any *movie*. It is possible to leave half points, e.g. 4.5 for an almost-perfect movie). 

In [2]:
import pandas as pd

# Read movies and ratings from the related CSV files
movies = pd.read_csv('./data/movies.csv', usecols=['movieId', 'title', 'genres'])
ratings = pd.read_csv('./data/ratings.csv', usecols=['userId', 'movieId', 'rating'])

users_list = ratings.userId.unique().astype(str).tolist()[:]
movies_list = movies.title.unique().astype(str).tolist()[:1_000]

C = pd.pivot_table(ratings, values='rating', index=['userId'], columns=['movieId']).values[:, :1_000]

Now we have:
* the `users_list`, that contains the unique names of all the users that reviewed a movie,
* the `movies_list`, with the name of all the movies in the reviewed collection, and
* the matrix `C`, that represents the feedback that any user gave to any movie in the collection. If no review for a specific movie has been left by a user, the related cell will contain a `nan` value.

## Building the Recommender System

The `utils` folder contains all the class definition for our system to work. In aprticular, the `recsys` package contains the definition of the `RecommenderSystem` class.

After initializing the `rs` object, we have to call the `build_embeddings()` funtion to compute *users* and *items* embeddings (using Weighted Matrix Factorization).

In [7]:
from utils.recsys import RecommenderSystem

try:
    # Load the recommender system from filesystem
    rs = RecommenderSystem(name="Steflix")
    rs = rs.load()
except FileNotFoundError:
    rs = RecommenderSystem(name="Steflix", reviews=C, users=users_list, items=movies_list) # Create a new recommender system
    rs.build_embeddings() # Build the users and items embeddings
    rs.save() # Save the recommender system to disk for persistence

2024-01-19 15:26:21,327 - DEBUG - Loading Steflix from filesystem...
2024-01-19 15:26:21,329 - DEBUG - No filename provided, using default (object_name.pkl)...


Now our Recommender System contains information about the users name, the movies titles and the related ratings.

But what about movies that are not been watched by a user? The embeddings we constructbuilted are useful for exactly this reason: they allow us to infer about how much a user is attracted to certain *latent features*, and at the same time how each movie can be described by each of these *latent features*. By putting the two together, we can try to guess what movies a user might enjoy before that user has even seen them.

Let's start by seeing the top 20 movies watched by user *569*, for example: 

In [12]:
# Print the 20 most liked movies by user "569"
rs.print_user_chart(user="569", first_n=20)

  Position  Item Name                                 Rating
         1  Batman Forever (1995)                          5
         2  Die Hard: With a Vengeance (1995)              5
         3  Pulp Fiction (1994)                            5
         4  Star Trek: Generations (1994)                  5
         5  Speed (1994)                                   5
         6  GoldenEye (1995)                               4
         7  Net, The (1995)                                4
         8  Stargate (1994)                                4
         9  Ace Ventura: Pet Detective (1994)              4
        10  Clear and Present Danger (1994)                4
        11  Cliffhanger (1993)                             4
        12  Jurassic Park (1993)                           4
        13  Aladdin (1992)                                 4
        14  Dances with Wolves (1990)                      4
        15  Beauty and the Beast (1991)                    4
        16  Usual Suspec

Oh, they are fan of thriller and action movies! Such good taste...

In [9]:
rs.contentbased_filtering(user="569", rec_len=10, print_chart=True)

Top 10 content-based recommendations for user 569:
  Position  Item Name                             Expected rating  Similarity
         1  Crimson Tide (1995)                          1.77123   23.63%
         2  Fatal Instinct (1993)                        0.429341  19.14%
         3  Bad Company (1995)                           0.133276  18.22%
         4  Quick and the Dead, The (1995)               0.625093  17.98%
         5  Hard Target (1993)                           0.335905  16.97%
         6  Disclosure (1994)                            0.832943  15.76%
         7  Outbreak (1995)                              1.13676   15.46%
         8  In the Line of Fire (1993)                   0.950335  14.18%
         9  Around the World in 80 Days (1956)           0.382406  14.05%
        10  Mixed Nuts (1994)                            0.133761  14.05%


In [10]:
rs.collaborative_filtering(user="569", rec_len=10, print_chart=True)

Top 10 collaborative recommendations for user 569:
  Position  Item Name                                                    Expected rating  Similarity
         1  Crimson Tide (1995)                                                 1.77123   52.61%
         2  Apollo 13 (1995)                                                    1.19906   46.85%
         3  Outbreak (1995)                                                     1.13676   43.24%
         4  Fugitive, The (1993)                                                0.342668  36.26%
         5  Disclosure (1994)                                                   0.832943  35.24%
         6  Quiz Show (1994)                                                    0.594029  31.78%
         7  Waterworld (1995)                                                   0.52041   29.85%
         8  While You Were Sleeping (1995)                                      0.656457  28.18%
         9  Shawshank Redemption, The (1994)                            