# Building a Recommender Engine

"User who liked ... also liked..." - nowadays, **recommender engines** are everywhere on the web. A recommender engine is basically any of a large variety of algorithms that recommends items to users while trying to maximize the likelyhood that the user will select them. This is also known as **collaborative filtering**, because such algorithms allow a user to use the input of many previous users to help them sift through the data.

In this example, we are going to build a simple recommender engine for movies. Given the ratings (1-5 stars) that a user has given to movies, the engine is going to predict the ratings that the user is likely to give to previously unseen movies.

## Preamble

In [1]:
%matplotlib inline

In [2]:
import pandas

In [3]:
import datascience101

## Example: Generating Movie Recommendations

In [4]:
import surprise

### Loading the Data

Our training data comes from the [MovieLens](https://grouplens.org/datasets/movielens/) dataset.

In [5]:
data_dir = "../.assets/data/movielens/small"

In [6]:
movies = pandas.read_csv(f"{data_dir}/movies.csv")
ratings = pandas.read_csv(f"{data_dir}/ratings.csv")

In [7]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [9]:
!head {data_dir}/ratings.csv

userId,movieId,rating,timestamp
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205
1,1263,2.0,1260759151
1,1287,2.0,1260759187
1,1293,2.0,1260759148
1,1339,3.5,1260759125


In [10]:
ratings = surprise.Dataset.load_from_file(
    file_path=f"{data_dir}/ratings.csv",
    reader=surprise.Reader(
        line_format="user item rating timestamp", 
        sep=",", 
        skip_lines=1
    )
)

In [11]:
ratings

<surprise.dataset.DatasetAutoFolds at 0x1a1d132710>

### Training a Recommendation Model

We use the **SVD** algorithm from the [**surprise**](http://surpriselib.com/) library.

In [12]:
%%time
surprise.model_selection.cross_validate(
    surprise.SVD(), 
    ratings, 
    measures=['RMSE', 'MAE'], 
    cv=5, 
    verbose=True
)


Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9035  0.8955  0.8937  0.8944  0.8927  0.8960  0.0039  
MAE (testset)     0.6943  0.6906  0.6873  0.6876  0.6868  0.6893  0.0028  
Fit time          5.78    5.74    5.96    5.58    5.86    5.79    0.13    
Test time         0.16    0.15    0.23    0.15    0.22    0.18    0.03    
CPU times: user 29.9 s, sys: 358 ms, total: 30.3 s
Wall time: 30.9 s


{'test_rmse': array([0.90350182, 0.8954691 , 0.89373232, 0.89442176, 0.89265432]),
 'test_mae': array([0.69429678, 0.69059403, 0.68725414, 0.687614  , 0.68679825]),
 'fit_time': (5.781611919403076,
  5.741915941238403,
  5.960292816162109,
  5.580376148223877,
  5.861799955368042),
 'test_time': (0.1619250774383545,
  0.1517181396484375,
  0.22964978218078613,
  0.14802789688110352,
  0.21787786483764648)}

### Example Recommendations

As a sanity check, let's pick out a user and look at their ratings and the recommendations generated:

In [13]:
from surprise.model_selection import train_test_split

In [14]:
ratings_train, ratings_test = train_test_split(ratings, test_size=.25)


In [20]:
predictions = surprise.SVD().fit(ratings_train).test(ratings_test)

In [48]:
predicted_ratings = pandas.DataFrame(
    [
        {"userId": pred.uid, "movieId": pred.iid, "rating": pred.est} for pred in predictions
    ],
    columns=["userId", "movieId", "rating"],
)

In [39]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [42]:
movies.dtypes

movieId     int64
title      object
genres     object
dtype: object

In [49]:
movies["movieId"] = movies["movieId"].astype("str")

In [52]:
predicted_ratings = predicted_ratings.join(movies.set_index("movieId"), on="movieId")

In [53]:
predicted_ratings.head()

Unnamed: 0,userId,movieId,rating,title,genres
0,624,112183,3.119373,Birdman: Or (The Unexpected Virtue of Ignoranc...,Comedy|Drama
1,254,223,3.89247,Clerks (1994),Comedy
2,596,1333,3.917722,"Birds, The (1963)",Horror|Thriller
3,237,1,3.591756,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,219,2791,4.055229,Airplane! (1980),Comedy


In [54]:
example_user = "642"

In [55]:
predicted_ratings[predicted_ratings["userId"] == example_user]

Unnamed: 0,userId,movieId,rating,title,genres
2998,642,954,4.282427,Mr. Smith Goes to Washington (1939),Drama
8629,642,1600,3.248652,She's So Lovely (1997),Drama|Romance
8932,642,1633,3.466873,Ulee's Gold (1997),Drama
10835,642,1218,4.220335,"Killer, The (Die xue shuang xiong) (1989)",Action|Crime|Drama|Thriller
11334,642,1199,3.827965,Brazil (1985),Fantasy|Sci-Fi
11709,642,1644,2.842919,I Know What You Did Last Summer (1997),Horror|Mystery|Thriller
12473,642,924,3.893607,2001: A Space Odyssey (1968),Adventure|Drama|Sci-Fi
13789,642,906,3.826673,Gaslight (1944),Drama|Thriller
15432,642,457,4.000352,"Fugitive, The (1993)",Thriller
15648,642,1230,4.146343,Annie Hall (1977),Comedy|Romance


In [56]:
example_user = "42"

In [57]:
predicted_ratings[predicted_ratings["userId"] == example_user]

Unnamed: 0,userId,movieId,rating,title,genres
752,42,1370,3.485613,Die Hard 2 (1990),Action|Adventure|Thriller
1351,42,122886,3.39753,Star Wars: Episode VII - The Force Awakens (2015),Action|Adventure|Fantasy|Sci-Fi|IMAX
5802,42,7153,4.405511,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy
6444,42,58559,4.278184,"Dark Knight, The (2008)",Action|Crime|Drama|IMAX
9258,42,1200,4.065065,Aliens (1986),Action|Adventure|Horror|Sci-Fi
12713,42,112852,3.954363,Guardians of the Galaxy (2014),Action|Adventure|Sci-Fi
16721,42,1291,4.170588,Indiana Jones and the Last Crusade (1989),Action|Adventure
18127,42,589,4.050335,Terminator 2: Judgment Day (1991),Action|Sci-Fi
20410,42,3793,4.036874,X-Men (2000),Action|Adventure|Sci-Fi
20883,42,508,4.131817,Philadelphia (1993),Drama


## So how does it work actually?

In this course we do not go deep into the mathematics or algorithmics of machine learning, but since you asked: The ALS algorithm used above uses a mathematical technique called **matrix factorization**. [This blogpost](https://beckernick.github.io/matrix-factorization-recommender/) explains the approach, also using the movie ratings data set. As usual in machine learning, matrix factorization entails an optimization problem, and **alternating least squares** is a fast and parallelizable way of solving it, as [explained here](https://www.quora.com/What-is-the-Alternating-Least-Squares-method-in-recommendation-systems-And-why-does-this-algorithm-work-intuition-behind-this).

---
_This notebook is licensed under a [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). Copyright © 2019 [Point 8 GmbH](https://point-8.de)_