## Recommendation Engine

The aim of this notebook is to build a movie recommendation engine. We'll use the MovieLens 25M [dataset](https://www.grouplens.org/datasets/movielens/), which contains 25 million ratings. We'll look at two main approaches:
1. Content-based rating - generate recommendations based on movie features
2. Collaborative rating - generate recommendations based on movie features and similar users' ratings

In [28]:
import pandas as pd
import numpy as np
from pathlib import Path
import re

from sklearn import base
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors

We'll start by loading in the movie datasets and exploring the contents.

In [29]:
df_movies = pd.read_csv(Path('./ml-25m/movies.csv'))
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [30]:
df_ratings = pd.read_csv(Path('./ml-25m/ratings.csv'))
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [31]:
df_tags = pd.read_csv(Path('./ml-25m/tags.csv'))
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [32]:
df_links = pd.read_csv(Path('./ml-25m/links.csv'))
df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [33]:
df_gtags = pd.read_csv(Path('./ml-25m/genome-tags.csv'))
df_gtags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [34]:
df_gscores = pd.read_csv(Path('./ml-25m/genome-scores.csv'))
df_gscores.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


### Preprocessing
Starting with the movies dataframe, let's extract the year from the title.

In [35]:
def get_year(string):
    year = re.findall(r"\(([0-9]{4})\)", string)
    if year:
        return int(year[0])
    else:
        return 0 # returning None coerces the column to float

movies = df_movies.copy()
movies['year'] = df_movies['title'].apply(lambda x: get_year(x))
movies.sort_values('year', ascending=False).head()

Unnamed: 0,movieId,title,genres,year
59184,200300,Triple Threat (2019),Action|Thriller,2019
58190,198087,Non ci resta che il crimine (2019),Comedy,2019
60410,203242,Framing John DeLorean (2019),Documentary|Drama,2019
59710,201452,A March to Remember (2019),Drama,2019
59709,201450,Buñuel in the Labyrinth of the Turtles (2019),Animation,2019


In [36]:
# Earliest movies with valid year listed
movies.query('year > 0').sort_values('year', ascending=True).head()

Unnamed: 0,movieId,title,genres,year
35536,148054,Passage de Venus (1874),Documentary,1874
35533,148048,Sallie Gardner at a Gallop (1878),(no genres listed),1878
59938,202045,Athlete Swinging a Pick (1880),Documentary,1880
43771,166800,Buffalo Running (1883),(no genres listed),1883
35529,148040,Man Walking Around a Corner (1887),(no genres listed),1887


Next we'll convert the genres field from a pipe-separated string into a list so that we can do featuring engineering with it. 

In [37]:
movies['genres'] = df_movies['genres'].apply(lambda x: x.split('|'))
movies.set_index('movieId')

Unnamed: 0_level_0,title,genres,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995
2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995
3,Grumpier Old Men (1995),"[Comedy, Romance]",1995
4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995
5,Father of the Bride Part II (1995),[Comedy],1995
...,...,...,...
209157,We (2018),[Drama],2018
209159,Window of the Soul (2001),[Documentary],2001
209163,Bad Poems (2018),"[Comedy, Drama]",2018
209169,A Girl Thing (2001),[(no genres listed)],2001


We'll do the same for tags after grouping by movie.

In [38]:
tags = df_tags.groupby('movieId')['tag'].apply(lambda x: x.tolist())
tags.head()

movieId
1    [Owned, imdb top 250, Pixar, Pixar, time trave...
2    [Robin Williams, time travel, fantasy, based o...
3    [funny, best friend, duringcreditsstinger, fis...
4    [based on novel or book, chick flick, divorce,...
5    [aging, baby, confidence, contraception, daugh...
Name: tag, dtype: object

And finally let's merge the dataframes.

In [39]:
combined = movies.merge(tags.to_frame(), on='movieId', how='left')
combined.head()

Unnamed: 0,movieId,title,genres,year,tag
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995,"[Owned, imdb top 250, Pixar, Pixar, time trave..."
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1995,"[Robin Williams, time travel, fantasy, based o..."
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",1995,"[funny, best friend, duringcreditsstinger, fis..."
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995,"[based on novel or book, chick flick, divorce,..."
4,5,Father of the Bride Part II (1995),[Comedy],1995,"[aging, baby, confidence, contraception, daugh..."


In [40]:
combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62423 entries, 0 to 62422
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
 3   year     62423 non-null  int64 
 4   tag      45251 non-null  object
dtypes: int64(2), object(3)
memory usage: 2.9+ MB


### Model Development
We're now ready to start building the recommendation engine. We'll vectorize both genres and tags with one-hot encoding via a custom transformer.

In [41]:
class DictEncoder(base.BaseEstimator, base.TransformerMixin):
    """Transform list into dictionary with one-hot encoding"""
    def __init__(self, col):
        self.col = col
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        def to_dict(l):
            try:
                return {x: 1 for x in l}
            except TypeError:
                return {}
        
        return X[self.col].apply(to_dict)

Next let's build the pipeline. We'll use `DictVectorizer()` to transform the list of dictionaries from `DictEncoder()` into columns, where each key gets a column and the value is either 0 or 1.

In [42]:
genre_pipe = Pipeline([('encode_g', DictEncoder('genres')),
                       ('vectorizer', DictVectorizer())])
genres = genre_pipe.fit_transform(combined)
genres

<62423x20 sparse matrix of type '<class 'numpy.float64'>'
	with 112307 stored elements in Compressed Sparse Row format>

In [43]:
tag_pipe = Pipeline([('encode_t', DictEncoder('tag')),
                     ('vectorizer', DictVectorizer(sort=False))])

union = FeatureUnion([('categories', genre_pipe),
                      ('tags', tag_pipe)])

features = union.fit_transform(combined)
features

<62423x73071 sparse matrix of type '<class 'numpy.float64'>'
	with 585169 stored elements in Compressed Sparse Row format>

The resulting transformation returns 20 unique features from the list of genres, but 73k features from the tags! That's an excessive amount since we don't believe there to be 73k unique concepts encoded in there. Instead, many groups of genres and tags may be representing the same concepts. When using such a large and sparsely populated feature space, its unlikely that similar movies will be near each other based on overlapping genres and tags. To mitigate this problem, we'll add some **dimensionality reduction** to help reduce the sparse space.

In [44]:
genre_pipe = Pipeline([('encode_g', DictEncoder('genres')),
                       ('vectorizer', DictVectorizer())])

tag_pipe = Pipeline([('encode_t', DictEncoder('tag')),
                     ('vectorizer', DictVectorizer(sort=False)),
                     ('svd', TruncatedSVD(n_components=80))])

union = FeatureUnion([('categories', genre_pipe),
                      ('tags', tag_pipe)])

features = union.fit_transform(combined)
features

<62423x100 sparse matrix of type '<class 'numpy.float64'>'
	with 3732387 stored elements in Compressed Sparse Row format>

There are significantly less features now. This can be further tuned later based on outputs from the recommender.

### Nearest Neighbors
Next let's implement an unsupervised learning algorithm for finding similar movies. `NearestNeighbors` is a simplistic method that looks for the closest neighbors by finding the points where the distance function is minimized. 

In [45]:
nn = NearestNeighbors(n_neighbors=20).fit(features)

To access how the model is performing, let's spot check some results. We'll start with the movie Toy Story:

In [46]:
dists, indices = nn.kneighbors(features[0])
combined.iloc[indices[0]]

Unnamed: 0,movieId,title,genres,year,tag
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995,"[Owned, imdb top 250, Pixar, Pixar, time trave..."
4780,4886,"Monsters, Inc. (2001)","[Adventure, Animation, Children, Comedy, Fantasy]",2001,"[Owned, imdb top 250, Katottava, funny, cute, ..."
8246,8961,"Incredibles, The (2004)","[Action, Adventure, Animation, Children, Comedy]",2004,"[imdb top 250, Pixar, superhero, animation, co..."
1996,2085,101 Dalmatians (One Hundred and One Dalmatians...,"[Adventure, Animation, Children]",1961,"[60s, children, Disney, animation, dalmatian, ..."
5110,5218,Ice Age (2002),"[Adventure, Animation, Children, Comedy]",2002,"[Owned, animated, Disney, Katottava, pixar, Ic..."
3021,3114,Toy Story 2 (1999),"[Adventure, Animation, Children, Comedy, Fantasy]",1999,"[Disney, Owned, imdb top 250, original, animat..."
6258,6377,Finding Nemo (2003),"[Adventure, Animation, Children, Comedy]",2003,"[Nice, imdb top 250, short-term memory loss, h..."
3650,3751,Chicken Run (2000),"[Animation, Children, Comedy]",2000,"[animation, Dreamworks, Mel Gibson, aardman, a..."
2264,2355,"Bug's Life, A (1998)","[Adventure, Animation, Children, Comedy]",1998,"[Owned, Pixar, Animated, Pixar, animation, ant..."
14477,76093,How to Train Your Dragon (2010),"[Adventure, Animation, Children, Fantasy, IMAX]",2010,"[accent, action, adventure, animated, animatio..."


Results look reasonable. A decent mix of genres and Pixar versus other animated films. Let' try another movie.

In [47]:
# Find movie index for spot check
mask = combined['title'].str.contains('Kill Bill')
combined[mask]

Unnamed: 0,movieId,title,genres,year,tag
6751,6874,Kill Bill: Vol. 1 (2003),"[Action, Crime, Thriller]",2003,"[imdb top 250, Tarantino, dark hero, revenge, ..."
7299,7438,Kill Bill: Vol. 2 (2004),"[Action, Drama, Thriller]",2004,"[imdb top 250, Tarantino, Quentin Tarantino, r..."


In [48]:
dists, indices = nn.kneighbors(features[6751])
combined.iloc[indices[0]]

Unnamed: 0,movieId,title,genres,year,tag
6751,6874,Kill Bill: Vol. 1 (2003),"[Action, Crime, Thriller]",2003,"[imdb top 250, Tarantino, dark hero, revenge, ..."
7299,7438,Kill Bill: Vol. 2 (2004),"[Action, Drama, Thriller]",2004,"[imdb top 250, Tarantino, Quentin Tarantino, r..."
3892,3996,"Crouching Tiger, Hidden Dragon (Wo hu cang lon...","[Action, Drama, Romance]",2000,"[buddhism, china, imdb top 250, dragon, Ang Le..."
11647,53519,Death Proof (2007),"[Action, Adventure, Crime, Horror, Thriller]",2007,"[Quentin Tarantino, Rosario Dawson, slow, slow..."
1178,1209,Once Upon a Time in the West (C'era una volta ...,"[Action, Drama, Western]",1968,"[imdb top 250, ennio morricone, italo western,..."
6965,7090,Hero (Ying xiong) (2002),"[Action, Adventure, Drama]",2002,"[imdb top 250, amazing photography, amazing ph..."
21755,112171,"Equalizer, The (2014)","[Action, Crime, Thriller]",2014,"[one man army, cliché, dark hero, revenge, rus..."
15,16,Casino (1995),"[Crime, Drama]",1995,"[Mafia, Mafia, Martin Scorsese, organized crim..."
5356,5464,Road to Perdition (2002),"[Crime, Drama]",2002,"[Jude Law, organized crime, 1930s, al capone, ..."
49612,179247,Revenge (2017),"[Action, Thriller]",2017,"[badass lead, desert, female lead, gore, rape,..."


Results again look reasonable.