<a href="https://colab.research.google.com/github/Zernach/BetterReads-DS/blob/master/hybrid_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

!pip install scikit-surprise

from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

In [0]:
!curl --output movieLens.zip http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

In [0]:
!unzip movieLens.zip

## Content Based Recommender 
This is the NLP part, uses only scikit learn tfidf to vectorize the genre description (this is pretty limited)

In [0]:
smd = pd.read_csv('data/movies.csv')
smd.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [0]:
smd.shape

(9742, 3)

In [0]:
# Remove the pipe to make it easier for tfidf to parse
smd['genres'] = smd['genres'].str.replace('|', ' ')

In [0]:
smd.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

### Description Based 



In [0]:
# Change the "genre" column to "description"
smd['description'] = smd['genres'] 
smd = smd.drop(columns=['genres'])
smd.head()

Unnamed: 0,movieId,title,description
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


In [0]:
# Create tfidf matrix
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [0]:
tfidf_matrix.shape

(9742, 177)

#### Cosine Similarity


In [0]:
# get cosine similarity, use linear_kernel since calculating dot product
# from tfidf already gives a cosine similarity score
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [0]:
cosine_sim[0]

array([1.        , 0.31379419, 0.0611029 , ..., 0.        , 0.16123168,
       0.16761358])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [0]:
# Create an index map for movies so we can access them by
# title later
smdcop = smd.copy().reset_index()
titles = smdcop['title']
indices = pd.Series(smd.index, index=smdcop['title'])

In [0]:
def get_recommendations(title):
    """Get recommendations based on similarity score"""
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

Getting Recommendations based on simple genre tags ("description"). Just a basic example

In [0]:
get_recommendations('Jurassic Park (1993)').head(10)

615       Independence Day (a.k.a. ID4) (1996)
656                    Escape from L.A. (1996)
856                          Abyss, The (1989)
858                Escape from New York (1981)
1044           Star Trek: First Contact (1996)
1057    Star Trek II: The Wrath of Khan (1982)
1164     Lost World: Jurassic Park, The (1997)
1194                              Spawn (1997)
2193                       Total Recall (1990)
2712                          Moonraker (1979)
Name: title, dtype: object

In [0]:
get_recommendations('Aladdin (1992)').head(10)

578                            Oliver & Company (1988)
1177                                   Hercules (1997)
2287                                 Robin Hood (1973)
1543                           Jungle Book, The (1967)
1564                           Steamboat Willie (1928)
3727             Ferngully: The Last Rainforest (1992)
5232        Claymation Christmas Celebration, A (1987)
7216    Alvin and the Chipmunks: The Squeakquel (2009)
8275               All Dogs Christmas Carol, An (1998)
5160                                    Shrek 2 (2004)
Name: title, dtype: object

## Collaborative Filtering
This piece uses the surprise library to get off-the-shelf collaborative filtering using SVD (singular value decomposition)


In [0]:
reader = Reader()

In [0]:
ratings = pd.read_csv('data/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [0]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [0]:
svd = SVD()

# Cross validate using SVD for evaluation
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8778  0.8793  0.8716  0.8638  0.8740  0.8733  0.0055  
MAE (testset)     0.6738  0.6763  0.6708  0.6632  0.6725  0.6713  0.0044  
Fit time          4.63    4.59    4.64    4.58    4.61    4.61    0.02    
Test time         0.14    0.14    0.14    0.24    0.14    0.16    0.04    


{'fit_time': (4.627185821533203,
  4.593142509460449,
  4.643988370895386,
  4.576023101806641,
  4.614572525024414),
 'test_mae': array([0.67381364, 0.6762578 , 0.67076808, 0.66315986, 0.67248728]),
 'test_rmse': array([0.87778543, 0.8792761 , 0.87156758, 0.86379673, 0.8739533 ]),
 'test_time': (0.13834142684936523,
  0.13794636726379395,
  0.13556718826293945,
  0.23809576034545898,
  0.13572955131530762)}

In [0]:
# build a training data set based on the data we fed the 
# surprise reader
trainset = data.build_full_trainset()

In [0]:
# fit the training data to make predictions later
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fbfbd67f0f0>

Predicted ratings for one particular user:



In [0]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
227,1,3744,4.0,964980694
228,1,3793,5.0,964981855
229,1,3809,4.0,964981220
230,1,4006,4.0,964982903


In [0]:
uid, iid, r_ui, est, details = svd.predict(1, 302)

print("For user #{0}, the predicted rating for item {1} is {2}"
      .format(uid, iid, round(est, 2))
)

For user #1, the predicted rating for item 302 is 4.15


For movie with ID 302, we get an estimated prediction of **2.686**. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

## Hybrid Recommender

![](https://www.toonpool.com/user/250/files/hybrid_20095.jpg)

In this section, I will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

* **Input:** User ID and the Title of a Movie
* **Output:** Similar movies sorted on the basis of expected ratings by that particular user.

In [0]:
# use 'id' in place of 'movieId' (this was to help me understand the code)
smd['id'] = smd['movieId']

In [0]:
# didn't really end up using tmdbId
links = pd.read_csv('data/links.csv')
links['tmdbId'] = links['tmdbId'].fillna(1000000).astype(int)
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862
1,2,113497,8844
2,3,113228,15602
3,4,114885,31357
4,5,113041,11862


In [0]:
# run this cell just in case, but I don't think it's necessary
id_map = pd.read_csv('data/links.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')

In [0]:
# run this cell just in case, but I don't think it's necessary
indices_map = id_map.set_index('id')

In [0]:
# the last two cells are just mapping the index to the movie id;
# the small movieLens dataset has a lot of movies removed (to make
# the file smaller)
indices_map

Unnamed: 0_level_0,movieId
id,Unnamed: 1_level_1
8844,2
949,6
710,10
1408,15
524,16
...,...
44238,158398
5434,159510
26797,169912
129229,170597


In [0]:
def hybrid(userId, title):
    """
    uses user ratings + svd to sort content based 
    recommendations by predicted ratings from highest
    down to lowest
    """
    idx = indices[title]
    
    # gets the cosine similarity and sorts them all 
    # most similar (to "title")
    scores = list(enumerate(cosine_sim[int(idx)]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    scores = scores[1:26]
    m_indices = [i[0] for i in scores]

    # takes the indices of the most similar movies and 
    # uses the fitted svd model to predict rating and sorts
    # to send the highest to the top
    mov = smd.iloc[m_indices][['title', 'id']]
    mov['est'] = mov['id'].apply(lambda x: svd.predict(userId, x).est)
    mov = mov.sort_values('est', ascending=False)

    return mov.head(10)

In [0]:
indices['TMNT (Teenage Mutant Ninja Turtles) (2007)']

6448

### Results of Hybrid Model

For "Jurassic Park" you get different recommendations for user #2 and #3

In [0]:
hybrid(2, 'Jurassic Park (1993)')

Unnamed: 0,title,id,est
8681,Mad Max: Fury Road (2015),122882,4.141148
1044,Star Trek: First Contact (1996),1356,4.053966
2765,"Road Warrior, The (Mad Max 2) (1981)",3703,4.009604
858,Escape from New York (1981),1129,3.881543
856,"Abyss, The (1989)",1127,3.833315
615,Independence Day (a.k.a. ID4) (1996),780,3.81758
2193,Total Recall (1990),2916,3.780283
1057,Star Trek II: The Wrath of Khan (1982),1374,3.759745
4965,You Only Live Twice (1967),7569,3.724447
3819,Spider-Man (2002),5349,3.627381


In [0]:
hybrid(3, 'Jurassic Park (1993)')

Unnamed: 0,title,id,est
2765,"Road Warrior, The (Mad Max 2) (1981)",3703,3.843435
1044,Star Trek: First Contact (1996),1356,3.090902
656,Escape from L.A. (1996),849,3.0328
4965,You Only Live Twice (1967),7569,2.918548
4334,X2: X-Men United (2003),6333,2.851717
5265,"I, Robot (2004)",8644,2.82619
1057,Star Trek II: The Wrath of Khan (1982),1374,2.816335
858,Escape from New York (1981),1129,2.762906
5931,War of the Worlds (2005),34048,2.676813
856,"Abyss, The (1989)",1127,2.668423


In [0]:
hybrid(2, 'Aladdin (1992)')

Unnamed: 0,title,id,est
4360,Finding Nemo (2003),6377,4.376036
8275,"All Dogs Christmas Carol, An (1998)",105540,3.828176
6015,Wallace & Gromit in The Curse of the Were-Rabb...,38038,3.807093
2287,Robin Hood (1973),3034,3.780194
1543,"Jungle Book, The (1967)",2078,3.747368
1177,Hercules (1997),1566,3.708163
5666,"Lion King 1½, The (2004)",27619,3.690446
3745,Ice Age (2002),5218,3.671226
6205,Over the Hedge (2006),45431,3.642699
6696,Horton Hears a Who! (2008),58299,3.575727


In [0]:
hybrid(45, 'Aladdin (1992)')

Unnamed: 0,title,id,est
4360,Finding Nemo (2003),6377,4.399013
1543,"Jungle Book, The (1967)",2078,4.390593
6015,Wallace & Gromit in The Curse of the Were-Rabb...,38038,4.301428
1757,"Bug's Life, A (1998)",2355,4.271374
5666,"Lion King 1½, The (2004)",27619,4.258307
5160,Shrek 2 (2004),8360,4.167289
3745,Ice Age (2002),5218,4.028209
6696,Horton Hears a Who! (2008),58299,3.996429
3727,Ferngully: The Last Rainforest (1992),5159,3.995498
5901,Madagascar (2005),33615,3.98281


We see that for our hybrid recommender, we get different recommendations for different users although the movie is the same. Hence, our recommendations are more personalized and tailored towards particular users.

In [0]:
smd.head()

Unnamed: 0,movieId,title,genres,description,id
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,Adventure Animation Children Comedy Fantasy,1
1,2,Jumanji (1995),Adventure Children Fantasy,Adventure Children Fantasy,2
2,3,Grumpier Old Men (1995),Comedy Romance,Comedy Romance,3
3,4,Waiting to Exhale (1995),Comedy Drama Romance,Comedy Drama Romance,4
4,5,Father of the Bride Part II (1995),Comedy,Comedy,5
