# Recomendation System
---

Recommend similar items to users based on their past preferences.

Used widely in e-commerce, social media, news, etc.

Types:
- Content-based (Producer, Author, Genre, etc.)
- Collaborative filtering (Demographic, Item-based, User-based)


## Content-based

Recommend items similar to those a particular user liked in the past. Characteristic of the items are used to recommend similar items.


Using:
- Similarity measure, measure similarity between 2 items based on certain content:
    - Euclidean distance, 
    - Cosine similarity*
    - Pearson correlation
    - Spearman correlation
    - Jaccard similarity/distance*

In [56]:
## EDA Standard Libary

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.stats as ss

In [74]:
#Load Data

movies = pd.read_csv('/Users/Dwika/My Projects/DATASETS/Movie Ratings Data/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [54]:
#split genre

movies['genres'] = movies['genres'].apply(lambda x: x.split('|'))

In [55]:
movies

Unnamed: 0,movieId,title,genres,genre
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),Comedy|Romance,"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),Comedy,[Comedy]
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,"[Action, Animation, Comedy, Fantasy]"
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,"[Animation, Comedy, Fantasy]"
9739,193585,Flint (2017),Drama,[Drama]
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,"[Action, Animation]"


In [45]:
movies['title'][0][-5:-1]

'1995'

In [46]:
#Extract year from title
movies['year'] = movies['title'].apply(lambda x: x[-5:-1])
movies.head()

Unnamed: 0,movieId,title,genres,genre_splitted,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men (1995),Comedy|Romance,"[Comedy, Romance]",1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II (1995),Comedy,[Comedy],1995


In [47]:
movies.sample(10)

Unnamed: 0,movieId,title,genres,genre_splitted,year
5246,8604,Taxi (1998),Action|Comedy,"[Action, Comedy]",1998
2297,3044,Dead Again (1991),Mystery|Romance|Thriller,"[Mystery, Romance, Thriller]",1991
2910,3900,Crime and Punishment in Suburbia (2000),Comedy|Drama,"[Comedy, Drama]",2000
5111,8136,Indestructible Man (1956),Crime|Horror|Sci-Fi,"[Crime, Horror, Sci-Fi]",1956
9498,170777,There Once Was a Dog (1982),Animation|Children|Comedy,"[Animation, Children, Comedy]",1982
7086,70159,Orphan (2009),Drama|Horror|Mystery|Thriller,"[Drama, Horror, Mystery, Thriller]",2009
8650,120635,Taken 3 (2015),Action|Crime|Thriller,"[Action, Crime, Thriller]",2015
7188,72407,"Twilight Saga: New Moon, The (2009)",Drama|Fantasy|Horror|Romance|Thriller,"[Drama, Fantasy, Horror, Romance, Thriller]",2009
9051,141844,12 Chairs (1971),Adventure|Comedy,"[Adventure, Comedy]",1971
9528,172215,Saved by the Bell: Hawaiian Style (1992),Comedy,[Comedy],1992


In [11]:
#Load Data

ratings = pd.read_csv('/Users/Dwika/My Projects/DATASETS/Movie Ratings Data/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,5,1,4.0,847434962
2,7,1,4.5,1106635946
3,15,1,2.5,1510577970
4,17,1,4.5,1305696483


In [42]:
#Number of unique movies
ratings['movieId'].nunique()

9724

In [20]:
#Group by movieId
ratings.groupby('movieId')['rating'].mean()

movieId
1         3.920930
2         3.431818
3         3.259615
4         2.357143
5         3.071429
            ...   
193581    4.000000
193583    3.500000
193585    3.500000
193587    3.500000
193609    4.000000
Name: rating, Length: 9724, dtype: float64

In [48]:
movies

Unnamed: 0,movieId,title,genres,genre_splitted,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men (1995),Comedy|Romance,"[Comedy, Romance]",1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II (1995),Comedy,[Comedy],1995
...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,"[Action, Animation, Comedy, Fantasy]",2017
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,"[Animation, Comedy, Fantasy]",2017
9739,193585,Flint (2017),Drama,[Drama],2017
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,"[Action, Animation]",2018


## Content-based Filtering

In [75]:
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [76]:
movies_exp = movies.copy()


In [77]:
#split genre into list
movies_exp['genres'] = movies_exp['genres'].apply(lambda x: x.split('|'))

In [78]:
#Explode genre
movies_exp = movies_exp.explode('genres', ignore_index=True) #Breaks down the list into rows

In [79]:
#Extract year from title
movies_exp['year'] = movies_exp['title'].apply(lambda x: x[-5:-1])
movies_exp.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure,1995
1,1,Toy Story (1995),Animation,1995
2,1,Toy Story (1995),Children,1995
3,1,Toy Story (1995),Comedy,1995
4,1,Toy Story (1995),Fantasy,1995


In [80]:
movies_exp

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure,1995
1,1,Toy Story (1995),Animation,1995
2,1,Toy Story (1995),Children,1995
3,1,Toy Story (1995),Comedy,1995
4,1,Toy Story (1995),Fantasy,1995
...,...,...,...,...
22079,193583,No Game No Life: Zero (2017),Fantasy,2017
22080,193585,Flint (2017),Drama,2017
22081,193587,Bungo Stray Dogs: Dead Apple (2018),Action,2018
22082,193587,Bungo Stray Dogs: Dead Apple (2018),Animation,2018


In [81]:
#Check on genre
movies_exp['genres'].value_counts()

genres
Drama                 4361
Comedy                3756
Thriller              1894
Action                1828
Romance               1596
Adventure             1263
Crime                 1199
Sci-Fi                 980
Horror                 978
Fantasy                779
Children               664
Animation              611
Mystery                573
Documentary            440
War                    382
Musical                334
Western                167
IMAX                   158
Film-Noir               87
(no genres listed)      34
Name: count, dtype: int64

In [82]:
#Drop no genre listed
movies_exp = movies_exp[movies_exp['genres'] != '(no genres listed)']
movies_exp['genres'].value_counts()

genres
Drama          4361
Comedy         3756
Thriller       1894
Action         1828
Romance        1596
Adventure      1263
Crime          1199
Sci-Fi          980
Horror          978
Fantasy         779
Children        664
Animation       611
Mystery         573
Documentary     440
War             382
Musical         334
Western         167
IMAX            158
Film-Noir        87
Name: count, dtype: int64

In [87]:
#Movies crosstab

movies_crosstab = pd.crosstab(movies_exp['title'], movies_exp['genres'])
movies_crosstab.sample(10)

genres,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Comme un chef (2012),0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"Exterminating Angel, The (Ángel exterminador, El) (1962)",0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0
How to Deal (2003),0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
Titan A.E. (2000),1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
Cocoon: The Return (1988),0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
Cat People (1942),0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0
You're Next (2011),0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0
Thor (2011),1,1,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0
A Bad Moms Christmas (2017),0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Kiss Me Deadly (1955),0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


In [106]:
movies_crosstab.shape

(9703, 19)

> If the matrix is binary, then the **Jaccard similarity** is more effective.

### Similarity Measure

In [92]:
#Sample check

toystory = movies_crosstab.loc['Toy Story (1995)']
jumanji = movies_crosstab.loc['Jumanji (1995)']
skyfall= movies_crosstab.loc['Skyfall (2012)']
display(toystory)
toystory.values

genres
Action         0
Adventure      1
Animation      1
Children       1
Comedy         1
Crime          0
Documentary    0
Drama          0
Fantasy        1
Film-Noir      0
Horror         0
IMAX           0
Musical        0
Mystery        0
Romance        0
Sci-Fi         0
Thriller       0
War            0
Western        0
Name: Toy Story (1995), dtype: int64

array([0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [96]:
#Jaccard Similarity for Toy Story, Skyfall and Jumanji

from sklearn.metrics import jaccard_score

print(jaccard_score(toystory, jumanji))
print(jaccard_score(toystory, skyfall))
print(jaccard_score(jumanji, skyfall))

0.6
0.125
0.16666666666666666


In [100]:
#Create jaccard matrix for all movies

from scipy.spatial.distance import pdist, squareform


jaccard_dist = pdist(movies_crosstab.values, metric='jaccard') #create pairwise distance array
pd.DataFrame(jaccard_dist)

Unnamed: 0,0
0,0.875000
1,0.800000
2,0.666667
3,0.800000
4,1.000000
...,...
47069248,1.000000
47069249,1.000000
47069250,1.000000
47069251,1.000000


In [103]:
#Matix form of jaccard distance
jaccard_matrix = squareform(jaccard_dist) #convert to matrix
pd.DataFrame(jaccard_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9693,9694,9695,9696,9697,9698,9699,9700,9701,9702
0,0.000000,0.875000,0.800000,0.666667,0.800000,1.0,1.0,0.75,0.833333,1.000000,...,0.60,0.60,0.80,0.80,0.800000,0.600000,0.600000,0.600000,1.000000,1.000000
1,0.875000,0.000000,1.000000,1.000000,1.000000,1.0,0.8,1.00,0.857143,0.714286,...,1.00,1.00,1.00,1.00,1.000000,0.857143,0.857143,0.857143,0.833333,0.833333
2,0.800000,1.000000,0.000000,0.800000,0.666667,1.0,1.0,0.50,0.750000,1.000000,...,0.75,0.75,1.00,1.00,0.666667,1.000000,1.000000,1.000000,1.000000,0.666667
3,0.666667,1.000000,0.800000,0.000000,0.800000,1.0,1.0,0.75,0.833333,1.000000,...,0.60,0.25,0.50,0.50,0.800000,0.833333,0.833333,0.833333,1.000000,1.000000
4,0.800000,1.000000,0.666667,0.800000,0.000000,0.5,1.0,0.50,0.333333,1.000000,...,0.75,0.75,1.00,1.00,0.666667,1.000000,1.000000,1.000000,1.000000,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9698,0.600000,0.857143,1.000000,0.833333,1.000000,1.0,1.0,1.00,1.000000,0.833333,...,0.80,0.80,0.75,0.75,1.000000,0.000000,0.500000,0.500000,1.000000,1.000000
9699,0.600000,0.857143,1.000000,0.833333,1.000000,1.0,1.0,1.00,1.000000,1.000000,...,0.50,0.80,0.75,0.75,1.000000,0.500000,0.000000,0.000000,1.000000,1.000000
9700,0.600000,0.857143,1.000000,0.833333,1.000000,1.0,1.0,1.00,1.000000,1.000000,...,0.50,0.80,0.75,0.75,1.000000,0.500000,0.000000,0.000000,1.000000,1.000000
9701,1.000000,0.833333,1.000000,1.000000,1.000000,1.0,0.5,1.00,0.750000,0.800000,...,1.00,1.00,1.00,1.00,1.000000,1.000000,1.000000,1.000000,0.000000,0.666667


In [105]:
#Jaccard Score Matrix
jaccard_sc = 1 - jaccard_matrix
pd.DataFrame(jaccard_sc)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9693,9694,9695,9696,9697,9698,9699,9700,9701,9702
0,1.000000,0.125000,0.200000,0.333333,0.200000,0.0,0.0,0.25,0.166667,0.000000,...,0.40,0.40,0.20,0.20,0.200000,0.400000,0.400000,0.400000,0.000000,0.000000
1,0.125000,1.000000,0.000000,0.000000,0.000000,0.0,0.2,0.00,0.142857,0.285714,...,0.00,0.00,0.00,0.00,0.000000,0.142857,0.142857,0.142857,0.166667,0.166667
2,0.200000,0.000000,1.000000,0.200000,0.333333,0.0,0.0,0.50,0.250000,0.000000,...,0.25,0.25,0.00,0.00,0.333333,0.000000,0.000000,0.000000,0.000000,0.333333
3,0.333333,0.000000,0.200000,1.000000,0.200000,0.0,0.0,0.25,0.166667,0.000000,...,0.40,0.75,0.50,0.50,0.200000,0.166667,0.166667,0.166667,0.000000,0.000000
4,0.200000,0.000000,0.333333,0.200000,1.000000,0.5,0.0,0.50,0.666667,0.000000,...,0.25,0.25,0.00,0.00,0.333333,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9698,0.400000,0.142857,0.000000,0.166667,0.000000,0.0,0.0,0.00,0.000000,0.166667,...,0.20,0.20,0.25,0.25,0.000000,1.000000,0.500000,0.500000,0.000000,0.000000
9699,0.400000,0.142857,0.000000,0.166667,0.000000,0.0,0.0,0.00,0.000000,0.000000,...,0.50,0.20,0.25,0.25,0.000000,0.500000,1.000000,1.000000,0.000000,0.000000
9700,0.400000,0.142857,0.000000,0.166667,0.000000,0.0,0.0,0.00,0.000000,0.000000,...,0.50,0.20,0.25,0.25,0.000000,0.500000,1.000000,1.000000,0.000000,0.000000
9701,0.000000,0.166667,0.000000,0.000000,0.000000,0.0,0.5,0.00,0.250000,0.200000,...,0.00,0.00,0.00,0.00,0.000000,0.000000,0.000000,0.000000,1.000000,0.333333


In [107]:
#Convert jaccard score and title to dataframe
jaccard_df = pd.DataFrame(jaccard_sc, index=movies_crosstab.index, columns=movies_crosstab.index)
jaccard_df

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.000000,0.125000,0.200000,0.333333,0.200000,0.0,0.0,0.25,0.166667,0.000000,...,0.40,0.40,0.20,0.20,0.200000,0.400000,0.400000,0.400000,0.000000,0.000000
'Hellboy': The Seeds of Creation (2004),0.125000,1.000000,0.000000,0.000000,0.000000,0.0,0.2,0.00,0.142857,0.285714,...,0.00,0.00,0.00,0.00,0.000000,0.142857,0.142857,0.142857,0.166667,0.166667
'Round Midnight (1986),0.200000,0.000000,1.000000,0.200000,0.333333,0.0,0.0,0.50,0.250000,0.000000,...,0.25,0.25,0.00,0.00,0.333333,0.000000,0.000000,0.000000,0.000000,0.333333
'Salem's Lot (2004),0.333333,0.000000,0.200000,1.000000,0.200000,0.0,0.0,0.25,0.166667,0.000000,...,0.40,0.75,0.50,0.50,0.200000,0.166667,0.166667,0.166667,0.000000,0.000000
'Til There Was You (1997),0.200000,0.000000,0.333333,0.200000,1.000000,0.5,0.0,0.50,0.666667,0.000000,...,0.25,0.25,0.00,0.00,0.333333,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ (1999),0.400000,0.142857,0.000000,0.166667,0.000000,0.0,0.0,0.00,0.000000,0.166667,...,0.20,0.20,0.25,0.25,0.000000,1.000000,0.500000,0.500000,0.000000,0.000000
xXx (2002),0.400000,0.142857,0.000000,0.166667,0.000000,0.0,0.0,0.00,0.000000,0.000000,...,0.50,0.20,0.25,0.25,0.000000,0.500000,1.000000,1.000000,0.000000,0.000000
xXx: State of the Union (2005),0.400000,0.142857,0.000000,0.166667,0.000000,0.0,0.0,0.00,0.000000,0.000000,...,0.50,0.20,0.25,0.25,0.000000,0.500000,1.000000,1.000000,0.000000,0.000000
¡Three Amigos! (1986),0.000000,0.166667,0.000000,0.000000,0.000000,0.0,0.5,0.00,0.250000,0.200000,...,0.00,0.00,0.00,0.00,0.000000,0.000000,0.000000,0.000000,1.000000,0.333333


### Building Recomendation

In [110]:
#Create function to recommend movie

def recommend(movie):
    return jaccard_df[movie].sort_values(ascending=False)[1:10]

recommend('Toy Story (1995)')

recommend('Jumanji (1995)')

title
Darby O'Gill and the Little People (1959)    1.0
The Cave of the Golden Rose (1991)           1.0
Santa Claus: The Movie (1985)                1.0
Jumanji (1995)                               1.0
Golden Compass, The (2007)                   1.0
Indian in the Cupboard, The (1995)           1.0
Seventh Son (2014)                           1.0
Pete's Dragon (2016)                         1.0
Gulliver's Travels (1996)                    1.0
Name: Jumanji (1995), dtype: float64

In [121]:
#Merge with movie rating

mov_rating = pd.merge(ratings, movies_exp, on='movieId', how='inner')
mov_rating

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,year
0,1,1,4.0,964982703,Toy Story (1995),Adventure,1995
1,1,1,4.0,964982703,Toy Story (1995),Animation,1995
2,1,1,4.0,964982703,Toy Story (1995),Children,1995
3,1,1,4.0,964982703,Toy Story (1995),Comedy,1995
4,1,1,4.0,964982703,Toy Story (1995),Fantasy,1995
...,...,...,...,...,...,...,...
274428,610,160836,3.0,1493844794,Hazard (2005),Drama,2005
274429,610,160836,3.0,1493844794,Hazard (2005),Thriller,2005
274430,610,163937,3.5,1493848789,Blair Witch (2016),Horror,2016
274431,610,163937,3.5,1493848789,Blair Witch (2016),Thriller,2016


--- 

## Collaborative Filtering


Recomendation based on similar users preferences.

How to measure similarity between users?
- Observing items rated by both users
- Similarity measure:
    - Euclidean distance
    - Cosine similarity
    - Pearson correlation
    - Spearman correlation
    - Jaccard similarity/distance


Methods:
- Memory-based
    - User-based, from users preferences --> recommend similar items
    - Item-based, from items features --> recommend similar items to similar users
- Model-based
    - Matrix factorization
    - Deep learning
    - SVD
    - ALS
    - WLS

In [122]:
mov_rating['title'].value_counts()

title
Forrest Gump (1994)                                     1316
Pulp Fiction (1994)                                     1228
Toy Story (1995)                                        1075
Lion King, The (1994)                                   1032
Shrek (2001)                                            1020
                                                        ... 
Mackenna's Gold (1969)                                     1
Prime of Miss Jean Brodie, The (1969)                      1
Unprecedented: The 2000 Presidential Election (2002)       1
Mr. Blandings Builds His Dream House (1948)                1
31 (2016)                                                  1
Name: count, Length: 9685, dtype: int64

In [124]:
#Pivot user x Movie
user_rating_pivot = pd.pivot_table(mov_rating, index='userId', columns='title', values='rating')
user_rating_pivot

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,,,,,,,,,,,...,,,,,,,,,,
607,,,,,,,,,,,...,,,,,,,,,,
608,,,,,,,,,,,...,,,,,,4.5,3.5,,,
609,,,,,,,,,,,...,,,,,,,,,,


> To avoid null values, we can use **mean imputation** or **k-NN imputation**.

In [125]:
#Normalize rating

avg_rating = user_rating_pivot.mean(axis=1)
avg_rating

userId
1      4.366379
2      3.948276
3      2.435897
4      3.555556
5      3.636364
         ...   
606    3.657399
607    3.786096
608    3.134176
609    3.270270
610    3.688556
Length: 610, dtype: float64

In [127]:
#Fill NaN with mean of user rating

#Normalize rating for each user by subtracting mean user rating, then fill NaN with 0 (if user didn't rate the movie)
rating_norm_pivot = user_rating_pivot.sub(avg_rating, axis=0).fillna(0) #mean = 0, std = 1
rating_norm_pivot

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,-0.366379,0.0
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
3,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
4,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
5,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
607,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0
608,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,1.365824,0.365824,0.000000,0.000000,0.0
609,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0


In [133]:
#Create for Item based collaborative filtering
rating_norm_pivot = rating_norm_pivot.T
rating_norm_pivot

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.311444
'Hellboy': The Seeds of Creation (2004),0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
'Round Midnight (1986),0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
'Salem's Lot (2004),0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
'Til There Was You (1997),0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ (1999),0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,1.492047,0.0,0.0,0.0,0.0,1.365824,0.0,0.000000
xXx (2002),0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.26087,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.365824,0.0,-1.688556
xXx: State of the Union (2005),0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,-2.188556
¡Three Amigos! (1986),-0.366379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000


### Similarity Measure: Cosine Similarity

We use cosine because the data is **continuous**.

In [137]:
#Cosine similarity

from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(rating_norm_pivot)
cosine_sim_df = pd.DataFrame(cosine_sim, index=rating_norm_pivot.index, columns=rating_norm_pivot.index)

In [143]:
cosine_sim.shape

(9685, 9685)

In [141]:
#Recommendation function

def recommend(item):
    return cosine_sim_df[item].sort_values(ascending=False)[1:10]

recommend('Toy Story (1995)')

title
Toy Story 2 (1999)                             0.403454
Aladdin (1992)                                 0.327261
Toy Story 3 (2010)                             0.327233
Wallace & Gromit: The Wrong Trousers (1993)    0.305441
Back to the Future (1985)                      0.276805
Incredibles, The (2004)                        0.274821
Blazing Saddles (1974)                         0.271634
Finding Nemo (2003)                            0.262828
Ghostbusters (a.k.a. Ghost Busters) (1984)     0.249702
Name: Toy Story (1995), dtype: float64