<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Recommender Systems

_Authors: Riley Dallas (AUS)_

---

In [11]:
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity

## Load `movies.csv` and `ratings.csv`
---

We'll be using the [MovieLens](https://grouplens.org/datasets/movielens/) dataset for building our recommendation engine. There are two CSVs (`movies.csv` and `ratings.csv`) that we'll eventually inner join. 

In [2]:
movies = pd.read_csv('../datasets/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings = pd.read_csv('../datasets/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


`movieId` column is common across the two tables. So we can use it as a join key to merge them both into a single dataframe

## Drop unnecessary columns
---

We won't need the `timestamp` column from `ratings`, nor will we need the `genres` column from `movies`. Drop both columns in the cells below.

In [4]:
ratings.drop(columns=['timestamp'], inplace=True)
movies.drop(columns=['genres'], inplace=True)

## Merge `movies` and `ratings`
---

Use `pd.merge` to **inner join** `movies` with `ratings` on the `movieId` column.

In [5]:
df = pd.merge(ratings, movies, on='movieId')
print(df.shape)
df.head()

(100836, 4)


Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,5,1,4.0,Toy Story (1995)
2,7,1,4.5,Toy Story (1995)
3,15,1,2.5,Toy Story (1995)
4,17,1,4.5,Toy Story (1995)


## Item-Based Collaborative Filtering
---

Because we're creating an item-based collaborative recommender (where item in this case is our movies), we'll set up our pivot table as follows:
1. The `title` will be the index (row)
2. The `userId` will be the column
3. The `rating` will be the value

![](images/item_based_collaborative_filtering.png)


<details><summary>If we were building a user-based collaborative recommender, what would change about this pivot table?</summary>

1. The `userId` will be the index (row)
2. The `title` will be the columns
3. The `rating` will be the value
</details>

In [6]:
pivot = pd.pivot_table(df, index='title', columns='userId', values='rating')

print(pivot.shape)
pivot.head()

# Optionally perform mean centering using the function we wrote in the previous class

(9719, 610)


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),,,,,,,,,,,...,,,,,,,,,,4.0
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,
'Round Midnight (1986),,,,,,,,,,,...,,,,,,,,,,
'Salem's Lot (2004),,,,,,,,,,,...,,,,,,,,,,
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,


## Create sparse matrix
---

- In real world use cases, we will have a lot of columns as there could be millions of users
- Calculating the overall cosine similarity on so many columns will most likely cause your computer to crash.
- Instead, we can use the sparsity of the data (most users have NaN for most movies) to run cosine similarity more efficiently

In [7]:
# First create a Sparse matrix from our pivot dataframe.
# Remember, only non 0 values will be stored in a spase matrix
sparse_pivot = sparse.csr_matrix(pivot.fillna(0))
print(sparse_pivot)

  (0, 609)	4.0
  (1, 331)	4.0
  (2, 331)	3.5
  (2, 376)	3.5
  (3, 344)	5.0
  (4, 112)	3.0
  (4, 344)	5.0
  (5, 20)	1.5
  (6, 11)	5.0
  (6, 18)	2.0
  (6, 90)	2.0
  (6, 94)	3.0
  (6, 171)	4.0
  (6, 216)	4.0
  (6, 287)	3.0
  (6, 293)	1.0
  (6, 306)	3.5
  (6, 376)	3.5
  (6, 413)	3.0
  (6, 473)	1.0
  (6, 476)	3.5
  (6, 519)	4.0
  (6, 554)	5.0
  (6, 560)	4.5
  (6, 598)	2.0
  :	:
  (9717, 26)	5.0
  (9717, 41)	5.0
  (9717, 56)	2.0
  (9717, 67)	4.0
  (9717, 87)	3.5
  (9717, 140)	3.5
  (9717, 197)	2.0
  (9717, 214)	2.5
  (9717, 216)	2.0
  (9717, 220)	3.5
  (9717, 238)	3.0
  (9717, 281)	4.0
  (9717, 293)	4.0
  (9717, 306)	2.5
  (9717, 312)	1.0
  (9717, 413)	3.0
  (9717, 420)	3.0
  (9717, 447)	3.0
  (9717, 473)	3.0
  (9717, 476)	3.5
  (9717, 554)	3.0
  (9717, 560)	4.0
  (9717, 596)	3.0
  (9717, 598)	2.5
  (9718, 526)	1.0


## Calculate cosine similarity
---

`sklearn` has a built-in `cosine_similarity` function that we can use for our recommender. 

In [8]:
# Use the sparse matrix to calculate cosine similarity more efficiently!
similarities = cosine_similarity(sparse_pivot)
print(similarities.shape)

(9719, 9719)


## Create Similarities DataFrame
---

At this point, we essentially have a recommender. We'll load it into a `pandas` DataFrame for readability. 

You'll notice that each movie has a "similarity" of 1 with itself (along the diagonal).

In [9]:
recommender_df = pd.DataFrame(similarities, 
                              columns=pivot.index, 
                              index=pivot.index)
recommender_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,1.0,0.857493,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.857493,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Evaluate recommender performance
---

Now comes the fun part! Let's check out a few movies to see if the recommender aligns with our intuition. In the cell below we'll do the following:
1. Create a search term
2. Use that to find all titles matching the search query
3. For each title, we'll list off the following:
  1. The average rating
  2. The number of ratings
  3. The ten most similar movies

In [15]:
search = 'Matrix'
pivot[pivot.index.str.contains(search)].index

Index(['Matrix Reloaded, The (2003)', 'Matrix Revolutions, The (2003)',
       'Matrix, The (1999)'],
      dtype='object', name='title')

In [17]:
# Find all the movies who's titles match the search query
search = 'Matrix'
titles = pivot[pivot.index.str.contains(search)].index
print(titles)

Index(['Matrix Reloaded, The (2003)', 'Matrix Revolutions, The (2003)',
       'Matrix, The (1999)'],
      dtype='object', name='title')


In [19]:
print(titles[0])

Matrix Reloaded, The (2003)


In [18]:
pivot.loc[titles[0]]

userId
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
      ... 
606    2.0
607    NaN
608    5.0
609    NaN
610    4.0
Name: Matrix Reloaded, The (2003), Length: 610, dtype: float64

In [23]:
# Loop over each movie in `titles` and find 10 most similar movies
for title in titles:
    print(title)
    print(f'Average rating {pivot.loc[title].dropna().mean()}')
    print(f'Number of ratings {pivot.loc[title].dropna().count()}')
    print('')
    print('10 closest movies')
    print(recommender_df[title].sort_values(ascending=False)[1:11]) # Not starting at 0 because 0 is the movie itself!
    print('')
    print('*******************************************************************************************')
    print('')

Matrix Reloaded, The (2003)
Average rating 3.3541666666666665
Number of ratings 96

10 closest movies
title
Matrix Revolutions, The (2003)                                   0.760446
X2: X-Men United (2003)                                          0.645495
Star Wars: Episode II - Attack of the Clones (2002)              0.630319
Spider-Man (2002)                                                0.612109
Batman Begins (2005)                                             0.610665
Pirates of the Caribbean: The Curse of the Black Pearl (2003)    0.610229
I, Robot (2004)                                                  0.607284
Minority Report (2002)                                           0.606937
Sin City (2005)                                                  0.604875
X-Men: The Last Stand (2006)                                     0.602943
Name: Matrix Reloaded, The (2003), dtype: float64

*******************************************************************************************

Matrix R