# Collaborative Filtering

#### Some theoretical Background

[Link](https://realpython.com/build-recommendation-engine-collaborative-filtering/)

### User-Based vs Item-Based Collaborative Filtering

The two approaches are mathematically quite similar, but there is a conceptual difference between the two. Here’s how the two compare:

+ **User-based**: For a user U, with a set of similar users determined based on rating vectors consisting of given item ratings, the rating for an item I, which hasn’t been rated, is found by picking out N users from the similarity list who have rated the item I and calculating the rating based on these N ratings.

+ **Item-based**: For an item I, with a set of similar items determined based on rating vectors consisting of received user ratings, the rating by a user U, who hasn’t rated it, is found by picking out N items from the similarity list that have been rated by U and calculating the rating based on these N ratings.

## Import Packages

In [293]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

### Surprise Library

In [294]:
from surprise import Dataset
from surprise import Reader

from surprise import BaselineOnly
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import SVD
from surprise import accuracy
from surprise import SlopeOne
from surprise import SVDpp
from surprise import NMF
from surprise import NormalPredictor
from surprise import KNNBaseline
from surprise import KNNWithZScore
from surprise import CoClustering

from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

from surprise import accuracy

RSEED = 42

#### Import Data

In [295]:
movies = pd.read_csv('../data/ml-latest-small/movies.csv')
ratings = pd.read_csv('../data/ml-latest-small/ratings.csv')
#links = pd.read_csv('../data/ml-latest-small/links.csv')
#tags = pd.read_csv('../data/ml-latest-small/tags.csv')

In [296]:
ratings['rating'].describe()

count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

In [297]:
# total number of ratings per movie
movie_rat_count = ratings.groupby('movieId').count()['rating'].reset_index()
movie_rat_count.head(2)

Unnamed: 0,movieId,rating
0,1,215
1,2,110


In [298]:
# extract one rating count for a given movie (here movie 2)
num = movie_rat_count[movie_rat_count['movieId'] == 2].reset_index()
num = num.loc[0, 'rating']
num

110

#### Define Reader &
#### Load the data frame into data (here: userId, movieId and rating column)

In [299]:
reader = Reader(rating_scale=(0.5,5))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

#### Function for Top N Recommendations

In [300]:
# ORIGINAL
# from the surprise documentation
# with the extension of a recommendation dictionary

from collections import defaultdict


def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est)) # append number of ratings

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

## Recommendations with SVD (Single Value Decomposition)

In [301]:
# First train an SVD algorithm on the movielens dataset.
#data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()

algo = SVD()

algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=100)

    # Print the recommended items for each user
    # and make a dictionary to store user ID and movie ID
#for uid, user_ratings in top_n.items():
#        print(uid,[iid for (iid, _) in user_ratings]) # prints recommendations for each user

In 'top_n' you get a dictionary for each user:
key: user ID
value: List of tuples (movieId, predicted rating), sorted by predicted rating

In [302]:
# for example
top_n[1][0] # user 1 recommendation 1 (movieId, prediction)

(318, 5)

In [303]:
# if you want to look at the titles of the recommended movies
def get_recomm_for_user(user):
    for i,j in enumerate(top_n[user]):
        movie = movies[movies['movieId']==j[0]].title
        print('{}. Movie: {}'.format(i+1, movie.iloc[0]))

get_recomm_for_user(1)

1. Movie: Shawshank Redemption, The (1994)
2. Movie: Dark Knight, The (2008)
3. Movie: Rear Window (1954)
4. Movie: Casablanca (1942)
5. Movie: Notorious (1946)
6. Movie: Brazil (1985)
7. Movie: Bridge on the River Kwai, The (1957)
8. Movie: Unforgiven (1992)
9. Movie: Seven Samurai (Shichinin no samurai) (1954)
10. Movie: Outlaw Josey Wales, The (1976)
11. Movie: Spirited Away (Sen to Chihiro no kamikakushi) (2001)
12. Movie: Captain Phillips (2013)
13. Movie: Wallace & Gromit: The Wrong Trousers (1993)
14. Movie: To Kill a Mockingbird (1962)
15. Movie: Grand Day Out with Wallace and Gromit, A (1989)
16. Movie: Cool Hand Luke (1967)
17. Movie: Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
18. Movie: Godfather: Part II, The (1974)
19. Movie: Raging Bull (1980)
20. Movie: Double Indemnity (1944)
21. Movie: Touch of Evil (1958)
22. Movie: Yojimbo (1961)
23. Movie: Great Escape, The (1963)
24. Movie: Guess Who's Coming to Dinner (1967)
25. Movie: It's a Wonderful Life (1946)
26. Mo

# Save recommendations as csv file:

In [305]:
top_recomm = pd.DataFrame.from_dict(top_n, orient='index')

In [306]:
# export new data to csv. file
top_recomm.to_csv('../data/recommendations.csv',index=False)

# Just some in depth look at the data:

In [307]:
# get all ratings for a certain user:

user_1 = ratings.query('userId == 1')
user_1.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [308]:
# put all the recommended movies for that user in a list:

recomm_movies = []
for i in top_n[1] :
    recomm_movies.append(i[0])

#recomm_movies

In [309]:
# check if all recommended movies are new to the user:

user_1[user_1['movieId'].isin(recomm_movies)]

Unnamed: 0,userId,movieId,rating,timestamp


### If the output above shows no rows then every recommended movie is new to the user!

In [310]:
# get the count of ratings for every recommended movie
for i in recomm_movies :
    num = movie_rat_count[movie_rat_count['movieId'] == i].reset_index()
    num = num.loc[0, 'rating']
    #print(i, num) # activate print statement to see output

### You can see how many ratings the recommended movies got in total.

### Below, you can set a threshold (here: 10) to see how many movies have got a number of ratings below or above the cut off:

In [311]:
below = []
above = []

for i in recomm_movies :
    num = movie_rat_count[movie_rat_count['movieId'] == i].reset_index()
    num = num.loc[0, 'rating']
    if num <= 10 :
        below.append((i, num))
    else :
        above.append((i, num))

print('10 or less ratings: ', len(below))
print('More than 10 ratings: ', len(above))


10 or less ratings:  6
More than 10 ratings:  94


# END of RECOMMENDATIONS with SVD