# Collaborative Filtering

#### Some theoretical Background

[Link](https://realpython.com/build-recommendation-engine-collaborative-filtering/)

### User-Based vs Item-Based Collaborative Filtering

The two approaches are mathematically quite similar, but there is a conceptual difference between the two. Here’s how the two compare:

+ **User-based**: For a user U, with a set of similar users determined based on rating vectors consisting of given item ratings, the rating for an item I, which hasn’t been rated, is found by picking out N users from the similarity list who have rated the item I and calculating the rating based on these N ratings.

+ **Item-based**: For an item I, with a set of similar items determined based on rating vectors consisting of received user ratings, the rating by a user U, who hasn’t rated it, is found by picking out N items from the similarity list that have been rated by U and calculating the rating based on these N ratings.

## Import Packages

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

### Surprise Library

In [2]:
from surprise import Dataset
from surprise import Reader

from surprise import BaselineOnly
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import SVD
from surprise import accuracy
from surprise import SlopeOne
from surprise import SVDpp
from surprise import NMF
from surprise import NormalPredictor
from surprise import KNNBaseline
from surprise import KNNWithZScore
from surprise import CoClustering

from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

from surprise import accuracy

RSEED = 42

#### Import Data

In [3]:
movies = pd.read_csv('../data/ml-latest-small/movies.csv')
ratings = pd.read_csv('../data/ml-latest-small/ratings.csv')
#links = pd.read_csv('../data/ml-latest-small/links.csv')
#tags = pd.read_csv('../data/ml-latest-small/tags.csv')

In [4]:
ratings['rating'].describe()
ratings.movieId.nunique()

9724

In [5]:
df = pd.read_csv('../data/df_features.csv')
movieIds = df.movieId.to_list()

len(movieIds)

9543

In [6]:
ratings = ratings[ratings['movieId'].isin(movieIds)]
ratings.movieId.nunique()

9525

In [21]:
# total number of ratings per movie
#movie_rat_count = ratings.groupby('movieId').count()['rating'].reset_index()
#movie_rat_count.head(2)

In [22]:
# extract one rating count for a given movie (here movie 2)
#num = movie_rat_count[movie_rat_count['movieId'] == 2].reset_index()
#num = num.loc[0, 'rating']
#num

#### Define Reader &
#### Load the data frame into data (here: userId, movieId and rating column)

In [7]:
reader = Reader(rating_scale=(0.5,5))
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

#### Function for Top N Recommendations

In [8]:
# ORIGINAL
# from the surprise documentation
# with the extension of a recommendation dictionary

from collections import defaultdict


def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est)) # append number of ratings

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

## Recommendations with SVD (Single Value Decomposition)

In [9]:
# First train an SVD algorithm on the movielens dataset.
#data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
trainset = data.build_full_trainset()

# KNNBaseline
sim_options = {'name': 'pearson_baseline',
               'user_based': False  # compute  similarities between items
               }
algo = KNNBaseline(sim_options=sim_options, random_state=RSEED)

algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=100)

    # Print the recommended items for each user
#for uid, user_ratings in top_n.items():
#        print(uid,[iid for (iid, _) in user_ratings]) # prints recommendations for each user

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


In 'top_n' you get a dictionary for each user:
key: user ID
value: List of tuples (movieId, predicted rating), sorted by predicted rating

In [10]:
# for example
top_n[1][0] # user 1 recommendation 1 (movieId, prediction)

(318, 5)

In [11]:
# if you want to look at the titles of the recommended movies
def get_recomm_for_user(user):
    for i,j in enumerate(top_n[user]):
        movie = movies[movies['movieId']==j[0]].title
        #print('{}. Movie: {}'.format(i+1, movie.iloc[0]))

get_recomm_for_user(1)

# Save recommendations as csv file:

Make to dictionaries, e.g.:
1. key : [1, 2, 3]
2. key : [[1, 2, 3]]

In [12]:
recommendations_knn = {}
for i in range(1, 611) :
    l = []
    for j in range(100) :
        reco = top_n[i][j][0]
        l.append(str(reco))
    recommendations_knn[i] = l

In [23]:
""" recommendations2 = {}
for i in range(1, 611) :
    l = []
    for j in range(100) :
        reco = top_n[i][j][0]
        l.append(str(reco))
    recommendations2[i] = [l] """

' recommendations2 = {}\nfor i in range(1, 611) :\n    l = []\n    for j in range(100) :\n        reco = top_n[i][j][0]\n        l.append(str(reco))\n    recommendations2[i] = [l] '

Make dataframes:

In [13]:
top_recomm_knn = pd.DataFrame.from_dict(recommendations_knn, orient='index')

In [24]:
#top_recomm2 = pd.DataFrame.from_dict(recommendations2, orient='index')

The first one has got the userId as index and then every movie recommendation as one column:

In [14]:
top_recomm_knn.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
1,318,908,912,1276,1262,4466,750,4294,7153,164917,...,3897,5995,541,2788,54503,913,57669,741,1203,51255
2,160718,7327,3992,100882,112460,136602,50003,73042,1034,109850,...,46972,4349,1271,63276,99437,6,48161,2297,805,1687
3,1004,2733,3064,3190,3370,3901,4079,85,104925,634,...,2287,168248,120799,3550,3432,73211,61,84414,5165,71304
4,26524,71429,5113,779,70946,4092,43869,40870,568,26116,...,116797,1228,1495,5782,6807,1035,3147,3671,3471,3089
5,6239,112804,389,3330,26810,51937,67186,6235,8782,6616,...,2791,52579,8254,104879,4959,6244,50274,99145,1225,98122
6,53,32289,60516,80615,4429,70641,70305,3042,74282,6858,...,238,141890,160567,72641,80241,93443,8605,1270,7888,80549
7,511,2977,4337,4708,1341,92422,4711,142115,1624,5113,...,94130,2947,6858,148652,1192,8477,527,172547,143257,945
8,7486,549,5279,1085,4297,27664,635,3963,6884,31431,...,3362,497,1196,1276,1272,4021,92259,2324,2599,8014
9,70946,64032,134368,159817,4079,74508,151739,26184,64116,100507,...,8643,62970,1219,1252,8405,1596,1617,112421,140174,48516
10,130087,7395,80727,6143,112460,5529,109848,68959,97866,26052,...,122896,7142,33499,89582,150,2477,595,6374,26606,42


The second one has got a list with all recommendations in one column:

In [25]:
#top_recomm2.head(11)

Export as csv files:

# recommendations KNN:
+ index - userId - every recommendation in a separate column:

In [15]:
# export new data to csv. file
top_recomm_knn.to_csv('../data/recommendations_knn.csv', index_label='userId')

## recommendations2:
+ index - userId - one column with a list of all recommendations:

In [26]:
#top_recomm2.to_csv('../data/recommendations2.csv',index=True, index_label='userId', header=['recommendations'])

# END of RECOMMENDATIONS with KNN