<a href="https://www.kaggle.com/code/gpreda/hybrid-recsys-evaluation?scriptVersionId=128640294" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

We combine two Recommender System algorithms to create a hybrid Recommender System, working with:
- a content based RecSys (using KNN);
- a colaborative filtering RecSys (using SVN);

Then, we compare 4 versions of recommender systems:
- random recommender  
- content based  
- collaborative filtering  
- hybrid approach.

The hybrid approach has several advantages:  
- compensates missing rates by certain users with presence of content-specific information;  
- compensates partially the cold start issues  

The Notebook is using extensively the material from Sundog Education about RecSys.


# Analysis preparation

In [1]:
import numpy as np
import pandas as pd
import re
import os
import heapq
from surprise import accuracy
from surprise import Dataset
from surprise import Reader
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise.model_selection import LeaveOneOut
from surprise import NormalPredictor, SVD, KNNBasic

In [2]:
from recommender_metrics import RecommenderMetrics
from movie_lens_data import MovieLensData
from evaluator import Evaluator

# Read the data

In [3]:
path = "/kaggle/input/movielens-100k-dataset/ml-100k"
movie_lens_data = MovieLensData(
    users_path = os.path.join(path, "u.user"),
    ratings_path = os.path.join(path, "u.data"), 
    movies_path = os.path.join(path, "u.item"), 
    genre_path = os.path.join(path, "u.genre") 
    )

evaluation_data = movie_lens_data.read_ratings_data()
movie_data = movie_lens_data.read_movies_data()
popularity_rankings = movie_lens_data.get_popularity_ranks()
ratings = movie_lens_data.get_ratings()

# Prepare evaluator

In [4]:
evaluator = Evaluator(evaluation_data, popularity_rankings)

Number of full trainset users: 943
Number of full trainset items: 1682
Number of trainset users: 943
Number of trainset items: 1641
Size of testset: 25000
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


# Add random recommender to evaluator

In [5]:
algo_np = NormalPredictor()
evaluator.add_algorithm(algo_np, "Random")

# Define and add Content based RecSys

In [6]:
from surprise import AlgoBase
from surprise import PredictionImpossible

class ContentBasedRecSysKNN(AlgoBase):

    def __init__(self, k=40, movie_data=None, sim_options={}):
        AlgoBase.__init__(self)
        self.k = k
        self.movie_data = movie_data

    def fit(self, trainset):
        """
        Fit function - calculate genre similarities
        Args
            trainset: the training set
        Returns
            self
        """
        AlgoBase.fit(self, trainset)
        
        self.compute_genre_similarities()
    
        return self
    
    def estimate(self, u, i):
        """
        Estimate function: evaluate the ratings for an user & item
        Process is:
            - gather the ratings for the user u
            - from the similarities (on items) between item i and the items
            from the ratings gathered (neighboring items)
            - order these neighbors 
            - calculate the average simmilarities score weighted by user ratings
        Args
            u: current user
            i: current item
        Returns
            predicted ratings
        """
        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unkown.')
        
        # Build up similarity scores between this item and everything the user rated
        neighbors = []
        for rating in self.trainset.ur[u]:
            genre_similarity = self.genre_similarities[i,rating[0]]
            neighbors.append( (genre_similarity, rating[1]) )
        
        # Extract the top-K most-similar ratings (k is set at init)
        k_neighbors = heapq.nlargest(self.k, neighbors, key=lambda t: t[0])
        
        # Compute average sim score of K neighbors weighted by user ratings
        sim_total = weighted_sum = 0
        for (sim_score, rating) in k_neighbors:
            if (sim_score > 0):
                sim_total += sim_score
                weighted_sum += sim_score * rating
            
        if (sim_total == 0):
            raise PredictionImpossible('No neighbors')

        predicted_rating = weighted_sum / sim_total

        return predicted_rating
    
    def compute_genre_similarities(self):
        # compute similarities
        genre_columns = self.movie_data.columns[6:-1]
        genres = self.movie_data[genre_columns]
        # calculate cosine similarity using linear_kernel from sklearn
        self.genre_similarities = linear_kernel(genres, genres)
        return self


In [7]:
algo_content = ContentBasedRecSysKNN(movie_data=movie_data)
evaluator.add_algorithm(algo_content, "ContentKNN")

# Add Collaborative Filtering using SVD

In [8]:
algo_colab_filter_SVD = SVD()
evaluator.add_algorithm(algo_colab_filter_SVD, "SVD")

# Define Hybrid RecSys

In [9]:
class HybridRecSysAlgorithm(AlgoBase):

    def __init__(self, algorithms, weights, sim_options={}):
        AlgoBase.__init__(self)
        self.algorithms = algorithms
        self.weights = weights

    def fit(self, trainset):
        AlgoBase.fit(self, trainset)
        
        for algorithm in self.algorithms:
            algorithm.fit(trainset)
                
        return self

    def estimate(self, u, i):
        
        sum_scores = 0
        sum_weights = 0
        
        for idx in range(len(self.algorithms)):
            try:
                sum_scores += self.algorithms[idx].estimate(u, i) * self.weights[idx]
                sum_weights += self.weights[idx]
            except:
                continue
        try:    
            return sum_scores / sum_weights
        except:
            return 0

# Initialize the Hybrid algorithm

In [10]:
hybrid = HybridRecSysAlgorithm([algo_colab_filter_SVD, algo_content], weights = [0.5, 0.5])

# Add the Hybrid algorithm

In [11]:
evaluator.add_algorithm(hybrid, "Hybrid")

# Evaluate algorithms

In [12]:
evaluator.evaluate(do_top_n=False)

Evaluating  Random ...
Evaluating accuracy...
Analysis complete.
Evaluating  ContentKNN ...
Evaluating accuracy...
Analysis complete.
Evaluating  SVD ...
Evaluating accuracy...
Analysis complete.
Evaluating  Hybrid ...
Evaluating accuracy...
Analysis complete.


Algorithm  RMSE       MAE       
Random     1.5240     1.2256    
ContentKNN 1.0735     0.8536    
SVD        0.9442     0.7432    
Hybrid     0.9731     0.7723    

Legend:

RMSE:      Root Mean Squared Error. Lower values mean better accuracy.
MAE:       Mean Absolute Error. Lower values mean better accuracy.


# Evaluate topN recommendations

In [13]:
evaluator.sample_top_n_recs(movie_lens_data, test_subject=85, k=10)


Using recommender  Random

Building recommendation model...
Computing recommendations...

We recommend:
Endless Summer 2, The 5
Sabrina 5
Die Hard 5
Rear Window 5
Get Shorty 5
Executive Decision 5
Shining, The 5
Rock, The 5
Cool Runnings 5
Jaws 2 5

Using recommender  ContentKNN

Building recommendation model...
Computing recommendations...

We recommend:
Get Shorty 4.0
That Darn Cat! 4.0
Up Close and Personal 4.0
Escape to Witch Mountain 4.0
Forget Paris 4.0
Crucible, The 4.0
Private Parts 4.0
2 Days in the Valley 4.0
Life Less Ordinary, A 4.0
Mr. Jones 4.0

Using recommender  SVD

Building recommendation model...
Computing recommendations...

We recommend:
Secrets & Lies 4.3600072773456855
Wrong Trousers, The 4.309806971603653
Rear Window 4.298641864930647
L.A. Confidential 4.2882632753481165
Close Shave, A 4.262845055951264
12 Angry Men 4.204997375586606
Shall We Dance? 4.1980901619607724
Apt Pupil 4.15597026952816
Sling Blade 4.145763159493655
Thin Man, The 4.1400478992880485

Usi

In [14]:
evaluator.sample_top_n_recs(movie_lens_data, test_subject=314, k=10)


Using recommender  Random

Building recommendation model...
Computing recommendations...

We recommend:
Kolya 5
Silence of the Lambs, The 5
Last of the Mohicans, The 5
Angels and Insects 5
Jean de Florette 5
That Darn Cat! 5
Raiders of the Lost Ark 5
Annie Hall 5
Shining, The 5
Pump Up the Volume 5

Using recommender  ContentKNN

Building recommendation model...
Computing recommendations...

We recommend:
Lamerica 4.5
Ninotchka 4.222222222222222
Breakfast at Tiffany's 4.222222222222222
Once Were Warriors 4.222222222222222
Surviving the Game 4.222222222222222
Female Perversions 4.222222222222222
Relative Fear 4.222222222222222
Great Day in Harlem, A 4.222222222222222
Pie in the Sky 4.222222222222222
Robocop 3 4.222222222222222

Using recommender  SVD

Building recommendation model...
Computing recommendations...

We recommend:
Some Like It Hot 4.895451297777243
Dances with Wolves 4.82303501664165
Rear Window 4.769434665510084
True Lies 4.733063037600801
Good Will Hunting 4.706933893668