## <h1> **`KNN Collaborative` Recommender System `item-based`** </h1>  
_Code by Victor Silvis_   
_500777168_ 
  
This notebook contains two versions of KNN collaborative Filtering recommender systems. One based on the traditional concept of only using item-user rating interactions, and the second one adds additional item features on top of these ratings, to try and get a more accurate rating prediction. Both of them have two types of predictions methodologies, regression and classification, which can be set in the recommenders system initialisation.  
  
**RecSys In Notebook:**
1. KNN CF ITEM-BASED with only ratings (Classification or Regression)  
2. KNN CF ITEM-BASED with ratings AND features (Classification or Regression)  
  
**Contents:**
1. Introduction Item-Based
2. Loading Data
3. Recommender system 1 KNN CF without features
4. Recommender system 2 KNN CF with features
5. Comparison of results
6. Conclusion
  
Datasets: Movielens & Netflix

---

### **Introduction Item-Based**  
In contrary to the user-based approach of the other KNN notebook, the recommendations are based upon finding similar movies, based on the rating patterns. So as an example, Instead of consulting with his peers, Eric instead determines whether the movie “Titanic” is right for him by considering the movies that he has already seen. He notices that people that have rated this movie have given similar ratings to the movies “Forrest Gump” and “Wall-E”. Since Eric liked these two movies he concludes that he will also like the movie “Titanic”. This concept will be used in this system. Similar items will be determined based on the favorite item of a user. We then check  the user rated items that are similar to the ones we are going to recommend. These ratings in combination with the distance are used to give a weighted predicted rating for movies the user hasn't seen yet. The recommender systems in this notebook take an user like Eric as input, and recommends items based on his personal favorite item. 
  
**Regression vs Classification**  
Secondly, based on (Nikolakopoulos et al. 2021), there a two methodologies of predicting the rating, classification and regression. For regression we take the ratings from Eric for Forrest Gump and Wall-E, and together with the similarity (distance) to the Titanic, we calculate a weighted score the the Titanic for Eric. With classification we dont calculate a weighted score, we classify an item (in this case Forrest Gump or Wall-E) as the top voter, and we take that rating directly, as this item is the most similar to the Titanic. Choosing the rating that is closest, and classify that one as the most appropriate (weight 1 or 0), has been proven to be a risky but viable approach in literature (Nikolakopoulos et al. 2021). This is one of the things this notebook will analyse.  
  
**Features vs without item Features**  
Finally this notebook will compare two systems. The first one will be solely based on the rating patterns, and utilising it to find similar items. A proven traditional approach of collaborative filtering. But the second system will also include the item features, in this case the genre. It will resemble more like a hybrid system (incoperating aspects of content-based). This notebook will analyse if this more complex system will also perform better in terms of accuracy and speed.

---

#### **Packages and Data**  
First the necessary packages are imported. Together with packages from the selfmade utils folder for distance calculations. In addition the netflix and movielens datasets are imported (see notebook about sampling).

In [2]:
#Packages
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.neighbors import NearestNeighbors

#Import package from own folder
from Utils.distance import Distance

In [93]:
#Data Movielens (~100.000 Rows)
ratings_ml = pd.read_csv('../../Data/Movielens/ratings.csv')
movies_ml = pd.read_csv('../../Data/Movielens/movies.csv')
ratings_ml.columns = ['user_id', 'movie_id', 'rating', 'timestamp']

#Data Netflix (~400.000 rows)
ratings_nf = pd.read_parquet('../../Data/Netflix/NetflixSample.gzip')
ratings_nf[['user_id','movie_id']] = ratings_nf[['user_id','movie_id']].astype(int)
movies_nf = pd.read_parquet('../../Data/Netflix/Netflix_movies.gzip')

In [94]:
#Create a helper dictonary, for demonstration later on
title_dict = dict(zip(movies_ml['movieId'], movies_ml['title']))

----

## **``Recsys 1:`` KNN Collaborative Item-Based**
##### Collaborative Filtering **Without** item Features

**Explanation:**  
The first system is the system without features as explained in the introduction. The system is made to be used on any type of dataset, if the init arguments are set correctly. Among the arguments is the type specification, by default its set to regression, but will later be changed during the comparison section. A few notes about the system. First of all this system does not utilize a similarity matrix, as it we found that it slows down the performance of the system, while not producing better results. The KNN system, finds the similar items to the input vector based on the user-item matrix. For this user-item matrix a CSR or sparse matrix is used. It saves the matrix in a 3 row format, significantly saving memory, especially with the sparse matrices we are dealing with. These factors lead to the fact that this system, can use the 24 million dataset of Netflix, and still is able to be fitted and get quick recommendations (2 a 3 seconds) on a medium laptop. We tested this also with a system using a similarity matrix, and it took significantly longer. Finally, both the systems are build in an object oriented approach. Apart from the benefits of easy transfering of variables it also allows for effeciently use later on. All the functions within the class are explained first in multiline comment within the function.

In [121]:
# Regression / Classification Recsys without features

class KNN_CF_ITEM:

    def __init__(self, userField, itemField, valueField, type='regression'):
        self.userField = userField
        self.itemField = itemField
        self.valueField = valueField
        self.predict_type = type
        self.set_distance_metrics()
        self.set_n_recommendations_and_k()

    '''
    args of recommender:
    userfield   : Name of column in which userid is located
    itemfield   : Name of column in which itemid is located
    valuefield  : Name of column in which ratings are located
    type        : Type of prediction, either regression or classification
    '''

    def set_distance_metrics(self, primary_metric = 'cosine', prediction_metric = 'manhattan'):
        '''set initial distance metrics'''
        self.primary_distance = primary_metric
        self.prediction_metric = prediction_metric

    def set_n_recommendations_and_k(self, n=20, k=10):
        '''Set initial parameters'''
        self.n_neighbors = n
        self.k = k

    def create_matrix(self, data):
        ''' First making some maps, as csr matrix requires range of 0 to N, with increments of 1.
        the id's might not comply to this (e.g. gaps). So mapping is made for both user/item to index
        and inverse. Secondly CSR matrix is made. A CSR matrix is chosen because of the sparsity.
        it makes the recommender perform faster, compared to normal pivot table'''

        #Mapping
        N = data[self.userField].nunique()
        M = data[self.itemField].nunique()
        user_list = np.unique(data[self.userField])
        item_list = np.unique(data[self.itemField])
        self.user_to_index = dict(zip(user_list, range(0, N)))
        self.item_to_index = dict(zip(item_list, range(0,M)))
        self.index_to_item = dict(zip(range(0,M), item_list))
        user_index = [self.user_to_index[i] for i in data[self.userField]]
        item_index = [self.item_to_index[i] for i in data[self.itemField]]

        #Create CSR matrix, items on rows, users on columns
        self.matrix = csr_matrix((data[self.valueField], (item_index, user_index)), shape=(M, N))

    def rated_by_user(self, user):
        '''Helper function, to get the already rated items of the user. From this the favourite
        items will be taken later on, which will be the input for finding similar items. Secondly,
        this is later used to filter out the items already rated. '''

        user_items = self.matrix.getcol(self.user_to_index[user]).A #Get vector of user, with ratings
        rated_items = list(zip(np.where(user_items > 0)[0],user_items[user_items> 0])) #combine index and rating when above 0
        self.rated_items = sorted(rated_items, key=lambda x: x[1], reverse=True) #sort the list on the ratings
        self.avg_user = np.mean([item[1] for item in self.rated_items]) #Take average of user, for later


    def fit(self, metric=None, k=None):
        '''Fit the nearest neighbors model with the matrix. Nearest neighbors was chosen over a
        similarity matrix due to faster performance, especially in combination with CSR'''
        
        if metric is None:
            metric = self.primary_distance
        if k is None:
            k = self.k
        self.KNN = NearestNeighbors(n_neighbors=k, algorithm='brute',metric= metric)
        self.KNN.fit(self.matrix) 
    

    def find_similar_items(self, user):
        """ Based on the favourite items of the user, we recommend similar items.     
        To find enough items that the user has NOT rated yet. We will loop through
        his/her favourite movies untill we have 10 items that the user has not seen yet
        to create a pool of unseen movies. this ensure us that we do not end up with 0 
        recommended items because the user has already seen them all."""

        self.rated_by_user(user) #run function to get favourite movies
        self.favorite_indices = [item[0] for item in self.rated_items] #Retreive the favorite indeces
        unseen_idx = [] #init pool in which items that has not been seen will be stored

        for favorite_index in self.favorite_indices: # Loop through favorite items of user
            item_vector = self.matrix[favorite_index].reshape(1, -1) # Get the vector
            distances, indices = self.KNN.kneighbors(item_vector, n_neighbors=self.n_neighbors) # Find similar items
            combined_list = list(zip(indices[0].tolist(), distances[0].tolist()))[1:] #combine and drop 1st
            filtered_list = [(index, distance) for index, distance in combined_list if index not in self.favorite_indices] #filter out the ones user has rated
            
            for index, distance in filtered_list: #Loop through the filtered list of unseen items
                if index not in [item[0] for item in unseen_idx]: #Check if not already in pool of items
                   unseen_idx.append((index, distance)) #Add to unseen pool of items
            if len(unseen_idx) >= self.n_neighbors: #Stop if we have enough recommendations
                break #stop if we have enough unwatched items
        
        #store the (index, distance) of similar (unseen) items to  his/her favourites
        self.similar_items = unseen_idx 


    def ratings_similar_items(self,user, n=100):
        '''Now that we have the N similar items of the favourite items of the user. We look into
        items that the user has ALREADY rated, and are similar to the ones we are going to recommend.
        This will give us a good indication of how the user is going to rate the unseen item, that
        we are going to recommend. We retreive those ratings and distance, to calculate the prediction
        later on. Secondly as we are looking for neighbors of the top N recommendations, we can assign
        an additional distance metric to hypertune. For this distance calculation we use our own made
        package for calculating the distances'''

        distance_calc = Distance()      #init our distance package
        result_list = []                #init list to store results

        #We look into the already rated items, and pick the ones that are closest to the unseen item
        for index, _ in self.similar_items: 
            input_vector = self.matrix[index].A[0]
            distances = []
            for idx, rating in self.rated_items[:n]:
                target_vector = self.matrix[idx].A[0]
                distance = distance_calc.calculate(vector1 = input_vector, 
                                                   vector2 = target_vector, 
                                                   metric=self.prediction_metric)
                distances.append((idx, distance, rating))
            distances = sorted(distances, key=lambda x: x[1])[:10] #sort the list on the distance
            result_list.append((index, distances))

        self.ratings_neighbors = result_list
        return result_list
    

    def predict_ratings(self):
        '''This function will go through the ratings of the user for the neighbours of the N recommended items.
        Next, there a two ways of predicting the ratings for the top N unseen items (Nikolakopoulos et al. 2021).

        1.Regression:
        if chosen for regression a prediction will be made for recommended item X by calculating a 
        weighted score. We take the scores that the user has given to items that are very similar to item X.
        Then, by using the distance, we come to a weighted score for item X. This is less risky to 
        classification, as the final rating is based on more ratings, and therefore a safer bet.

        Formula:  sum(w1 * rating1 + w2 * rating2) / sum(w1 + w2)

        2.Classification:
        if chosen for classification. Instead of calculating a weighted average, we take the rating of the
        item that is most similar to the unseen item we are going to recommend. This is riskier, but can
        actually give better results. Especially, if the discrete rating scale (e.g. 1 to 5) is a small range.
        Its less recommended for ratings scales like 1 to 25. Choosing the rating that is closest, and classify
        that one as the most appropriate (weight 1 or 0), has been proven to be a viable approach in literature.'''
          
        # 1. Regression prediction method
        if self.predict_type == 'regression':
            weighted_averages = [] 
            for idx, idx_dist_rating in self.ratings_neighbors:
                weighted_sum = sum_inverse_distances = 0
                for idx_nb, distance, rating in idx_dist_rating:
                    if distance != 0:
                        inverse_distance = 1 / distance
                        weighted_sum += inverse_distance * rating
                        sum_inverse_distances += inverse_distance
                if sum_inverse_distances != 0:
                    weighted_avg = weighted_sum / sum_inverse_distances
                else:
                    weighted_avg = self.avg_user #predict avg of user because no data (e.g. only rated 1 item)
                weighted_averages.append((self.target_user, idx, weighted_avg))
            self.recommendations_predictions = sorted(weighted_averages, key=lambda x: x[1], reverse=True)
        

        # 2. Classification Prediction Method
        elif self.predict_type == 'classification':
            sorted_voters = [] #list of sorted, and only the top voters (with weight 1 not 0)

            for item in self.ratings_neighbors:
                idx, entries = item #each item in recommendations format:
                entries.sort(key=lambda x: x[1]) #sort on weight (distance)
                top_vote = entries[0] #Take the one with the highest weight
                sorted_voters.append((idx, top_vote)) #take the rating of the top vote
            
            #append user idx, recommended item and predicted score to the list
            idx_and_predictions = []
            for item in sorted_voters:
                idx_and_predictions.append((self.target_user, item[0], item[1][2]))
        
            #Sort the list, so the indices with highest pred ratings are on top
            self.recommendations_predictions = sorted(idx_and_predictions, key=lambda x: x[2], reverse=True)
        
        #If not chosen for either regression or classification, display error.
        else:
            print('Please select valid prediction type (regression or classification)')


    def recommend(self, userlist, prints=True):
        ''' This is the main function that combines most helper functions to recommend N new items 
        to a user, with the predicted ratings. This function will convert them also back to ID's
        from indices, as its easier to evaluate '''  

        recommendations = []
        for user in userlist:
            self.target_user = user
            self.find_similar_items(user)
            self.ratings_similar_items(user)
            self.predict_ratings()
            recommendations.append([(userid, self.index_to_item[idx], rating) for userid, idx, rating in self.recommendations_predictions][:self.n_neighbors]) #convert back to ID's   
        recommendations = [sorted(sublist, key=lambda x: x[2], reverse=True) for sublist in recommendations]
        if prints == True:
            print(recommendations)
        else:
            return recommendations
        

    def evaluate(self, data, limit=0.5, prints=False):
        ''' This is the evaluate function, the input should be the test data, if called manually.
        The hypertune function will use the validation data automatically. This function compares
        the predicted rating for the unseen items based on the training set, with the actual ratings
        in the test (or validation) data set. It returns RMSE & MAE.

        data    :   Actual data (usally test or validation in case of hypertune function)
        limit   :   Manual limit to limit amount of users to be tested (for big testdata)
        prints  :   Default is True, if True it outputs the results, for hypertune its turned Off
        '''

        user_list = data[self.userField].unique()
        self.count = 0
        if limit is not None:
            user_list = user_list[:int(len(user_list) * limit)]
        if prints == True: #If True output the prints for information
            print(len(user_list), f' : Users are going to be evaluted. {limit} of input data')
        predictions, targets = [], []
        for user in user_list:
            output = self.recommend([user], prints=False)
            for rec in output:
                for item in rec:
                    actual_rating = actual_rating = data[(data[self.userField] == item[0]) & (data[self.itemField] == item[1])][self.valueField].values
                    if (actual_rating.size > 0 & (item[2] > 0)):
                        self.count += 1
                        targets.append(actual_rating[0]) #store actual rating
                        predictions.append(item[2])
        rmse = np.sqrt(mean_squared_error(predictions, targets))
        mae = mean_absolute_error(predictions, targets)
        
        if prints == True: #if True print the RMSE score
            print('RMSE: '.ljust(20), round(rmse, 3))
            print('MAE: '.ljust(20), round(mae, 3))
            print(f'{self.count} Valid Ratings Evaluated')
        else: # else output the rmse
            return rmse
        
    def hypertune(self, data, k_folds=5, prints=True, limit=0.5):
        ''' Hyperparameter Tuning function. Utilizes the evaluation function with a K-Fold
        cross validation approach. One of the K fold acts as validation dataset to calculate
        the RMSE for the particular parameters. This is done K times for each parameter combination.

        data    :   (Train) Data to perform hypertuning on,
        k_folds :   How many times to split the train data, and cross validate
        prints  :   If True, outputs best metrics and RMSE found
        '''

        #Default set of parameters to test
        k_list = [10,15,20]
        n_rec_metric = ['cosine', 'manhattan', 'euclidean']
        prediction_metric = ['manhattan','cosine', 'euclidean']

        #dictonary to store the results and best_rmse that will get updated
        best_params_dict = {}
        best_rmse = 999

        #Loop through every combination, and for each perform K-Fold cross validation
        for k in k_list:
            for rec_dist in n_rec_metric:
                for pred_dist in prediction_metric:
                    kf = KFold(n_splits=k_folds, shuffle=True)
                    rmse_scores = []
                    for train_idx, validation_idx in kf.split(data):
                        train_data = data.iloc[train_idx]
                        val_data = data.iloc[validation_idx]
                        self.create_matrix(train_data)
                        self.set_distance_metrics(primary_metric=rec_dist, 
                                                prediction_metric=pred_dist)
                        self.set_n_recommendations_and_k(k=k)
                        self.fit()
                        rmse = self.evaluate(val_data, limit=limit, prints=False)
                        rmse_scores.append(rmse)
                    avg_rmse = np.mean(rmse_scores)
                    if avg_rmse < best_rmse:
                        best_rmse = avg_rmse
                        best_params_dict = {'K':k,
                                            'primary_metric': rec_dist,
                                            'prediction_metric': pred_dist,
                                            'RMSE': round(best_rmse,3)}
        
        #update the system with new best parameters
        self.set_n_recommendations_and_k(k=best_params_dict['K'])
        self.set_distance_metrics(prediction_metric=best_params_dict['prediction_metric'],
                                  primary_metric=best_params_dict['primary_metric'])


        #Print the best results and parameters found
        if prints == True:
            print(f"Best RMSE: {round(best_rmse, 3)}")
            print("-------------------")
            print("Best Parameters:")
            for param, value in best_params_dict.items():
                print(f"{param}:".ljust(30) + f"{value}")

## **``Recsys 2:`` KNN Collaborative Item-Based**
##### Collaborative Filtering **With** Features Genre

**Explanation:**  
The second system is the system WITH item features. This requires an additional step and argument compared to the other system. This system needs an item feature matrix as input, in addition to the ratings data. The item feature matrix are made first below, by using a simple one-hot encoding technique to get the genres per movie. This is done for both the movielens and netflix data. Futhermore, this system shares alot of similarities with the other system, only the process of creating the matrix has been changed to incorperate the addition of the hot encoded genres.

#### **2.1 Feature Engineering**

In [5]:
#Create a feature matrix as input for additional features next to collaborative (MovieLens)
movies_ml = movies_ml[['movieId', 'genres']]
movies_ml.columns = ['movie_id', 'genres']
genre_one_hot = movies_ml['genres'].str.get_dummies("|")
item_features_ml = pd.concat([movies_ml,genre_one_hot],axis=1).drop(columns='genres', axis=0)

In [6]:
#Create a feature matrix as input for additional features next to collaborative (Netflix)
movies_nf = movies_nf[['movie_id', 'genres']]
genre_one_hot = movies_nf['genres'].str.get_dummies("|")
item_features_nf = pd.concat([movies_nf,genre_one_hot],axis=1).drop(columns='genres', axis=0)

#### **2.2 RecSys with Features added**

In [134]:
# Regression / Classification Recsys with features

class KNN_CF_ITEM_FEATURES:


    def __init__(self, userField, itemField, valueField, featurematrix, type='regression'):
        self.userField = userField
        self.itemField = itemField
        self.valueField = valueField
        self.feature_matrix = featurematrix
        self.predict_type = type
        self.set_distance_metrics()
        self.set_n_recommendations_and_k()
        
    def set_distance_metrics(self, primary_metric = 'cosine', prediction_metric = 'manhattan'):
        ''' Function to set the default distance metrics, can be called later (by hypertune)
        to set the distance metrics to something different.'''
        self.primary_distance = primary_metric
        self.prediction_metric = prediction_metric
    
    def set_n_recommendations_and_k(self,n=10, k=10):
        self.n_neighbors = n
        self.k = k

    def create_matrix(self, data):
        ''' As with the other system this function creates some usefull maps. However
        for this system it also includes combining the item vectors (genre) with the
        normal rating vectors and storing them in a combined matrix.'''

        N = data[self.userField].nunique()
        M = data[self.itemField].nunique()
        user_list = np.unique(data[self.userField])
        item_list = np.unique(data[self.itemField])
        self.user_to_index = dict(zip(user_list, range(0, N)))
        self.item_to_index = dict(zip(item_list, range(0,M)))
        self.index_to_user = dict(zip(range(0,N), user_list))
        self.index_to_item = dict(zip(range(0,M), item_list))
        user_index = [self.user_to_index[i] for i in data[self.userField]]
        item_index = [self.item_to_index[i] for i in data[self.itemField]]

        #Create CSR matrix, items on rows, users on columns
        self.rating_matrix = csr_matrix((data[self.valueField], (item_index, user_index)), shape=(M, N))

        #combine vectors features and ratings
        combined_item_vectors = np.zeros((M, N + (self.feature_matrix.shape[1]-1)))

        # For each item in the standard user-item matrix, combine the vector for that item taken from the 
        # Feature matrix, with the vectors of the hot-encoded genres
        for item_id in item_list:
            item_index = self.item_to_index[item_id] #Get the item index from the dictonary
            rating_vector = self.rating_matrix[item_index].toarray().flatten()  # Convert to dense array
            feature_vector = self.feature_matrix[self.feature_matrix[self.itemField] == item_id].values[:, 1:].flatten() #Get vector from feature matrix for the specific item
            combined_vector = np.concatenate((rating_vector, feature_vector)) #combine rating vector with genre vector
            combined_item_vectors[item_index, :len(combined_vector)] = combined_vector #Store the combined vector in the combined matrix.
        self.combined_matrix = combined_item_vectors

    def rated_by_user(self, user):
        #Helper function to get items already rated by user
        user_items = self.rating_matrix.getcol(self.user_to_index[user]).A #Get vector of user, with ratings
        rated_items = list(zip(np.where(user_items > 0)[0],user_items[user_items> 0])) #combine index and rating when above 0
        self.rated_items = sorted(rated_items, key=lambda x: x[1], reverse=True) #sort the list on the ratings
        self.avg_user = np.mean([item[1] for item in self.rated_items])

    def fit(self, metric = None, k = None):
        if metric is None:
            metric = self.primary_distance
        if k is None:
            k = self.k
        self.KNN = NearestNeighbors(n_neighbors=k, algorithm='brute',metric= metric)
        self.KNN.fit(self.combined_matrix)

    def find_similar_items(self, user):
        #Helper function to find k similar items

        """ To find enough items that the user has NOT rated yet. We will loop through
        his/her favourite movies untill we have 10 movies that the user has not seen yet
        to create a pool of unseen movies. this ensure us that we do not end up with 0 
        recommended items because the user has already seen them all."""

        self.rated_by_user(user) #run function to get favourite movies
        self.favorite_indices = [item[0] for item in self.rated_items]
        unseen_idx = [] #init pool in which items that has not been seen will be stored


        for favorite_index in self.favorite_indices:
            item_vector = self.combined_matrix[favorite_index].reshape(1, -1)
            distances, indices = self.KNN.kneighbors(item_vector, n_neighbors=self.n_neighbors)
            combined_list = list(zip(indices[0].tolist(), distances[0].tolist()))[1:] #combine and drop 1st
            filtered_list = [(index, distance) for index, distance in combined_list if index not in self.favorite_indices]
            for index, distance in filtered_list:
                if index not in [item[0] for item in unseen_idx]:
                   unseen_idx.append((index, distance))
            if len(unseen_idx) >= self.n_neighbors:
                break #stop if we have enough unwatched movies
        
        self.similar_items = unseen_idx
        return self.similar_items
    
    def ratings_similar_items(self,user, n=100):
        '''Now that we have the N similar items of the favourite items of the user. We look into
        items that the user has ALREADY rated, and are similar to the ones we are going to recommend.
        This will give us a good indication of how the user is going to rate the unseen item, that
        we are going to recommend. We retreive those ratings and distance, to calculate the prediction
        later on. Secondly as we are looking for neighbors of the top N recommendations, we can assign
        an additional distance metric to hypertune. For this distance calculation we use our own made
        package for calculating the distances'''

        distance_calc = Distance()      #init our distance package
        result_list = []                #init list to store results

        #We look into the already rated items, and pick the ones that are closest to the unseen item
        for index, _ in self.similar_items: 
            input_vector = self.combined_matrix[index]
            distances = []
            for idx, rating in self.rated_items[:n]:
                target_vector = self.combined_matrix[idx]
                distance = distance_calc.calculate(vector1 = input_vector, 
                                                   vector2 = target_vector, 
                                                   metric=self.prediction_metric)
                distances.append((idx, distance, rating))
            distances = sorted(distances, key=lambda x: x[1])[:10] #sort the list on the distance
            result_list.append((index, distances))

        self.ratings_neighbors = result_list
        return result_list

    def predict_ratings(self):
        '''This function will go through the ratings of the user for the neighbours of the N recommended items.
        Next, there a two ways of predicting the ratings for the top N unseen items (Nikolakopoulos et al. 2021).

        1.Regression:
        if chosen for regression a prediction will be made for recommended item X by calculating a 
        weighted score. We take the scores that the user has given to items that are very similar to item X.
        Then, by using the distance, we come to a weighted score for item X. This is less risky to 
        classification, as the final rating is based on more ratings, and therefore a safer bet.

        Formula:  sum(w1 * rating1 + w2 * rating2) / sum(w1 + w2)

        2.Classification:
        if chosen for classification. Instead of calculating a weighted average, we take the rating of the
        item that is most similar to the unseen item we are going to recommend. This is riskier, but can
        actually give better results. Especially, if the discrete rating scale (e.g. 1 to 5) is a small range.
        Its less recommended for ratings scales like 1 to 25. Choosing the rating that is closest, and classify
        that one as the most appropriate (weight 1 or 0), has been proven to be a viable approach in literature.'''
          
        # 1. Regression prediction method
        if self.predict_type == 'regression':
            weighted_averages = [] 
            for idx, idx_dist_rating in self.ratings_neighbors:
                weighted_sum = sum_inverse_distances = 0
                for idx_nb, distance, rating in idx_dist_rating:
                    if distance != 0:
                        inverse_distance = 1 / distance
                        weighted_sum += inverse_distance * rating
                        sum_inverse_distances += inverse_distance
                if sum_inverse_distances != 0:
                    weighted_avg = weighted_sum / sum_inverse_distances
                else:
                    weighted_avg = self.avg_user #predict avg of user because no data (e.g. only rated 1 item)
                weighted_averages.append((self.target_user, idx, weighted_avg))
            self.recommendations_predictions = sorted(weighted_averages, key=lambda x: x[1], reverse=True)
        

        # 2. Classification Prediction Method
        elif self.predict_type == 'classification':
            sorted_voters = [] #list of sorted, and only the top voters (with weight 1 not 0)

            for item in self.ratings_neighbors:
                idx, entries = item #each item in recommendations format:
                entries.sort(key=lambda x: x[1]) #sort on weight (distance)
                top_vote = entries[0] #Take the one with the highest weight
                sorted_voters.append((idx, top_vote)) #take the rating of the top vote
            
            #append user idx, recommended item and predicted score to the list
            idx_and_predictions = []
            for item in sorted_voters:
                idx_and_predictions.append((self.target_user, item[0], item[1][2]))
        
            #Sort the list, so the indices with highest pred ratings are on top
            self.recommendations_predictions = sorted(idx_and_predictions, key=lambda x: x[2], reverse=True)
        
        #If not chosen for either regression or classification, display error.
        else:
            print('Please select valid prediction type (regression or classification)')

    
    def recommend(self, userlist, prints=True):
        ''' This is the main function that combines most helper functions to recommend N new items 
        to a user, with the predicted ratings. This function will convert them also back to ID's
        from indices '''  
        n_neighbors = self.n_neighbors
        recommendations = []
        for user in userlist:
            self.target_user = user
            self.find_similar_items(user)
            self.ratings_similar_items(user)
            self.predict_ratings()
            recommendations.append([(userid, self.index_to_item[idx], rating) for userid, idx, rating in self.recommendations_predictions][:n_neighbors]) #convert back to ID's
        recommendations = [sorted(sublist, key=lambda x: x[2], reverse=True) for sublist in recommendations]
        if prints == True:
            print(recommendations)
        else:
            return recommendations
        
    def evaluate(self, data, limit=0.5, prints=False):
        ''' This is the evaluate function, the input should be the test data, if called manually.
        The hypertune function will use the validation data automatically. This function compares
        the predicted rating for the unseen items based on the training set, with the actual ratings
        in the test (or validation) data set. It returns RMSE & MAE.

        data    :   Actual data (usally test or validation in case of hypertune function)
        limit   :   Manual limit to limit amount of users to be tested (for big testdata)
        prints  :   Default is True, if True it outputs the results, for hypertune its turned Off
        '''

        user_list = data[self.userField].unique()
        self.count = 0
        if limit is not None:
            user_list = user_list[:int(len(user_list) * limit)]
        if prints == True: #If True output the prints for information
            print(len(user_list), f' : Users are going to be evaluted. {limit} of input data')
        predictions, targets = [], []
        for user in user_list:
            output = self.recommend([user], prints=False)
            for rec in output:
                for item in rec:
                    actual_rating = actual_rating = data[(data[self.userField] == item[0]) & (data[self.itemField] == item[1])][self.valueField].values
                    if (actual_rating.size > 0 & (item[2] > 0)):
                        self.count += 1
                        targets.append(actual_rating[0]) #store actual rating
                        predictions.append(item[2])
        rmse = np.sqrt(mean_squared_error(predictions, targets))
        mae = mean_absolute_error(predictions, targets)
        
        if prints == True: #if True print the RMSE score
            print('RMSE: '.ljust(20), round(rmse, 3))
            print('MAE: '.ljust(20), round(mae, 3))
            print(f'{self.count} Valid Ratings Evaluated')
        else: # else output the rmse
            return rmse
        
    def hypertune(self, data, k_folds=5, prints=True, limit=0.5):
        ''' Hyperparameter Tuning function. Utilizes the evaluation function with a K-Fold
        cross validation approach. One of the K fold acts as validation dataset to calculate
        the RMSE for the particular parameters. This is done K times for each parameter combination.

        data    :   (Train) Data to perform hypertuning on,
        k_folds :   How many times to split the train data, and cross validate
        prints  :   If True, outputs best metrics and RMSE found
        '''

        #Default set of parameters to test
        k_list = [10,15,20]
        n_rec_metric = ['cosine', 'manhattan', 'euclidean']
        prediction_metric = ['manhattan''cosine', 'euclidean']

        #dictonary to store the results and best_rmse that will get updated
        best_params_dict = {}
        best_rmse = 999

        #Loop through every combination, and for each perform K-Fold cross validation
        for k in k_list:
            for rec_dist in n_rec_metric:
                for pred_dist in prediction_metric:
                    kf = KFold(n_splits=k_folds, shuffle=True)
                    rmse_scores = []
                    for train_idx, validation_idx in kf.split(data):
                        train_data = data.iloc[train_idx]
                        val_data = data.iloc[validation_idx]
                        self.create_matrix(train_data)
                        self.set_distance_metrics(primary_metric=rec_dist, 
                                                prediction_metric=pred_dist)
                        self.set_n_recommendations_and_k(k=k)
                        self.fit()
                        rmse = self.evaluate(val_data, limit=limit, prints=False)
                        rmse_scores.append(rmse)
                    avg_rmse = np.mean(rmse_scores)
                    if avg_rmse < best_rmse:
                        best_rmse = avg_rmse
                        best_params_dict = {'K':k,
                                            'primary_metric': rec_dist,
                                            'prediction_metric': pred_dist,
                                            'RMSE': round(best_rmse,3)}
        
        #update the system with new best parameters
        self.set_n_recommendations_and_k(k=best_params_dict['K'])
        self.set_distance_metrics(prediction_metric=best_params_dict['prediction_metric'],
                                  primary_metric=best_params_dict['primary_metric'])


        #Print the best results and parameters found
        if prints == True:
            print(f"Best RMSE: {round(best_rmse, 3)}")
            print("-------------------")
            print("Best Parameters:")
            for param, value in best_params_dict.items():
                print(f"{param}:".ljust(30) + f"{value}")

---

## **3. Example Usage**
Before going to the comparisons of the models, we will show an example usage of the recommender, to show some of its features. We will show an example using, the regression methodology on the movielens data. We get the recommendations for a user, based on his/her favorite movie. 

#### **3.1 Train, test split**
The train and test split is done via the traditional way. The reasoning behind this is that splitting it on e.g. time, resulted in to few actual interactions, to comprehensively evaluate the systems.

In [88]:
#Split Train Test movielens
train_data, test_data = train_test_split(ratings_ml, test_size=0.2, random_state=42)

#### **3.2 Init the Recommender and Hypertune**
We give the recommender the names of the columns in which the relevant items are located. Specify the type of prediction methodology it needs to use, and give it the train data. In the hypertune function this data is splitted again during the K-Fold crossvalidation into train and validation. 

In [135]:
#Init the recommender
recommender = KNN_CF_ITEM('user_id','movie_id','rating', type='regression')
recommender.create_matrix(train_data)
recommender.fit()
recommender.hypertune(train_data, limit=0.05, prints=True)

Best RMSE: 0.962
-------------------
Best Parameters:
K:                            10
primary_metric:               cosine
prediction_metric:            cosine
RMSE:                         0.962


#### **3.3 Get recommendations**

In [129]:
#Get recommendations for user 3, based on his/her favourite movies
output = recommender.recommend([3], prints=False)

print('Recommended Unseen Movies:')
for _, item, rating in output[0][:10]:
    movie = title_dict[item]
    print(f'{movie}'.ljust(50), round(rating,2))

Recommended Unseen Movies:
Predator 2 (1990)                                  4.85
Death Wish 4: The Crackdown (1987)                 4.58
Hollywood Knights, The (1980)                      4.58
Bustin' Loose (1981)                               4.58
Bronco Billy (1980)                                4.57
Crimson Pirate, The (1952)                         4.57
Black Sabbath (Tre volti della paura, I) (1963)    4.57
Iron Eagle (1986)                                  4.54
RoboCop (1987)                                     4.44
Hellraiser (1987)                                  4.43


---

# **4. Comparing The Recommender Systems**
Now in the following section we will compare the recommender systems. For each system and prediction methodology, we fit the system, hypertune it, and evaluate it on the test data. We split the comparison for the two datasets. Futhermore, on each dataset we will also try the classification vs regression methodology. This will give us some nice insights into which system works best (item features vs no item features) and which prediction methodology. The insights are discussed at the end of each dataset analysis. This is the structure for comparing the systems:  
  
**Movielens:**
1) Regression system standard
2) Classification system standard
3) Regression system with item features
4) Classification system with item features
5) Findings

**Netflix:**
1) Regression system standard
2) Classification system standard
3) Regression system with item features
4) Classification system with item features
5) Findings

---

## **4.1 Results `MovieLens`**

First we will test the recommender systems and their type of rating prediciton (regression or classification) on the `movielens` dataset. For each recommender system we will input the necessary arguments, and specify the type of prediction methodology, and hypertune the model. Before that we will split the movielens data. We split the normal way, the reasoning behind this is that splitting it on e.g. time, resulted in to few actual interactions, to comprehensively evaluate the systems.

In [None]:
#Split Train Test movielens
train_data_ml, test_data_ml = train_test_split(ratings_ml, test_size=0.2, random_state=42)

### **``Recsys 1:`` Regression**  
KNN item-based CF, without extra item features

In [9]:
#Init the recsys, with movielens traindata and type regression
KNN_CF_1 = KNN_CF_ITEM('user_id', 'movie_id', 'rating', type='regression')
KNN_CF_1.create_matrix(train_data_ml)
KNN_CF_1.hypertune(train_data_ml, limit=0.5, k_folds=3, prints=False)
KNN_CF_1.fit()

In [10]:
#Evaluate the recommender on test data
KNN_CF_1.evaluate(test_data_ml, limit=1, prints=True)

610  : Users are going to be evaluted. 1 of input data
RMSE:                0.909
MAE:                 0.682
1255 Valid Ratings Evaluated


### **``Recsys 2:`` Classification**  
KNN item-based CF, without extra item features

In [11]:
#Init the recsys, with movielens traindata and type classification
KNN_CF_2 = KNN_CF_ITEM('user_id', 'movie_id', 'rating', type='classification')
KNN_CF_2.create_matrix(train_data_ml)
KNN_CF_2.hypertune(train_data_ml, limit=0.5, k_folds=3, prints=False)
KNN_CF_2.fit()

In [12]:
#Evaluate the recommender on test data
KNN_CF_2.evaluate(test_data_ml, limit=1, prints=True)

610  : Users are going to be evaluted. 1 of input data
RMSE:                1.045
MAE:                 0.716
1384 Valid Ratings Evaluated


### **``Recsys 3 With Features:`` Regression**  
KNN item-based CF, with extra item features

In [13]:
#Init the recsys, with movielens traindata and type classification
KNN_CF_3 = KNN_CF_ITEM_FEATURES('user_id', 'movie_id', 'rating', featurematrix=item_features_ml, type='regression')
KNN_CF_3.create_matrix(train_data_ml)
KNN_CF_3.hypertune(train_data_ml, limit=0.03, k_folds=3, prints=False)
KNN_CF_3.fit()

In [14]:
#Evaluate the recommender on test data
KNN_CF_3.evaluate(test_data_ml, limit=1, prints=True)

610  : Users are going to be evaluted. 1 of input data
RMSE:                0.886
MAE:                 0.673
768 Valid Ratings Evaluated


### **``Recsys 4 With Features:`` Classification**  
KNN item-based CF, with extra item features

In [61]:
#Init the recsys, with movielens traindata and type classification
KNN_CF_4 = KNN_CF_ITEM_FEATURES('user_id', 'movie_id', 'rating', featurematrix=item_features_ml, type='classification')
KNN_CF_4.create_matrix(train_data_ml)
KNN_CF_4.hypertune(train_data_ml, limit=0.03, k_folds=3, prints=False)
KNN_CF_4.fit()

In [62]:
#Evaluate the recommender on test data
KNN_CF_4.evaluate(test_data_ml, limit=1, prints=True)

610  : Users are going to be evaluted. 1 of input data
RMSE:                0.986
MAE:                 0.667
815 Valid Ratings Evaluated


## **Findings:** Movielens 
  

**Regression vs Classification:**
For both with and without additional features, the classification had worse accuracy. This is understandable, as classification is a riskier approach. However it can also fall the other way, because if its correct, it will have a closer predicted rating than is possible with the regression methodology, as that is based and therefore spread among more ratings. However, because many ratings are evaluated, it might have an effect on the evaluation of the classification, as it will bring it closer to the average. In terms of speed, both methods are similar; both are able to give recommendations for 5 people within 0.2 seconds. Also, the evaluation takes around the same time for both ways of predicting ratings. Therefore, for this dataset, the traditional regression method is the preferred option, as it returns a higher RMSE while having no significant impact on performance.
  
**With vs Without additional Features:**  
Secondly, let's take a look at the different results when using only ratings to find similar items versus using also the genres of the items. Here, we observe that by adding item features, we are able to achieve a lower RMSE (0.886 vs 0.909). Furthermore, we see the same trend in the classification task, where the model incorporating the extra features outperforms the one without (0.986 vs 1.045). However, a drawback of adding additional features is the impact on speed. Particularly during the fitting process, the time has increased from 0.3 seconds to 1.8 seconds, as a larger and additional feature matrix is required. Additionally, the recommendation time has also increased; when recommending for 10 users, the model that uses the extra features takes 0.2 seconds longer (0.8 seconds vs 1.0 second). This is something to keep in mind when scaling up to a larger dataset in the future. However, for this analysis. The system that incorperates additional item features, is the preferred system, which is expected as it has more information to base its similarity on. However, we should keep in mind that mixing content based principles with the collaborative approach has its drawbacks, that cannot be measured at this moment. It can be that due to the implementation of genres, people will get stuck with recommendetions within a specific genre, as we are moving more towards features then rating patterns as foundation. This might effect customer satisfaction and overall novelty of recommendations in the long run.
  
**Best Recommender system Movielens:**  
Regression with additional features


---

# **4.2 Results `Netflix`**  
Now we will do the same tests on the Netflix dataset. This dataset is 4x bigger, with also a 3.8 times the amount of users. The ratio item user is 7 to 1, while the movielens data is 15 to 1. (See EDA for more information). This means that Item-based should work in theory a bit better as its closer to its ideal scenario of having less items than users. We will now perform the same tests and write our conclusions below. Once again we start with splitting the Netflix data.

In [17]:
#Split Train Test Netflix
train_data_nf, test_data_nf = train_test_split(ratings_nf, test_size=0.2, random_state=42)

### **``Recsys 1:`` Regression**  
KNN item-based CF, without extra item features

In [18]:
#Init the recsys, with netflix traindata and type regression
KNN_CF_N1 = KNN_CF_ITEM('user_id', 'movie_id', 'rating', type='regression')
KNN_CF_N1.create_matrix(train_data_nf)
KNN_CF_N1.hypertune(train_data_nf, limit=0.03, k_folds=3, prints=False)
KNN_CF_N1.fit()

In [26]:
#Evaluate the recommender on test data
KNN_CF_N1.evaluate(test_data_nf, limit=0.5, prints=True)

982  : Users are going to be evaluted. 0.5 of input data
RMSE:                1.037
MAE:                 0.73
2779 Valid Ratings Evaluated


### **``Recsys 2:`` Classification**  
KNN item-based CF, without extra item features

In [27]:
#Init the recsys, with netflix traindata and type classification
KNN_CF_N2 = KNN_CF_ITEM('user_id', 'movie_id', 'rating', type='classification')
KNN_CF_N2.create_matrix(train_data_nf)
KNN_CF_N2.hypertune(train_data_nf, limit=0.03, k_folds=3, prints=False)
KNN_CF_N2.fit()

In [30]:
#Evaluate the recommender on test data
KNN_CF_N2.evaluate(test_data_nf, limit=0.4, prints=True)

786  : Users are going to be evaluted. 0.4 of input data
RMSE:                1.091
MAE:                 0.701
2443 Valid Ratings Evaluated


### **``Recsys 3 With Features:`` Regression**  
KNN item-based CF, with extra item features

In [68]:
#Init the recsys with features, with netflix traindata and type regression
KNN_CF_N3 = KNN_CF_ITEM_FEATURES('user_id', 'movie_id', 'rating', featurematrix= item_features_nf, type='regression')
KNN_CF_N3.create_matrix(train_data_nf)
KNN_CF_N3.hypertune(train_data_nf, limit=0.03, k_folds=3, prints=False)
KNN_CF_N3.fit()

In [69]:
#Evaluate the recommender on test data
KNN_CF_N3.evaluate(test_data_nf, limit=0.4, prints=True)

786 : Users are going to be evaluted. 0.4 of input data
RMSE:                0.922
MAE:                 0.634
2503 Valid Ratings Evaluated


### **``Recsys 4 With Features:`` Classification**  
KNN item-based CF, with extra item features

In [75]:
#Init the recsys with features, with netflix traindata and type classification
KNN_CF_N4 = KNN_CF_ITEM_FEATURES('user_id', 'movie_id', 'rating', featurematrix=item_features_nf, type='classification')
KNN_CF_N4.create_matrix(train_data_nf)
KNN_CF_N4.hypertune(train_data_nf, limit=0.03, k_folds=3, prints=False)
KNN_CF_N4.fit()

In [76]:
#Evaluate the recommender on test data
KNN_CF_N4.evaluate(test_data_nf, limit=0.4, prints=True)

786 : Users are going to be evaluted. 0.4 of input data
RMSE:                0.935
MAE:                 0.6
2342 Valid Ratings Evaluated


## **Findings:** Netflix
  
  
**Regression vs Classification:**
As with MovieLens, the classification performs slightly worse, but not to the same extent as observed with MovieLens. With the Netflix dataset, the classification methodology RMSE performance drop is actually negliable, but nevertheless it continues the trend that it performs worse than the regression method. In general this suggests that for these datasets, classification may not be the preferred method. One possibility would for future research would be to test a system that tests for which users this method works best and implement a mixed recommender that utilizes the most suitable methodology for certain types of users. Finally, similar to MovieLens, the performance remains comparable in terms of speed between the two predictions methodologies.
  
**With vs Without additional Features:**  
As with MovieLens, using additional features results in a better RMSE. However, similar to MovieLens, fitting the model takes longer with the additional features (3 seconds with features, 0.3 seconds without). The model without features is almost unaffected by the larger dataset, while the model with features does experience some impact (1 second to 3 seconds). As a test, we applied the model without features to the Netflix dataset with 24,000,000 rows, and it remained very fast, taking only 3 seconds for recommendations on a medium laptop. Conversely, the model with features would struggle to perform at that scale, due to a bigger matrix. However, in this scenario, apart from fitting the model, the recommendation speed is negligibly different compared to the model without features. Therefore, as with MovieLens, the regression-based model with additional features is the preferred choice, but ofcourse has might have the same drawbacks in the long run as discussed in the findings of the movielens.
  
**Best Recommender system Netflix:**  
regression based model with features

---

# **5 Final Conclusions** 
In general, the KNN-based item-based collaborative system performed reasonably well. As a benchmark, the off-the-shelf 'Surprise' algorithms achieved an RMSE of 0.98 on the MovieLens data, while this system often performed even better. Furthermore, some valuable insights have been gleaned regarding which type of system performs best. For both datasets, systems that incorporated extra item features and used regression methods for predicting ratings consistently outperformed others. Despite the small discrete rating scale of 1-5, classification didn't perform as well as expected in this analysis.Additionally, as mentioned, systems that included additional features performed better, albeit at a cost. The performance of utilizing an additional feature matrix slowed down, especially during fitting and to some extent in recommendation speed. Nonetheless, the systems generally performed quite well, capable of recommending items very quickly to multiple users, even on a medium-performing laptop. However, it's worth noting that when dealing with even larger datasets, the model incorporating features might experience slower performance compared to the alternative model. And could on the long run, restrict novelity of items recommended to users, as the system leans more towards a hybrid with content based, which are known to struggle with this issue. In conclusion, in this static analysis, for both datasets, a KNN-based algorithm that incorporates item features in addition to ratings, using a regression approach instead of classification, emerged as the most effective choice.