# 1 Introduction

The applications of recommender systems in businesses have become increasingly popular. Recommender systems apply various sources of information including demographics, social, and preferences to provide users with tailored recommended items. Moreover, the favoured technique for building recommender systems is Collaborative filtering. This technique is further divided into three main categories including Memory-Based, Model-Based, and Hybrid-BAsed. Therefore, as large businesses realise the advantages of recommending personalized items to users, the research for techniques, sources of information, and implementation grows. Netflix is among the big organisations interested in the expansion of recommmender system. The application of recommender systems at Netflix is widely known, however improvements are continiously being investigated to provide users with the best movie and series recommendations. Thus, this notebook aims to investigate the application of collaborative filtering techniques for Netflix.      

#### Which type of RecSys based on CF could Netflix user to provide the most accurate recommendations to users?

- What are the different type of RecSys based on CF?
- Which types could be used for Netflix' dataset?
- How do KNN and SVD compare?

# 2 Importing Libraries

Importing the necessary libraries

In [1]:
# General imports
import os
import pandas as pd
import numpy as np

# Used for visualisations in the EDA
import seaborn as sns

# Used in reducing the memory storage of sparse matrices
from scipy.sparse import csr_matrix

# Used for creating a KNN and SVD RecSys model
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from scipy.sparse.linalg import svds

# Used for performance evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

# As working with NaN values in matrices overwelmed the output with warnings 
# these warnings will be ignored.
import warnings
warnings.filterwarnings('ignore')

# 3 Data Processing

In [2]:
debugging = True

## 3.1 Netflix Dataset

### 3.1.1 Import Source Data

Code to append movieId to each record in all of the source files if this has not been executed earlier. This will allow all the source files to be loaded into a dataframe with one line of code and without having to add the movieId seperately before concatting the sourcefiles. This additionally resulted in a faster importing time of the source data.

In [3]:
def format_netflix_source():
    # Variable to keep track of which movieId has to be appended
    x = 0
    string = ","+str(x)

    # Loop through each file in the directory
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r') as f:
            # Check if the first line of the first file already has been formatted
            # If true formatting the source will be skipped
            if(f.readlines()[0] == '1:,1\n'):
                print('Source is already formatted, continuing \n')
                return
            else:
                print('Started formatting source files \n')
                # Append a ',', movieId and \n (newline) to each line resulting in an extra column with movieId when using pandas read_csv 
                file_lines = [''.join([x.strip(), string, '\n']) for x in f.readlines()]
        with open(os.path.join(directory, filename), 'w') as f:
            # Save the formatted file
            f.writelines(file_lines)
            print('Completed formatting source files')
    return

Creating the movie dataframe by concatting all the sourcefiles without their title (skiprows=1). Concluding with naming the columns.

In [4]:
# Define the folder in which all the seperate movie files are located
directory = "/Users/vbraun/Downloads/training_set/"

# Run the function to format the source data if necessary
format_netflix_source()

print('Started concatting all source files to DataFrame \n')
# Performing 1 concat on all sourcefiles to create one dataframe (including itemId)
# As the files do not have a header and differ between each other 'header' = None and the first row will be skipped
netflix_df = pd.concat(pd.read_csv(os.path.join(directory, fname), skiprows=1,header=None) for fname in os.listdir(directory)).rename(columns={0:'userId',1:'rating',2:'date',3:'itemId'})
print('Completed concatting all source files to DataFrame \n')

# Dropping the date column as this is not relevant for this recommender system
netflix_df = netflix_df.drop(columns='date')

display(netflix_df.head(3))

Source is already formatted, continuing 

Started concatting all source files to DataFrame 

Completed concatting all source files to DataFrame 



Unnamed: 0,userId,rating,itemId
0,1488844,3,1
1,822109,5,1
2,885013,4,1


### 3.1.2 Data Filtering

To allow for faster development a debugging variable is used. If debugging is True the dataset will only consist of the first 100 movies. For the final model, debugging will be set to False.

In [5]:
# In order to achieve faster execution time of the while developing a selection of the total dataset is made when debugging = True
if debugging == True:
    print('Debugging is set to True: limiting the dataset to the first 100 movies')
    filtered_netflix_df = netflix_df[netflix_df['itemId'] <= 100]
else:
    filtered_netflix_df = netflix_df
print('Length of selected dataset: {0} \n'.format(len(filtered_netflix_df)))

Debugging is set to True: limiting the dataset to the first 100 movies
Length of selected dataset: 352771 



In order to filter the dataset based on activity and reduce the sparsity of the data, the data will be grouped and filtered based on movies and users. The datasets will show how many ratings each movie has gotten and how many rating each user has given.

To reduce the sparcity of data in the dataset, we will filter out the users that have rated fewer than 5% of the total amount of movies.

Finally, the movies that have been rated by fewer than 50 people will be filtered out of the dataset.

In [6]:
# Making two dataframes: One grouping items to show how many ratings an item has gotten and one grouping users to show how many ratings an user has given
filtered_movie_count = filtered_netflix_df[['itemId','userId']].groupby('itemId').count().reset_index().rename(columns={'userId':'user_count'})
filtered_user_count = filtered_netflix_df[['itemId','userId']].groupby('userId').count().reset_index().rename(columns={'itemId':'item_count'})

# Defining the minimal percentage of items that a user has to have rated to be included
required_rated_percentage = 0.05
print('Filtering out users that have rated less than {0}% of all movies'.format(int(required_rated_percentage*100)))

# Check if the userId exists when the total amount of items rated by a user (filtered_user_count['item_count']), divided by the total amount of movies (len(filtered_movie_count)' is bigger than the required_rated_percentage
filtered_netflix_df = filtered_netflix_df[filtered_netflix_df['userId'].isin(filtered_user_count[filtered_user_count['item_count']/len(filtered_movie_count) > required_rated_percentage]['userId'])]
print('Length of filtered dataset: {0} \n'.format(len(filtered_netflix_df)))

print('Filtering out movies that have been rated by fewer than {0} users'.format(50))
filtered_netflix_df = filtered_netflix_df[filtered_netflix_df['itemId'].isin(filtered_movie_count[filtered_movie_count['user_count']>50]['itemId'])]
print('Length of filtered dataset:',len(filtered_netflix_df))

Filtering out users that have rated less than 5% of all items
Length of filtered dataset: 44209 

Filtering out movies that have been rated by fewer than 50 users
Length of filtered dataset: 44209


## 3.2 Jester Dataset

### 3.2.1 Import Source Data

In [7]:
# Creating dataframes for the jester datasets
jester_items = pd.read_csv(r'C:\Users\vbraun\Downloads\SDM-Datasets\jester_items.csv')
jester_ratings = pd.read_csv(r'C:\Users\vbraun\Downloads\SDM-Datasets\jester_ratings.csv')

# Rename jokeId to itemId to have one uniform format for the Netflix and Jester dataset
jester_df = jester_ratings.rename(columns={'jokeId':'itemId'})

# The ratings given in the jester dataset range from -10 to 10. Since calculations including average ratings of users could be influenced by this, the ratings will be increase with 10 to range from 0 - 20. 
jester_df['rating'] = jester_df['rating'] + 10

### 3.2.2 Data filtering

In [8]:
# In order to achieve faster execution time of the while developing a selection of the total dataset is made when debugging = True
if debugging == True:
    print('Debugging is set to True: limiting the dataset to the first 50 jokes')
    filtered_jester_df = jester_df[jester_df['itemId'] <= 50]
else:
    filtered_jester_df = jester_df
print('Length of selected dataset: {0} \n'.format(len(filtered_jester_df)))

Debugging is set to True: limiting the dataset to the first 50 jokes
Length of selected dataset: 715403 



In [9]:
# Making two dataframes: One grouping items to show how many ratings an item has gotten and one grouping users to show how many ratings an user has given
jester_item_count = filtered_jester_df[['itemId','userId']].groupby('itemId').count().reset_index().rename(columns={'userId':'user_count'})
jester_user_count = filtered_jester_df[['itemId','userId']].groupby('userId').count().reset_index().rename(columns={'itemId':'item_count'})

# Defining the minimal percentage of items that a user has to have rated to be included
required_rated_percentage = 0.05
print('Filtering out users that have rated less than {0}% of all jokes'.format(int(required_rated_percentage*100)))

# Check if the userId exists when the total amount of items rated by a user (jester_user_count['item_count']), divided by the total amount of movies (len(jester_item_count)' is bigger than the required_rated_percentage
filtered_jester_df = filtered_jester_df[filtered_jester_df['userId'].isin(jester_user_count[jester_user_count['item_count']/len(jester_item_count) > required_rated_percentage]['userId'])]
print('Length of filtered dataset: {0} \n'.format(len(filtered_jester_df)))

print('Filtering out jokes that have been rated by fewer than {0} users'.format(20))
filtered_jester_df = filtered_jester_df[filtered_jester_df['itemId'].isin(jester_item_count[jester_item_count['user_count']>20]['itemId'])]
print('Length of filtered dataset:',len(filtered_jester_df))

# 4 Exploratory Data Analysis (EDA)

## 4.1 EDA Netflix Dataset

In [10]:
# print('The filtered dataset has', filtered_df['userId'].nunique(), 'unique users')
# print('The filtered dataset has', filtered_df['itemId'].nunique(), 'unique movies')
# print('The filtered dataset has', filtered_df['rating'].nunique(), 'unique ratings')
# print('The unique ratings are', sorted(filtered_df['rating'].unique()))

In [11]:
# display(filtered_df.head(),filtered_df.tail())

In [12]:
# filtered_df.describe()

In [13]:
# print('Amount of NaN values in the dataset:',filtered_df.loc[lambda x: x.isnull().any(axis=1)].shape[0])

The following graph shows for each movie (as a dot) what its mean rating is in comparison to the total amount of ratings. 

In [14]:
# plt = sns.jointplot(x='rating_mean', y='rating_amount', data=filtered_df.groupby('itemId').agg(rating_mean = ('rating', 'mean'), rating_amount = ('rating', 'count')).reset_index())
# plt.fig.suptitle("Rating Mean and Rating Amount of Movies")
# plt.fig.subplots_adjust(top=0.95)

Notably, movies with low mean ratings have generally been rated a low number of times in relation to movies with a mean rating higher than 3.0.

The next graph shows the same variables as the graph seen above for each user. 

In [15]:
# plt = sns.jointplot(x='rating_mean', y='rating_amount', data=filtered_df.groupby('userId').agg(rating_mean = ('rating', 'mean'), rating_amount = ('rating', 'count')).reset_index())
# plt.fig.suptitle("Rating Mean and Rating Amount of Users")
# plt.fig.subplots_adjust(top=0.95)

The graph for users show that there are several outliers of users that have rated many movies while having a low mean of their ratings. Additionally, the graph shows that mean of the rating mean approaches normality. 

## 4.2 EDA Jester Dataset

In [16]:
# print('The filtered dataset has', filtered_df['userId'].nunique(), 'unique users')
# print('The filtered dataset has', filtered_df['itemId'].nunique(), 'unique movies')
# print('The filtered dataset has', filtered_df['rating'].nunique(), 'unique ratings')
# print('The unique ratings are', sorted(filtered_df['rating'].unique()))

In [17]:
# display(filtered_df.head(),filtered_df.tail())

In [18]:
# filtered_df.describe()

In [19]:
# print('Amount of NaN values in the dataset:',filtered_df.loc[lambda x: x.isnull().any(axis=1)].shape[0])

In [20]:
# plt = sns.jointplot(x='rating_mean', y='rating_amount', data=filtered_df.groupby('itemId').agg(rating_mean = ('rating', 'mean'), rating_amount = ('rating', 'count')).reset_index())
# plt.fig.suptitle("Rating Mean and Rating Amount of Movies")
# plt.fig.subplots_adjust(top=0.95)

In [21]:
# plt = sns.jointplot(x='rating_mean', y='rating_amount', data=filtered_df.groupby('userId').agg(rating_mean = ('rating', 'mean'), rating_amount = ('rating', 'count')).reset_index())
# plt.fig.suptitle("Rating Mean and Rating Amount of Users")
# plt.fig.subplots_adjust(top=0.95)

# 5 Recommender Systems

Both models will be evaluated using the same function that calculates the root mean squared error and the mean absolute error.

In [22]:
def evaluate_predictions(pred, truth):
    # In order to only compare y's that were not 0, a selection is made from y and corresponding the y^ where y != 0 by using .nonzero() 
    pred = pred[truth.nonzero()].flatten()
    truth = truth[truth.nonzero()].flatten()

    # Standard RMSE and MAE calculations
    rmse = np.sqrt(mean_squared_error(pred,truth))
    mae = mean_absolute_error(pred,truth)
    
    return rmse, mae

## 5.1 K Nearest Neighbors (KNN)

### 5.1.1 KNN Model

In [23]:
def train_test(filtered_df):
    # Since the data is sparse the pivot_table is performed in the csr_matrix to reduce the required memory 
    sparse_matrix = csr_matrix(filtered_df.pivot_table(index='userId', columns='itemId', values='rating').fillna(0).values)
    print(sparse_matrix.check_format)

    # To train and evaluate the KNN model the sparse matrix is split into 70% train, 15% validation and 15% test data
    train_data, test_data = train_test_split(sparse_matrix, test_size=.30)
    test_data, validation_data = train_test_split(test_data, test_size=.50)

    return train_data, validation_data, test_data

In [24]:
def calculate_knn_prediction(train_data, test_data, k = 5, metric = 'cosine', n_neighbors = 20):
    knn_model = NearestNeighbors(metric=metric,algorithm='brute',n_neighbors=n_neighbors,n_jobs=-1)

    knn_model_fitted = knn_model.fit(train_data.toarray())
    distance, indices = knn_model_fitted.kneighbors(test_data.toarray(),k)

    raw_recommends = sorted(list(zip(indices.squeeze().tolist(), distance.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
    knn_prediction = []
    for i, (idx, dist) in enumerate(raw_recommends):
        td = pd.DataFrame(train_data.toarray())
        sim_users = np.array(td[td.index.isin(idx)])
        sim_users[sim_users == 0] = np.nan
        average_rat = np.nan_to_num(np.nanmean(sim_users,axis=0))
        knn_prediction.append(average_rat)
    
    return np.array(knn_prediction)


### 5.1.2 KNN Hyper Parameter Tuning

In [25]:
def hyper_parameter_tuning_knn(train_data, validation_data):
    n_neighbors = [5,10,20,50]
    recommendation_amount = [3,5,10]
    metric = ['euclidean','cosine','minkowski']

    hpt_results = []
    for met in metric:
        for k in recommendation_amount:
            for n in n_neighbors:
                rmse, mae = evaluate_predictions(validation_data.toarray(),calculate_knn_prediction(train_data = train_data, test_data = validation_data, metric = met, k = k, n_neighbors = n))
                print(rmse,met,k,n)
                hpt_results.append([rmse,met,k,n])

    best_parameters_knn = sorted(hpt_results, key=lambda x: x[0])[0]
    print(best_parameters_knn)

    return best_parameters_knn

1. Create NearestNeighbors model
1. Fit the model with train data
1. Use kneighbors to find the k amount of neighbors of the jokes in the test data
1. Calculate the prediction by taking the average score of the k most similar jokes
1. Evaluate the model by comparing the actual ratings with the predicted ratings

Algorithm is set at brute (force) because the inputdata is sparse

## 5.2 Singular Value Decomposition (SVD)

### 5.2.1 SVD Model

Pivot the dataset into a matrix with index='userId', columns='itemId', values='rating' in order to later perform user-based collaborative filtering. Moreover, fill_value = 0 in order to remove NaN values and save them as 0. Finally, the matrix is directly stored as a sparse matrix to save memory, instead of first saving the entire matrix into memory. 

Scipy.sparse.linalg.svds was used to perform a partial singular value decomposition of a sparse matrix. This function allows us to specify 'k' which is the number of singular values and singular vectors that have to be computed. 

In [26]:
def calculate_svd_prediction(data, k = 5):
    # Performing the SVD matrix factorisation giving: u (m x r) orthogonal matrix, 
    # s (r x r) diagonal matrix, and vt(ransposed) (r x n) orthogonal matrix.
    u, s, vt = svds(data.toarray(), k = k)

    # A diagonal matrix has to be created for s in order to recreate a matrix from u, s, and vt
    s_diagonal = np.diag(s)

    # Recreate the matrix by performing matrix multiplications of u, s, and vt
    predictions = np.dot(np.dot(u, s_diagonal), vt)
    
    return predictions

In order to evaluate the performance of the recommendations following SVD we only need the $\hat{y}$ of existing $y$. Therefore, all other values will be filtered out of the prediction dataset by using pred[truth.nonzero()]. Afterwards we are able to evaluate the performance of our model by comparing $\hat{y}$ with their corresponding $y$. 

### 5.2.2 SVD Hyper Parameter Tuning

Different k values lead to different predictions

We will perform hyperparameter tuning to find the k with the lowest rmse. For each k we will perform multiple iterations, in which a random sample of the data will be masked and used to calculate the rmse. The rmse of a k value will be the average rmse of all iterations of that k. 

In [53]:
def hyper_parameter_tuning_svd(dataset, iterations = 10):
    results = []
    print('Calculating the average rmse over {0} iterations'.format(iterations))

    # List with k values that will be tested
    k_list = [1,2,3,4,5,6,7,8,9,10,20,30,50,80]

    # As k has to be 0 < k < min(Matrix.shape) in SVD the values in k_list that are higher are filtered out.
    # The shape of the items in the matrix is passed with csr_matrix(dataset.pivot_table(index='userId', columns='itemId', values='rating').fillna(0).values).shape[1]
    k_list = list(filter(lambda num: num < csr_matrix(dataset.pivot_table(index='userId', columns='itemId', values='rating').fillna(0).values).shape[1], k_list))
    print(k_list)
    
    for k in k_list:
        rmse_list = []
        for i in range(0,iterations):
            dataset_ex_masked, masked_data = train_test_split(dataset, test_size=.05)

            dataset_ex_masked_csr = csr_matrix(dataset_ex_masked.pivot_table(index='userId', columns='itemId', values='rating').fillna(0).values)
            masked_data_csr = csr_matrix(masked_data.pivot_table(index='userId', columns='itemId', values='rating').fillna(0).values)

            rmse, mae = evaluate_predictions(calculate_svd_prediction(dataset_ex_masked_csr,k),masked_data_csr.toarray())
            
            rmse_list.append(rmse)
    
        results.append([k,(sum(rmse_list)/len(rmse_list))])
        print('For k = {0}, the average rmse = {1}'.format(k,(sum(rmse_list)/len(rmse_list))))

    best_parameters_svd = sorted(results, key=lambda x: x[1])[0]
    print('The rmse is lowest for k = {0} at = {1}'.format(best_parameters_svd[0],best_parameters_svd[1]))
    
    return best_parameters_svd
    

# 6 Model Evaluation

## 6.1 Performance Netflix Dataset

### 6.1.1 Results KNN model

In [30]:
knn_train, knn_validation, knn_test = train_test(filtered_netflix_df)

In [31]:
best_params_knn = hyper_parameter_tuning_knn(train_data = knn_train, validation_data = knn_validation)

2.623055009576771 euclidean 3 5
2.623055009576771 euclidean 3 10
2.623055009576771 euclidean 3 20
2.623055009576771 euclidean 3 50
2.538732452423058 euclidean 5 5
2.538732452423058 euclidean 5 10
2.538732452423058 euclidean 5 20
2.538732452423058 euclidean 5 50
2.4509703519004 euclidean 10 5
2.4509703519004 euclidean 10 10
2.4509703519004 euclidean 10 20
2.4509703519004 euclidean 10 50
2.7771249981708976 cosine 3 5
2.7771249981708976 cosine 3 10
2.7771249981708976 cosine 3 20
2.7771249981708976 cosine 3 50
2.7501424757697457 cosine 5 5
2.7501424757697457 cosine 5 10
2.7501424757697457 cosine 5 20
2.7501424757697457 cosine 5 50
2.7139979666763248 cosine 10 5
2.7139979666763248 cosine 10 10
2.7139979666763248 cosine 10 20
2.7139979666763248 cosine 10 50
2.623055009576771 minkowski 3 5
2.623055009576771 minkowski 3 10
2.623055009576771 minkowski 3 20
2.623055009576771 minkowski 3 50
2.538732452423058 minkowski 5 5
2.538732452423058 minkowski 5 10
2.538732452423058 minkowski 5 20
2.5387324

In [32]:
evaluate_predictions(knn_test.toarray(),calculate_knn_prediction(train_data = knn_train, test_data = knn_test, metric = best_params_knn[1], k = best_params_knn[2], n_neighbors = best_params_knn[3]))

(2.429720432043772, 2.1248197640488926)

### 6.1.2 Results SVD model

In [33]:
best_params_svd = hyper_parameter_tuning_svd(dataset = filtered_netflix_df, iterations = 10)

Calculating the average rmse over 10 iterations
For k = 1, the average rmse = 3.2914045440830813
For k = 2, the average rmse = 3.3032158167079757
For k = 3, the average rmse = 3.3190726110889743
For k = 4, the average rmse = 3.313065654238047
For k = 5, the average rmse = 3.3721363127371595
For k = 6, the average rmse = 3.315092224347321
For k = 7, the average rmse = 3.2735355994253643
For k = 8, the average rmse = 3.3544884783067004
For k = 9, the average rmse = 3.311079299177652
For k = 10, the average rmse = 3.337231803280313
For k = 20, the average rmse = 3.2723853079844942
For k = 30, the average rmse = 3.3463953145666445
For k = 50, the average rmse = 3.3863219001530007
For k = 80, the average rmse = 3.39113859615619
The rmse is lowest for k = 20 at = 3.2723853079844942


### 6.1.3 Comparison KNN & SVD for Netflix

#### 6.1.3.1 Performance

#### 6.1.3.2 Recommendations

## 6.2 Performance Jester Dataset

### 6.2.1 Results KNN model

In [34]:
knn_train, knn_validation, knn_test = train_test(filtered_jester_df)

In [35]:
best_params_knn = hyper_parameter_tuning_knn(train_data = knn_train, validation_data = knn_validation)

9.033372224544578 euclidean 3 5
9.033372224544578 euclidean 3 10
9.033372224544578 euclidean 3 20
9.033372224544578 euclidean 3 50
8.910130010204323 euclidean 5 5
8.910130010204323 euclidean 5 10
8.910130010204323 euclidean 5 20
8.910130010204323 euclidean 5 50
8.76778828052912 euclidean 10 5
8.76778828052912 euclidean 10 10
8.76778828052912 euclidean 10 20
8.76778828052912 euclidean 10 50
9.463540730309592 cosine 3 5
9.463540730309592 cosine 3 10
9.463540730309592 cosine 3 20
9.463540730309592 cosine 3 50
9.349449156099086 cosine 5 5
9.349449156099086 cosine 5 10
9.349449156099086 cosine 5 20
9.349449156099086 cosine 5 50
9.23628444832392 cosine 10 5
9.23628444832392 cosine 10 10
9.23628444832392 cosine 10 20
9.23628444832392 cosine 10 50
9.033372224544578 minkowski 3 5
9.033372224544578 minkowski 3 10
9.033372224544578 minkowski 3 20
9.033372224544578 minkowski 3 50
8.910130010204323 minkowski 5 5
8.910130010204323 minkowski 5 10
8.910130010204323 minkowski 5 20
8.910130010204323 min

In [36]:
evaluate_predictions(knn_test.toarray(),calculate_knn_prediction(train_data = knn_train, test_data = knn_test, metric = best_params_knn[1], k = best_params_knn[2], n_neighbors = best_params_knn[3]))

(8.73842878841048, 7.3428941454196375)

### 6.2.2 Results SVD model

In [54]:
best_params_svd = hyper_parameter_tuning_svd(dataset = filtered_jester_df, iterations = 2)

Calculating the average rmse over 2 iterations
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30]
For k = 1, the average rmse = 8.530042083820124
For k = 2, the average rmse = 8.868934041400944
For k = 3, the average rmse = 8.988529151964887
For k = 4, the average rmse = 9.11775329691272
For k = 5, the average rmse = 9.142603298280465
For k = 6, the average rmse = 9.23717779901505
For k = 7, the average rmse = 9.299385331053614
For k = 8, the average rmse = 9.325726077453089
For k = 9, the average rmse = 9.452114322926636
For k = 10, the average rmse = 9.543293105701935
For k = 20, the average rmse = 9.899036402990543
For k = 30, the average rmse = 9.949725169304354
The rmse is lowest for k = 1 at = 8.530042083820124


### 6.2.3 Comparison KNN & SVD for Jester

#### 6.1.3.1 Performance

#### 6.1.3.2 Recommendations

KNN:

In [55]:
recommend_for_user = 50

user_pred_df = pd.DataFrame(predictions)
user_sel_pred_df = user_pred_df.loc[recommend_for_user].sort_values(ascending=False)

user_df = pd.DataFrame(final_csr_matrix.toarray())
selected = pd.DataFrame(user_df.loc[recommend_for_user])
rated_movies = selected.loc[~(selected==0).all(axis=1)].index.values.tolist()

recommended_movies = user_sel_pred_df.loc[~user_sel_pred_df.index.isin(rated_movies)]

print('Rated items are:',rated_movies)
print(recommended_movies[:3])

NameError: name 'predictions' is not defined

SVD:

In [None]:

# final_csr_matrix = csr_matrix(filtered_df.pivot_table(index='userId', columns='itemId', values='rating').fillna(0).values)
# predictions = hyper_svd(final_csr_matrix,best_parameters_svd[0])

In [None]:
# recommend_for_user = 650

# user_pred_df = pd.DataFrame(predictions)
# user_sel_pred_df = user_pred_df.loc[recommend_for_user].sort_values(ascending=False)

# user_df = pd.DataFrame(final_csr_matrix.toarray())
# selected = pd.DataFrame(user_df.loc[recommend_for_user])
# rated_movies = selected.loc[~(selected==0).all(axis=1)].index.values.tolist()

# recommended_movies = user_sel_pred_df.loc[~user_sel_pred_df.index.isin(rated_movies)]

# print('Rated items are:',rated_movies)
# print(recommended_movies[:3])

In order to answer the subquestion: "How do the KNN and SVD models compare?" we will compare compare the rmse of both models for the same dataset.

# 7 Conclusion

#### Which type of RecSys based on CF could Netflix user to provide the most accurate recommendations to users?

- How do KNN and SVD compare?