### Collaborative Filtering-based Movie Recommendation Systems

With collaborative filtering, the system is based on past interactions between users and movies.

To obtain recommendations for our users, we will **predict their ratings** for movies they haven’t watched yet. Movies are then indexed and suggested to users based on these predicted ratings.

In [40]:
import pandas as pd
ratings = pd.read_csv('Datasets/ratings_small.csv')
ratings.shape

(100004, 4)

In [41]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [42]:
ratings.drop('timestamp', axis=1, inplace=True)

In [43]:
print(ratings['userId'].nunique())
print(ratings.shape)
ratings.head()

671
(100004, 3)


Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [44]:
# Splitting into train and test
from sklearn.model_selection import train_test_split
train, test = train_test_split(ratings, test_size=0.2, random_state=42)

print(train.shape)
print(test.shape)

(80003, 3)
(20001, 3)


# Matrix Factorization-based Algorithm

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems.

It works by decomposing the `user-movie interaction matrix` (dataset) into the product of two lower-dimensional matrices: a `user matrix` and a `movie matrix` (each with some k features that capture underlying factors or influence user preferences).
The decomposition is done in such a way that the product results in almost similar values to the user-movie interaction matrix.

We then perform the product of 2 matrices to predict the empty cells of the user-movie interaction matrix.

In [45]:
from surprise import SVD
import numpy as np
import surprise
from surprise import Reader, Dataset

In [46]:
# train set

# It is to specify how to read the dataframe.
# for our dataframe, we don't have to specify anything extra..
reader = Reader(rating_scale=(1,5))

# create the traindata from the dataframe...
train_data_mf = Dataset.load_from_df(train[['userId', 'movieId', 'rating']], reader)

# build the trainset from traindata.., It is of dataset format from surprise library..
trainset = train_data_mf.build_full_trainset() 

In [47]:
trainset

<surprise.trainset.Trainset at 0x225655b4970>

In [48]:
#test set
reader = Reader(rating_scale=(1,5))
test_data_mf = Dataset.load_from_df(test[['userId', 'movieId', 'rating']], reader)
testset = test_data_mf.build_full_trainset() 

In [49]:
svd = SVD(n_factors=100, biased=True, random_state=15, verbose=True)
svd.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x225650d5f40>

In [50]:
#getting predictions of trainset
train_preds = svd.test(trainset.build_testset())

train_pred_mf = np.array([pred.est for pred in train_preds])

In [51]:
train_pred_mf.shape

(80003,)

In [52]:
train_pred_mf

array([4.05703641, 3.8292154 , 4.29416284, ..., 3.73216477, 4.18028913,
       2.99374156])

In [53]:
def get_error_metrics(y_true, y_pred):
    rmse = np.sqrt(np.mean([ (y_true[i] - y_pred[i])**2 for i in range(len(y_pred)) ]))
    mape = np.mean(np.abs( (y_true - y_pred)/y_true )) * 100
    return rmse, mape

In [54]:
# error
rmse_train, mape_train = get_error_metrics(train.rating.values, train_pred_mf)
print( {'rmse': rmse_train,
        'mape' : mape_train} )

{'rmse': 1.245609623806668, 'mape': 42.06257415522802}


In [55]:
#getting predictions of trainset
test_preds = svd.test(testset.build_testset())
test_pred_mf = np.array([pred.est for pred in test_preds])

In [56]:
# error
rmse_test, mape_test = get_error_metrics(test.rating.values, test_pred_mf)
print( {'rmse': rmse_test,
        'mape' : mape_test} )

{'rmse': 1.2061597070667478, 'mape': 41.15563553771098}


We will further be reducing this error in the final model.

# Feature Engineering
Creating some handcrafted features for the final model

We'll be creating **3 sets of new features.**
1. Global averages

    - The average ratings of all movies given by all users
    - The average ratings of a particular movie given by all users
    - The average ratings of all movies given by a particular user


2. Top 5 similar users
3. Top 5 similar movies

In [57]:
# Creating a sparse matrix
# Matrices used in this type of problems are generally sparse as there’s a high chance users only rated a few movies.
from scipy import sparse
train_sparse_matrix = sparse.csr_matrix((train.rating.values, 
                                        (train.userId.values, train.movieId.values)))

In [58]:
train_sparse_matrix

<672x163950 sparse matrix of type '<class 'numpy.float64'>'
	with 80003 stored elements in Compressed Sparse Row format>

In [59]:
train_averages = dict()

# Global avg of all ratings (all movies by all users)
train_global_average = train_sparse_matrix.sum()/train_sparse_matrix.count_nonzero()
train_averages['global'] = train_global_average
train_averages

{'global': 3.5417484344337087}

In [60]:
# get the user averages in dictionary (key: user_id/movie_id, value: avg rating)
def get_average_ratings(sparse_matrix, of_users):
    
    # Determine the axis (user/axes) : 1 - User axes,0 - Movie axes
    ax = 1 if of_users else 0 

    # ".A1" is for converting Column_Matrix to 1-D numpy array 
    sum_of_ratings = sparse_matrix.sum(axis=ax).A1
    
    # Boolean matrix of ratings ( whether a user rated that movie or not)
    is_rated = sparse_matrix!=0
    # no of ratings for each user or movie
    no_of_ratings = is_rated.sum(axis=ax).A1
    
    # max_user and max_movie ids (Dimensions of sparse matrix)
    num_users, num_movies = sparse_matrix.shape
    # creae a dictonary of users and their average ratigns..
    average_ratings = { 
        i : sum_of_ratings[i]/no_of_ratings[i] 
        for i in range(num_users if of_users else num_movies) 
        if no_of_ratings[i] !=0
    }

    return average_ratings

In [61]:
# Average ratings given by a user
train_averages['user'] = get_average_ratings(train_sparse_matrix, of_users=True)
print('Average rating of user 10 :',train_averages['user'][10])

Average rating of user 10 : 3.6944444444444446


In [62]:
# Average ratings given for a movie
train_averages['movie'] =  get_average_ratings(train_sparse_matrix, of_users=False)
print('Average rating of movie 15 :',train_averages['movie'][15])

Average rating of movie 15 : 2.0


In [63]:
# get users, movies and ratings from our samples train sparse matrix
train_users, train_movies, train_ratings = sparse.find(train_sparse_matrix)

In [64]:
train_users

array([  7,   9,  13, ..., 287, 611, 547], dtype=int32)

In [65]:
train_ratings

array([3., 4., 5., ..., 5., 5., 5.])

**Final dataframe:**

New features:
- GAvg: Average rating of all the ratings
- Similar users rating of this movie: sur1, sur2, sur3, sur4, sur5 ( top 5 similar users who rated that movie )
- Similar movies rated by this user: smr1, smr2, smr3, smr4, smr5 ( top 5 similar movies rated by user)
- UAvg: User AVerage rating
- MAvg: Average rating of this movie
- rating: Rating of this movie by this user.
- mf_svd: Prediction by MF model

In [78]:
from datetime import datetime
from sklearn.metrics.pairwise import cosine_similarity

def getFinalData(_users, _movies, _ratings, _sparse_matrix, _averages_dict):
    # final_data = pd.DataFrame()
    final_data = pd.DataFrame(columns=['user', 'movie', 'GAvg', 'sur1', 'sur2', 'sur3', 'sur4', 'sur5',
                'smr1', 'smr2', 'smr3', 'smr4', 'smr5', 'UAvg', 'MAvg', 'rating'])

    count = 0
    start = datetime.now()
    for (user, movie, rating)  in zip(_users, _movies, _ratings):
        row = list()
        # print(user, movie)
        row.append(user)
        row.append(movie)

        # gloabal avrg rating
        row.append(_averages_dict['global']) # first feature

    #---------------------- Ratings of "movie" by similar users of "user" ----------------------------------###

        # compute the similar Users of the "user"        
        user_sim = cosine_similarity(_sparse_matrix[user], _sparse_matrix).ravel()

        top_sim_users = user_sim.argsort()[::-1][1:]  
            # Ignoring 'user' from its similar users
            # [::-1] reverses the sorted array making it descending

        # get the ratings of most similar users for this movie
        top_ratings = _sparse_matrix[top_sim_users, movie].toarray().ravel()

        top_sim_users_ratings = list(top_ratings[top_ratings != 0][:5])
        # making the length "5", if not, by adding movie averages at last
        top_sim_users_ratings.extend([_averages_dict['movie'][movie]]*(5 - len(top_sim_users_ratings)))
    #     print(top_sim_users_ratings, end=" ")    

        # next 5 features are similar_users "movie" ratings
        row.extend(top_sim_users_ratings)


    #--------------------- Ratings by "user"  to similar movies of "movie" ------------------------------###

        # compute the similar movies of the "movie"        
        movie_sim = cosine_similarity(_sparse_matrix[:,movie].T, _sparse_matrix.T).ravel()

        top_sim_movies = movie_sim.argsort()[::-1][1:] # Ignoring 'movie' from its similar movies

        # get the ratings of most similar movie rated by this user..
        top_ratings = _sparse_matrix[user, top_sim_movies].toarray().ravel()

        top_sim_movies_ratings = list(top_ratings[top_ratings != 0][:5])
        top_sim_movies_ratings.extend([_averages_dict['user'][user]]*(5-len(top_sim_movies_ratings))) # make length 5 if needed
    #     print(top_sim_movies_ratings, end=" : -- ")

        # next 5 features are "user" ratings for similar_movies
        row.extend(top_sim_movies_ratings)


    #------------------- add further features to the row ------------------------------------#
        # Avg_user rating
        row.append(_averages_dict['user'][user])
        # Avg_movie rating
        row.append(_averages_dict['movie'][movie])

        # finally, The actual Rating of this user-movie pair...
        row.append(rating)

    #     print(row)

        count = count + 1
    #     print(count)

        #final_data = final_data.append([row])
        final_data.loc[len(final_data)] = row


        if (count)%10000 == 0:
            print("Done for {} rows----- {}".format(count, datetime.now() - start))
            start = datetime.now()
            
    return final_data

In [67]:
# final_data.columns=['user', 'movie', 'GAvg', 'sur1', 'sur2', 'sur3', 'sur4', 'sur5',
#             'smr1', 'smr2', 'smr3', 'smr4', 'smr5', 'UAvg', 'MAvg', 'rating']

In [68]:
final_data = getFinalData(train_users, train_movies, train_ratings, train_sparse_matrix, train_averages)
final_data.head()

Done for 10000 rows----- 0:06:37.323159
Done for 20000 rows----- 0:03:11.794999
Done for 30000 rows----- 0:03:09.781131
Done for 40000 rows----- 0:03:36.685106
Done for 50000 rows----- 0:03:34.759069
Done for 60000 rows----- 0:03:42.528459
Done for 70000 rows----- 0:03:58.472888
Done for 80000 rows----- 0:03:51.946963


Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,smr3,smr4,smr5,UAvg,MAvg,rating
0,7.0,1.0,3.541748,4.0,3.5,5.0,4.0,5.0,5.0,3.0,4.0,3.0,5.0,3.478261,3.851485,3.0
1,9.0,1.0,3.541748,5.0,4.0,4.5,4.5,5.0,5.0,4.0,5.0,4.0,4.0,3.725,3.851485,4.0
2,13.0,1.0,3.541748,4.5,5.0,4.5,4.0,5.0,3.0,3.5,3.0,5.0,4.0,3.732558,3.851485,5.0
3,15.0,1.0,3.541748,3.0,4.0,3.5,4.0,3.5,2.5,5.0,5.0,4.0,4.0,2.603257,3.851485,2.0
4,19.0,1.0,3.541748,5.0,4.0,4.0,5.0,4.0,4.0,3.0,5.0,4.0,5.0,3.538462,3.851485,3.0


In [69]:
final_data.shape

(80003, 16)

In [70]:
final_data['mf_svd']=train_pred_mf
final_data.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,smr3,smr4,smr5,UAvg,MAvg,rating,mf_svd
0,7.0,1.0,3.541748,4.0,3.5,5.0,4.0,5.0,5.0,3.0,4.0,3.0,5.0,3.478261,3.851485,3.0,4.057036
1,9.0,1.0,3.541748,5.0,4.0,4.5,4.5,5.0,5.0,4.0,5.0,4.0,4.0,3.725,3.851485,4.0,3.829215
2,13.0,1.0,3.541748,4.5,5.0,4.5,4.0,5.0,3.0,3.5,3.0,5.0,4.0,3.732558,3.851485,5.0,4.294163
3,15.0,1.0,3.541748,3.0,4.0,3.5,4.0,3.5,2.5,5.0,5.0,4.0,4.0,2.603257,3.851485,2.0,4.399865
4,19.0,1.0,3.541748,5.0,4.0,4.0,5.0,4.0,4.0,3.0,5.0,4.0,5.0,3.538462,3.851485,3.0,4.12075


#### Preparing test data

In [71]:
# Creating a sparse matrix
test_sparse_matrix = sparse.csr_matrix((test.rating.values, (test.userId.values,
                                               test.movieId.values)))

In [72]:
test_averages = dict()

# Global avg of all movies by all users
test_global_average = test_sparse_matrix.sum()/test_sparse_matrix.count_nonzero()
test_averages['global'] = test_global_average
test_averages

{'global': 3.5510474476276186}

In [73]:
# Average ratings given by a user
test_averages['user'] = get_average_ratings(test_sparse_matrix, of_users=True)
print('Average rating of user 10 :', test_averages['user'][10])

# Average ratings given for a movie
test_averages['movie'] =  get_average_ratings(test_sparse_matrix, of_users=False)
print('Average rating of movie 15 :', test_averages['movie'][15])

Average rating of user 10 : 3.7
Average rating of movie 15 : 2.875


In [79]:
# get users, movies and ratings from test sparse matrix
test_users, test_movies, test_ratings = sparse.find(test_sparse_matrix)

In [80]:
final_test_data = getFinalData(test_users, test_movies, test_ratings, test_sparse_matrix, test_averages)
final_test_data.head()

Done for 10000 rows----- 0:02:35.688095
Done for 20000 rows----- 0:02:35.690510


Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,smr3,smr4,smr5,UAvg,MAvg,rating
0,43.0,1.0,3.551047,5.0,5.0,3.0,5.0,4.0,3.0,4.0,4.0,4.0,4.0,3.26087,3.966667,4.0
1,56.0,1.0,3.551047,5.0,5.0,3.0,5.0,5.0,4.0,2.0,2.0,4.0,4.0,3.592593,3.966667,4.0
2,69.0,1.0,3.551047,5.0,5.0,5.0,3.0,5.0,3.5,4.0,4.5,4.5,5.0,4.25,3.966667,5.0
3,73.0,1.0,3.551047,5.0,4.0,5.0,4.0,5.0,4.0,3.5,3.5,3.0,3.5,3.343939,3.966667,5.0
4,94.0,1.0,3.551047,3.5,3.5,2.5,5.0,4.5,1.5,3.0,3.5,4.5,3.0,3.410256,3.966667,4.0


In [81]:
test_pred_mf.shape

(20001,)

In [82]:
final_test_data['mf_svd']=test_pred_mf
final_test_data.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,smr3,smr4,smr5,UAvg,MAvg,rating,mf_svd
0,43.0,1.0,3.551047,5.0,5.0,3.0,5.0,4.0,3.0,4.0,4.0,4.0,4.0,3.26087,3.966667,4.0,3.981476
1,56.0,1.0,3.551047,5.0,5.0,3.0,5.0,5.0,4.0,2.0,2.0,4.0,4.0,3.592593,3.966667,4.0,3.640555
2,69.0,1.0,3.551047,5.0,5.0,5.0,3.0,5.0,3.5,4.0,4.5,4.5,5.0,4.25,3.966667,5.0,3.932812
3,73.0,1.0,3.551047,5.0,4.0,5.0,4.0,5.0,4.0,3.5,3.5,3.0,3.5,3.343939,3.966667,5.0,3.908617
4,94.0,1.0,3.551047,3.5,3.5,2.5,5.0,4.5,1.5,3.0,3.5,4.5,3.0,3.410256,3.966667,4.0,3.031592


# XGBoost

In [83]:
# prepare train data
# droping user/movie ids
x_train = final_data.drop(['user', 'movie', 'rating'], axis=1)
y_train = final_data['rating']

so ratings are predicted from
- ratings of similar users for same movie
- ratings of similar movies

In [84]:
# Prepare Test data
x_test = final_test_data.drop(['user', 'movie', 'rating'], axis=1)
y_test = final_test_data['rating']

In [91]:
# initialize XGBoost model
import xgboost as xgb
xgb_model = xgb.XGBRegressor(n_jobs=13, random_state=15, n_estimators=100)

In [92]:
print('Training the model..')
start = datetime.now()
xgb_model.set_params(eval_metric='rmse')
xgb_model.fit(x_train, y_train)
print('Done. Time taken={}\n'.format(datetime.now()-start))

Training the model..
Done. Time taken=0:00:09.332620



In [93]:
# Prediction
y_train_pred = xgb_model.predict(x_train)

# get the rmse and mape of train data..
rmse_train, mape_train = get_error_metrics(y_train.values, y_train_pred)
    
# store the results in train_results dictionary..
train_results = {'rmse': rmse_train, 
                 'mape' : mape_train, 
                 'predictions' : y_train_pred}
train_results

{'rmse': 0.708574116876655,
 'mape': 21.42636439286759,
 'predictions': array([3.8829207, 4.182848 , 3.7492647, ..., 4.9218345, 5.090525 ,
        4.913607 ], dtype=float32)}

In [94]:
# Test predicton
y_test_pred = xgb_model.predict(x_test) 
rmse_test, mape_test = get_error_metrics(y_true=y_test.values, y_pred=y_test_pred)

test_results = {'rmse': rmse_test, 
                'mape' : mape_test, 
                'predictions':y_test_pred}
test_results

{'rmse': 0.7700471894305032,
 'mape': 22.610169216035032,
 'predictions': array([3.6912205, 3.5940406, 4.1798253, ..., 4.993876 , 4.161826 ,
        2.8621824], dtype=float32)}

In [None]:
# test results
# {'rmse': 0.7700471894305032,
#  'mape': 22.610169216035032,}