# **Recommender Systems using Collaborative Filtering**

Recommender Systems is one of the topics that ignited my curiosity and interest to pursue data science as a career. All the modern recommender systems employ Collaborative Filtering in some or the other form. Now, what is **Collaborative Filtering?** 

Lets say we have a data of users and movies and some of the ratings which the users have given to the movies they have watched. Now, lets say the company wants to recommend me movies based on collaborative filtering. To recommend me a movie based on collaborative filtering, the company will first find the users who are similar to me using the ratings on the other movies that the other users and I have provided and then they will try to predict the ratings I could have given to all the other movies I haven't watched using the power of other similar users.

**Netflix** uses **colaborative filtering** to show us movies which 'users like us' have seen and liked. Similarly **Amazon** provides product recommendations using the products that similar users have bought.

The dataset we will be using is the MovieLens 100k dataset on Kaggle :

https://www.kaggle.com/prajitdatta/movielens-100k-dataset

Now, let us build a recommender system using collaborative filtering.



In [0]:
#importing necessary libraries

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from surprise import Reader, Dataset, KNNBasic
from surprise.model_selection import cross_validate
from surprise import SVD

In [0]:
from google.colab import files
uploaded = files.upload()

Saving u.data to u.data


In [0]:
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_csv('u.data',  sep='\t', names=r_cols,
 encoding='latin-1')

ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


The ratings dataframe contains information about the user_id, movie_id and ratings for all the available ratings in the dataset.

In [0]:
from google.colab import files
uploaded = files.upload()

Saving u.item to u.item


In [0]:
i_cols = ['movie_id', 'title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']


movies = pd.read_csv('u.item',  sep='|', names=i_cols, encoding='latin-1')


movies.head()

Unnamed: 0,movie_id,title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0


The **movies** dataframe will help us map the name of the movie to the movie id to get the ratings from the ratings dataframe.

In [0]:
from google.colab import files
uploaded = files.upload()

Saving u.user to u.user


In [0]:
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv('u.user', sep='|', names=u_cols,
 encoding='latin-1')

users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [0]:
ratings.shape

(100000, 4)

In [0]:
movies.shape

(1682, 24)

In [0]:
users.shape

(943, 5)

In [0]:
sum(ratings['movie_id'].isnull())

0

So, we have **1682 unique movies** and **100,000 total ratings** for these unique movies by **943 users**.

Now, we need to split our ratings dataframe into 2 parts - first part will be used to train the algorithm to predict ratings and the other to test whether the rating predicted is close to what was expected. This will help in evaluating our models.

We will take y as user_id just to ensure that the splitting leads to **stratified sampling** and we have all the user_ids in the training set to make our allgorithm powerful.

In [0]:
#Assign X as the original ratings dataframe and y as the user_id column of ratings.
X = ratings.copy()
y = ratings['user_id']

#Split into training and test datasets, stratified along user_id
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify=y, random_state=42)

In [0]:
X_train.shape

(75000, 4)

In [0]:
X_test.shape

(25000, 4)

We have **75k ratings** in the **training** set and **25k** in the **test** set to evaluate our models.

In [0]:
df_ratings = X_train.pivot(index='user_id', columns='movie_id', values='rating')


Now, our df_ratings dataframe is indexed by user_ids with movie_ids belonging to different columns and the values are the ratings with most of the values as Nan as each user watches and rates only few movies. Its a **sparse** dataframe.

In [0]:
df_ratings.head()

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,1633,1634,1635,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1647,1648,1649,1651,1652,1653,1654,1656,1657,1658,1659,1660,1661,1662,1663,1664,1668,1669,1670,1671,1673,1674,1675,1676,1679,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,2.0,5.0,,5.0,,,,4.0,5.0,4.0,1.0,,4.0,3.0,4.0,3.0,,4.0,1.0,,3.0,5.0,4.0,,1.0,2.0,,3.0,4.0,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,,,,,,,,,2.0,,,4.0,4.0,,,,,3.0,,,,,,4.0,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,,3.0,,,,,,,,,,,,,,,4.0,,,,3.0,,,4.0,3.0,,,,4.0,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [0]:
#943 users * 1647 unique movies
df_ratings.shape

(943, 1647)

Now, we are gonna use 2 different methods for collaborative filtering. In the first method, we will use the **weighted average** of the ratings and we will implement the second method using model-based classification approaches **KNN** and **SVD**. We will talk about KNN and SVD later.

In the 1st method, we will use the weighted avg of the ratings using cosine similarity. The users who are more similar to the input_user will have a higher weight to compute the rating for the input_user.

Lets first replace the NULL values by 0s since the similarity matrices doesn't work will NA values and then proceed to build the recommender function using weighted avg of ratings.

In [0]:
df_ratings_dummy = df_ratings.copy().fillna(0)

In [0]:
df_ratings_dummy.head()

movie_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,1633,1634,1635,1636,1637,1638,1639,1640,1641,1642,1643,1644,1645,1647,1648,1649,1651,1652,1653,1654,1656,1657,1658,1659,1660,1661,1662,1663,1664,1668,1669,1670,1671,1673,1674,1675,1676,1679,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,2.0,5.0,0.0,5.0,0.0,0.0,0.0,4.0,5.0,4.0,1.0,0.0,4.0,3.0,4.0,3.0,0.0,4.0,1.0,0.0,3.0,5.0,4.0,0.0,1.0,2.0,0.0,3.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,0.0,4.0,3.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
type(df_ratings_dummy)

pandas.core.frame.DataFrame

In [0]:
df_ratings_dummy.columns

Int64Index([   1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
            ...
            1669, 1670, 1671, 1673, 1674, 1675, 1676, 1679, 1681, 1682],
           dtype='int64', name='movie_id', length=1647)

In [0]:
#cosine similarity of the ratings
similarity_matrix = cosine_similarity(df_ratings_dummy, df_ratings_dummy)

In [0]:
similarity_matrix.shape

(943, 943)

In [0]:
similarity_matrix_df = pd.DataFrame(similarity_matrix, index=df_ratings.index, columns=df_ratings.index)

In [0]:
type(similarity_matrix_df)

pandas.core.frame.DataFrame

In [0]:
#calculate ratings using weighted sum of cosine similarity

def calculate_ratings(id_movie, id_user):
  if id_movie in df_ratings:
    cosine_scores = similarity_matrix_df[id_user] #similarity of id_user with every other user
    ratings_scores = df_ratings[id_movie]      #ratings of every other user for the movie id_movie

    #won't consider users who havent rated id_movie so drop similarity scores and ratings corresponsing to np.nan
    index_not_rated = ratings_scores[ratings_scores.isnull()].index
    ratings_scores = ratings_scores.dropna()
    cosine_scores = cosine_scores.drop(index_not_rated)

    #calculating rating by weighted mean of ratings and cosine scores of the users who have rated the movie
    ratings_movie = np.dot(ratings_scores, cosine_scores)/cosine_scores.sum()

  else:
    return 2.5
  
  return ratings_movie


Now that we have written a function to calculate the rating given a user and a movie, lets see how it performs on a test set

In [0]:
calculate_ratings(3,150)  #predicts rating for user_id 150 and movie_id 3

2.9926409218795715

In [0]:
X_test.shape

(25000, 4)

In [0]:
X_test.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
33745,237,489,4,879376381
93837,535,156,2,879617613
19779,176,303,3,886047118
76325,83,756,4,883867791
10309,232,204,4,888549515


Lets build a function score_on_test_set that evaluates our model on the test set using **root_mean_squared_error**

In [0]:
#evaluates on test set
def score_on_test_set():
  user_movie_pairs =  zip(X_test['movie_id'], X_test['user_id'])
  predicted_ratings = np.array([calculate_ratings(movie, user) for (movie,user) in user_movie_pairs])
  true_ratings = np.array(X_test['rating'])
  score = np.sqrt(mean_squared_error(true_ratings, predicted_ratings))
  return score

In [0]:
test_set_score = score_on_test_set()
print(test_set_score)

1.0172812824757378


The **test_set's root mean square error is 1.01** which is kind of amazing. This means our algorithm worked really well in predicting the movie ratings of new users using weighted average of ratings. :Lets now use the model based approaches and see how far we can improve the root mean square error.

**Model based approaches**

In the model based approaches, we will use 2 methods KNN and SVD. The surprise package has inbuilt libraries with different models to build recommender systems and we are gonna use the same.

In the **KNN based approach**, the prediction is done by finding a cluster of similar users to the input_user whose rating is to be predicted and the a average of those ratings is taken. KNN is a famous classification algorithm.

In the **SVD (Singular Value decomposition) method**, the sparse user-movie (ratings) matrix is compressed into dense matrix by applying **matrix factorization techniques**. If M is a user* movie matrix, SVD decomposes it into 3 parts : M = UZV, where intuitively, U is user-concept matrix, Z is weights of different concepts and V is concept*movie matrix. Concept can be intuitively understood by imagining it as a superset of similar movies like maybe a suspense thriller genre can be a concept, etc. 

Once SVD decomposes the original matrix into 3, the dense matrix is directly used for predicting a rating for a (user,movie) pair using the concept to which the input_movie belongs. Its magical.


In [0]:
# installing surprise library
!pip install surprise

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/f5/da/b5700d96495fb4f092be497f02492768a3d96a3f4fa2ae7dea46d4081cfa/scikit-surprise-1.1.0.tar.gz (6.4MB)
[K     |████████████████████████████████| 6.5MB 3.2MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.0-cp36-cp36m-linux_x86_64.whl size=1678246 sha256=b2790a8be77ef9cb2a3c290321534ae24d6a29a1df0cdf285ce069d21e75e337
  Stored in directory: /root/.cache/pip/wheels/cc/fa/8c/16c93fccce688ae1bde7d979ff102f7bee980d9cfeb8641bcf
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.0 surprise-0.1


In [0]:
#Define a Reader object
#The Reader object helps in parsing the file or dataframe containing ratings

ratings = ratings.drop(columns='timestamp')
reader = Reader()

#dataset creation
data = Dataset.load_from_df(ratings, reader)

#model
knn = KNNBasic()

#Evaluating the performance in terms of RMSE
cross_validate(knn, data, measures=['RMSE', 'mae'], cv = 3)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


{'fit_time': (0.16745638847351074, 0.19007349014282227, 0.19052672386169434),
 'test_mae': array([0.78058569, 0.78245415, 0.78244863]),
 'test_rmse': array([0.9881952 , 0.98896545, 0.99125656]),
 'test_time': (3.9972915649414062, 4.044982194900513, 3.9820525646209717)}

We can see that the **root_mean_square error** in case of KNN has even further reduced to **0.98** compared to weighted mean approach. KNN is definitely performing better than weighted mean approach to predict movie ratings.

Now, lets see how SVD performs.

In [0]:
#Define the SVD algorithm object
svd = SVD()

#Evaluate the performance in terms of RMSE
cross_validate(svd, data, measures=['RMSE'], cv = 3)

{'fit_time': (3.4192357063293457, 3.4798734188079834, 3.466024160385132),
 'test_rmse': array([0.94628708, 0.94844318, 0.94181908]),
 'test_time': (0.2957947254180908, 0.1984567642211914, 0.30214881896972656)}

**The error has even further reduced to an amazing rmse value of 0.948 which is kind of the best results among the 3 approaches we used.**

In [0]:
trainset = data.build_full_trainset()

In [0]:
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fc412952c18>

In [0]:
ratings[ratings['user_id'] == 5]

Unnamed: 0,user_id,movie_id,rating
172,5,2,3
439,5,17,4
673,5,439,1
679,5,225,2
922,5,110,1
...,...,...,...
93172,5,419,3
94436,5,375,3
95021,5,373,3
96918,5,368,1


In [0]:
svd.predict(1, 110)

Prediction(uid=1, iid=110, r_ui=None, est=2.146343065683422, details={'was_impossible': False})

**The prediction for user_id 1 and movie 110 by svd model is 2.14 and the actual rating was 2** which is kind of amazing.

Thank you. 

I hope you have ignited your interest in recommender systems by reading my notebook.