# Developing a Prediction-Based Recommendation System using Model-Based Matrix Factorization

# Dataset story

### The data set is provided by the mobile lens, it contains the movies and the scores given to these movies. The dataset contains more than 20000000 ratings for approximately 27000 movies per hour.

# Variables

### There are many different tables in this dataset, but there are 2 CSV files to use.

#### movie.csv
* movieId - Unique movie number
* title - movie name

#### rating.csv
* userid - Unique user number
* movieId - Unique movie number
* rating - the rating given to the movie by the user
* timestamp - review date

# Importing the libraries

In [1]:
import pandas as pd
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import train_test_split, GridSearchCV, cross_validate
pd.set_option('display.max_columns', None)

# Reading and combining the dataset

In [2]:
movie = pd.read_csv('/kaggle/input/movies-ratings/movie.csv')
rating = pd.read_csv('/kaggle/input/movies-ratings/rating.csv')
df = movie.merge(rating, how='left', on='movieId')
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41


# Data preparation

### Let's bring 4 movie ids to 4 movies in terms of traceability

In [3]:
movie_ids = [130219, 356, 4422, 541]
movies = ["The Dark Knight (2011)",
          "Cries and Whispers (Viskningar och rop) (1972)",
          "Forrest Gump (1994)",
          "Blade Runner (1982)"]

### Let's reduce all dataset according to those movies

In [4]:
sample_df = df[df['movieId'].isin(movie_ids)]
sample_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
2457839,356,Forrest Gump (1994),Comedy|Drama|Romance|War,4.0,4.0,1996-08-24 09:28:42
2457840,356,Forrest Gump (1994),Comedy|Drama|Romance|War,7.0,4.0,2002-01-16 19:02:55
2457841,356,Forrest Gump (1994),Comedy|Drama|Romance|War,8.0,5.0,1996-06-05 13:44:19
2457842,356,Forrest Gump (1994),Comedy|Drama|Romance|War,9.0,4.0,2001-07-01 20:26:38
2457843,356,Forrest Gump (1994),Comedy|Drama|Romance|War,10.0,3.0,1999-11-25 02:32:02


In [5]:
sample_df.shape

(97343, 6)

### Let's create user_movie_df by using pivot_table

In [6]:
user_movie_df = sample_df.pivot_table(index='userId', columns=['title'], values='rating')
user_movie_df.head()

title,Blade Runner (1982),Cries and Whispers (Viskningar och rop) (1972),Forrest Gump (1994),The Dark Knight (2011)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,4.0,,,
2.0,5.0,,,
3.0,5.0,,,
4.0,,,4.0,
7.0,,,4.0,


In [7]:
user_movie_df.shape

(76918, 4)

### Let's give a scale number range for the variable rating and use 'Reader' method for this

In [8]:
reader = Reader(rating_scale = (1, 5))

### Now, let's adapt the used dataset according to the 'surprise' library

In [9]:
data = Dataset.load_from_df(sample_df[['userId', 'movieId', 'rating']], reader)

# Modeling

### Let's create a model and get a train set to create the model and test set to test the model

In [10]:
trainset, testset = train_test_split(data, test_size=0.25)
svd_model = SVD().fit(trainset)         # bring the model object and establish the model
predictions = svd_model.test(testset)   # predicting

### Let's calculate the RMSE value

In [11]:
print(f'The RMSE value of the model is {round(accuracy.rmse(predictions), 3)}')
print(f'The MAE value of the model is {round(accuracy.mae(predictions), 3)}')

RMSE: 0.9411
The RMSE value of the model is 0.941
MAE:  0.7242
The MAE value of the model is 0.724


### Let's predict for a special user

In [12]:
svd_model.predict(uid=1.0, iid=541, verbose=True)

user: 1.0        item: 541        r_ui = None   est = 4.31   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=4.31059354884639, details={'was_impossible': False})

# Model tuning

### Let's optimize the model to increase the prediction performance of the model via hyperparameters

In [13]:
param_grid = {'n_epochs': [5, 10, 20, 30, 50],
             'lr_all': [0.002, 0.005, 0.007, 0.01]}

In [14]:
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=5, n_jobs=-1, joblib_verbose=True)
gs.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  5.1min finished


### Let's calculate the RMSE and MAE values

In [15]:
print('RMSE:', gs.best_score['rmse'], 'and', 'MAE:', gs.best_score['mae'])

RMSE: 0.9312019912711843 and MAE: 0.718338452797967


### Let's determine the best parameters for RMSE and MAE

In [16]:
print('The best parameters for RMSE are', gs.best_params['rmse'])

The best parameters for RMSE are {'n_epochs': 5, 'lr_all': 0.002}


In [17]:
print('The best parameters for MAE are', gs.best_params['mae'])

The best parameters for MAE are {'n_epochs': 10, 'lr_all': 0.002}


# Final model and prediction

### Since the hyperparameters gave better results, let's create the SVD model again.

In [18]:
svd_model_rmse = SVD(**gs.best_params['rmse'])
svd_model_mae = SVD(**gs.best_params['mae'])

### Let's show the whole data to the model to get better learn with more data

In [19]:
data = data.build_full_trainset()

### Let's predict using the above defined movie ids

In [20]:
movie_ids = [130219, 356, 4422, 541]
svd_model_rmse.fit(data)
for i in movie_ids:
    print(svd_model_rmse.predict(uid=1.0, iid=i, verbose=True))

user: 1.0        item: 130219     r_ui = None   est = 3.92   {'was_impossible': False}
user: 1.0        item: 130219     r_ui = None   est = 3.92   {'was_impossible': False}
user: 1.0        item: 356        r_ui = None   est = 4.07   {'was_impossible': False}
user: 1.0        item: 356        r_ui = None   est = 4.07   {'was_impossible': False}
user: 1.0        item: 4422       r_ui = None   est = 4.07   {'was_impossible': False}
user: 1.0        item: 4422       r_ui = None   est = 4.07   {'was_impossible': False}
user: 1.0        item: 541        r_ui = None   est = 4.21   {'was_impossible': False}
user: 1.0        item: 541        r_ui = None   est = 4.21   {'was_impossible': False}


In [21]:
movie_ids = [130219, 356, 4422, 541]
svd_model_mae.fit(data)
for i in movie_ids:
    print(svd_model_mae.predict(uid=1.0, iid=i, verbose=True))

user: 1.0        item: 130219     r_ui = None   est = 4.14   {'was_impossible': False}
user: 1.0        item: 130219     r_ui = None   est = 4.14   {'was_impossible': False}
user: 1.0        item: 356        r_ui = None   est = 4.06   {'was_impossible': False}
user: 1.0        item: 356        r_ui = None   est = 4.06   {'was_impossible': False}
user: 1.0        item: 4422       r_ui = None   est = 3.90   {'was_impossible': False}
user: 1.0        item: 4422       r_ui = None   est = 3.90   {'was_impossible': False}
user: 1.0        item: 541        r_ui = None   est = 4.16   {'was_impossible': False}
user: 1.0        item: 541        r_ui = None   est = 4.16   {'was_impossible': False}


### Consequently, currently, we have an optimized model and the possibility to make predictions for the user-movie pair we want using this model. Therefore, when the information of the users and movies are entered according to some subsets to be selected, an information about which movie should be recommended to which user is received. The prediction value after entering user-movie pair may be low. Therefore, after getting the relevant estimates, it is necessary to filter the movies and recommend the movies for which we have predicted high scores for certain users.

# Thank you for checking my notebook!