## 3-Collaborative-Based-Filtering (ML Example)

### Loading data

In [3]:
import pandas as pd

ratings = pd.read_csv("ratings.csv")[["userId", "movieId", "rating"]]
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


### Create the dataset

In [2]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25ldone
[?25h  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp39-cp39-linux_x86_64.whl size=2527007 sha256=08401bc581022d592b28d3b4717494a23b9195d8f613d6133589265aa1594b1d
  Stored in directory: /root/.cache/pip/wheels/42/41/d3/a56ae864ad22cc6583ec9312be43fbc611c87e53dc49aac953
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is availabl

In [5]:
from surprise import Dataset, Reader

reader = Reader(rating_scale=(1, 5))
dataset = Dataset.load_from_df(df=ratings, reader=reader)

In [6]:
dataset

<surprise.dataset.DatasetAutoFolds at 0x7f7d00b5d400>

### Build the trainset

In [7]:
trainset = dataset.build_full_trainset()
trainset

<surprise.trainset.Trainset at 0x7f7cf99f6af0>

In [8]:
list(trainset.all_ratings())

[(0, 0, 2.5),
 (0, 1, 3.0),
 (0, 2, 3.0),
 (0, 3, 2.0),
 (0, 4, 4.0),
 (0, 5, 2.0),
 (0, 6, 2.0),
 (0, 7, 2.0),
 (0, 8, 3.5),
 (0, 9, 2.0),
 (0, 10, 2.5),
 (0, 11, 1.0),
 (0, 12, 4.0),
 (0, 13, 4.0),
 (0, 14, 3.0),
 (0, 15, 2.0),
 (0, 16, 2.0),
 (0, 17, 2.5),
 (0, 18, 1.0),
 (0, 19, 3.0),
 (1, 20, 4.0),
 (1, 21, 5.0),
 (1, 22, 5.0),
 (1, 23, 4.0),
 (1, 24, 4.0),
 (1, 25, 3.0),
 (1, 26, 3.0),
 (1, 27, 4.0),
 (1, 28, 3.0),
 (1, 29, 5.0),
 (1, 30, 4.0),
 (1, 31, 3.0),
 (1, 32, 3.0),
 (1, 33, 3.0),
 (1, 34, 3.0),
 (1, 35, 3.0),
 (1, 36, 3.0),
 (1, 37, 5.0),
 (1, 38, 1.0),
 (1, 39, 3.0),
 (1, 40, 3.0),
 (1, 41, 3.0),
 (1, 42, 4.0),
 (1, 43, 4.0),
 (1, 44, 5.0),
 (1, 45, 5.0),
 (1, 46, 3.0),
 (1, 47, 4.0),
 (1, 48, 3.0),
 (1, 49, 4.0),
 (1, 50, 3.0),
 (1, 51, 4.0),
 (1, 52, 2.0),
 (1, 53, 1.0),
 (1, 54, 3.0),
 (1, 55, 4.0),
 (1, 56, 4.0),
 (1, 57, 3.0),
 (1, 58, 3.0),
 (1, 59, 3.0),
 (1, 60, 3.0),
 (1, 61, 2.0),
 (1, 62, 3.0),
 (1, 63, 3.0),
 (1, 64, 3.0),
 (1, 65, 3.0),
 (1, 66, 2.0),
 (1, 

### Training the model (Singular Value Decomposition (SVD) algorithm - from surprise library)

In [9]:
from surprise import SVD

In [10]:
svd = SVD()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f7d00c7b9a0>

In [15]:
ratings[ratings["userId"] == 15]
ratings[ratings["movieId"] == 1956]

Unnamed: 0,userId,movieId,rating
1379,15,1956,4.0
4276,23,1956,1.0
5012,28,1956,4.0
5402,30,1956,4.0
8923,57,1956,4.0
9554,65,1956,3.0
15686,102,1956,4.0
18047,119,1956,4.0
22845,160,1956,4.0
30288,214,1956,5.0


In [16]:
svd.predict(15, 1956, 4.0)

Prediction(uid=15, iid=1956, r_ui=4.0, est=3.5338476501934215, details={'was_impossible': False})

In [17]:
svd.predict(15, 1956).est

3.5338476501934215

### Validate model / Evaluation metric

In [18]:
from surprise import model_selection

In [19]:
model_selection.cross_validate(svd, dataset, measures=["RMSE", "MAE"], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8978  0.8976  0.8925  0.8985  0.8972  0.8967  0.0021  
MAE (testset)     0.6895  0.6931  0.6870  0.6945  0.6906  0.6909  0.0026  
Fit time          0.68    0.62    0.72    0.63    0.73    0.68    0.05    
Test time         0.10    0.08    0.08    0.08    0.08    0.08    0.01    


{'test_rmse': array([0.89783257, 0.89759996, 0.8925306 , 0.89846171, 0.89718292]),
 'test_mae': array([0.68949324, 0.69314958, 0.68700218, 0.69447191, 0.69061942]),
 'fit_time': (0.6822845935821533,
  0.6231322288513184,
  0.7216906547546387,
  0.6312494277954102,
  0.7328755855560303),
 'test_time': (0.10038518905639648,
  0.07878947257995605,
  0.08070802688598633,
  0.07885193824768066,
  0.08064770698547363)}

> 1. A lower RMSE (Root Mean Squared Error)indicates a better model fit. If RMSE is close to 0, it means the model’s predictions are very accurate. 
2. Like RMSE, a lower MAE (Mean Absolute Error) is desirable. If MAE is close to 0, the model’s predictions align well with actual values.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=c51bcb0e-bf1a-4645-b5a4-68ee85e19e39' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>