<a href="https://colab.research.google.com/github/XLingTong/movielens-recommender_uts2025/blob/main/01_cf_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 01: Collaborative Filtering using Surprise SVD
This notebook trains a collaborative filtering model using the Surprise library to generate rating predictions for the MovieLens 100k dataset.

In [1]:
%pip install numpy==1.24.4
%pip install --force-reinstall scikit-surprise


Collecting numpy==1.24.4
  Using cached numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.5
    Uninstalling numpy-2.2.5:
      Successfully uninstalled numpy-2.2.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
blosc2 3.3.2 requires numpy>=1.26, but you have numpy 1.24.4 which is incompatible.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.24.4 which is incompatible.
treescope 0.1.9 requires numpy>=1.25.2, but you have numpy 1.24.4 which is incompatible.
pymc 5.22.0 requires numpy>=1.25.0, but you have numpy 1.24.4 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have

Collecting scikit-surprise
  Using cached scikit_surprise-1.1.4-cp311-cp311-linux_x86_64.whl
Collecting joblib>=1.2.0 (from scikit-surprise)
  Using cached joblib-1.5.0-py3-none-any.whl.metadata (5.6 kB)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 377, in run
    requirement_set = resolver.resolve(
                      ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 95, in resolve
    result = self._result = resolver.resolve(
                            ^^^^^^^^^^^^^^^^^
  File "/usr/local

In [2]:
# Install Surprise if needed
# !pip install scikit-surprise
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse
import pandas as pd

## Load and Prepare Data

In [4]:
# Load u_data.csv using pandas
ratings_df = pd.read_csv(
    "https://raw.githubusercontent.com/XLingTong/movielens-recommender_uts2025/refs/heads/main/u_data.csv",
    sep=",",
    header=0
)
ratings_df["userID"] = ratings_df["userID"].astype(int)
ratings_df["itemID"] = ratings_df["itemID"].astype(int)
ratings_df["rating"] = ratings_df["rating"].astype(int)
ratings_df = ratings_df[["userID", "itemID", "rating"]]

## Prepare Data for Surprise

In [5]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_df[["userID", "itemID", "rating"]], reader)
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

## Train SVD Model

In [6]:
svd_model = SVD(n_factors=50, lr_all=0.005, reg_all=0.02, n_epochs=20)
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7b7af61b6790>

## Evaluate RMSE on Test Set

In [7]:
predictions = svd_model.test(testset)
rmse(predictions)

RMSE: 0.9345


0.9344772684238207

## Save Predictions for Hybrid Model

In [10]:
import os

# Convert predictions into a DataFrame
pred_df = pd.DataFrame([{
    "userID": int(pred.uid),
    "itemID": int(pred.iid),
    "cf_pred": round(pred.est, 4)
} for pred in predictions])

# Create the 'models' directory if it doesn't exist
os.makedirs("models", exist_ok=True)

# Save for use in hybrid model
pred_df.to_csv("models/cf_predictions.csv", index=False)
print("Saved CF predictions to models/cf_predictions.csv")

Saved CF predictions to models/cf_predictions.csv
