# Movie Recommender System

## Datasets

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. The dataset consists of 100,000 ratings (1-5) from 943 users on 1682 movies. This data has been cleaned up – users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Data download: https://www.dropbox.com/s/ip7x5v26a5kvixg/ml-100k.zip?dl=0

***u.data***: The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies.  Users and items are numbered consecutively from 1.  The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC 

***u.item***: Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date |IMDb URL | unknown | Action | Adventure | Animation |Children's | Comedy | Crime | Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |Thriller | War | Western | 
The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the *u.data* data set. 

***u.user***: Demographic information about the users; this is a tab separated list of user id | age | gender | occupation | zip code. The user ids are the ones used in the *u.data* data set. 

In [5]:
import pandas as pd
import numpy as np

In [6]:
# Load the ratings file
ratings = pd.read_csv('https://raw.githubusercontent.com/XLingTong/movielens-recommender_uts2025/refs/heads/main/u_data.csv')

# Display the first few rows to check
ratings.head()


Unnamed: 0,userID,itemID,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [7]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   userID     100000 non-null  int64
 1   itemID     100000 non-null  int64
 2   rating     100000 non-null  int64
 3   timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


In [8]:
from sklearn.model_selection import train_test_split

# Split ratings into 80% train, 20% test
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# Check the size of each set
print(f"Training set size: {train_data.shape[0]} ratings")
print(f"Testing set size: {test_data.shape[0]} ratings")


Training set size: 80000 ratings
Testing set size: 20000 ratings


# Build Matrix Factorization model

        

Parameters:

        # n_users (int): Number of unique users
        # n_items (int): Number of unique movies
        # n_factors (int): Number of latent features. Each latent factor (each "dimension" inside the vector) as an abstract concept — not something labeled directly, but something that captures hidden patterns in user behavior and movie properties.
        # learning_rate (float): Step size for gradient descent
        # n_epochs (int): Number of training epochs
        # reg (float): Regularization strength

In [9]:
class MatrixFactorization:
    def __init__(self, n_users, n_items, n_factors=20, learning_rate=0.01, n_epochs=20, reg=0.02):
       
        # Initialize the Matrix Factorization model.

        # Parameters:
        # n_users (int): Number of unique users
        # n_items (int): Number of unique movies
        # n_factors (int): Number of latent features (default 20)
        # learning_rate (float): Step size for gradient descent
        # n_epochs (int): Number of training epochs
        # reg (float): Regularization strength
       
        self.n_users = n_users
        self.n_items = n_items
        self.n_factors = n_factors
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.reg = reg

        # Initialize user and item latent matrices with small random values
        self.user_factors = np.random.normal(scale=1./self.n_factors, size=(n_users, n_factors)) 
        self.item_factors = np.random.normal(scale=1./self.n_factors, size=(n_items, n_factors))
    
    #add a predict function to the class
    def predict(self, user_id, item_id):
        
        # Predict the rating of a user for a given item (movie).

        # Parameters:
        # user_id (int): ID/index of the user
        # item_id (int): ID/index of the item

        # Returns:
        # float: Predicted rating
       
        return np.dot(self.user_factors[user_id], self.item_factors[item_id])
    
    #define a loss function to compute the Mean Squared Error (MSE)
    def compute_mse(self, data):    
        #
        # Compute Mean Squared Error (MSE) over given data.

        # Parameters:
        # data (DataFrame): A pandas DataFrame with columns ['userID', 'itemID', 'rating']

        # Returns:
        # float: MSE value
        
        mse = 0
        for _, row in data.iterrows():
            user_id = int(row['userID']) - 1  # IDs start from 1
            item_id = int(row['itemID']) - 1
            true_rating = row['rating']
            pred_rating = self.predict(user_id, item_id)
            mse += (true_rating - pred_rating) ** 2

        mse /= len(data)
        return mse
    
    
    #implement the training loop, use Gradient Descent
    def train(self, train_data):
        # 
        # Train the matrix factorization model using SGD.

        # Parameters:
        # train_data (DataFrame): DataFrame with columns ['userID', 'itemID', 'rating']
        # 
        for epoch in range(self.n_epochs):
            # Shuffle training data
            train_data = train_data.sample(frac=1).reset_index(drop=True)
            
            total_loss = 0
            for _, row in train_data.iterrows():
                user_id = int(row['userID']) - 1
                item_id = int(row['itemID']) - 1
                true_rating = row['rating']

                # Prediction
                pred_rating = self.predict(user_id, item_id)
                error = true_rating - pred_rating

                # Gradients and updates
                user_grad = -2 * error * self.item_factors[item_id] + 2 * self.reg * self.user_factors[user_id]
                item_grad = -2 * error * self.user_factors[user_id] + 2 * self.reg * self.item_factors[item_id]

                self.user_factors[user_id] -= self.learning_rate * user_grad
                self.item_factors[item_id] -= self.learning_rate * item_grad

                total_loss += error**2

            # Print average loss for this epoch, monitor if the model is learning properly.
            mse = total_loss / len(train_data)
            print(f"Epoch {epoch+1}/{self.n_epochs} - Training MSE: {mse:.4f}")   
    
    #function, predict ratings for the test set
    def predict_testset(self, test_data):
        # 
        # Predict ratings for all user-item pairs in the test set.

        # Parameters:
        # test_data (DataFrame): A pandas DataFrame with columns ['userID', 'itemID', 'rating']

        # Returns:
        # list: A list of predicted ratings
        # 
        predictions = []
        for _, row in test_data.iterrows():
            user_id = int(row['userID']) - 1  # adjust if IDs start at 1
            item_id = int(row['itemID']) - 1
            pred_rating = self.predict(user_id, item_id)
            predictions.append(pred_rating)
        return predictions
    
    #calculate RMSE: 
    def calculate_rmse(self, test_data):
        # 
        # Calculate RMSE (Root Mean Squared Error) on the test data.

        # Parameters:
        # test_data (DataFrame): Test data with ['userID', 'itemID', 'rating']

        # Returns:
        # float: RMSE score
        # 
        predictions = self.predict_testset(test_data)
        true_ratings = test_data['rating'].values

        mse = np.mean((true_ratings - predictions) ** 2)
        rmse = np.sqrt(mse)
        return rmse
    
    #Calculate Recall for the test data based on a rating threshold.
    def calculate_recall(self, test_data, threshold=3.5):
        # Threshold = 3.5 by default (ratings 4 and 5 are usually considered liked).
        # Counts how many liked movies were recommended.
        # Calculates recall score between 0 and 1.
        # Parameters:
        # test_data (DataFrame): Test data with ['userID', 'itemID', 'rating']
        # threshold (float): Minimum rating to consider as a 'liked' item

        # Returns:
        # float: Recall score
        # 
        predictions = self.predict_testset(test_data)
        true_ratings = test_data['rating'].values

        # Define "liked" movies
        true_positives = 0
        false_negatives = 0

        for true, pred in zip(true_ratings, predictions):
            if true >= threshold:
                if pred >= threshold:
                    true_positives += 1
                else:
                    false_negatives += 1

        if true_positives + false_negatives == 0:
            return 0  # To avoid division by zero

        recall = true_positives / (true_positives + false_negatives)
        return recall

    
    #recommend Top-N movies for a given user based on predicted ratings
    def recommend_top_n(self, user_id, movie_titles, n=5):
        # 
        # Recommend top N movies for a given user based on predicted ratings.

        # Parameters:
        # user_id (int): ID/index of the user (starting from 1)
        # movie_titles (DataFrame): A DataFrame mapping movieID to movie titles
        # n (int): Number of recommendations to return

        # Returns:
        # list: List of (movie title, predicted rating) tuples
        # 
        user_index = user_id - 1  # Adjust index if user IDs start from 1
        scores = []

        for item_index in range(self.n_items):
            pred_rating = self.predict(user_index, item_index)
            scores.append((item_index, pred_rating))

        # Sort movies by predicted rating, descending
        scores.sort(key=lambda x: x[1], reverse=True)

        top_n = scores[:n]
        
        recommendations = []
        for movie_id, score in top_n:
            movie_title = movie_titles.loc[movie_titles['itemID'] == movie_id + 1, 'title'].values[0]
            recommendations.append((movie_title, round(score, 2)))

        return recommendations


# Implementation

Load movie titles

In [None]:
# Load the file
movie_titles = pd.read_csv('https://raw.githubusercontent.com/XLingTong/movielens-recommender_uts2025/refs/heads/main/u_item.csv')

# Display the first few rows to check
movie_titles.head()


Unnamed: 0,itemID,title,release_date,video_release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


Instantiate the Model

In [11]:
# Number of users and movies
n_users = ratings['userID'].nunique()
n_items = ratings['itemID'].nunique()

# Create the model
mf_model = MatrixFactorization(
    n_users=n_users,
    n_items=n_items,
    n_factors=20,
    learning_rate=0.01,
    n_epochs=20,
    reg=0.02
)


Train the Model

In [12]:
mf_model.train(train_data)

Epoch 1/20 - Training MSE: 8.2740
Epoch 2/20 - Training MSE: 1.2399
Epoch 3/20 - Training MSE: 0.9832
Epoch 4/20 - Training MSE: 0.9180
Epoch 5/20 - Training MSE: 0.8665
Epoch 6/20 - Training MSE: 0.8181
Epoch 7/20 - Training MSE: 0.7728
Epoch 8/20 - Training MSE: 0.7253
Epoch 9/20 - Training MSE: 0.6789
Epoch 10/20 - Training MSE: 0.6370
Epoch 11/20 - Training MSE: 0.6013
Epoch 12/20 - Training MSE: 0.5677
Epoch 13/20 - Training MSE: 0.5403
Epoch 14/20 - Training MSE: 0.5174
Epoch 15/20 - Training MSE: 0.4983
Epoch 16/20 - Training MSE: 0.4809
Epoch 17/20 - Training MSE: 0.4666
Epoch 18/20 - Training MSE: 0.4545
Epoch 19/20 - Training MSE: 0.4431
Epoch 20/20 - Training MSE: 0.4336


Evaluate the Model

In [13]:
rmse = mf_model.calculate_rmse(test_data)
print(f"Test RMSE: {rmse:.4f}")

Test RMSE: 0.9891


In [14]:
# Calculate Recall on test set
recall_score = mf_model.calculate_recall(test_data, threshold=3.5)
print(f"Test Recall (threshold=3.5): {recall_score:.4f}")

Test Recall (threshold=3.5): 0.6897


Make Recommendations

In [15]:
recommendations = mf_model.recommend_top_n(user_id=10, movie_titles=movie_titles, n=5)

print("Top 5 recommended movies for User 10:")
for title, score in recommendations:
    print(f"{title}: predicted rating {score}")

Top 5 recommended movies for User 10:
Pather Panchali (1955): predicted rating 5.29
Paradise Lost: The Child Murders at Robin Hood Hills (1996): predicted rating 5.25
Schindler's List (1993): predicted rating 5.18
Casablanca (1942): predicted rating 5.13
Boot, Das (1981): predicted rating 5.04
