We will use collaborative filtering. It is a part of unsupervised learning

**Key Steps:**
  1. Split the dataset into training (80%) and testing (20%) sets.
  2. Train the model using only the training data.
  3. Use the model to predict ratings for movies in the test set.
  4. Calculate errors between predicted and actual ratings.

  If CF is used with only user-item interactions (e.g., movie watch history, clicks, purchases) without explicit labels, it’s considered unsupervised learning.
memory-based collaborative filtering
  You used ``````, which relies on similarity between movies (item-based filtering) to make recommendations. While this is a Machine Learning technique, it's not a model that "learns" from data in the way that deep learning does. Instead, it computes similarities and makes predictions based on existing ratings.




In [30]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from math import sqrt


script_dir = os.getcwd() 

print(f"Current working directory: {script_dir}")

# Load ratings data
ratings_file = os.path.join(script_dir, "Cleaned Datasets", "Audience_Ratings.csv")
df_ratings = pd.read_csv(ratings_file)

Current working directory: c:\Users\willi\OneDrive\Documents\GitHub\Movie-Recommendations


In [31]:
print(df_ratings.columns)

print("Dataset shape:", df_ratings.shape)

# Drop NA values if any
df_ratings.dropna(inplace=True)

# Check unique users and movies
print(f"Unique users: {df_ratings['userId'].nunique()}")
print(f"Unique movies: {df_ratings['imdbId'].nunique()}")

# Filter users or movies with very few interactions
min_user_ratings = 5
min_movie_ratings = 5

user_counts = df_ratings['userId'].value_counts()
movie_counts = df_ratings['imdbId'].value_counts()

df = df_ratings[df_ratings['userId'].isin(user_counts[user_counts >= min_user_ratings].index)]
df = df_ratings[df_ratings['imdbId'].isin(movie_counts[movie_counts >= min_movie_ratings].index)]


Index(['userId', 'imdbId', 'rating'], dtype='object')
Dataset shape: (100836, 3)
Unique users: 610
Unique movies: 9724


From above we can see that the total rating 100,836 with 610 unique users and 9,724 unique movies. 

In [None]:
# Mapping for userId and imdbId to index-based values
user_ids = df['userId'].unique()
movie_ids = df['imdbId'].unique()

user2idx = {user_id: idx for idx, user_id in enumerate(user_ids)}
movie2idx = {movie_id: idx for idx, movie_id in enumerate(movie_ids)}

df['user_idx'] = df['userId'].map(user2idx)
df['movie_idx'] = df['imdbId'].map(movie2idx)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['user_idx'] = df['userId'].map(user2idx)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['movie_idx'] = df['imdbId'].map(movie2idx)


In [33]:
n_users = len(user2idx)
n_movies = len(movie2idx)
n_factors = 20  # Number of latent features

# Initialize user and movie matrices
np.random.seed(42)
P = np.random.normal(scale=0.1, size=(n_users, n_factors))  # User latent matrix
Q = np.random.normal(scale=0.1, size=(n_movies, n_factors))  # Movie latent matrix

# Bias terms
user_bias = np.zeros(n_users)
movie_bias = np.zeros(n_movies)
global_bias = df['rating'].mean()


In [34]:
def train_svd(df, P, Q, user_bias, movie_bias, global_bias, n_factors, epochs=20, lr=0.01, reg=0.1):
    for epoch in range(epochs):
        for row in df.itertuples():
            u = row.user_idx
            m = row.movie_idx
            rating = row.rating

            pred = global_bias + user_bias[u] + movie_bias[m] + np.dot(P[u], Q[m])
            error = rating - pred

            # Update biases
            user_bias[u] += lr * (error - reg * user_bias[u])
            movie_bias[m] += lr * (error - reg * movie_bias[m])

            # Update latent factors
            P[u] += lr * (error * Q[m] - reg * P[u])
            Q[m] += lr * (error * P[u] - reg * Q[m])
        
        # Optional: evaluate performance after each epoch
        preds = predict_all(df, P, Q, user_bias, movie_bias, global_bias)
        rmse = sqrt(mean_squared_error(df['rating'], preds))
        print(f"Epoch {epoch+1}/{epochs}, RMSE: {rmse:.4f}")
    
    return P, Q, user_bias, movie_bias


In [35]:
def predict_all(df, P, Q, user_bias, movie_bias, global_bias):
    preds = []
    for row in df.itertuples():
        u = row.user_idx
        m = row.movie_idx
        pred = global_bias + user_bias[u] + movie_bias[m] + np.dot(P[u], Q[m])
        preds.append(pred)
    return np.array(preds)


In [36]:
P, Q, user_bias, movie_bias = train_svd(df, P, Q, user_bias, movie_bias, global_bias, n_factors=20, epochs=20, lr=0.01, reg=0.1)


Epoch 1/20, RMSE: 0.1815
Epoch 2/20, RMSE: 0.1755
Epoch 3/20, RMSE: 0.1723
Epoch 4/20, RMSE: 0.1703
Epoch 5/20, RMSE: 0.1689
Epoch 6/20, RMSE: 0.1679
Epoch 7/20, RMSE: 0.1671
Epoch 8/20, RMSE: 0.1666
Epoch 9/20, RMSE: 0.1661
Epoch 10/20, RMSE: 0.1657
Epoch 11/20, RMSE: 0.1654
Epoch 12/20, RMSE: 0.1652
Epoch 13/20, RMSE: 0.1649
Epoch 14/20, RMSE: 0.1648
Epoch 15/20, RMSE: 0.1646
Epoch 16/20, RMSE: 0.1645
Epoch 17/20, RMSE: 0.1644
Epoch 18/20, RMSE: 0.1643
Epoch 19/20, RMSE: 0.1642
Epoch 20/20, RMSE: 0.1642


In [37]:
def predict_rating(user_id, movie_id):
    u = user2idx.get(user_id)
    m = movie2idx.get(movie_id)
    if u is None or m is None:
        return global_bias  # Fallback to global average
    pred = global_bias + user_bias[u] + movie_bias[m] + np.dot(P[u], Q[m])
    return pred

# Example:
predict_rating(1, 1)


np.float64(0.7074716972771784)