We will use collaborative filtering. It is a part of unsupervised learning

**Key Steps:**
  1. Split the dataset into training (80%) and testing (20%) sets.
  2. Train the model using only the training data.
  3. Use the model to predict ratings for movies in the test set.
  4. Calculate errors between predicted and actual ratings.

  If CF is used with only user-item interactions (e.g., movie watch history, clicks, purchases) without explicit labels, it’s considered unsupervised learning.
memory-based collaborative filtering
  You used ``````, which relies on similarity between movies (item-based filtering) to make recommendations. While this is a Machine Learning technique, it's not a model that "learns" from data in the way that deep learning does. Instead, it computes similarities and makes predictions based on existing ratings.




In [1]:
import os
import pandas as pd
import numpy as np

script_dir = os.getcwd() 

print(f"Current working directory: {script_dir}")

# Load ratings data
ratings_file = os.path.join(script_dir, "Cleaned Datasets", "ratings_imdb_matched.csv")
df_ratings = pd.read_csv(ratings_file)

Current working directory: c:\Users\willi\OneDrive\Documents\GitHub\Movie-Recommendations


### Creating User-Item Matrices

To perform collaborative filtering, we need to convert our rating data into a **user-item matrix**, where:

- **Rows represent users (`userId`)**
- **Columns represent movies (`movieId`)**
- **Values represent ratings given by users to movies**
- Missing values (movies that a user hasn't rated) are filled with `0`.

Example User-Item Matrix:

| userId | movieId=1 | movieId=2 | movieId=3 | movieId=4 |
|--------|----------|----------|----------|----------|
| 1      | 4.0      | 0.0      | 3.5      | 5.0      |
| 2      | 0.0      | 2.5      | 5.0      | 3.0      |
| 3      | 1.0      | 0.0      | 4.0      | 2.0      |

The matrix allows us to perform **collaborative filtering** by finding patterns in user ratings. We can then use it to compute **movie similarities** and predict missing ratings.


In [2]:
from sklearn.model_selection import train_test_split

# Shuffle and split dataset into train (80%) and test (20%)
train_df, test_df = train_test_split(df_ratings, test_size=0.2, random_state=42)

# User-item matrices for training and testing
train_matrix = train_df.pivot(index="userId", columns="imdbId", values="rating").fillna(0)
test_matrix = test_df.pivot(index="userId", columns="imdbId", values="rating").fillna(0)

# Convert to NumPy arrays
train_array = train_matrix.values
test_array = test_matrix.values

print(f"Train set size: {len(train_df)}")
print(f"Test set size: {len(test_df)}")


Train set size: 80668
Test set size: 20168


### Computing Movie Similarities

To find how similar two movies are based on user ratings, **cosine similarity** is used. This measures how close two movies are in rating patterns. 

- **Formula**:  

 ```math
\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}
``` 
where **A** and **B** are rating vectors for two movies.

- **Key Steps:**
  1. Extract user ratings for each movie.
  2. Compute cosine similarity between every pair of movies.
  3. Store these values in a **similarity matrix** for quick lookup.

This similarity matrix helps in predicting user ratings based on movies they have already rated.

In [None]:
'''

from numpy.linalg import norm

# Compute cosine similarity manually
def cosine_similarity(movie1, movie2):
    dot_product = np.dot(movie1, movie2)
    norm_product = norm(movie1) * norm(movie2)
    return dot_product / norm_product if norm_product != 0 else 0

# Create similarity matrix based on the training set
num_movies = train_array.shape[1]
similarity_matrix_train = np.zeros((num_movies, num_movies))

for i in range(num_movies):
    for j in range(num_movies):
        similarity_matrix_train[i, j] = cosine_similarity(train_array[:, i], train_array[:, j])

# Convert to DataFrame
movie_similarity_train_df = pd.DataFrame(similarity_matrix_train, index=train_matrix.columns, columns=train_matrix.columns)

'''

The above cell block takes around 30 minutes to run which is far too long, while the code below uses the cosine_similarity function from the sklearn library. This does the exact same thing while only taking just over a second to execute. 

My original code manually computes cosine similarity using loops, which is inefficient for large datasets.
cosine_similarity(train_array.T) from sklearn performs the same computation in a highly optimised way using matrix operations.

In [5]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarity matrix in one step
similarity_matrix_train = cosine_similarity(train_array.T)

# Convert to DataFrame
movie_similarity_train_df = pd.DataFrame(similarity_matrix_train, index=train_matrix.columns, columns=train_matrix.columns)


### Making Predictions

Once we have the similarity matrix, we can predict how a user would rate a movie they haven't seen yet. This is done by looking at similar movies they have already rated.

- **Key Steps:**
  1. Identify movies the user has rated.
  2. Find similar movies using the similarity matrix.
  3. Compute the weighted average of ratings from similar movies.
  4. Predict the rating for the unseen movie.

The formula for predicting a rating ```math \hat{r}_{u,m} ``` 
for user \( u \) and movie \( m \) is:

```math
\hat{r}_{u,m} = \frac{\sum_{n \in N} \text{similarity}(m, n) \times r_{u,n}}{\sum_{n \in N} |\text{similarity}(m, n)|}
```


where:
- \( N \) is the set of movies similar to \( m \) that the user has rated.
- \( \text{similarity}(m, n) \) is the cosine similarity between movies \( m \) and \( n \).
- \( r_{u,n} \) is the rating given by user \( u \) to movie \( n \).

This allows us to estimate how much a user might like a movie based on their past ratings.


### Predicting User Ratings

This function, `predict_rating(user_id, movie_id)`, estimates how a user would rate a given movie based on their past ratings and the similarity between movies.

#### **How it Works:**
1. **Check if the movie exists in the training set:**  
   - If the movie is missing, return a default rating of `0`.

2. **Retrieve the user's past ratings:**  
   - Extract all movies that the user has already rated.

3. **Find similar movies:**  
   - Get similarity scores between the target movie and other movies the user has rated.

4. **Compute a weighted average rating:**  
   - Multiply each similarity score by the user's rating for that movie.  
   - Sum these weighted values.  
   - Normalise by the total similarity sum.

5. **Return the predicted rating:**  
   - If there are no similar movies, return `0`.  
   - Otherwise, return the weighted average rating.

#### **Why Use This Approach?**
This method is a **memory-based collaborative filtering technique** that makes predictions based on past user behavior. It helps recommend movies that are similar to those the user already likes, improving personalization.


In [5]:
def predict_rating(user_id, movie_id):
    # If the movie is not in the training set, return 0
    if movie_id not in train_matrix.columns:
        return 0  
    
    # Get movies rated by the user
    user_ratings = train_matrix.loc[user_id]
    
    # Get similarity scores for the target movie
    similar_movies = movie_similarity_train_df[movie_id]
    
    # Compute weighted average of similar movie ratings
    weighted_sum = 0
    sim_sum = 0
    for rated_movie, rating in user_ratings[user_ratings > 0].items():
        if rated_movie in similar_movies:
            similarity = similar_movies[rated_movie]
            weighted_sum += similarity * rating
            sim_sum += abs(similarity)
    
    # Normalise by similarity sum
    return weighted_sum / sim_sum if sim_sum != 0 else 0

### Evaluating the Model

To measure how well our recommendation system performs, we compare its predicted ratings with actual ratings from a test dataset.

Common evaluation metrics include calculating the **Mean Absolute Error** (MAE) and the **Root Mean Square Error** (RMSE). Lower MAE and RMSE values indicate better accuracy, meaning our recommendations are closer to actual user preferences.



  **Mean Absolute Error (MAE):** Measures the average absolute difference between predicted and actual ratings.
  
  ```math
    MAE = \frac{1}{N} \sum_{i=1}^{N} | \hat{r}_i - r_i |
  ```

In [6]:
# Compute MAE (Mean Absolute Error)
actual_ratings = []
predicted_ratings = []

for _, row in test_df.iterrows():
    user_id = row["userId"]
    movie_id = row["imdbId"]
    actual_rating = row["rating"]
    
    predicted_rating = predict_rating(user_id, movie_id)
    
    actual_ratings.append(actual_rating)
    predicted_ratings.append(predicted_rating)

# Calculate MAE
mae = np.mean(np.abs(np.array(actual_ratings) - np.array(predicted_ratings)))

print(f"Mean Absolute Error (MAE): {mae:.4f}")

Mean Absolute Error (MAE): 0.1627


  **Root Mean Square Error (RMSE):** Penalises larger errors more heavily.
  ```math
    RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (\hat{r}_i - r_i)^2}
  ```

In [7]:
# Compute RMSE (Root Mean Squared Error) manually
squared_errors = [(actual - predicted) ** 2 for actual, predicted in zip(actual_ratings, predicted_ratings)]
rmse = np.sqrt(sum(squared_errors) / len(squared_errors))

print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")


Root Mean Squared Error (RMSE): 0.2272


## Improvements

To improve, I can use:
- Different similarity metrics (Pearson correlation).
- Weighted collaborative filtering (account for user similarity too).
- Matrix Factorisation (SVD, ALS)

### 1. Using Pearson Correlation Similarity

Cosine similarity only considers the angle between vectors, but it does not account for differences in user rating scales. 
Pearson correlation, on the other hand, measures how well two rating patterns correlate, adjusting for individual biases.


In [None]:
from scipy.stats import pearsonr

# Compute Pearson correlation similarity
def pearson_similarity(movie1, movie2):
    common_users = (movie1 > 0) & (movie2 > 0)
    if common_users.sum() < 2:
        return 0
    
    # Check for constant arrays
    if np.std(movie1[common_users]) == 0 or np.std(movie2[common_users]) == 0:
        return 0  # No variation in one or both movies
    
    try:
        return pearsonr(movie1[common_users], movie2[common_users])[0]
    except:
        return 0  # Handle any other exceptions

# Create similarity matrix using Pearson correlation
num_movies = train_array.shape[1]
pearson_similarity_matrix = np.zeros((num_movies, num_movies))

for i in range(num_movies):
    for j in range(num_movies):
        pearson_similarity_matrix[i, j] = pearson_similarity(train_array[:, i], train_array[:, j])

# Convert to DataFrame
pearson_similarity_df = pd.DataFrame(pearson_similarity_matrix, index=train_matrix.columns, columns=train_matrix.columns)


print("Pearson similarity matrix computed!")




  return pearsonr(movie1[common_users], movie2[common_users])[0]


In [None]:
from scipy.stats import pearsonr

# Compute Pearson similarity manually
def pearson_similarity(movie1, movie2):
    # Ignore missing values (zeroes)
    common_users = (movie1 > 0) & (movie2 > 0)
    
    if np.sum(common_users) < 2:  # Need at least 2 common ratings
        return 0
    
    return pearsonr(movie1[common_users], movie2[common_users])[0]




# Define num_movies before using it
num_movies = train_array.shape[1]  # Number of movies in train_array

# Create similarity matrix using Pearson
similarity_matrix_pearson = np.zeros((num_movies, num_movies))

for i in range(num_movies):
    for j in range(num_movies):
        similarity_matrix_pearson[i, j] = pearson_similarity(train_array[:, i], train_array[:, j])

# Convert to DataFrame
movie_similarity_pearson_df = pd.DataFrame(similarity_matrix_pearson, index=train_matrix.columns, columns=train_matrix.columns)

print("Pearson similarity matrix computed.")


  return pearsonr(movie1[common_users], movie2[common_users])[0]


In [None]:
from scipy.stats import pearsonr

# Compute Pearson similarity manually
def pearson_similarity(movie1, movie2):
    # Ignore missing values (zeroes)
    mask = (movie1 > 0) & (movie2 > 0)
    
    if np.sum(mask) < 2:  # Need at least 2 common ratings
        return 0
    
    return pearsonr(movie1[mask], movie2[mask])[0]

# Define num_movies before using it
num_movies = train_array.shape[1]  # Number of movies

# Create similarity matrix using Pearson
similarity_matrix_pearson = np.zeros((num_movies, num_movies))

for i in range(num_movies):
    for j in range(num_movies):
        similarity_matrix_pearson[i, j] = pearson_similarity(train_array[:, i], train_array[:, j])

# Convert to DataFrame
movie_similarity_pearson_df = pd.DataFrame(similarity_matrix_pearson, index=train_matrix.columns, columns=train_matrix.columns)

print("Pearson similarity matrix computed.")


### 2. Weighted Collaborative Filtering

Instead of simply averaging ratings of similar movies, we apply a weighted approach where movies with higher similarity scores contribute more to the predicted rating. 
This helps account for cases where some movies are more closely related than others.


In [None]:
def predict_weighted_rating(user_id, movie_id, similarity_matrix):
    if movie_id not in train_matrix.columns:
        return 0  

    user_ratings = train_matrix.loc[user_id]
    similar_movies = similarity_matrix[movie_id]

    # Compute weighted sum
    weighted_sum = 0
    sim_sum = 0
    for rated_movie, rating in user_ratings[user_ratings > 0].items():
        if rated_movie in similar_movies:
            similarity = similar_movies[rated_movie]
            weighted_sum += similarity * rating
            sim_sum += abs(similarity)

    return weighted_sum / sim_sum if sim_sum != 0 else 0

# Test with Pearson similarity
predicted_rating = predict_weighted_rating(user_id=1, movie_id='tt0133093', similarity_matrix=movie_similarity_pearson_df)
print(f"Predicted rating (Pearson, Weighted): {predicted_rating:.4f}")


### 3. Matrix Factorization (SVD)

Memory-based filtering works well but struggles with sparse datasets. 
SVD (Singular Value Decomposition) reduces the user-movie rating matrix into a lower-dimensional space, revealing hidden relationships. 
This allows us to make better recommendations even when explicit ratings are missing.


In [None]:
from scipy.sparse.linalg import svds

# Decompose train_matrix using SVD
U, sigma, Vt = svds(train_matrix, k=50)  # k = number of latent factors

# Convert sigma to diagonal matrix
sigma = np.diag(sigma)

# Reconstruct ratings matrix
predicted_ratings_matrix = np.dot(np.dot(U, sigma), Vt)

# Convert to DataFrame
predicted_ratings_df = pd.DataFrame(predicted_ratings_matrix, index=train_matrix.index, columns=train_matrix.columns)

print("SVD-based rating predictions computed.")


### 4. Predict Ratings using SVD

Once we decompose the user-movie matrix into latent factors, we can reconstruct an approximation of the original matrix.
This allows us to make predictions based on learned patterns rather than explicit similarity scores.


In [None]:
def predict_svd_rating(user_id, movie_id):
    if movie_id not in predicted_ratings_df.columns:
        return 0  

    return predicted_ratings_df.loc[user_id, movie_id]

predicted_rating_svd = predict_svd_rating(user_id=1, movie_id='tt0133093')
print(f"Predicted rating (SVD): {predicted_rating_svd:.4f}")

### 5. Comparing MAE and RMSE Across Methods

To evaluate our different approaches, we will calculate MAE and RMSE for:
- Cosine similarity (original)
- Pearson correlation
- Weighted filtering
- SVD (Matrix Factorization)

Lower MAE/RMSE values indicate better prediction accuracy.


In [None]:
def evaluate_model(predict_function, similarity_matrix=None):
    actual_ratings = []
    predicted_ratings = []

    for _, row in test_df.iterrows():
        user_id = row["userId"]
        movie_id = row["imdbId"]
        actual_rating = row["rating"]

        if similarity_matrix is not None:
            predicted_rating = predict_function(user_id, movie_id, similarity_matrix)
        else:
            predicted_rating = predict_function(user_id, movie_id)

        actual_ratings.append(actual_rating)
        predicted_ratings.append(predicted_rating)

    actual_ratings = np.array(actual_ratings)
    predicted_ratings = np.array(predicted_ratings)

    mae = np.mean(np.abs(actual_ratings - predicted_ratings))
    rmse = np.sqrt(np.mean((actual_ratings - predicted_ratings) ** 2))

    return mae, rmse

# Evaluate all models
models = {
    "Cosine Similarity": (predict_rating, movie_similarity_train_df),
    "Pearson Correlation": (predict_weighted_rating, movie_similarity_pearson_df),
    "SVD Matrix Factorization": (predict_svd_rating, None),
}

for model_name, (predict_func, sim_matrix) in models.items():
    mae, rmse = evaluate_model(predict_func, sim_matrix)
    print(f"{model_name} - MAE: {mae:.4f}, RMSE: {rmse:.4f}")
