We will use collaborative filtering

In [None]:
import os
import pandas as pd
import numpy as np

script_dir = os.getcwd() 

print(f"Current working directory: {script_dir}")

# Load ratings data
ratings_file = os.path.join(script_dir, "Cleaned Datasets", "ratings_imdb_matched.csv")
df_ratings = pd.read_csv(ratings_file)

Current working directory: c:\Users\willi\OneDrive\Documents\GitHub\Test\Movie-Recommendation


### Creating User-Item Matrices

To perform collaborative filtering, we need to convert our rating data into a **user-item matrix**, where:

- **Rows represent users (`userId`)**
- **Columns represent movies (`movieId`)**
- **Values represent ratings given by users to movies**
- Missing values (movies that a user hasn't rated) are filled with `0`.

Example User-Item Matrix:

| userId | movieId=1 | movieId=2 | movieId=3 | movieId=4 |
|--------|----------|----------|----------|----------|
| 1      | 4.0      | 0.0      | 3.5      | 5.0      |
| 2      | 0.0      | 2.5      | 5.0      | 3.0      |
| 3      | 1.0      | 0.0      | 4.0      | 2.0      |

The matrix allows us to perform **collaborative filtering** by finding patterns in user ratings. We can then use it to compute **movie similarities** and predict missing ratings.


In [None]:
from sklearn.model_selection import train_test_split

# Shuffle and split dataset into train (80%) and test (20%)
train_df, test_df = train_test_split(df_ratings, test_size=0.2, random_state=42)

# User-item matrices for training and testing
train_matrix = train_df.pivot(index="userId", columns="imdbId", values="rating").fillna(0)
test_matrix = test_df.pivot(index="userId", columns="imdbId", values="rating").fillna(0)

# Convert to NumPy arrays
train_array = train_matrix.values
test_array = test_matrix.values

print(f"Train set size: {len(train_df)}")
print(f"Test set size: {len(test_df)}")


Train set size: 80668
Test set size: 20168


### Computing Movie Similarities

To find how similar two movies are based on user ratings, **cosine similarity** is used. This measures how close two movies are in rating patterns. 

- **Formula**:  

 ```math
\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}
``` 
where **A** and **B** are rating vectors for two movies.

- **Key Steps:**
  1. Extract user ratings for each movie.
  2. Compute cosine similarity between every pair of movies.
  3. Store these values in a **similarity matrix** for quick lookup.

This similarity matrix helps in predicting user ratings based on movies they have already rated.

In [None]:
from numpy.linalg import norm

# Compute cosine similarity manually
def cosine_similarity(movie1, movie2):
    dot_product = np.dot(movie1, movie2)
    norm_product = norm(movie1) * norm(movie2)
    return dot_product / norm_product if norm_product != 0 else 0

# Create similarity matrix based on the training set
num_movies = train_array.shape[1]
similarity_matrix_train = np.zeros((num_movies, num_movies))

for i in range(num_movies):
    for j in range(num_movies):
        similarity_matrix_train[i, j] = cosine_similarity(train_array[:, i], train_array[:, j])

# Convert to DataFrame
movie_similarity_train_df = pd.DataFrame(similarity_matrix_train, index=train_matrix.columns, columns=train_matrix.columns)

### Making Predictions

Once we have the similarity matrix, we can predict how a user would rate a movie they haven't seen yet. This is done by looking at similar movies they have already rated.

- **Key Steps:**
  1. Identify movies the user has rated.
  2. Find similar movies using the similarity matrix.
  3. Compute the weighted average of ratings from similar movies.
  4. Predict the rating for the unseen movie.

The formula for predicting a rating \( \hat{r}_{u,m} \) for user \( u \) and movie \( m \) is:

```math
\hat{r}_{u,m} = \frac{\sum_{n \in N} \text{similarity}(m, n) \times r_{u,n}}{\sum_{n \in N} |\text{similarity}(m, n)|}
```

where:
- \( N \) is the set of movies similar to \( m \) that the user has rated.
- \( \text{similarity}(m, n) \) is the cosine similarity between movies \( m \) and \( n \).
- \( r_{u,n} \) is the rating given by user \( u \) to movie \( n \).

This allows us to estimate how much a user might like a movie based on their past ratings.


In [None]:
def predict_rating(user_id, movie_id):
    # If the movie is not in the training set, return 0
    if movie_id not in train_matrix.columns:
        return 0  
    
    # Get movies rated by the user
    user_ratings = train_matrix.loc[user_id]
    
    # Get similarity scores for the target movie
    similar_movies = movie_similarity_train_df[movie_id]
    
    # Compute weighted average of similar movie ratings
    weighted_sum = 0
    sim_sum = 0
    for rated_movie, rating in user_ratings[user_ratings > 0].items():
        if rated_movie in similar_movies:
            similarity = similar_movies[rated_movie]
            weighted_sum += similarity * rating
            sim_sum += abs(similarity)
    
    # Normalize by similarity sum
    return weighted_sum / sim_sum if sim_sum != 0 else 0


### Evaluating the Model

To measure how well our recommendation system performs, we compare its predicted ratings with actual ratings from a test dataset.

- **Key Steps:**
  1. **Split the dataset** into training (80%) and testing (20%) sets.
  2. Train the model using only the training data.
  3. Use the model to predict ratings for movies in the test set.
  4. Calculate errors between predicted and actual ratings.

- **Common Evaluation Metrics:**
  - **Mean Absolute Error (MAE):** Measures the average absolute difference between predicted and actual ratings.
    \[
    MAE = \frac{1}{N} \sum_{i=1}^{N} | \hat{r}_i - r_i |
    \]
  - **Root Mean Square Error (RMSE):** Penalizes larger errors more heavily.
    \[
    RMSE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (\hat{r}_i - r_i)^2}
    \]

Lower MAE and RMSE values indicate better accuracy, meaning our recommendations are closer to actual user preferences.


In [13]:
# Compute MAE (Mean Absolute Error)
actual_ratings = []
predicted_ratings = []

for _, row in test_df.iterrows():
    user_id = row["userId"]
    movie_id = row["imdbId"]
    actual_rating = row["rating"]
    
    predicted_rating = predict_rating(user_id, movie_id)
    
    actual_ratings.append(actual_rating)
    predicted_ratings.append(predicted_rating)

# Calculate MAE
mae = np.mean(np.abs(np.array(actual_ratings) - np.array(predicted_ratings)))

print(f"Mean Absolute Error (MAE): {mae:.4f}")


Mean Absolute Error (MAE): 0.1627
