# Collaborative Filtering

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called Collaborative Filtering to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the Surprise library that used extremely powerful algorithms like Singular Value Decomposition (SVD) to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [48]:
import pandas as pd
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

In [49]:
# Load the ratings CSV file
ratings = pd.read_csv('ratings.csv')

# Define a Reader object to parse the ratings data
reader = Reader(rating_scale=(1, 5))

# Load the data into the Surprise Dataset
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Split the data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2)


In [50]:
# Initialize the SVD model
svd = SVD()

# Train the model on the trainset
svd.fit(trainset)

# Predict ratings for the testset
predictions = svd.test(testset)

In [51]:
# Evaluate the predictions using RMSE and MAE metrics
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

print("RMSE:", rmse)
print("MAE:", mae)

RMSE: 1.4802
MAE:  1.2757
RMSE: 1.480244353303389
MAE: 1.2756915449418076


The RMSE and MAE values seem to be around 1.5 and 1.28, respectively. These values indicate the average magnitude of errors between the predicted ratings and the actual ratings. Generally, in the context of movie recommendation, an RMSE value below 1.0 and an MAE value below 0.8 would be considered very good. 

Rated movies by userid 1

In [54]:
# Load the merged data CSV file
merged_data = pd.read_csv('movie_data_with_keywords.csv')

# Load the ratings CSV file
ratings = pd.read_csv('ratings.csv')

# Specify the target user ID
target_user_id = 1

# Get rated movies for the target user from the ratings data
rated_movies = ratings[ratings['userId'] == target_user_id]['movieId']

# Filter the merged data for rated movies by the target user
rated_movies_data = merged_data[merged_data['movie_id'].isin(rated_movies)]

# Sort the rated movies by popularity (you can change this to another metric)
sorted_rated_movies = rated_movies_data.sort_values(by='popularity', ascending=False)

# Display the top 5 original titles of rated movies
top_rated_original_titles = sorted_rated_movies.head(5)['original_title']
print("Top 5 Rated Movies by User", target_user_id)
print(top_rated_original_titles)



Top 5 Rated Movies by User 1
68                                   The Dark Knight
110                                    The Godfather
3                                       Forrest Gump
127                         The Shawshank Redemption
56     The Lord of the Rings: The Return of the King
Name: original_title, dtype: object


In [41]:
# Load the merged data CSV file
merged_data = pd.read_csv('movie_data_with_keywords.csv')

# Recommend movies for a specific user (e.g., user with ID 1)
target_user_id = 1

# Get unrated movies for the target user
unrated_movies = merged_data[~merged_data['movie_id'].isin(ratings[ratings['userId'] == target_user_id]['movieId'])]

# Predict ratings for unrated movies
unrated_movies['predicted_rating'] = unrated_movies['movie_id'].apply(lambda movie_id: svd.predict(target_user_id, movie_id).est)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unrated_movies['predicted_rating'] = unrated_movies['movie_id'].apply(lambda movie_id: svd.predict(target_user_id, movie_id).est)


In [43]:
# Sort unrated movies by predicted rating
recommended_movies = unrated_movies.sort_values(by='predicted_rating', ascending=False)

# Display the recommended movie names and predicted ratings
print("Recommended Movies:")
for index, row in recommended_movies.iterrows():
    print(f"{row['movie_name']} - Predicted Rating: {row['predicted_rating']}")

Recommended Movies:
Four Rooms - Predicted Rating: 2.6904338889185144
Ironclad - Predicted Rating: 2.6904338889185144
Ondine - Predicted Rating: 2.6904338889185144
Gandhi, My Father - Predicted Rating: 2.6904338889185144
Bran Nue Dae - Predicted Rating: 2.6904338889185144
Grown Ups - Predicted Rating: 2.6904338889185144
Fair Game - Predicted Rating: 2.6904338889185144
The Last Exorcism - Predicted Rating: 2.6904338889185144
Morning Glory - Predicted Rating: 2.6904338889185144
Transformers: Dark of the Moon - Predicted Rating: 2.6904338889185144
Big Mommas: Like Father, Like Son - Predicted Rating: 2.6904338889185144
Priest - Predicted Rating: 2.6904338889185144
Your Highness - Predicted Rating: 2.6904338889185144
Zookeeper - Predicted Rating: 2.6904338889185144
You Again - Predicted Rating: 2.6904338889185144
Harriet the Spy - Predicted Rating: 2.6904338889185144
Eat Pray Love - Predicted Rating: 2.6904338889185144
The Divide - Predicted Rating: 2.6904338889185144
The 41–Year–Old Virgi

In this example, we first load and preprocess the ratings data, train the SVD model, and then make predictions and recommendations for a specific user. The predictions are made using the trained SVD model's predict method. The recommended movies are sorted based on predicted ratings.