# Collaborative Filtering Analysis


This notebook contains the implementation and analysis of user-based and item-based collaborative filtering for movie recommendations.


In [24]:

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error, precision_recall_fscore_support
from math import sqrt


## Load and Inspect Data

In [25]:

# Load datasets
ratings = pd.read_csv('../data/cleaned_ratings.csv')
users = pd.read_csv('../data/cleaned_users.csv')
movies = pd.read_csv('../data/cleaned_movies.csv')

# Display dataset information
ratings.head(), users.head(), movies.head()


(   user_id  item_id  rating            timestamp
 0      196      242       3  1997-12-04 15:55:49
 1      186      302       3  1998-04-04 19:22:22
 2       22      377       1  1997-11-07 07:18:36
 3      244       51       2  1997-11-27 05:02:03
 4      166      346       1  1998-02-02 05:33:16,
    user_id  age gender  occupation zip_code
 0        1   24      M  technician    85711
 1        2   53      F       other    94043
 2        3   23      M      writer    32067
 3        4   24      M  technician    43537
 4        5   33      F       other    15213,
    movie_id        movie_title release_date  \
 0         1   Toy Story (1995)  01-Jan-1995   
 1         2   GoldenEye (1995)  01-Jan-1995   
 2         3  Four Rooms (1995)  01-Jan-1995   
 3         4  Get Shorty (1995)  01-Jan-1995   
 4         5     Copycat (1995)  01-Jan-1995   
 
                                             imdb_url  unknown  Action  \
 0  http://us.imdb.com/M/title-exact?Toy%20Story%2...        0  

## User-Based Collaborative Filtering

In [38]:

# Create user-movie matrix
user_movie_matrix = ratings.pivot(index='user_id', columns='item_id', values='rating').fillna(0)

# Calculate cosine similarity between users
user_similarity = cosine_similarity(user_movie_matrix)
user_similarity_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)

# Predict ratings function for user-based filtering
def predict_user_rating(user_id, movie_id, user_movie_matrix, user_similarity_df, k=5):
    if movie_id not in user_movie_matrix.columns:
        return None
    user_similarities = user_similarity_df.loc[user_id]
    similar_users = user_similarities[user_movie_matrix[movie_id].notna()].nlargest(k+1).iloc[1:]
    similar_users_ratings = user_movie_matrix.loc[similar_users.index, movie_id]
    weighted_sum = (similar_users_ratings * similar_users).sum()
    similarity_sum = similar_users.sum()
    return weighted_sum / similarity_sum if similarity_sum != 0 else None

# Example: Predict a rating
predict_user_rating(1, 2, user_movie_matrix, user_similarity_df)


3.1998799373359197

La note prédite pour l'utilisateur 1 pour le film 2 est d'environ 3.20. Cela a été calculé en utilisant la moyenne pondérée des notes des utilisateurs les plus similaires (basée sur la similarité cosinus)

## Item-Based Collaborative Filtering

In [39]:

# Create item-movie matrix (transpose of user-movie matrix)
item_movie_matrix = user_movie_matrix.T

# Calculate cosine similarity between items (movies)
item_similarity = cosine_similarity(item_movie_matrix)
item_similarity_df = pd.DataFrame(item_similarity, index=item_movie_matrix.index, columns=item_movie_matrix.index)

# Predict ratings function for item-based filtering
def predict_item_rating(user_id, movie_id, user_movie_matrix, item_similarity_df, k=5):
    if movie_id not in user_movie_matrix.columns:
        return None
    user_ratings = user_movie_matrix.loc[user_id]
    movie_similarities = item_similarity_df[movie_id]
    rated_movies = user_ratings[user_ratings.notna()].index
    similar_movies = movie_similarities[rated_movies].nlargest(k)
    weighted_sum = (user_ratings[similar_movies.index] * similar_movies).sum()
    similarity_sum = similar_movies.sum()
    return weighted_sum / similarity_sum if similarity_sum != 0 else None

# Example: Predict a rating
predict_item_rating(1, 2, user_movie_matrix, item_similarity_df)


1.9375113053433601

La note prédite pour l'utilisateur 1 pour le film 2 en utilisant le filtrage collaboratif basé sur les items (films) est d'environ 1.94.

## Evaluation of Collaborative Filtering Models

In [28]:

# Split the data for evaluation
train_data = ratings.sample(frac=0.8, random_state=42)
test_data = ratings.drop(train_data.index)

# Predict ratings and evaluate with RMSE for user-based filtering
test_data['predicted_user_rating'] = test_data.apply(
    lambda row: predict_user_rating(row['user_id'], row['item_id'], user_movie_matrix, user_similarity_df) or 0, axis=1
)
rmse_user = sqrt(mean_squared_error(test_data['rating'], test_data['predicted_user_rating']))
rmse_user


2.0268093184405904

In [29]:

# Predict ratings and evaluate with RMSE for item-based filtering
test_data['predicted_item_rating'] = test_data.apply(
    lambda row: predict_item_rating(row['user_id'], row['item_id'], user_movie_matrix, item_similarity_df) or 0, axis=1
)
rmse_item = sqrt(mean_squared_error(test_data['rating'], test_data['predicted_item_rating']))
rmse_item


1.2241869648288641

## Recommendations with Fallback

In [30]:

# Recommend movies with fallback to popular movies if no predictions are available
def recommend_movies_with_fallback(user_id, user_movie_matrix, item_similarity_df, movies_df, top_n=10, k=5):
    unrated_movies = user_movie_matrix.loc[user_id][user_movie_matrix.loc[user_id].isna()].index
    predicted_ratings = {movie_id: predict_item_rating(user_id, movie_id, user_movie_matrix, item_similarity_df, k=k)
                         for movie_id in unrated_movies}
    predicted_ratings = {movie_id: rating for movie_id, rating in predicted_ratings.items() if rating is not None}
    if not predicted_ratings:
        popular_movies = user_movie_matrix.sum(axis=0).nlargest(top_n).index
        return movies_df[movies_df['movie_id'].isin(popular_movies)][['movie_title']]
    sorted_movies = sorted(predicted_ratings.items(), key=lambda x: x[1], reverse=True)[:top_n]
    recommended_df = pd.DataFrame(sorted_movies, columns=['movie_id', 'Predicted_Rating'])
    return recommended_df.merge(movies_df[['movie_id', 'movie_title']], on='movie_id')[['movie_title', 'Predicted_Rating']]

# Example: Generate recommendations for user 1
recommend_movies_with_fallback(1, user_movie_matrix, item_similarity_df, movies)


Unnamed: 0,movie_title
0,Toy Story (1995)
49,Star Wars (1977)
97,"Silence of the Lambs, The (1991)"
99,Fargo (1996)
126,"Godfather, The (1972)"
173,Raiders of the Lost Ark (1981)
180,Return of the Jedi (1983)
257,Contact (1997)
284,"English Patient, The (1996)"
286,Scream (1996)
