# Assignment 4 - Collaborative Filterting

Use collaborative filtering technique to generate a list of movies for a specific user.

The ratings_small.csv file contains the userid, movieid, and rating value. For example:

| userid | movieid | rating |
|:------:|:-------:|:------:|
| 1      | 31      | 2.5    |

User 1 rate movie 31 as 2.5

Implement the collaborative filtering technique using the training set read from the file: ratings_small_training.csv

1. Download ratings_small_training.csv.
2. Use this training set to predict the rating of the user and movie from the test set: ratings_small_test.csv
3. Download ratings_small_test.csv.
4. Add the third column to ratings_small_test.csv for the ratings of each user and each movie.

## Approach - Hybrid approach: User-Based Collaborative Filtering and Item-Based Collaborative Filtering

In the hybrid approach, I combined two collaborative filtering techniques: User-Based Collaborative Filtering (UBCF) and Item-Based Collaborative Filtering (IBCF).

1. **User-Based Collaborative Filtering (UBCF):** UBCF predicts a user's ratings for items based on the ratings of similar users. It calculates the _similarity between users_ based on their ratings and uses this similarity to predict a user's rating for an item by taking a weighted average of the ratings of similar users.

2. **Item-Based Collaborative Filtering (IBCF):** IBCF predicts a user's ratings for items based on the ratings of similar items. It calculates the _similarity between items_ based on user ratings and uses this similarity to predict a user's rating for an item by taking a weighted average of the user's ratings for similar items.

For the hybrid model, I combined the predictions from UBCF and IBCF by assigning weights to each approach. I used a weight of 0.7 for UBCF and 0.3 for IBCF.

- **UBCF Weight (0.7):** I assigned a higher weight to UBCF (0.7) because it tends to perform better when there are more users in the dataset, as it relies on finding similar users to make predictions. By giving UBCF a higher weight, we are emphasizing the predictions based on user similarities, which can be more reliable when there are enough users to compare.

- **IBCF Weight (0.3):** I assigned a lower weight to IBCF (0.3) because it can be more sensitive to sparsity in the dataset, especially when there are fewer ratings per item. By giving IBCF a lower weight, we are still considering the predictions based on item similarities, but to a lesser extent compared to UBCF.

These weights are not fixed and can be tuned based on the characteristics of the dataset and the performance of each approach.

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
ratings_small_training_csv = "../data/ratings_small_training.csv"
ratings_small_test_csv = "../data/ratings_small_test.csv"

with open(ratings_small_training_csv, "rb") as training_csv:
    train_data = pd.read_csv(training_csv)

with open(ratings_small_test_csv, "rb") as test_csv:
    test_data = pd.read_csv(test_csv)

train_data.rename(columns={'userId': 'userid', 'movieId': 'movieid'}, inplace=True)
test_data.rename(columns={'userid': 'userid', 'movieid': 'movieid'}, inplace=True)

In [3]:
def predict_using_ubcf(user_id, movie_id, train_data, user_similarity_matrix, k=4):
    """
    Predict the rating of a user for a given movie using User-Based Collaborative Filtering (UBCF).

    Parameters:
    - user_id (int): The ID of the target user.
    - movie_id (int): The ID of the target movie.
    - train_data (DataFrame): The training dataset containing user-item ratings.
    - user_similarity_matrix (ndarray): The similarity matrix between users.
    - k (int): The number of similar users to consider. Default is 4.

    Returns:
    - float: The predicted rating for the target user and movie, or the average rating of the item if no similar items are found.
    """
    user_ratings = train_data[train_data['userid'] == user_id].set_index('movieid')['rating']
    user_similarities = user_similarity_matrix[user_id - 1]
    similar_users = user_similarities.argsort()[::-1][1:k+1]

    weighted_sum = 0
    similarity_sum = 0
    for similar_user_id in similar_users:
        similar_user_rating = train_data[(train_data['userid'] == similar_user_id) & (train_data['movieid'] == movie_id)]['rating']
        if not similar_user_rating.empty:
            similarity = user_similarities[similar_user_id - 1]
            weighted_sum += similarity * similar_user_rating.values[0]
            similarity_sum += similarity

    if similarity_sum == 0:
        # Return the average rating of the user if no similar users are found
        return user_ratings.mean()

    predicted_rating = weighted_sum / similarity_sum
    return predicted_rating

In [4]:
def predict_using_ibcf(user_id, movie_id, train_data, item_similarity_matrix, k=4):
    """
    Predict the rating of a user for a given movie using Item-Based Collaborative Filtering (IBCF).

    Parameters:
    - user_id (int): The ID of the target user.
    - movie_id (int): The ID of the target movie.
    - train_data (DataFrame): The training dataset containing user-item ratings.
    - item_similarity_matrix (ndarray): The similarity matrix between items.
    - k (int): The number of similar items to consider. Default is 4.

    Returns:
    - float: The predicted rating for the target user and movie, or the average rating of the item if no similar items are found.
    """
    user_ratings = train_data[train_data['userid'] == user_id].set_index('movieid')['rating']
    item_similarities = item_similarity_matrix[movie_id - 1]
    similar_items = item_similarities.argsort()[::-1][1:k+1]

    weighted_sum = 0
    similarity_sum = 0
    for similar_item_id in similar_items:
        similar_item_rating = user_ratings.get(similar_item_id)
        if similar_item_rating is not None:
            similarity = item_similarities[similar_item_id - 1]
            weighted_sum += similarity * similar_item_rating
            similarity_sum += similarity

    if similarity_sum == 0:
        return train_data[train_data['movieid'] == movie_id]['rating'].mean()

    predicted_rating = weighted_sum / similarity_sum
    return predicted_rating

In [5]:
user_similarity_matrix = cosine_similarity(train_data.pivot(index='userid', columns='movieid', values='rating').fillna(0))

print(user_similarity_matrix)

[[1.         0.         0.         ... 0.06291708 0.         0.01746565]
 [0.         1.         0.10155486 ... 0.02425089 0.17137938 0.11369591]
 [0.         0.10155486 1.         ... 0.08152754 0.11141105 0.17133542]
 ...
 [0.06291708 0.02425089 0.08152754 ... 1.         0.04260878 0.08520194]
 [0.         0.17137938 0.11141105 ... 0.04260878 1.         0.22867673]
 [0.01746565 0.11369591 0.17133542 ... 0.08520194 0.22867673 1.        ]]


In [6]:
item_similarity_matrix = cosine_similarity(train_data.pivot(index='movieid', columns='userid', values='rating').fillna(0))

print(item_similarity_matrix)

[[1.         0.39451145 0.30651588 ... 0.         0.         0.05582876]
 [0.39451145 1.         0.21749153 ... 0.         0.         0.        ]
 [0.30651588 0.21749153 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         1.         0.        ]
 [0.         0.         0.         ... 1.         1.         0.        ]
 [0.05582876 0.         0.         ... 0.         0.         1.        ]]


In [7]:
test_data['predicted_rating'] = test_data.apply(lambda x: 
    0.7 * predict_using_ubcf(x['userid'], x['movieid'], train_data, user_similarity_matrix) +
    0.3 * predict_using_ibcf(x['userid'], x['movieid'], train_data, item_similarity_matrix), axis=1)

In [8]:
test_data

Unnamed: 0,userid,movieid,predicted_rating
0,1,1339,2.774423
1,2,377,3.52095
2,3,527,4.092593
3,4,112,4.019504
4,5,150,3.620603
5,6,2072,3.17093
6,7,333,3.422136
7,8,805,3.756466
8,9,608,3.969122
9,10,1127,3.664365


In [9]:
ratings_small_test_predicted_csv = "../data/ratings_small_test_predicted.csv"
with open(ratings_small_test_predicted_csv, "wb") as output_data:
    test_data.to_csv(output_data, index=False)