In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from sklearn.preprocessing import MinMaxScaler

In [2]:
# (2: 30 pts) On Canvas, there is the fake rmp.csv file which is a simulated data set of 100 students and their
# ratings of 15 professors (on a scale of 1-5). Read the data into python making sure that the first column is treated
# as the index:
# rmp_df = pd.read_csv(’fake_rmp.csv’, index_col = 0)
# rmp_df.head()

rmp_df = pd.read_csv('fake_rmp.csv', index_col = 0)
rmp_df.head()

Unnamed: 0,Dr. Gerber,Dr. Johnson,Dr. Craig,Dr. Zhu,Dr. Wang,Dr. Murphy,Dr. Sun,Dr. Eagan,Dr. Ness,Dr. Sudyanti,Dr. Zhang,Dr. Wasab,Dr. Diallo,Dr. Hernandez,Dr. King
Student_1,,3.0,,5.0,4.0,4.0,5.0,2.0,,4.0,2.0,4.0,5.0,2.0,3.0
Student_2,,4.0,5.0,5.0,5.0,4.0,4.0,5.0,,4.0,4.0,3.0,5.0,1.0,4.0
Student_3,4.0,4.0,2.0,4.0,3.0,5.0,2.0,4.0,4.0,5.0,2.0,3.0,3.0,5.0,3.0
Student_4,2.0,5.0,5.0,5.0,5.0,2.0,4.0,3.0,5.0,4.0,5.0,4.0,4.0,4.0,5.0
Student_5,3.0,,4.0,,,4.0,5.0,2.0,5.0,3.0,5.0,5.0,2.0,4.0,4.0


In [3]:
rmp_df.loc['Student_5', 'Dr. Johnson']

nan

In [4]:
# (a) In python, write a function called collab filter() which:
# • Takes as arguments a pandas data frame (such as the fake rmp.csv file), the name of a target user (the user
# id, e.g. ”Student_1”), a string which specifies which similarity metric (L2 or Cosine) to use, and a value k
# which specifies the number of most similar users to predict the target user with.
# • Performs collaborative filtering given the specified arguments and outputs the predicted ratings of the profes-
# sors for the target user.

# – Notes and Hints:
# ∗ By specifying the target user at the outset, it should simplify the computation time; your function
# should only need to calculate similarity scores for the target user with all other users (rather than
# calculating all pairwise similarity scores)
# ∗ You’ll want to center the user ratings before fulling NaN values with 0.
# ∗ For the differences between L2 and Cosine:
# · After calculating all the similarity scores, you should MinMax Scale them, so that the weighted
# average prediction works for both similarity metrics (L2 or Cosine).
# · BUT before scaling, you’ll want to take the negative of the L2 norm, so that smaller (more
# negative) values are less similar (and will map to 0).
# ∗ The .nlargest() function will be helpful for identifying the k largest similarity scores.
# ∗ When calculating the weighted similarity scores, if a similar user has not rated the same item that
# the target user has, simply fill that missing value with the similar user’s mean rating.
# ∗ Some students will not have missing ratings. You should include a check for this in the function
# which exits the function and gives a message pointing this out before running the algorithm.
# ∗ You probably also want a check to see if your target user is even in the data. If the user names are
# the matrix rows, something like:
# if target_user not in data.index:
# print(f"Error: {target_user} not found in the dataset.")
# return None

# • Takes as arguments a pandas data frame (such as the fake rmp.csv file), the name of a target user (the user
# id, e.g. ”Student_1”), a string which specifies which similarity metric (L2 or Cosine) to use, and a value k
# which specifies the number of most similar users to predict the target user with.
def collab_filter(data: pd.DataFrame, target_user: str, similarity_metric: str, k: int):
    if target_user not in data.index:
        return f"Error: {target_user} not found in the dataset."
    
    if data.loc[target_user].isnull().sum() == 0:
        return f"{target_user} has no missing ratings."
    
    data_centered = data.sub(data.mean(axis=1), axis=0).fillna(0)
    target_user_ratings = data_centered.loc[target_user].values.reshape(1, -1)
    
    if similarity_metric == "Cosine":
        similarity_scores = cosine_similarity(data_centered, target_user_ratings).flatten()
    elif similarity_metric == "L2":
        similarity_scores = -np.linalg.norm(data_centered.values - target_user_ratings, axis=1)
    else:
        return "Error: Invalid similarity metric. Use 'Cosine' or 'L2'."

    sim_df = pd.DataFrame(similarity_scores, index=data.index, columns=["Similarity"])
    sim_df = sim_df.drop(index=target_user)

    scaler = MinMaxScaler()
    sim_df["Similarity"] = scaler.fit_transform(sim_df[["Similarity"]])

    k = min(k, len(sim_df))
    top_k_users = sim_df.nlargest(k, "Similarity")
    filled_ratings = data.loc[top_k_users.index].T.fillna(data.mean(axis=1))
    pred_ratings = filled_ratings.dot(top_k_users["Similarity"]) / top_k_users["Similarity"].sum()

    return pred_ratings

In [5]:
# (b) Test that your function works by verifying:
# • ”Student 5”, with k = 3 and using Cosine similarity, has a predicted value of approximately 4.15 for Dr.
# Johnson.
# • ”Student 3” returns the message that they have no missing ratings.

print(collab_filter(rmp_df, 'Student_5', 'Cosine', 3)['Dr. Johnson'])
print(collab_filter(rmp_df, 'Student_5', 'L2', 3)['Dr. Johnson'])
print(collab_filter(rmp_df, 'Student_3', 'Cosine', 3))
print(collab_filter(rmp_df, 'Student_3', 'L2', 3))

4.149623259275851
4.320137677982919
Student_3 has no missing ratings.
Student_3 has no missing ratings.


In [6]:
# (c) Students 1, 2, and 13 are all considering taking Dr. Gerber (the first professor in the data) next Fall (they
# have not taken him yet). Choose some value of k > 4. Use your function to predict what their ratings would be
# under both the L2 and Cosine similarity metrics, and then decide if there is consensus for each student and
# if so, would you recommend they take Dr. Gerber.

print(collab_filter(rmp_df, 'Student_1', 'Cosine', 5)['Dr. Gerber'])
print(collab_filter(rmp_df, 'Student_1', 'L2', 5)['Dr. Gerber'])
print(collab_filter(rmp_df, 'Student_2', 'Cosine', 5)['Dr. Gerber'])
print(collab_filter(rmp_df, 'Student_2', 'L2', 5)['Dr. Gerber'])
print(collab_filter(rmp_df, 'Student_13', 'Cosine', 5)['Dr. Gerber'])
print(collab_filter(rmp_df, 'Student_13', 'L2', 5)['Dr. Gerber'])

# 4.388437826579327
# 4.205697580213907
# 4.251769420851983
# 4.557496943759391
# 2.867398438491762
# 4.358856705459793

# For student 1 and 2, there is consensus to take Dr. Gerber, as the predicted ratings for Student_1 are very similar (4.388 vs 4.206) and for Student_2 are also very similar (4.252 vs 4.557). For Student_13, the predicted ratings are not similar (2.867 vs 4.359), so there is no consensus to take Dr. Gerber. One of these ratings is below the mean possible rating of 3, so there is no consensus to take Dr. Gerber for this student.

4.388437826579327
4.205697580213907
4.251769420851983
4.557496943759391
2.867398438491762
4.358856705459793


In [7]:
# (3: 15 pts) Given the table from problem (1):
# (a) By hand, apply item-item collaborative filtering with the cosine similarity score to predict the missing
# ratings (on a scale of 1-10). Use the k = 2 most similar items to predict.
# (b) Discuss in a few sentences how you would go about determining which algorithm (User-User or Item-Item)
# is more accurate/produces better predictions.

# Overall, I would determine the more accurate algorithm by finding the root mean squared error and the mean average error for both algorithms. 
# The algorithm with the lower RMSE and MAE would be the more accurate algorithm. 
# I would also consider the complexity of the algorithm when determining which algorithm is better. 
# While this circumstance only had 6 items and 4 users, a real-world dataset could contain thousands of items with thousands of users. 
# In this scenario, item-item collaborative filtering would be more computationally efficient than user-user collaborative filtering.
# Meanwhile, user-user collaborative filtering would be more computationally efficient when the number of items is greater than the number of users.

In [8]:
# (4: 30 pts)
# (a) In python update your collab filter() function from problem (2) by adding an additional argument type
# which can be either "User" or "Item", where "User" runs the user-user collaborative filtering algorithm from (2),
# while "Item" runs the item-item collaborative filtering algorithm. Recall that for item-item CF:
# • We center the data based on the items (so you are calculating the mean of the Professors ratings now and
# centering by column instead of by row).
# • We are calculating pairwise similarity scores for all items/professors, while ignoring any users who have
# not rated them. You might want to write a helper function that does this to avoid your collab filter()
# function from getting too messy...
# • When calculating the weighted similarity scores, if the target user has not rated one of the most similar items,
# simply fill that missing value with the similar item’s mean.

def item_similarity(data_centered, similarity_metric):
    n_items = data_centered.shape[1]
    sim_matrix = np.zeros((n_items, n_items))

    for i in range(n_items):
        for j in range(i + 1, n_items):
            itemA = data_centered.iloc[:, i]
            itemB = data_centered.iloc[:, j]
            shared = ~np.isnan(itemA) & ~np.isnan(itemB)

            if shared.sum() == 0:
                sim = 0 
            else:
                if similarity_metric == "Cosine":
                    sim = cosine_similarity(itemA[shared].values.reshape(1, -1),
                                            itemB[shared].values.reshape(1, -1))[0, 0]
                elif similarity_metric == "L2":
                    sim = -np.linalg.norm(itemA[shared].values - itemB[shared].values)
                else:
                    raise ValueError("Invalid similarity metric. Requires 'Cosine' or 'L2'.")

            sim_matrix[i, j] = sim
            sim_matrix[j, i] = sim

    np.fill_diagonal(sim_matrix, 1)
    return pd.DataFrame(sim_matrix, index=data_centered.columns, columns=data_centered.columns)


In [9]:
def collab_filter_user_item(data: pd.DataFrame, target_user: str, similarity_metric: str, k: int, type: str = "User"):
    if type == "User":
        return collab_filter(data, target_user, similarity_metric, k)
    
    elif type == "Item":
        if target_user not in data.index:
            return f"Error: {target_user} not found in the dataset."

        if data.loc[target_user].isnull().sum() == 0:
            return f"{target_user} has no missing ratings."

        data_centered = data.sub(data.mean(axis=0), axis=1).fillna(0)

        sim_df = item_similarity(data_centered, similarity_metric)
        k = min(k, len(sim_df))


        target_ratings = data.loc[target_user]
        missing_items = target_ratings[target_ratings.isna()].index

        preds = {}
        for item in missing_items:
            similar_items = sim_df[item].drop(index=item).dropna()
            top_k_similar = similar_items.nlargest(min(k, len(similar_items)))

            if top_k_similar.empty or top_k_similar.sum() == 0:
                preds[item] = data.mean(axis=0)[item]
            else:
                filled_ratings = data[top_k_similar.index].loc[target_user].fillna(data.mean(axis=0)[top_k_similar.index])
                preds[item] = np.dot(filled_ratings, top_k_similar) / top_k_similar.sum()

        return pd.Series(preds)
    
    else:
        return "Error: Invalid type. Use 'User' or 'Item'."

In [10]:
# (b) Test that your function works for item-item CF by verifying that ”Student 5”, with k = 2 and using Cosine
# similarity with type = ”Item”, has a predicted value of approximately 4.47 for Dr. Johnson.

print(collab_filter_user_item(rmp_df, 'Student_5', 'Cosine', 2, 'Item')['Dr. Johnson'])

4.402834432170603


In [11]:

# (c) Repeat problem (2: c) but with item-item collaborative filtering and discuss if there are any differences.
# CF Problem Ideation (15 points)
# (c) Students 1, 2, and 13 are all considering taking Dr. Gerber (the first professor in the data) next Fall (they
# have not taken him yet). Choose some value of k > 4. Use your function to predict what their ratings would be
# under both the L2 and Cosine similarity metrics, and then decide if there is consensus for each student and
# if so, would you recommend they take Dr. Gerber.

print(collab_filter_user_item(rmp_df, 'Student_1', 'Cosine', 5, 'User')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_1', 'L2', 5, 'User')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_2', 'Cosine', 5, 'User')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_2', 'L2', 5, 'User')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_13', 'Cosine', 5, 'User')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_13', 'L2', 5, 'User')['Dr. Gerber'], "\n")


print(collab_filter_user_item(rmp_df, 'Student_1', 'Cosine', 5, 'Item')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_1', 'L2', 5, 'Item')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_2', 'Cosine', 5, 'Item')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_2', 'L2', 5, 'Item')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_13', 'Cosine', 5, 'Item')['Dr. Gerber'])
print(collab_filter_user_item(rmp_df, 'Student_13', 'L2', 5, 'Item')['Dr. Gerber'])

# 4.388437826579327
# 4.205697580213907
# 4.251769420851983
# 4.557496943759391
# 2.867398438491762
# 4.358856705459793 

# 3.1213236504783155
# 3.211987076168631
# 3.6148643667643126
# 3.823153706935148
# 3.532088582183623
# 3.593581983313002

# Overall, the predictions are lower when using Item-Item collaborative filtering compared to User-User collaborative filtering.
# For all 3 students, the cosine similarity metric and L2 metric predicted recommending Dr. Gerber, as the predictions are all similar and greater than 3. 
# In each scenario, the L2 metric predicted a higher rating than the cosine similarity metric. 

4.388437826579327
4.205697580213907
4.251769420851983
4.557496943759391
2.867398438491762
4.358856705459793 

3.1213236504783155
3.211987076168631
3.6148643667643126
3.823153706935148
3.532088582183623
3.593581983313002


In [81]:
# (5) Discuss a real data problem and identify a corresponding data set, which you think would be appropriate to
# analyze with Collaborative Filtering (or hybrid CF and Content-Based Filtering). You may think of using Collab-
# orative Filtering (either user-user, or item-item) in either recommendation system or non-recommendation system
# settings (you may want to discuss with the TAs or Dr. Gerber how that might work). The data set must, to
# your knowledge, NOT have had Collaborative Filtering applied to it before (the TAs and Dr. Gerber will be
# checking). Provide:
# (a) A working link to the data set or to the website(s) from which the data set may be collected.
# (b) A description of the problem of interest and how the data could be used to answer the problem with
# Collaborative Filtering.
# (c) Your intuition as to why Collaborative Filtering would be appropriate for this problem.

# https://www.kaggle.com/datasets/ofurkancoban/discogs-datasets-january-2025/data

# This dataset contains information about albums and artists on the Discogs platform. 
# A problem many individuals face is finding new artists to listen to while still being similar to their current preferences. With the use of the Discogs dataset, users could input their favorite artist (assuming they exist within the dataset) and receive recommendations for similar artists. The data includes descriptions of the artist, which could be used to create a content-based filtering system. 

# I have previously created a webapp project called MarinoBuddy that used cosine similarity to pair students with matching gym interests, and I think this could be similar. Because the artist description contains biographical information, such as the city and year of birth, it may be necessary to preprocess the data to try optimizing the search query.  