$$ ITI \space AI-Pro: \space Intake \space 45 $$
$$ Recommender \space Systems $$
$$ Lab \space no. \space 1 $$

# `01` Import Necessary Libraries

## `i` Default Libraries

In [1]:
import numpy as np
import pandas as pd

## `ii` Additional Libraries
Add imports for additional libraries you used throughout the notebook

----------------------------

# `02` Load Data

In [2]:
ratings = pd.read_csv("Data/songsDataset.csv", names=['userID', 'songID', 'rating'], skiprows=[0])
ratings.head()

Unnamed: 0,userID,songID,rating
0,0,90409,5
1,4,91266,1
2,5,8063,2
3,5,24427,4
4,5,105433,4


---------------------------------

# `03` Similarity Metrics

## `0` Utility Matrix
Construct utility matrix for the loaded data `ratings`
- Users as Index
- Songs as Columns

**Hint**: you can use `pandas.DataFrame.pivot` method (see [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html))

In [3]:
utility_matrix = pd.pivot_table(ratings, index='userID', columns='songID', values='rating')
utility_matrix.head()

songID,2263,2726,3785,8063,12709,13859,16548,17029,19299,19670,...,113954,119103,120147,122065,123176,125557,126757,131048,132189,134732
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,2.0,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,3.0
14,,,,,,,,,,,...,,,,,,,,,,


## `i` Cosine Similarity
Finish implmenting the function below to calculate `Cosine Similarity` between two vectors

In [4]:
def cosine_sim(vec_a, vec_b):
    """
    Returns the raw cosine similarity score between two vectors.

            Parameters:
                vec_a (pandas.Series): Vector A
                vec_b (pandas.Series): Vector B

            Returns:
                sim_score (float): Similarity score between vectors vec_a and vec_b
    """
    vec_a = vec_a.fillna(0)
    vec_b = vec_b.fillna(0)

    n1 = len(vec_a)

    if n1 != len(vec_b):
        raise ValueError("Vectors must be of the same length.")

    dot_product = 0
    m1 = 0
    m2 = 0

    for i in range(n1):
        dot_product += vec_a.values[i] * vec_b.values[i]
        m1 += vec_a.values[i] ** 2
        m2 += vec_b.values[i] ** 2

    if m1 == 0 or m2 == 0:
        return 0.0

    sim_score = dot_product / (m1 ** 0.5 * m2 ** 0.5)

    return sim_score


In [5]:
print(f'Cosine Similarity between userID 56 and userID 227 is: {cosine_sim(utility_matrix.iloc[56].copy(), utility_matrix.iloc[227].copy())}')

Cosine Similarity between userID 56 and userID 227 is: 0.7808688094430304


## `ii` Adjusted Cosine Similarity
Finish implmenting the function below to calculate `Adjusted Cosine Similarity` between two vectors

In [6]:
import numpy as np

def adjusted_cosine_sim(vec_a, vec_b):
    if len(vec_a) != len(vec_b):
        raise ValueError("vectors must be of the same length.")

    mean_a = vec_a.fillna(0).mean()
    mean_b = vec_b.fillna(0).mean()

    adjusted_a = vec_a.fillna(0) - mean_a
    adjusted_b = vec_b.fillna(0) - mean_b

    dot_product = np.dot(adjusted_a, adjusted_b)
    m1 = np.sqrt(np.sum(adjusted_a ** 2))
    m2 = np.sqrt(np.sum(adjusted_b ** 2))

    if m1 == 0 or m2 == 0:
        return 0.0

    sim_score = dot_product / (m1 * m2)

    return sim_score


In [7]:
print(f'Adjusted Cosine Similarity between userID 56 and userID 227 is: {adjusted_cosine_sim(utility_matrix.iloc[56].copy(), utility_matrix.iloc[227].copy())}')

Adjusted Cosine Similarity between userID 56 and userID 227 is: 0.7764278070396684


## `iii` Pearson Correlation Coefficient
Finish implmenting the function below to calculate `Pearson Correlation Coefficient` between two vectors

In [8]:
import numpy as np

def pearson_sim(vec_a, vec_b):
    if len(vec_a) != len(vec_b):
        raise ValueError("vectors must be of the same length.")

    vec_a = vec_a.fillna(0)
    vec_b = vec_b.fillna(0)

    mean_a = vec_a.mean()
    mean_b = vec_b.mean()

    covariance = np.sum((vec_a - mean_a) * (vec_b - mean_b))

    std_a = np.sqrt(np.sum((vec_a - mean_a) ** 2))
    std_b = np.sqrt(np.sum((vec_b - mean_b) ** 2))

    if std_a == 0 or std_b == 0:
        return 0.0

    sim_score = covariance / (std_a * std_b)

    return sim_score

In [9]:
print(f'Pearson Similarity between songID 3785 and songID 17029 is: {pearson_sim(utility_matrix[3785].copy(), utility_matrix[17029].copy())}')

Pearson Similarity between songID 3785 and songID 17029 is: -0.015085785303531213


## `iv` Mean Squared Difference
Finish implmenting the function below to calculate `Mean Squared Difference` between two vectors

**Note**: Make sure you calculate the difference for common dimensions only (i.e. the dimensions both items/users have non-zero values in)

In [10]:
def msd_sim(vec_a, vec_b):
    common_items = vec_a.index.intersection(vec_b.index)
    
    if len(common_items) == 0:
        return 0.0
    
    diff_square = np.sum((vec_a[common_items] - vec_b[common_items]) ** 2)
    
    msd = diff_square / len(common_items)
    
    sim_score = 1 / (1 + msd)

    return sim_score

In [11]:
print(f'MSD Similarity between userID 56 and userID 227 is: {msd_sim(utility_matrix.iloc[56].copy(), utility_matrix.iloc[227].copy())}')
print(f'MSD Similarity between songID 3785 and songID 17029 is: {msd_sim(utility_matrix[3785].copy(), utility_matrix[17029].copy())}')

MSD Similarity between userID 56 and userID 227 is: 1.0
MSD Similarity between songID 3785 and songID 17029 is: 0.9999258806307558


--------------------------

# `04` Collaborative Filtering

Practice for item-based collaborative filtering

## `0` Utility Matrix
Construct utility matrix for the loaded data `ratings`
- Songs as Index
- Users as Columns

In [12]:
utility_matrix = pd.pivot_table(ratings, index='userID', columns='songID', values='rating')

In [13]:
utility_matrix.head()

songID,2263,2726,3785,8063,12709,13859,16548,17029,19299,19670,...,113954,119103,120147,122065,123176,125557,126757,131048,132189,134732
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,2.0,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,3.0
14,,,,,,,,,,,...,,,,,,,,,,


## `i` Item-Item Similarity Matrix

Construct item-item (Cosine/Adjusted Cosine) similarity matrix from the utility matrix  above.

In [14]:
def item_item_cosine_similarity(utility_matrix):
    items = utility_matrix.columns
    similarity_matrix = pd.DataFrame(index=items, columns=items)
    
    for item1 in items:
        for item2 in items:
            vec1 = utility_matrix[item1].fillna(0)
            vec2 = utility_matrix[item2].fillna(0)
            similarity_matrix.loc[item1, item2] = cosine_sim(vec1, vec2)
    
    return similarity_matrix

sim_mat = item_item_cosine_similarity(utility_matrix)


In [15]:
sim_df = sim_mat.copy()
sim_df.head()

songID,2263,2726,3785,8063,12709,13859,16548,17029,19299,19670,...,113954,119103,120147,122065,123176,125557,126757,131048,132189,134732
songID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2263,1.0,0.019204,0.002648,0.007362,0.002131,0.010461,0.001127,0.013909,0.004832,0.00313,...,0.014558,0.011182,0.011221,0.017239,0.003774,0.004995,0.010107,0.007416,0.003489,0.011051
2726,0.019204,1.0,0.006077,0.01569,0.005398,0.015407,0.003485,0.037177,0.006026,0.003684,...,0.023848,0.030855,0.036434,0.008127,0.009455,0.01278,0.01328,0.010879,0.000748,0.018534
3785,0.002648,0.006077,1.0,0.021914,0.01454,0.013389,0.010612,0.004021,0.006226,0.003725,...,0.003643,0.008918,0.002273,0.00403,0.015357,0.006775,0.006063,0.010293,0.010658,0.016858
8063,0.007362,0.01569,0.021914,1.0,0.016481,0.022418,0.011467,0.016541,0.030483,0.00557,...,0.003372,0.018185,0.006622,0.006871,0.020601,0.013036,0.01158,0.014453,0.007626,0.029462
12709,0.002131,0.005398,0.01454,0.016481,1.0,0.008663,0.005604,0.01374,0.016231,0.011711,...,0.003969,0.004713,0.010613,0.01116,0.00751,0.005068,0.002381,0.008103,0.017943,0.014836


## `ii` Candidate Generation and Filtering

Filter out items (user 199988) has rated from the similarity matrix above.

In [32]:
user_id = 199988

# Get items the user has rated
rated_items = utility_matrix.loc[user_id].dropna().index

rated_items

Index([2726, 19299, 43267, 56660], dtype='int64', name='songID')

In [35]:
# Get all items
all_items = utility_matrix.columns

# Find items the user has not rated
potential_items = all_items.difference(rated_items)

# Filter and transpose
filtered_sim_df = sim_mat.loc[rated_items, potential_items]

In [36]:
filtered_sim_df

songID,2263,3785,8063,12709,13859,16548,17029,19670,22763,24427,...,113954,119103,120147,122065,123176,125557,126757,131048,132189,134732
songID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2726,0.019204,0.006077,0.01569,0.005398,0.015407,0.003485,0.037177,0.003684,0.006715,0.009673,...,0.023848,0.030855,0.036434,0.008127,0.009455,0.01278,0.01328,0.010879,0.000748,0.018534
19299,0.004832,0.006226,0.030483,0.016231,0.019555,0.021715,0.009046,0.015246,0.030448,0.024521,...,0.003317,0.013556,0.004492,0.018581,0.037828,0.005302,0.004005,0.019138,0.008508,0.029209
43267,0.012299,0.010692,0.012279,0.004785,0.01074,0.006431,0.027851,0.002085,0.007528,0.015044,...,0.013484,0.034512,0.037962,0.005459,0.009854,0.012405,0.017465,0.009158,0.001073,0.026027
56660,0.005823,0.011976,0.012521,0.02241,0.008185,0.002408,0.012608,0.010788,0.016244,0.015606,...,0.001654,0.011231,0.008555,0.009993,0.019749,0.008598,0.006939,0.010053,0.021829,0.022804


## `iii` Top-K Candidate Selection

Selet top-K (a k of your choice) similar items for each item (user 199988) rated from the filtered similarity matrix above.

In [42]:
k = 5 

top_k_candidates = {}

for rated_item in rated_items:
    if rated_item in filtered_sim_df.columns:

        sims = filtered_sim_df[rated_item]
        
        top_k = sims.sort_values(ascending=False).head(k)
        
        top_k_candidates[rated_item] = top_k
    else:
        print(f"Warning: Item '{rated_item}' not found in similarity DataFrame.")

for item, candidates in top_k_candidates.items():
    print(f"Rated item {item} -> Top-{k} candidates:")
    print(candidates)
    print("-" * 30)




## `iv` Candidate Rating Prediction

Calculate the predicted rating for each of the candidate items.

Unnamed: 0_level_0,predicted_rating,ref_1,ref_1_similarity,ref_1_rating,ref_2,ref_2_similarity,ref_2_rating
candidate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
45026,3.0,43267,0.010135,3,,,
86341,5.0,2726,0.009534,5,,,
17029,4.279497,2726,0.01324,5,43267.0,0.007456,3.0
12709,5.0,56660,0.004105,5,,,
40712,5.0,2726,0.012574,5,,,
123176,5.0,19299,0.014827,5,,,
90409,5.0,56660,0.020505,5,,,
134732,5.0,19299,0.00685,5,,,
60465,5.0,56660,0.003673,5,,,
120147,3.844362,2726,0.013797,5,43267.0,0.018883,3.0


------------------------------------------------------

# `05` Additional Tasks

## `i` Explore Surprise Library

- Install Scikit Surprise library.
- Explore the Library Documentation

## `ii` Implement Item-Based KNN Approach [Bonus]

- Follow the steps explained in the sessions to prepare the KNN approach.
- Generate prediction ratings for user $199988$ on all songs.

----------------------------------------------

$$ Wish \space you \space all \space the \space best \space ♡ $$
$$ Abdelrahman \space Eid $$