$$ ITI \space AI-Pro: \space Intake \space 43 $$
$$ Recommender \space Systems $$
$$ Lab \space no. \space 2 $$

# `01` Import Necessary Libraries

## `i` Default Libraries

In [2]:
import numpy as np
import pandas as pd
from surprise.reader import Reader
from surprise.dataset import Dataset
from surprise.model_selection import train_test_split
from surprise.prediction_algorithms.knns import KNNWithMeans

## `ii` Additional Libraries
Add imports for additional libraries you used throughout the notebook

----------------------------

# `02` Load Data

 The dataset will have the following columns : 
   - song_id (String) : Unique identified for the song
   - user_id (String) : Unique identifier for the user
   - song_genre (Integer) : An integer representing a genre for the song, 
                              value is between 1 and 5, indicating that 
                              there are 5 unique genres. Each song can only
                              have 1 genre
   - artist_id (String) : Unique identifier for the author of the song
   - n_listen (Integer) : The number of times this user has heard the song (0 -> 15)
   - publish_year (Integer) : The year of song publishing

In [2]:
data = pd.read_csv("Data/songs_data.csv")
data.head()

Unnamed: 0,song_id,artist_id,song_genre,user_id,n_listen,publish_year
0,537,368,4,2066,13,2002
1,921,107,1,1179,5,2006
2,352,188,1,1468,11,2013
3,853,370,4,460,9,2020
4,479,408,2,1125,3,2020


--------------------------

# `03` Content-based Filtering

Practice for content-based filtering on dummy data

## `i` Feature Engineering/Selection
Construct the item vector representation matrix from the `data` above

In [82]:
# make a copy of the dataframe
item_matrix = data.copy(deep=True)

# Get unique genres,artists,songs and sort them
unique_genres = sorted(item_matrix['song_genre'].unique())
unique_artists = sorted(item_matrix['artist_id'].unique())
unique_songs = sorted(item_matrix['song_id'].unique())

# Construct binary vector representations for song_genre
genre_vectors = {}
for song_id,genre in zip(item_matrix['song_id'].values,item_matrix['song_genre'].values):
    genre_vector = [int(genre == g) for g in unique_genres]
    genre_vectors[song_id] = genre_vector 

# Calculate total number of listens for each song and do min max scaling
total_listens = item_matrix.groupby('song_id').agg({'n_listen':'sum','artist_id':'last','song_genre':'last','publish_year':'last'})
total_listens['publish_year'] = total_listens['publish_year'].apply(lambda x: x - 2000)
total_listens['n_listen'] = (total_listens['n_listen'] - total_listens['n_listen'].min())/(total_listens['n_listen'].max() - total_listens['n_listen'].min())
#total_listens.set_index('song_id',inplace=True)

# Concatenate genre_df with genre vectors
genre_df = pd.DataFrame(genre_vectors.values(),index=genre_vectors.keys(), columns=[f'song_genre_{g}' for g in unique_genres])

item_df = pd.concat([total_listens, genre_df], axis=1)
item_df.drop('song_genre',axis=1,inplace=True)
item_df.drop('artist_id',axis=1,inplace=True)

In [83]:
item_df

Unnamed: 0,n_listen,publish_year,song_genre_1,song_genre_2,song_genre_3,song_genre_4,song_genre_5
1,0.669333,2,0,1,0,0,0
2,0.498667,16,0,0,1,0,0
3,0.518222,15,0,0,0,1,0
4,0.512000,4,0,1,0,0,0
5,0.760000,4,0,0,0,0,1
...,...,...,...,...,...,...,...
996,0.488889,16,1,0,0,0,0
997,0.241778,14,1,0,0,0,0
998,0.603556,19,0,0,1,0,0
999,0.419556,16,0,0,0,0,1


## `ii` Utility Matrix
Construct utility matrix for the loaded dataframe `data`

In [16]:
utility_matrix = data.pivot(index='user_id', columns='song_id', values='n_listen')
utility_matrix = utility_matrix.fillna(0)
utility_matrix.head()

song_id,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
2,15.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,8.0,0.0,0.0,0.0,0.0,11.0,0.0,6.0
3,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,11.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0


## `iii` Item-Item Similarity Matrix

Construct item-item (Cosine/Adjusted Cosine) similarity matrix.

In [55]:
def adjusted_cosine_sim(vec_a, vec_b):
    """
    Returns the raw cosine similarity score between two vectors.

            Parameters:
                vec_a (pandas.Series): Vector A
                vec_b (pandas.Series): Vector B

            Returns:
                sim_score (float): Similarity score between vectors vec_a and vec_b 
    """
    vec_a = vec_a.values
    vec_b = vec_b.values
    
    # Get the lengths of the two pandas.Series.
    n1 = len(vec_a)
    n2 = len(vec_b)

    # Check if the two pandas.Series have the same length.
    if n1 != n2:
        raise ValueError("The two pandas.Series must have the same length.")
    
    # Calculate the mean of each vector.
    mean_a = vec_a.mean()
    mean_b = vec_b.mean()

    #calculate the normalized vector
    normalized_a = vec_a - mean_a
    normalized_b = vec_b - mean_b
    
    # Calculate the dot product of the two pandas.Series.
    dot_product = 0
    for i in range(n1):
        dot_product += normalized_a[i] * normalized_b[i]

    # Calculate the magnitudes of the two pandas.Series.
    magnitude1 = 0
    magnitude2 = 0
    for i in range(n1):
        magnitude1 += normalized_a[i] ** 2
        magnitude2 += normalized_b[i] ** 2

    # Calculate the adjusted cosine similarity.
    sim_score = dot_product / (magnitude1 ** 0.5 * magnitude2 ** 0.5)
    #sim_score = cosine_sim(normalized_a, normalized_b)
    return sim_score

In [88]:
similarity_matrix = np.array([[None]*1000]*1000)
for i in range(1000):
    for j in range(i,1000):
        sim_mat[i][j] = adjusted_cosine_sim(item_df.iloc[i,:], item_df.iloc[j,:])
        sim_mat[j][i] = sim_mat[i][j]

similarity_matrix_df = pd.DataFrame(sim_mat, index=item_df.index, columns=item_df.index)

In [89]:
similarity_matrix_df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
1,1.0,0.84431,0.844136,0.956247,0.816056,-0.264803,0.840046,0.705671,0.985168,0.453524,...,0.874654,0.838256,0.874151,0.876967,0.843916,0.844181,0.839362,0.846472,0.843256,0.845215
2,0.84431,1.0,0.994987,0.960996,0.954652,-0.240689,0.992749,0.903378,0.8694,0.621991,...,0.99598,0.992761,0.996331,0.996012,0.99995,0.995317,0.994493,0.999949,0.995308,0.99601
3,0.844136,0.994987,1.0,0.960638,0.954458,-0.241107,0.992402,0.866317,0.868781,0.684161,...,0.995638,0.992393,0.996013,0.995697,0.995455,0.994986,0.994115,0.995697,0.994966,0.999885
4,0.956247,0.960996,0.960638,1.0,0.920318,-0.266163,0.957438,0.821282,0.970065,0.566835,...,0.977551,0.956934,0.977061,0.97825,0.961345,0.960959,0.95883,0.962283,0.960687,0.961922
5,0.816056,0.954652,0.954458,0.920318,1.0,-0.252328,0.951378,0.820204,0.81895,0.563766,...,0.954656,0.978196,0.956157,0.956002,0.954539,0.954576,0.951492,0.956099,0.973114,0.955359
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,0.844181,0.995317,0.994986,0.960959,0.954576,-0.167846,0.999588,0.866816,0.869375,0.621981,...,0.995985,0.992764,0.996332,0.996011,0.9958,1.0,0.999849,0.996008,0.995311,0.996012
997,0.839362,0.994493,0.994115,0.95883,0.951492,-0.16241,0.999608,0.864092,0.867105,0.619605,...,0.995316,0.991945,0.995561,0.995182,0.995101,0.999849,1.0,0.995167,0.994562,0.995261
998,0.846472,0.999949,0.995697,0.962283,0.956099,-0.237992,0.99352,0.899399,0.871143,0.624258,...,0.996627,0.993511,0.996985,0.996686,0.999948,0.996008,0.995167,1.0,0.995988,0.996671
999,0.843256,0.995308,0.994966,0.960687,0.973114,-0.241657,0.992719,0.8665,0.869189,0.621907,...,0.996015,0.999602,0.996329,0.995994,0.995822,0.995311,0.994562,0.995988,1.0,0.996018


## `iv` Top-K Candidate Generation

Selet top-K (a k of your choice) similar items for each item (a user of your choice) rated from the similarity matrix above.

In [99]:
utility_matrix

song_id,1,2,3,4,5,6,7,8,9,10,...,991,992,993,994,995,996,997,998,999,1000
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
2,15.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,8.0,0.0,0.0,0.0,0.0,11.0,0.0,6.0
3,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,11.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2996,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2997,0.0,9.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0
2998,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,...,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2999,0.0,5.0,0.0,9.0,0.0,3.0,0.0,6.0,0.0,0.0,...,14.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [121]:
user_number = 10
rated_items_index = utility_matrix.iloc[user_number][utility_matrix.iloc[user_number] != 0].index
unrated_items_index = utility_matrix.iloc[user_number][utility_matrix.iloc[user_number] == 0].index
rated_similarity_df = similarity_matrix_df.iloc[rated_items_index][unrated_items_index]
rated_similarity_df

Unnamed: 0,1,2,3,4,5,7,8,9,10,11,...,991,992,993,994,995,996,997,998,999,1000
7,0.840046,0.992749,0.992402,0.957438,0.951378,1.0,0.861932,0.864344,0.615262,0.986484,...,0.993442,0.989982,0.993859,0.993519,0.993244,0.999588,0.999608,0.99352,0.992719,0.993495
16,0.886953,0.994523,0.994215,0.982342,0.954923,0.991944,0.866058,0.910012,0.620051,0.961637,...,0.999612,0.991857,0.999634,0.999775,0.994932,0.994516,0.993497,0.995267,0.994456,0.995198
18,0.843736,0.993387,0.993077,0.959087,0.953731,0.999911,0.864075,0.866153,0.61722,0.985562,...,0.993964,0.990624,0.994456,0.994159,0.99379,0.999653,0.999441,0.994169,0.993302,0.994078
27,0.841414,0.999804,0.995288,0.960526,0.953004,0.993058,0.898901,0.869664,0.623169,0.96239,...,0.996467,0.993228,0.996671,0.996303,0.999936,0.995669,0.995139,0.999775,0.99574,0.996397
32,0.881802,0.995301,0.994985,0.980328,0.95524,0.992754,0.86719,0.906048,0.622058,0.962563,...,0.999858,0.992725,0.999872,0.999947,0.995745,0.995298,0.994402,0.996007,0.995268,0.995978
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
934,0.846867,0.995538,0.995237,0.961953,0.956204,0.99948,0.868105,0.870363,0.622948,0.981018,...,0.99611,0.992968,0.996527,0.996241,0.995943,0.999944,0.999611,0.996246,0.995483,0.996191
972,0.299286,-0.227698,-0.228439,0.041164,-0.255293,-0.233752,-0.297004,0.279792,-0.321924,-0.263823,...,-0.163585,-0.234424,-0.167783,-0.162647,-0.226686,-0.227747,-0.230938,-0.225241,-0.228089,-0.225716
975,0.877119,0.995548,0.995198,0.978696,0.953862,0.992965,0.86677,0.903822,0.622652,0.962456,...,0.999981,0.993059,0.999906,0.999918,0.99609,0.995554,0.994883,0.996208,0.995587,0.996265
979,0.845491,0.995817,0.995496,0.961813,0.955476,0.999373,0.868033,0.87057,0.623568,0.980083,...,0.996463,0.993309,0.996807,0.996496,0.996284,0.999976,0.999754,0.996494,0.995808,0.996494


## `v` Candidate Filtering

Filter out items (your user) has rated from the candidates above.

In [123]:
k = 3
topk = rated_similarity_df.apply(lambda row: row.sort_values(ascending=False).head(k), axis=1)
topk = topk.fillna(0)
topk

Unnamed: 0,7,10,11,13,16,18,19,20,22,24,...,975,976,979,981,986,987,989,990,991,993
7,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
934,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
972,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
975,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
979,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## `vi` Candidate Rating Prediction

Calculate the predicted rating for each of the candidate items.

In [125]:
ratings = utility_matrix.iloc[user_number][utility_matrix.iloc[user_number] != 0]
def create_new_rating(sim_matrix, song, ratings):
    # Get similar songs
    similar_songs = []
    for idx, sim in enumerate(sim_matrix[song]):
        if sim != 0:
            similar_songs.append((idx, sim))
            
    # Calculate new rating
    numerator = 0
    denominator = 0
    for item, sim in similar_songs:
        numerator += ratings.iloc[item] * sim
        denominator += sim
    new_rating = numerator / denominator
    
    return new_rating

In [129]:
predicted_ratings = {}
""" for user_song in topk.keys():
    for song in topk[user_song].keys():
        if song in predicted_ratings.keys():
            continue
        else:
         predicted_ratings[song] = create_new_rating(topk, song, potential_rating) """
for song in topk.columns:
    predicted_ratings[song] = create_new_rating(topk, song, ratings)

predictions = pd.DataFrame(predicted_ratings.values(), index=predicted_ratings.keys(), columns=['Predicted Ratings'])
predictions

Unnamed: 0,Predicted Ratings
7,12.000000
10,8.000000
11,13.000000
13,4.000000
16,12.000000
...,...
987,6.495554
989,2.000000
990,3.000000
991,11.000000


--------------------------

# `04` KNN Item-based Colaborative Filtering

Practice for Using Scikit Surprise Library

## `i` Data Loading

Load `songsModifiedDataset.csv` file into a dataframe

In [6]:
df = pd.read_csv('Data/songsModifiedDataset.csv')
df.head()

Unnamed: 0,userID,songID,rating
0,0,90409,5
1,4,91266,1
2,5,8063,2
3,5,24427,4
4,5,105433,4


## `ii` Prepare Data

Procedures to Follow:
- Instantiate the Reader Object (see, [Documentation](https://surprise.readthedocs.io/en/stable/reader.html))
- Load the Data into `surprise.dataset.Dataset` (see, [Documentation](https://surprise.readthedocs.io/en/stable/dataset.html))
- Build the full (i.e. without folds) `surprise.Trainset` (see, [Documentation](https://surprise.readthedocs.io/en/stable/trainset.html#:~:text=It%20is%20used%20by%20the%20fit()%20method%20of%20every%20prediction%20algorithm.%20You%20should%20not%20try%20to%20build%20such%20an%20object%20on%20your%20own%20but%20rather%20use%20the%20Dataset.folds()%20method%20or%20the%20DatasetAutoFolds.build_full_trainset()%20method.)) 

In [4]:
reader = Reader() #default value for rating_scale is (1,5)

In [8]:
data = Dataset.load_from_df(df, reader)
data

<surprise.dataset.DatasetAutoFolds at 0x1103564d0>

In [9]:
train_set = data.build_full_trainset()

## `iii` Initialize the `KNNWithMeans` Model

**Note**: `KNNWithMeans` uses the normalized ratings instead of the raw ones. (See [Documentation](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans))

**Hint**: Use $k=10$ and configure `sim_options` to be: 
- item_based
- pearson

In [1]:
sim_options = {
    "name": "pearson",
    "user_based": False
}

In [3]:
# default min k is 1
knn_model = KNNWithMeans(k=10,sim_options=sim_options)

## `iv` Fit the Model on Data

In [10]:
knn_model.fit(train_set)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x137f37c10>

## `v` Calculate Predicted Rating $\hat{r}$ for User $199988$

**Hine**: you can use `.predict()` method of the model (see [Documentaion](https://surprise.readthedocs.io/en/stable/getting_started.html?highlight=.predict#train-on-a-whole-trainset-and-the-predict-method:~:text=pred%20%3D%20algo.predict(uid%2C%20iid%2C%20r_ui%3D4%2C%20verbose%3DTrue)))

In [15]:
user_id = 199988

# Find items rated by the user
items_rated_by_user = df.loc[df['userID'] == user_id, 'songID'].unique()

# Find items not rated by the user
items_not_rated_by_user = df.loc[~df['songID'].isin(items_rated_by_user), 'songID'].unique()

# Create the test set for items not rated by the user
test_set = [(user_id, item, 0) for item in items_not_rated_by_user]

In [30]:
predictions = []
for item in test_set:
    prediction = knn_model.predict(*item)
    predictions.append(prediction.est)

In [33]:
song_predictions = pd.DataFrame(items_not_rated_by_user,columns=['songID'])
song_predictions.head()

Unnamed: 0,songID
0,90409
1,91266
2,8063
3,24427
4,105433


In [34]:
song_predictions['predicted_rating'] = predictions
song_predictions.head()

Unnamed: 0,songID,predicted_rating
0,90409,4.808493
1,91266,4.70561
2,8063,4.2398
3,24427,4.549136
4,105433,4.872347


## `vi` Recommend Top 10 Songs

In [36]:
song_predictions_sorted = song_predictions.sort_values('predicted_rating', ascending=False)
song_predictions_sorted.head(10)

Unnamed: 0,songID,predicted_rating
31,40712,5.0
50,122065,5.0
47,132189,5.0
19,71582,5.0
24,52611,5.0
30,60888,5.0
29,62954,5.0
17,112023,4.999623
25,126757,4.983563
27,92881,4.941095


----------------------------------------------

$$ Wish \space you \space all \space the \space best \space ♡ $$
$$ Abdelrahman \space Eid $$