$$ ITI \space AI-Pro: \space Intake \space 45 $$
$$ Recommender \space Systems $$
$$ Lab \space no. \space 2 $$

# `01` Import Necessary Libraries

In [3]:
# %pip install --no-cache-dir --force-reinstall numpy==1.23.5 scipy==1.9.3
# %pip install scikit-surprise==1.1.3

Collecting scikit-surprise==1.1.3
  Using cached scikit-surprise-1.1.3.tar.gz (771 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp311-cp311-linux_x86_64.whl size=3310845 sha256=ab154e81d1286453483c5fc7cfd3b26321f1ff18fc279842ec2ece3fe490670b
  Stored in directory: /root/.cache/pip/wheels/f4/2b/26/e2a5eae55d3b7688995e66abe7f40473aac6c95ddd8ee174a8
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


## `i` Default Libraries

In [4]:
import numpy as np
import pandas as pd
from surprise.reader import Reader
from surprise.dataset import Dataset
from surprise.model_selection import train_test_split
from surprise.prediction_algorithms.knns import KNNWithMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

----------------------------

# `02` Load Data

 The dataset will have the following columns :
   - song_id (String) : Unique identified for the song
   - user_id (String) : Unique identifier for the user
   - song_genre (Integer) : An integer representing a genre for the song, value is between 1 and 5, indicating that there are 5 unique genres. Each song can only have 1 genre
   - artist_id (String) : Unique identifier for the author of the song
   - n_listen (Integer) : The number of times this user has heard the song (0 -> 15)
   - publish_year (Integer) : The year of song publishing

In [5]:
data = pd.read_csv("songs_data.csv")
data.head()

Unnamed: 0,song_id,artist_id,song_genre,user_id,n_listen,publish_year
0,537,368,4,2066,13,2002
1,921,107,1,1179,5,2006
2,352,188,1,1468,11,2013
3,853,370,4,460,9,2020
4,479,408,2,1125,3,2020


--------------------------

# `03` Content-based Filtering

Practice for content-based filtering on dummy data

## `i` Feature Engineering/Selection
Construct the item vector representation matrix from the `data` above

In [6]:
item_data = data[['song_id', 'artist_id', 'song_genre', 'publish_year']].drop_duplicates()

In [7]:
categorical_features = ['artist_id', 'song_genre']
encoder = OneHotEncoder(sparse_output=False)
encoded_cat = encoder.fit_transform(item_data[categorical_features])

encoded_feature_names = encoder.get_feature_names_out(categorical_features)

In [8]:
scaler = StandardScaler()
scaled_year = scaler.fit_transform(item_data[['publish_year']])

In [9]:
all_feature_names = list(encoded_feature_names) + ['publish_year_scaled']

item_vectors = np.hstack([encoded_cat, scaled_year])

item_vector_df = pd.DataFrame(item_vectors, index=item_data['song_id'], columns=all_feature_names)

item_vector_df.head()

Unnamed: 0_level_0,artist_id_1,artist_id_2,artist_id_3,artist_id_4,artist_id_5,artist_id_6,artist_id_8,artist_id_9,artist_id_10,artist_id_12,...,artist_id_497,artist_id_498,artist_id_499,artist_id_500,song_genre_1,song_genre_2,song_genre_3,song_genre_4,song_genre_5,publish_year_scaled
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
537,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.282489
921,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,-0.653278
352,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.447841
853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.54896
479,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.54896


## `ii` Utility Matrix
Construct utility matrix for the loaded dataframe `data`

In [10]:
utility_matrix = data.pivot_table(
    index='song_id',
    columns='user_id',
    values='n_listen',
    fill_value=0
)

utility_matrix.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,2991,2992,2993,2994,2995,2996,2997,2998,2999,3000
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,15.0,6.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,...,0.0,0.0,0.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,0.0,0.0,9.0,8.0,5.0,0.0
3,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,2.0,0.0,0.0,9.0,1.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0


## `iii` Item-Item Similarity Matrix

Construct item-item (Cosine/Adjusted Cosine) similarity matrix.

In [11]:
cosine_sim_matrix = pd.DataFrame(
    cosine_similarity(item_vector_df),
    index=item_vector_df.index,
    columns=item_vector_df.index
)

cosine_sim_matrix.head()

song_id,537,921,352,853,479,759,146,307,776,137,...,815,324,648,670,657,910,490,319,889,234
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
537,1.0,0.28171,-0.202803,-0.246367,-0.496098,-0.264272,0.502932,0.379406,0.334053,-0.471177,...,-0.496098,-0.441736,-0.441736,0.45127,0.725635,0.222318,-0.202803,0.73879,0.28171,-0.441736
921,0.28171,1.0,0.30613,-0.309694,-0.309694,-0.164974,0.614877,0.236848,0.208536,0.029396,...,-0.309694,-0.275758,-0.275758,0.61795,0.28171,0.567117,-0.126602,0.299176,0.58793,-0.275758
352,-0.202803,0.30613,1.0,0.22295,0.22295,0.118765,0.089984,-0.170507,-0.150125,0.551505,...,0.22295,0.198519,0.557638,0.150297,-0.202803,0.349899,0.091141,-0.215377,0.30613,0.198519
853,-0.246367,-0.309694,0.22295,1.0,0.54538,0.290524,-0.552892,-0.417096,-0.367237,0.517983,...,0.54538,0.485617,0.485617,-0.496098,-0.246367,-0.244403,0.22295,-0.290617,-0.309694,0.485617
479,-0.496098,-0.309694,0.22295,0.54538,1.0,0.600469,-0.552892,-0.417096,-0.367237,0.517983,...,0.54538,0.485617,0.739606,-0.496098,-0.496098,-0.244403,0.544347,-0.526856,-0.309694,0.739606


## `iv` Top-K Candidate Generation

Selet top-K (a k of your choice) similar items for each item (a user of your choice) rated from the similarity matrix above.

In [12]:
target_user = 99
K = 10

user_songs = utility_matrix[utility_matrix[target_user] > 0].index

top_k_similar = {}

for song_id in user_songs:
    similar_scores = cosine_sim_matrix.loc[song_id].drop(labels=[song_id])
    top_k = similar_scores.sort_values(ascending=False).head(K)
    top_k_similar[song_id] = top_k


for song, sim_songs in top_k_similar.items():
    print(f"\nTop {K} songs similar to song {song}:")
    print(sim_songs)



Top 10 songs similar to song 3:
song_id
36     0.957717
68     0.920856
637    0.647207
433    0.647207
403    0.647207
239    0.647207
896    0.647207
483    0.647207
497    0.647207
176    0.647207
Name: 3, dtype: float64

Top 10 songs similar to song 4:
song_id
974    0.934079
964    0.696383
972    0.696383
346    0.696383
76     0.696383
914    0.696383
861    0.696383
281    0.692067
730    0.692067
523    0.692067
Name: 4, dtype: float64

Top 10 songs similar to song 5:
song_id
21     0.696383
804    0.696383
142    0.696383
775    0.696383
204    0.696383
661    0.696383
816    0.696383
412    0.696383
30     0.696383
148    0.692067
Name: 5, dtype: float64

Top 10 songs similar to song 6:
song_id
718    0.780254
327    0.780254
254    0.780254
645    0.780254
165    0.780254
162    0.780254
593    0.780254
591    0.780254
944    0.780254
920    0.780254
Name: 6, dtype: float64

Top 10 songs similar to song 9:
song_id
972    0.748473
861    0.748473
346    0.748473
964    0.74

## `v` Candidate Filtering

Filter out items (your user) has rated from the candidates above.

In [13]:
filtered_candidates = {}

for song, similar_songs in top_k_similar.items():
    filtered_candidates[song] = similar_songs[~similar_songs.index.isin(user_songs)]

for song, filtered_songs in filtered_candidates.items():
    print(f"\nFiltered candidates for song {song}:")
    print(filtered_songs)


Filtered candidates for song 3:
song_id
36     0.957717
433    0.647207
403    0.647207
239    0.647207
483    0.647207
497    0.647207
176    0.647207
Name: 3, dtype: float64

Filtered candidates for song 4:
song_id
974    0.934079
964    0.696383
972    0.696383
346    0.696383
76     0.696383
914    0.696383
861    0.696383
281    0.692067
730    0.692067
523    0.692067
Name: 4, dtype: float64

Filtered candidates for song 5:
song_id
21     0.696383
804    0.696383
142    0.696383
775    0.696383
204    0.696383
661    0.696383
816    0.696383
412    0.696383
30     0.696383
148    0.692067
Name: 5, dtype: float64

Filtered candidates for song 6:
song_id
718    0.780254
327    0.780254
254    0.780254
645    0.780254
162    0.780254
593    0.780254
591    0.780254
944    0.780254
Name: 6, dtype: float64

Filtered candidates for song 9:
song_id
972    0.748473
861    0.748473
346    0.748473
964    0.748473
76     0.748473
914    0.748473
149    0.738790
845    0.738790
887    0.73

## `vi` Candidate Rating Prediction

Calculate the predicted rating for each of the candidate items.

In [14]:
predicted_ratings = {}

for song, filtered_songs in filtered_candidates.items():
    user_rating = utility_matrix.at[song, target_user]
    for candidate_song, similarity in filtered_songs.items():
        if candidate_song not in predicted_ratings:
            predicted_ratings[candidate_song] = {"weighted_sum": 0.0, "similarity_sum": 0.0}

        predicted_ratings[candidate_song]["weighted_sum"] += similarity * user_rating
        predicted_ratings[candidate_song]["similarity_sum"] += similarity


In [17]:
final_scores = {
    song: vals["weighted_sum"] / vals["similarity_sum"] if vals["similarity_sum"] > 0 else 0
    for song, vals in predicted_ratings.items()
}

predicted_ratings_df = pd.DataFrame.from_dict(final_scores, orient='index', columns=['predicted_rating'])
predicted_ratings_df.sort_values(by='predicted_rating', ascending=False, inplace=True)

predicted_ratings_df.head(15)

Unnamed: 0,predicted_rating
717,15.0
203,15.0
16,15.0
309,15.0
759,15.0
747,15.0
454,15.0
227,15.0
140,15.0
177,15.0


--------------------------

# `04` KNN Item-based Colaborative Filtering

Practice for Using Scikit Surprise Library

## `i` Data Loading

Load `songsDataset.csv` file into a dataframe

In [18]:
df = pd.read_csv('songsDataset.csv')
df.head()

Unnamed: 0,userID,songID,rating
0,0,90409,5
1,4,91266,1
2,5,8063,2
3,5,24427,4
4,5,105433,4


## `ii` Prepare Data

Procedures to Follow:
- Instantiate the Reader Object (see, [Documentation](https://surprise.readthedocs.io/en/stable/reader.html))
- Load the Data into `surprise.dataset.Dataset` (see, [Documentation](https://surprise.readthedocs.io/en/stable/dataset.html))
- Build the full (i.e. without folds) `surprise.Trainset` (see, [Documentation](https://surprise.readthedocs.io/en/stable/trainset.html#:~:text=It%20is%20used%20by%20the%20fit()%20method%20of%20every%20prediction%20algorithm.%20You%20should%20not%20try%20to%20build%20such%20an%20object%20on%20your%20own%20but%20rather%20use%20the%20Dataset.folds()%20method%20or%20the%20DatasetAutoFolds.build_full_trainset()%20method.))

In [19]:
df['rating'].min(), df['rating'].max()

(1, 5)

In [20]:
reader = Reader(rating_scale=(1, 5))

In [21]:
data = Dataset.load_from_df(df, reader)
data

<surprise.dataset.DatasetAutoFolds at 0x7e6774311010>

In [22]:
trainset = data.build_full_trainset()
trainset.n_users, trainset.n_items, trainset.n_ratings

(53963, 56, 72046)

## `iii` Initialize the `KNNWithMeans` Model

**Note**: `KNNWithMeans` uses the normalized ratings instead of the raw ones. (See [Documentation](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans))

**Hint**: Use $k=10$ and configure `sim_options` to be:
- item_based
- pearson

In [23]:
knn_model = KNNWithMeans(k=10, sim_options={'name': 'pearson', 'user_based': False})

## `iv` Fit the Model on Data

In [24]:
knn_model.fit(trainset)

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7e67743e2f50>

## `v` Calculate Predicted Rating $\hat{r}$ for User $199988$

**Hine**: you can use `.predict()` method of the model (see [Documentaion](https://surprise.readthedocs.io/en/stable/getting_started.html?highlight=.predict#train-on-a-whole-trainset-and-the-predict-method:~:text=pred%20%3D%20algo.predict(uid%2C%20iid%2C%20r_ui%3D4%2C%20verbose%3DTrue)))

In [25]:
song_predictions = df[['songID']].drop_duplicates()
song_predictions.head()

Unnamed: 0,songID
0,90409
1,91266
2,8063
3,24427
4,105433


In [26]:
song_predictions['predicted_rating'] = song_predictions['songID'].apply(
	lambda song_id: knn_model.predict(uid=199988, iid=song_id).est
)
song_predictions.head()

Unnamed: 0,songID,predicted_rating
0,90409,4.808493
1,91266,4.70561
2,8063,4.2398
3,24427,4.549136
4,105433,4.872347


## `vi` Recommend Top 10 Songs

In [27]:
song_predictions_sorted = song_predictions.sort_values(by='predicted_rating', ascending=False)
song_predictions_sorted.head(10)

Unnamed: 0,songID,predicted_rating
41,60888,5.0
167,122065,5.0
123,132189,5.0
21,71582,5.0
29,52611,5.0
37,62954,5.0
45,40712,5.0
19,112023,4.999623
30,126757,4.983563
32,92881,4.941095


----------------------------------------------

$$ Wish \space you \space all \space the \space best \space ♡ $$
$$ Abdelrahman \space Eid $$