# **Collaborative Filtering**

# 0: Theory

Collaborative filtering creates a user-item matrix with values corresponding to the users preferences. Next, using a chosen similarity metric, the similarities between users' preferences are used to give recommendations for each user. Each user will be given recommendations for items that they have not given feedback for, but have positive feedback from users similar to the chosen user. These recommendations may also be predictions.

### Similarity
There are different similarity measures that can be chosen. The Pearson correlation coefficient measures linear relation between two variables, and the cosine similarity measures the simialrirty between two vectors depending on the angle between them in a vector space. Similarity may also be referenced as the distance metric or correlation metric.

The two types of collaborative filtering techniques are model-based and memory-based.

### Memory-based Methods
Memory-based collaborative filtering can be user-based or item-based. User-based techniques compute the similarities between users based on their implicit feedback for the same item. Then, the predicted rating or given feedback is calculated using weighted averages of the item's ratings given by similar users. The weights are the similarities of the other users with the chosen item. Item-based techniques work similalrly but use the similarity between items instead of the similarity between users. Both of these methods form a similarity matrix.

### Model-based Methods
Model-based collaborative filtering can be a lot quicker than memory-based methods. An example of this is the singular value decomposition (SVD). These methods use the user-item matrix to find rules between items and uses these rules to give a list of recommendations. If data is sparse, then model-based methods are recommended to deal with this. More advanced model-based recommendation systems can use clustering, neural networks and elements of graph theory. The main drawback of model-based methods is that they are typically have a very high computational cost and may require a large amount of memory.

The most popular algorithm used for collaborative filtering, when the similarity matrix is sparse, is Alternating Least Squares (ALS) minimisation. Simply, this aims to estimate the entries of a matrix $M=UV^T$ when only a subset of these entries is observed. The algorithm minimises the squared error with the observed entries, when alternating in optimising $U$ and $V$. This would allow us to give predicted entries for items which a given user has not listened to yet.

### Pros & Cons
Collaborative filtering can be used when data is difficult to analyse since it can use the imnplicit feedback. However, there are a few problems. Firstly, the cold-start problem - a new user has no data, hence, the system cannot make meaningful recommendations for them. Also, if data is sparse, then recommendations can be less accurate and many items may not be recommended at all. Finally, the method must be scalable in order to stay efficient. The basic collaborative filtering methods can struggle with this, but model-based methods like SVD can be used to give efficient and robust recommendations.

# 1: Data Preparation
## 1.1 Loading Data & Importing Libraries

In [None]:
# Load libraries
import pandas as pd
import numpy as np
import os
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
import implicit
import time
from sklearn.model_selection import train_test_split

In [2]:
# Import datasets
artists = pd.read_csv(os.path.join('..','data','artists.dat'), delimiter='\t')
tags = pd.read_csv(os.path.join('..','data','tags.dat'), delimiter='\t',encoding='ISO-8859-1')
user_artists = pd.read_csv(os.path.join('..','data','user_artists.dat'), delimiter='\t')
user_friends = pd.read_csv(os.path.join('..','data','user_friends.dat'), delimiter='\t')
user_taggedartists_timestamps = pd.read_csv(os.path.join('..','data','user_taggedartists-timestamps.dat'), delimiter='\t')
user_taggedartists = pd.read_csv(os.path.join('..','data','user_taggedartists.dat'), delimiter='\t')

## 1.2 Data Cleaning

In [3]:
# Drop irrelevant columns from the Artists dataset
artists_cleaned = artists.drop(columns=['url', 'pictureURL']).drop_duplicates(keep='first') 

# Drop the irrelevant columns in the Tags dataset
tags_cleaned = tags.drop_duplicates(keep='first') 

# For the User-Artists dataset, we can filter out rows with a weight of 0, as they show no meaningful interaction
user_artists_cleaned = user_artists[user_artists['weight'] > 0]
user_artists_cleaned = user_artists_cleaned.drop_duplicates(keep='first') 

# Drop duplicates from the User-Tagged Artists Timestamps dataset
user_taggedartists_timestamps_cleaned = user_taggedartists_timestamps.drop_duplicates(keep='first') 

# Convert timestamps from ms to datetime format
user_taggedartists_timestamps_cleaned['timestamp'] = pd.to_datetime(user_taggedartists_timestamps_cleaned['timestamp'], unit='ms')

# Drop duplicates from the User-Friends dataset
user_friends_cleaned = user_friends.drop_duplicates(keep='first')

In [4]:
# Uncomment to output cleaned datasets for inspection
# print("Cleaned Artists dataset:", artists_cleaned.info(), artists_cleaned.head())
# print("Cleaned Tags dataset:", tags_cleaned.info(), tags_cleaned.head())
print("Cleaned User-Artists dataset:", user_artists_cleaned.info(), user_artists_cleaned.head())
# print("Cleaned User-Tagged Artists Timestamps dataset:", user_taggedartists_timestamps_cleaned.info(), user_taggedartists_timestamps_cleaned.head())
# print("Cleaned User-Friends dataset:", user_friends_cleaned.info(), user_friends_cleaned.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92834 entries, 0 to 92833
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   userID    92834 non-null  int64
 1   artistID  92834 non-null  int64
 2   weight    92834 non-null  int64
dtypes: int64(3)
memory usage: 2.1 MB
Cleaned User-Artists dataset: None    userID  artistID  weight
0       2        51   13883
1       2        52   11690
2       2        53   11351
3       2        54   10300
4       2        55    8983


---

# 2: Implementing Collaborative Filtering

We now implement collaborative filtering using some of the different techniques that we have described. We will use some memory-based and some model-based methods.
We will investigate how these methods give recommendations and how efficient they are at doing so.

First, we create a dictionary which will allow us to map artistID recommendations to the corresponding names of the artists.

In [5]:
# Create a dictionary to map artistID to artistName
artist_id_to_name = dict(zip(artists['id'], artists['name']))

We initialise the user-artist matrix which has values corresponding to the listening counts.

In [6]:
# Create a user-artist interaction matrix using the user_artists_cleaned dataset
user_artist_matrix = user_artists_cleaned.pivot(index='userID', columns='artistID', values='weight')

# Fill NaN values with 0s (assuming binary or implicit feedback, i.e., 1 for interaction, 0 for no interaction)
user_artist_matrix = user_artist_matrix.fillna(0)

print(user_artist_matrix)

artistID  1      2      3      4      5      6      7      8      9      \
userID                                                                    
2           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
3           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
4           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
5           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
6           0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
...         ...    ...    ...    ...    ...    ...    ...    ...    ...   
2095        0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
2096        0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
2097        0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
2099        0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0   
2100        0.0    0.0  408.0    0.0    0.0  404.0    0.0    0.0    0.0   

artistID  10     ...  18

In [None]:
# Perform train-test split
train_data, test_data = train_test_split(user_artists_cleaned, test_size=0.2, random_state=27)

# Initialize train and test matrices directly with 0s
train_matrix = pd.DataFrame(0, index=user_artist_matrix.index, columns=user_artist_matrix.columns)
test_matrix = pd.DataFrame(0, index=user_artist_matrix.index, columns=user_artist_matrix.columns)

# Populate train and test matrices
for _, row in train_data.iterrows():
    train_matrix.at[row['userID'], row['artistID']] = row['weight']

for _, row in test_data.iterrows():
    test_matrix.at[row['userID'], row['artistID']] = row['weight']

# Compute cosine similarity
user_similarity = cosine_similarity(train_matrix)
user_similarity_df = pd.DataFrame(user_similarity, index=train_matrix.index, columns=train_matrix.index)

# Dictionary to map artistID to artistName
artist_id_to_name = dict(zip(artists['id'], artists['name']))

# User-based recommendation function
def get_user_based_recommendations(user_id, user_similarity_df, train_matrix, artist_id_to_name, top_n=10):
    similar_users = user_similarity_df[user_id].sort_values(ascending=False).index[1:]
    recommendations = {}

    for similar_user in similar_users:
        interacted_artists = train_matrix.loc[similar_user][train_matrix.loc[similar_user] > 0].index.tolist()
        for artist in interacted_artists:
            if artist not in train_matrix.loc[user_id][train_matrix.loc[user_id] > 0].index.tolist():
                recommendations[artist] = recommendations.get(artist, 0) + user_similarity_df[user_id][similar_user]

    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
    return [(artist, artist_id_to_name.get(artist, "Unknown"), score) for artist, score in sorted_recommendations[:top_n]]

# Start timing
start_time = time.time()

# Example: Get recommendations for user 2
user_id = 2
user_based_recommendations = get_user_based_recommendations(user_id, user_similarity_df, train_matrix, artist_id_to_name, top_n=5)

# Display recommendations
print("Top User-Based Recommendations for User 2:")
for artist_id, artist_name, score in user_based_recommendations:
    print(f"Artist ID: {artist_id}, Artist: {artist_name}, Similarity Score: {score:.2f}")

# End timing
end_time = time.time()

# Calculate and display elapsed time
elapsed_time = end_time - start_time
print(f"Time elapsed: {elapsed_time:.4f} seconds")

Top User-Based Recommendations for User 2:
Artist ID: 227, Artist: The Beatles, Similarity Score: 10.22
Artist ID: 159, Artist: The Cure, Similarity Score: 9.35
Artist ID: 89, Artist: Lady Gaga, Similarity Score: 9.13
Artist ID: 154, Artist: Radiohead, Similarity Score: 9.00
Artist ID: 190, Artist: Muse, Similarity Score: 8.79
Time elapsed: 8.6982 seconds


## 2.1 Memory-based Collaborative Filtering

### 2.1.1 User-Based Implementation
For the user-based implementation, we must compute the similarity matrix using the cosine similarity between users.

In [8]:
# Compute the cosine similarity between users
user_similarity = cosine_similarity(user_artist_matrix)

# Convert the similarity matrix into a DataFrame for easy inspection
user_similarity_df = pd.DataFrame(user_similarity, index=user_artist_matrix.index, columns=user_artist_matrix.index)

# Display a portion of the user similarity matrix
print(user_similarity_df.head())

userID      2     3         4         5         6         7         8     \
userID                                                                     
2       1.000000   0.0  0.144786  0.028692  0.007016  0.030219  0.008964   
3       0.000000   1.0  0.000000  0.000000  0.000000  0.000000  0.000000   
4       0.144786   0.0  1.000000  0.081193  0.006609  0.000000  0.000000   
5       0.028692   0.0  0.081193  1.000000  0.000000  0.000000  0.000000   
6       0.007016   0.0  0.006609  0.000000  1.000000  0.012713  0.018881   

userID  9         10        11    ...      2090      2091      2092      2093  \
userID                            ...                                           
2        0.0  0.000000  0.021267  ...  0.000000  0.043405  0.000000  0.004625   
3        0.0  0.000000  0.000000  ...  0.000000  0.000000  0.000000  0.000000   
4        0.0  0.009072  0.013407  ...  0.000000  0.000000  0.003776  0.006178   
5        0.0  0.169078  0.004639  ...  0.010993  0.000000  0.2

We create a function which gives user-based recommendations. This function uses the user-artist matrix and the similarity matrix to give the top-$N$ recommendations for the given user. It then maps the corresponding ID recommendations of artists to the artist names using the previously defined dictionary.

The implementation is shown for user 2.

In [9]:
# Function to get user-based recommendations
def get_user_based_recommendations(user_id, user_similarity_df, user_artist_matrix, artist_id_to_name, top_n=10):    
    # Get the most similar users (excluding the user itself)
    similar_users = user_similarity_df[user_id].sort_values(ascending=False).index[1:]

    recommendations = {}
    for similar_user in similar_users:
        # Get the artists this similar user has interacted with (non-zero values)
        interacted_artists = user_artist_matrix.loc[similar_user][user_artist_matrix.loc[similar_user] > 0].index.tolist()

        for artist in interacted_artists:
            # Only consider artists the target user has not interacted with
            if artist not in user_artist_matrix.loc[user_id][user_artist_matrix.loc[user_id] > 0].index.tolist():
                # Add the artist to recommendations with a score (using the scaled similarity as a weight)
                if artist not in recommendations:
                    recommendations[artist] = user_similarity_df[user_id][similar_user]
                else:
                    # Add the weight of similarity to the current score
                    recommendations[artist] += user_similarity_df[user_id][similar_user]

    # Sort recommendations by score (highest first)
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)

    # Convert artist IDs to names and prepare the final list with IDs, names, and scores
    recommended_artists = [(artist, artist_id_to_name.get(artist, "Unknown"), score) for artist, score in sorted_recommendations[:top_n]]

    return recommended_artists

In [10]:
# start timing
start_time = time.time()

# Example: Get top 5 user-based recommendations for user with userID=2
user_id = 2
user_based_recommendations = get_user_based_recommendations(user_id, user_similarity_df, user_artist_matrix, artist_id_to_name, top_n=5)

# Display user-based recommendations
print("Top User-Based Recommendations for User 2:")
for artist_id, artist_name, score in user_based_recommendations:
    print(f"Artist ID: {artist_id}, Artist: {artist_name}, Similarity Score: {score:.2f}")

# end timing
end_time = time.time()

# calculate and display elapsed time
elapsed_time = end_time - start_time
print(f"Time elapsed: {elapsed_time:.4f} seconds")

Top User-Based Recommendations for User 2:
Artist ID: 289, Artist: Britney Spears, Similarity Score: 20.67
Artist ID: 288, Artist: Rihanna, Similarity Score: 20.10
Artist ID: 295, Artist: Beyoncé, Similarity Score: 16.92
Artist ID: 292, Artist: Christina Aguilera, Similarity Score: 16.73
Artist ID: 300, Artist: Katy Perry, Similarity Score: 15.50
Time elapsed: 10.7437 seconds


We can see that the output is the top 5 recommendations for user 2 with their similarity scores. The output has recommended 5 female pop singers which is interesting. We will now give recommendations for user 3. We also take note of the time taken to give the recommendations.

In [11]:
# start timing
start_time = time.time()

# Example: Get top 5 user-based recommendations for user with userID=3
user_id = 3
user_based_recommendations = get_user_based_recommendations(user_id, user_similarity_df, user_artist_matrix, artist_id_to_name, top_n=5)

# Display user-based recommendations
print("Top User-Based Recommendations for User 3:")
for artist_id, artist_name, score in user_based_recommendations:
    print(f"Artist ID: {artist_id}, Artist: {artist_name}, Similarity Score: {score:.2f}")

# end timing
end_time = time.time()

# calculate and display elapsed time
elapsed_time = end_time - start_time
print(f"Time elapsed: {elapsed_time:.4f} seconds")

Top User-Based Recommendations for User 3:
Artist ID: 757, Artist: Crystal Castles, Similarity Score: 0.25
Artist ID: 603, Artist: Aphex Twin, Similarity Score: 0.25
Artist ID: 1222, Artist: Venetian Snares, Similarity Score: 0.23
Artist ID: 2174, Artist: edIT, Similarity Score: 0.23
Artist ID: 154, Artist: Radiohead, Similarity Score: 0.21
Time elapsed: 11.3848 seconds


The artists recommended for user 3 have much lower similarity scores than those recommended to user 2. In this case, the top 4 recommendations all could be categorised as electronic artists. However, the fifth recommendation 'Radiohead' is not very similar to the others.  We also take note of the time taken to give the recommendations.

### 2.1.1 Item-based Implementation
For the item-based implementation, we must compute the similarity matrix using the cosine similarity between artists.

In [12]:
# Compute the cosine similarity between artists (transpose the matrix to compare artists)
artist_similarity = cosine_similarity(user_artist_matrix.T)  # Transpose to compare artists (columns)

# Convert the similarity matrix into a DataFrame for easy inspection
artist_similarity_df = pd.DataFrame(artist_similarity, index=user_artist_matrix.columns, columns=user_artist_matrix.columns)

# Display a portion of the artist similarity matrix
print(artist_similarity_df.head())

artistID  1        2      3      4        5         6         7         8      \
artistID                                                                        
1           1.0  0.00000    0.0    0.0  0.00000  0.000000  0.008784  0.032075   
2           0.0  1.00000    0.0    0.0  0.20774  0.000000  0.010696  0.000000   
3           0.0  0.00000    1.0    0.0  0.00000  0.205607  0.000000  0.000000   
4           0.0  0.00000    0.0    1.0  0.00000  0.000000  0.019742  0.049547   
5           0.0  0.20774    0.0    0.0  1.00000  0.000000  0.042728  0.000000   

artistID     9         10     ...  18736  18737  18738  18739  18740  18741  \
artistID                      ...                                             
1         0.000000  0.000000  ...    0.0    0.0    0.0    0.0    0.0    0.0   
2         0.102094  0.387653  ...    0.0    0.0    0.0    0.0    0.0    0.0   
3         0.000000  0.000000  ...    0.0    0.0    0.0    0.0    0.0    0.0   
4         0.000000  0.000000  ...    

We create a function which gives item-based recommendations. This function uses the user-artist matrix and the similarity matrix to give the top-$N$ recommendations for the given user. It then maps the corresponding ID recommendations of artists to the artist names using the previously defined dictionary.

The implementation is shown for user 2.

In [13]:
# Function to get item-based recommendations
def get_item_based_recommendations(user_id, user_artist_matrix, artist_similarity_df, artist_id_to_name, top_n=10):
    # Get the artists the user has interacted with (non-zero values)
    interacted_artists = user_artist_matrix.loc[user_id][user_artist_matrix.loc[user_id] > 0].index.tolist()
    
    recommendations = {}
    for artist in interacted_artists:
        # Get the most similar artists to the ones the user interacted with
        similar_artists = artist_similarity_df[artist].sort_values(ascending=False).index[1:]  # Exclude the artist itself

        for similar_artist in similar_artists:
            # Add the similar artist to recommendations with a score (using the similarity as a weight)
            if similar_artist not in recommendations:
                recommendations[similar_artist] = artist_similarity_df[artist][similar_artist]
            else:
                recommendations[similar_artist] += artist_similarity_df[artist][similar_artist]

    # Sort recommendations by score (highest first)
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)

    # Convert artist IDs to names using artist_id_to_name
    recommended_artists = [(artist_id, artist_id_to_name.get(artist_id, "Unknown"), score) for artist_id, score in sorted_recommendations[:top_n]]

    return recommended_artists

In [None]:
# start timing
start_time = time.time()

# Example: Get top 5 item-based recommendations for user with userID=2
user_id = 2
item_based_recommendations = get_item_based_recommendations(user_id, user_artist_matrix, artist_similarity_df, artist_id_to_name, top_n=5)

# Display item-based recommendations
print("\nTop Item-Based Recommendations for User 2:")
for artist_id, artist_name, score in item_based_recommendations:
    print(f"Artist ID: {artist_id}, Artist: {artist_name}, Similarity Score: {score:.2f}")

# end timing
end_time = time.time()

# print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")


Top Item-Based Recommendations for User 2:
Artist ID: 74, Artist: Basia, Similarity Score: 24.97
Artist ID: 92, Artist: Vitamin Z, Similarity Score: 24.97
Artist ID: 79, Artist: Fiction Factory, Similarity Score: 24.97
Artist ID: 87, Artist: Deacon Blue, Similarity Score: 24.97
Artist ID: 60, Artist: Matt Bianco, Similarity Score: 23.97
Time elapsed: 6.9654 seconds


We can see that the item-based method gives different recommendations to the user-based method. The similarity scores for the top 5 recommendations are higher for the item-based model. We can see that the item-based method is much faster than the user-based method for this datset. This is because the user-based function iterates over both similar users and artists, whereas as the item-based method only iterates over similar artists.

In [15]:
# start timing
start_time = time.time()

# Example: Get top 5 item-based recommendations for user with userID=3
user_id = 3
item_based_recommendations = get_item_based_recommendations(user_id, user_artist_matrix, artist_similarity_df, artist_id_to_name, top_n=5)

# Display item-based recommendations
print("\nTop Item-Based Recommendations for User 3:")
for artist_id, artist_name, score in item_based_recommendations:
    print(f"Artist ID: {artist_id}, Artist: {artist_name}, Similarity Score: {score:.2f}")

# end timing
end_time = time.time()

# print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")


Top Item-Based Recommendations for User 3:
Artist ID: 134, Artist: Big Brotherz, Similarity Score: 41.77
Artist ID: 131, Artist: Part Timer, Similarity Score: 41.77
Artist ID: 130, Artist: Philippe Lamy, Similarity Score: 41.77
Artist ID: 129, Artist: Aless, Similarity Score: 41.77
Artist ID: 128, Artist: strom noir, Similarity Score: 41.77
Time elapsed: 6.0982 seconds


Again, we can see that the item-based method gives different recommendations to the user-based method. The similarity scores for the top 5 recommendations are much higher for the item-based model for user 3.

---

## 2.2 Model-based Methods

### 2.2.1 Singular Value Decomposition
We compute the SVD of the user-artist matrix using `scikit-learn` and use this to give recommendations. The SVD may help to identify patterns in the data and knowledge of these could improve our receommendations. We apply a SVD model to the user-artist matrix to get the SVD components (artist features), and then approximate the original user-artist matrix. This approximation matrix is what we use to make our recommendations.

In [16]:
def get_svd_recommendations(user_id, user_artist_matrix, artist_id_to_name, top_n=10, n_components=50):
    # Apply SVD to the user-artist matrix
    svd = TruncatedSVD(n_components=n_components, random_state=42)
    svd_matrix = svd.fit_transform(user_artist_matrix)
    svd_components = svd.components_

    # Reconstruct the user-artist interaction matrix
    reconstructed_matrix = np.dot(svd_matrix, svd_components)
    
    recommendations = {}
        
    # Get the user's interaction vector from the reconstructed matrix
    reconstructed_user_vector = reconstructed_matrix[user_id - 2]  # User IDs start at 2, so subtract 2
    
    # Iterate through all artists to recommend
    for i, score in enumerate(reconstructed_user_vector):
        # Check if the artist has been interacted with (score > 0) and if the artist ID is valid
        if user_artist_matrix.iloc[user_id - 2, i] == 0:  # Ensure we only recommend non-interacted artists
            artist_id = i  # The index of the artist in the matrix
            if artist_id not in recommendations:
                recommendations[artist_id] = score
            else:
                recommendations[artist_id] += score
    
    # Sort recommendations by score (highest first)
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
    
    # Convert artist IDs to names using the artist_id_to_name mapping
    recommended_artists = [(artist_id, artist_id_to_name.get(artist_id, "Unknown"), score)
                           for artist_id, score in sorted_recommendations[:top_n]]
    
    return recommended_artists

In [17]:
# start timing
start_time = time.time()

# Example: Get top 5 SVD-based recommendations for user with userID=2
user_id = 2
svd_recommendations = get_svd_recommendations(user_id, user_artist_matrix, artist_id_to_name, top_n=5)

# Display SVD-based recommendations
print("\nTop SVD-Based Recommendations for User 2:")
for artist_id, artist_name, score in svd_recommendations:
    print(f"Artist ID: {artist_id}, Artist: {artist_name}, Similarity Score: {score:.2f}")

# end timing
end_time = time.time()

# print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")


Top SVD-Based Recommendations for User 2:
Artist ID: 3464, Artist: Counting Crows, Similarity Score: 2346.16
Artist ID: 1089, Artist: Suede, Similarity Score: 1826.24
Artist ID: 259, Artist: 9th Wonder, Similarity Score: 1581.01
Artist ID: 153, Artist: De/Vision, Similarity Score: 1536.43
Artist ID: 992, Artist: Chris Rea, Similarity Score: 1110.52
Time elapsed: 7.0510 seconds


In [18]:
# start timing
start_time = time.time()

# Example: Get top 5 SVD-based recommendations for user with userID=3
user_id = 3
svd_recommendations = get_svd_recommendations(user_id, user_artist_matrix, artist_id_to_name, top_n=5)

# Display SVD-based recommendations
print("\nTop SVD-Based Recommendations for User 3:")
for artist_id, artist_name, score in svd_recommendations:
    print(f"Artist ID: {artist_id}, Artist: {artist_name}, Similarity Score: {score:.2f}")

# end timing
end_time = time.time()

# print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")


Top SVD-Based Recommendations for User 3:
Artist ID: 184, Artist: James Blunt, Similarity Score: 6.19
Artist ID: 148, Artist: The Boats, Similarity Score: 4.82
Artist ID: 1089, Artist: Suede, Similarity Score: 3.29
Artist ID: 151, Artist: Deep Forest, Similarity Score: 2.92
Artist ID: 298, Artist: Lily Allen, Similarity Score: 2.87
Time elapsed: 2.6960 seconds


The recommendations for user 2 had significantly higher similarity scores compared to the user-based and item-based methods from before. However, for user 3, the SVD gave higher similarities than the user-based method but lower than the item-based method. This suggests that the performance of the method is highly dependenet on the available data and that this must be considered when choosing which technique we use. So far, the SVD method is the fastest at giving recommendations, since the SVD reduces the dimensionality of the user-artist matrix by keeping only the most important features, making computations more efficient.

### 2.2.2 ALS Using `implicit`
A library developed for efficient recommendation systems using Python is `implicit`. We can use this with a sparse matrix of user or item weights to give recommendations. We initialise using implicit.als.AlternatingLeastSquares() and then use .fit() and .recommend() to fit our model and give recommendations.

We now implement the very popular ALS method using the `implicit` library for efficiency. We use the csr_matrix() function from `scipy` to convert the user-artist matrix to a sparse format that is suitbale fort ALS.

In [19]:
def get_als_recommendations(user_id, user_artist_matrix, artist_id_to_name, top_n=5, factors=50, regularization=0.1, iterations=20):
    # Convert the user-artist matrix to sparse format (csr_matrix)
    sparse_matrix = csr_matrix(user_artist_matrix.values)
    
    # Initialize and train the ALS model
    model = implicit.als.AlternatingLeastSquares(factors=factors, regularization=regularization, iterations=iterations)
    model.fit(sparse_matrix)

    # Get the user's interaction vector (row from sparse matrix)
    user_vector = sparse_matrix[user_id]

    # Get top N artist recommendations (returns artist IDs and scores)
    recommendations = model.recommend(user_id, user_vector, N=top_n)

    # Convert artist IDs to artist names using the provided dictionary
    recommended_artists = [(artist_id_to_name[artist_id], score) for artist_id, score in zip(recommendations[0], recommendations[1])]

    return recommended_artists

In [20]:
# start timing
start_time = time.time()

# Example: Get top 5 ALS-based recommendations for user with userID=2
user_id = 2
als_recommendations = get_als_recommendations(user_id, user_artist_matrix, artist_id_to_name, top_n=5)

# Display ALS-based recommendations
print(f"\nTop ALS-Based Recommendations for User {user_id}:")
for artist_name, score in als_recommendations:
    print(f"Artist: {artist_name}, Predicted Listening Count: {score:.2f}")

# end timing
end_time = time.time()

# print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")

  check_blas_config()


  0%|          | 0/20 [00:00<?, ?it/s]


Top ALS-Based Recommendations for User 2:
Artist: Billy Ray Cyrus, Predicted Listening Count: 1.28
Artist: Gothminister, Predicted Listening Count: 1.12
Artist: Jeff Buckley, Predicted Listening Count: 1.10
Artist: Marc Almond, Predicted Listening Count: 1.10
Artist: Deftones, Predicted Listening Count: 1.09
Time elapsed: 2.4704 seconds


In [21]:
# start timing
start_time = time.time()

# Example: Get top 5 ALS-based recommendations for user with userID=3
user_id = 3
als_recommendations = get_als_recommendations(user_id, user_artist_matrix, artist_id_to_name, top_n=5)

# Display ALS-based recommendations
print(f"\nTop ALS-Based Recommendations for User {user_id}:")
for artist_name, score in als_recommendations:
    print(f"Artist: {artist_name}, Predicted Listening Count: {score:.2f}")

# end timing
end_time = time.time()

# print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")

  0%|          | 0/20 [00:00<?, ?it/s]


Top ALS-Based Recommendations for User 3:
Artist: Camouflage, Predicted Listening Count: 1.28
Artist: Parliament, Predicted Listening Count: 1.21
Artist: Renato Carosone, Predicted Listening Count: 1.06
Artist: Michael Jackson, Predicted Listening Count: 1.05
Artist: The Montesas, Predicted Listening Count: 1.04
Time elapsed: 2.3793 seconds


We can see that this method returns predicted listening counts, unlike our other methods. All of the predicted listening counts for the top 5 recommendations are similar and are in the range 1.22-1.25. The artists are quite different in terms of genre. This method takes a similar amount of time as the SVD method for this dataset. This suggests that ALS is amn efficient method for giving recommendations.

# 3: Collaborative Filtering with PySpark
Apache Spark is an engine used to process data at a large scale efficiently. It has APIs in Python and R. PySpark is the Python API for Apache Spark. PySpark has features which include Spark SQL, dataframes and machine learning.

Using PySpark dataframes allows us to efficiently analyse and tranform data by using Python and Spark SQL together. Spark SQL is the Apache Spark module for using structured data, like dataframes.

## 3.1 ALS Collaborative Filtering with PySpark
We can use the `pyspark.ml` library to implement ALS. We import the ALS model from `pyspark.ml.recommendation` to create our model, then use .fit() and .recommendForAllUsers() and .recommendForAllItems() to make recommendations.

We now implement ALS using PySpark methods to improve the efficiency of our recommender system, since it is highly scalable if we were to use our recommender system for very large datasets. We will try user-based and item-based implementations.

In [22]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.recommendation import ALS

# Start Spark session
spark = SparkSession.builder.appName("CollaborativeFilteringALS").getOrCreate()

# Convert cleaned pandas DataFrames to PySpark DataFrames
artists_spark_df = spark.createDataFrame(artists_cleaned)
user_artists_spark_df = spark.createDataFrame(user_artists_cleaned)

### 3.1.1 User-based PySpark ALS

In [23]:
# start timing
start_time = time.time()

# ALS model setup for user-based collaborative filtering
# listening counts are implicit feedback, we are not starting from a cold-start
als = ALS(userCol="userID", itemCol="artistID", ratingCol="weight", coldStartStrategy="drop", implicitPrefs=True)

# Fit the ALS model
model = als.fit(user_artists_spark_df)

# Generate recommendations
user_recommendations = model.recommendForAllUsers(5)

In [24]:
# Create a dictionary to map artistID to artistName
artist_id_to_name = {row['id']: row['name'] for row in artists_spark_df.collect()}

# Function to map artistID to artistName and round scores to 2 decimal places
def map_recommendations(user_recommendations):
    def map_row(row):
        recommendations_with_names = [
            (artist_id_to_name.get(rec[0], "Unknown"), round(rec[1], 2)) for rec in row['recommendations']
        ]
        return (row['userID'], recommendations_with_names)

    mapped_recommendations = user_recommendations.rdd.map(map_row).toDF(["userID", "recommendations"])
    return mapped_recommendations

# Apply the artistID to name mapping function
user_recommendations_with_names = map_recommendations(user_recommendations)

# Show the final recommendations with artist names and rounded scores
user_recommendations_with_names.show(truncate=False)

# end timing
end_time = time.time()

# print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")

+------+--------------------------------------------------------------------------------------------------------------------------------+
|userID|recommendations                                                                                                                 |
+------+--------------------------------------------------------------------------------------------------------------------------------+
|3     |[{Danny Elfman, 1.26}, {Dolly Parton, 1.16}, {Michael Giacchino, 1.15}, {嵐, 1.08}, {Hande Yener, 1.07}]                        |
|5     |[{God Is an Astronaut, 1.11}, {She Wants Revenge, 1.11}, {Portishead, 1.1}, {Cat Power, 1.1}, {Ladytron, 1.1}]                  |
|6     |[{Eminem, 0.53}, {Jay-Z, 0.52}, {Drake, 0.51}, {T.I., 0.51}, {Kanye West, 0.51}]                                                |
|12    |[{Eminem, 1.46}, {Jay-Z, 1.27}, {Scorpions, 1.21}, {2Pac, 1.2}, {Lil' Wayne, 1.17}]                                             |
|13    |[{Lady Gaga, 0.92}, {Coldpl

We can see that the output is recommendations for all users and this was computed very quickly, so the use of PySpark is effective. Note that the given time is much longer than the previous methods, but this method has givemn recommendations for all users, not just one user at a time. This clearly shows the efficiency and scalability of PySpark ALS for recommender systems.

Also, the output gives recommendations with a 'score'. This is the relative confidence that a user will like a given artist. Clearly, some users, like user 16, have higher score but some users, like user 28, have much lower scores. This would indicate that the recommendations for user 16 relative to user 28 are much better. This is likely dependent on the availability of data for the different users.

We will analyse the distribution of the scores.

In [25]:
# Explode the recommendations column into individual rows
exploded_user_recommendations = user_recommendations.withColumn("recommendation", F.explode("recommendations"))

# Extract artistID and rating
exploded_user_recommendations = exploded_user_recommendations.select(
    F.col("userID"),
    F.col("recommendation.artistID").alias("artistID"),  # Extract artistID
    F.col("recommendation.rating").alias("score")        # Extract rating as score
)

# Generate summary statistics for the scores
summary_stats = exploded_user_recommendations.select("score").summary()

# Display the summary statistics
summary_stats.show()

+-------+------------------+
|summary|             score|
+-------+------------------+
|  count|              9460|
|   mean|1.1207548875480557|
| stddev|0.2733402298878301|
|    min|        4.2475E-40|
|    25%|         1.0323408|
|    50%|         1.1315149|
|    75%|         1.2711515|
|    max|         2.0517335|
+-------+------------------+



We can see that the scores range from around 0 to 2, with most of the recommendations being between 1.03 and 1.25. We then will say that the best recommendations are given by scores above 1.25.

### 3.1.2 Item-based PySpark ALS

In [26]:
# start timing
start_time = time.time()

# ALS model setup for item-based collaborative filtering
# Swap userCol and itemCol for item-based filtering
# listening counts are implicit feedback, we are not starting from a cold-start
als = ALS(userCol="artistID", itemCol="userID", ratingCol="weight", coldStartStrategy="drop", implicitPrefs=True)

# Fit the ALS model
model = als.fit(user_artists_spark_df)

# Generate item-based recommendations for each artist (item)
item_recommendations = model.recommendForAllItems(5)

In [27]:
# Apply the artistID to name mapping function
item_recommendations_with_names = map_recommendations(item_recommendations)

# Show the final recommendations with artist names and rounded scores
item_recommendations_with_names.show(truncate=False)

# end timing
end_time = time.time()

# print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")

+------+----------------------------------------------------------------------------------------------------------------------------+
|userID|recommendations                                                                                                             |
+------+----------------------------------------------------------------------------------------------------------------------------+
|3     |[{Superchic[k], 1.21}, {Carter Burwell, 1.17}, {Jonathan Larson, 1.14}, {Tracy Chapman, 1.13}, {Jónsi, 1.06}]               |
|5     |[{John Frusciante, 1.16}, {Beirut, 1.09}, {Yann Tiersen, 1.08}, {Sigur Rós, 1.07}, {Radiohead, 1.06}]                       |
|6     |[{Mary J. Blige, 0.5}, {Eminem, 0.5}, {Whitney Houston, 0.49}, {Ciara, 0.48}, {Brandy, 0.48}]                               |
|12    |[{Epica, 1.41}, {Theatre of Tragedy, 1.27}, {In This Moment, 1.22}, {Alesana, 1.21}, {Serj Tankian, 1.19}]                  |
|13    |[{Lady Gaga, 0.92}, {Avril Lavigne, 0.91}, {Green Day,

Again, the item-based implementation seems to be more efficient than the user-based due to the different iterations needed by each method.

In [28]:
# Explode the recommendations column into individual rows
exploded_item_recommendations = item_recommendations.withColumn("recommendation", F.explode("recommendations"))

# Extract artistID and rating from the struct fields
exploded_item_recommendations = exploded_item_recommendations.select(
    F.col("userID"),
    F.col("recommendation.artistID").alias("artistID"),  # Extract artistID
    F.col("recommendation.rating").alias("score")        # Extract rating as score
)

# Generate summary statistics for the scores
summary_stats = exploded_item_recommendations.select("score").summary()

# Display the summary statistics
summary_stats.show()

+-------+------------------+
|summary|             score|
+-------+------------------+
|  count|              9460|
|   mean|1.1090791217136604|
| stddev| 0.263306523427101|
|    min|     1.0707016E-38|
|    25%|         1.0336491|
|    50%|         1.1219879|
|    75%|         1.2467672|
|    max|          2.249456|
+-------+------------------+



The scores for the item-based implementation have a similar distribution to the user-based implementation.

## 3.2 Evaluating ALS Collaborative Filtering in PySpark
Now that we have implemented memory-based and model-based methods with both Python and PySpark, we can clearly see that the PySpark ALS method is by far the most efficient and can give recommendations for all users quickly. From the collaborative filtering methods we have seen, this is the most suitable for our recommender system in practice, since scalability and efficiency is very important.

We will now split the data into training and test sets to evaluaet the performance of this model.


In [None]:
# Start timing
start_time = time.time()

# ALS model setup for user-based collaborative filtering
als = ALS(userCol="userID", itemCol="artistID", ratingCol="weight", coldStartStrategy="drop", implicitPrefs=True, regParam=1.0)

# Split data into training and test sets (80% training, 20% testing)
train_data, test_data = user_artists_spark_df.randomSplit([0.8, 0.2], seed=27)

# Fit the ALS model
model = als.fit(train_data)

# Make predictions on the test set
predictions = model.transform(test_data)

# Calculate RMSE (Root Mean Squared Error)
rmse = predictions.withColumn("squared_error", (F.col("prediction") - F.col("weight"))**2)
rmse_value = rmse.select(F.sqrt(F.avg("squared_error"))).first()[0]

# Print evaluation metrics
print(f"Root Mean Squared Error (RMSE): {rmse_value:.4f}")

# Generate recommendations
user_recommendations = model.recommendForAllUsers(5)

# Create a dictionary to map artistID to artistName
artist_id_to_name = {row['id']: row['name'] for row in artists_spark_df.collect()}

# Function to map artistID to artistName and round scores to 2 decimal places
def map_recommendations(user_recommendations):
    def map_row(row):
        recommendations_with_names = [
            (artist_id_to_name.get(rec[0], "Unknown"), round(rec[1], 2)) for rec in row['recommendations']
        ]
        return (row['userID'], recommendations_with_names)

    mapped_recommendations = user_recommendations.rdd.map(map_row).toDF(["userID", "recommendations"])
    return mapped_recommendations

# Apply the artistID to name mapping function
user_recommendations_with_names = map_recommendations(user_recommendations)

# Show the final recommendations with artist names and rounded scores
user_recommendations_with_names.show(truncate=False)

# End timing
end_time = time.time()

# Print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")

Root Mean Squared Error (RMSE): 4492.4848
+------+------------------------------------------------------------------------------------------------------------------------------+
|userID|recommendations                                                                                                               |
+------+------------------------------------------------------------------------------------------------------------------------------+
|3     |[{Hande Yener, 1.14}, {Ani DiFranco, 1.05}, {ムック, 1.0}, {Pleq, 1.0}, {Klaus Badelt, 0.98}]                                 |
|5     |[{Beck, 1.09}, {Kings of Convenience, 1.08}, {Beirut, 1.06}, {The Smiths, 1.05}, {The White Stripes, 1.05}]                   |
|6     |[{Brandy, 0.49}, {Danity Kane, 0.44}, {50 Cent, 0.44}, {B.o.B, 0.43}, {Joss Stone, 0.43}]                                     |
|12    |[{Eminem, 1.57}, {Lenny Kravitz, 1.31}, {No Doubt, 1.24}, {Scooter, 1.24}, {Guano Apes, 1.2}]                                 |
|13    |[

The RMSE is very high, suggesting that the recommendations are not very meaningful. This is likely due to the sparsity of the dataset.

We will see if using regularisation parameter `regParam` may reduce the RMSE.

In [30]:
# Experiment 1: With regParam=0.5
als_0_5 = ALS(userCol="userID", itemCol="artistID", ratingCol="weight", coldStartStrategy="drop", implicitPrefs=True, regParam=0.5)
model_0_5 = als_0_5.fit(train_data)
predictions_0_5 = model_0_5.transform(test_data)

# Calculate RMSE for Experiment 1
squared_error_0_5 = predictions_0_5.withColumn("squared_error", (F.col("prediction") - F.col("weight")) ** 2)
rmse_0_5_value = squared_error_0_5.select(F.sqrt(F.avg("squared_error"))).first()[0]
print(f"RMSE with regParam=0.5: {rmse_0_5_value:.4f}")

# Experiment 2: With regParam=1.0
als_1_0 = ALS(userCol="userID", itemCol="artistID", ratingCol="weight", coldStartStrategy="drop", implicitPrefs=True, regParam=1.0)
model_1_0 = als_1_0.fit(train_data)
predictions_1_0 = model_1_0.transform(test_data)

# Calculate RMSE for Experiment 2
squared_error_1_0 = predictions_1_0.withColumn("squared_error", (F.col("prediction") - F.col("weight")) ** 2)
rmse_1_0_value = squared_error_1_0.select(F.sqrt(F.avg("squared_error"))).first()[0]
print(f"RMSE with regParam=1.0: {rmse_1_0_value:.4f}")

RMSE with regParam=0.5: 4492.4849
RMSE with regParam=1.0: 4492.4848


The regularisation does not seem to have had a significant effect on the RMSE. The RMSE is very high, hence, the model does not seem to give meaningful recommendations.

## 4: Conclusion
We have investigated collaborative filtering through memory-based and model-based methods. First, we implemented simple user-based and item-based methods which were slow, but successfully gave recommendations. The item-based methods seem to run quicker than the user-based methods due to the structure of the data. We then implemented model-based methods - SVD and ALS - which were much more efficient than the previous memory-based methods, and allowed quick recommendations to be given.

All of these methods gave recommendations which ranged in quality and performance depending on the availability of data. Next, we used PySpark to implement ALS and this increased the computational efficiency significantly, allowing us to quickly give recommendations for all of the users. This feature would allow us to scale our recommendation methods to large datasets and would be vital in developing our music recommendation system, which could theoretically have millions of users.

Finally, as the most efficient method, we evaluated the performance of the PySpark ALS method use RMSE and found this to be very high, leading us to conclude that there is significant room for improvement in our recommendations. However, although our recommender system may not be as good as we would have liked, we have implemented a number of different collaborative filtering methods and seen the benefits of the scalability and efficiency of PySpark.

---

## **References**
[1] F.O. Isinkaye, Y.O. Folajimi, B.A. Ojokoh,
Recommendation systems: Principles, methods and evaluation,
Egyptian Informatics Journal,
Volume 16, Issue 3,
2015,
Pages 261-273.
(https://www.sciencedirect.com/science/article/pii/S1110866515000341)

[2] Implicit Documentation: https://benfred.github.io/implicit/

[3] PySpark Collaborative Filtering Documentation: https://spark.apache.org/docs/latest/ml-collaborative-filtering.html

[4] scikit-learn Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

[5] csr_matrix Documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
 
[6] PySpark ALS Documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html