# Content-Based Filtering with PySpark


## 0: Introducing The Model

In this section, we explore content-based filtering using the built in PySpark. We use a dataset which includes information about music artists, their associated tags, and how users interacted with them. We aim to generate artist recommendations for users based on the tags associated with artists they have interacted with. This involves loading the data, creating profiles for the artists, vectorising the tags, calculating cosine similarities and finally generating recommendations based on those similarities.

 Our process for calculating the cosine similarities will involve using the following formula: 

$$

\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}.

$$

This is where:

- A and B represent the feature vectors, each artist profile vectorised.


We have chosen to use PySpark due to seeing PySpark being the most efficient in a content recommender systems, specifically for a large dataset, along with having literature to back up its speed.

## 1: Importing Modules, Requirements, Creating our PySpark Session and Loading Data Data

We have our model requirements listed below.

In [1]:
# Run this cell to automatically generate requirements

!pip install pipreqsnb

!pipreqsnb --savepath report/requirements/04-requirements.txt --encoding utf-8 report/05-Graph-Neural-Network.ipynb
!pipreqsnb --savepath report/requirements/LightGCN-requirements.txt --encoding utf-8 scripts/LightGCN

Traceback (most recent call last):
  File "/Users/harrywilson/anaconda3/bin/pipreqsnb", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/harrywilson/anaconda3/lib/python3.11/site-packages/pipreqsnb/pipreqsnb.py", line 98, in main
    is_file, is_nb = path_is_file(input_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/harrywilson/anaconda3/lib/python3.11/site-packages/pipreqsnb/pipreqsnb.py", line 80, in path_is_file
    raise Exception('{} if an invalid path'.format(path))
Exception: report/05-Graph-Neural-Network.ipynb if an invalid path
Traceback (most recent call last):
  File "/Users/harrywilson/anaconda3/bin/pipreqsnb", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/harrywilson/anaconda3/lib/python3.11/site-packages/pipreqsnb/pipreqsnb.py", line 98, in main
    is_file, is_nb = path_is_file(input_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/harrywilson/anaconda3/lib/python3.11/site-pack

In [2]:
import os

print("REQUIREMENTS for 04-Content-Based-Filtering.ipynb:\n")
with open(os.path.join('..','report','requirements','04-requirements.txt')) as f:
    print(f.read())

REQUIREMENTS for 04-Content-Based-Filtering.ipynb:



FileNotFoundError: [Errno 2] No such file or directory: '../report/requirements/04-requirements.txt'

In [None]:
# Run this cell to automatically install requirements from saved .txt files

!pip install -r report/requirements/04-requirements.txt

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'report/requirements/04-requirements.txt'[0m[31m
[0m

In [1]:
# Importing Required Libraries

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.stat import Correlation
from pyspark.mllib.linalg.distributed import RowMatrix
import random

Here, we must intialise our spark session. This is fundamental for using any of spark's great features.

In [2]:
# Initialising our Spark Session

spark = SparkSession.builder \
    .appName("Content-Based Filtering") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/26 13:12:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
base_directory = "/Users/harrywilson/Desktop/DataScienceToolbox/Assessment2Data"


# FOR THE READER
# base_directory = "your_path"

# Function to load a .dat file as a DataFrame
def load_data(filename):
    file_path = f"{base_directory}/{filename}"
    return spark.read.csv(file_path, sep="\t", header=True, inferSchema=True)

# Load datasets
user_artists_df = load_data("user_artists.dat")
artists_df = load_data("artists.dat")
tags_df = load_data("tags.dat")
user_taggedartists_df = load_data("user_taggedartists.dat")

# Display a few rows from the datasets to ensure everything is working
print("User-Artists Dataset:")
user_artists_df.show(5)

print("Artists Dataset:")
artists_df.show(5)

print("Tags Dataset:")
tags_df.show(5)

print("User-Tagged Artists Dataset:")
user_taggedartists_df.show(5)


                                                                                

User-Artists Dataset:
+------+--------+------+
|userID|artistID|weight|
+------+--------+------+
|     2|      51| 13883|
|     2|      52| 11690|
|     2|      53| 11351|
|     2|      54| 10300|
|     2|      55|  8983|
+------+--------+------+
only showing top 5 rows

Artists Dataset:
+---+-----------------+--------------------+--------------------+
| id|             name|                 url|          pictureURL|
+---+-----------------+--------------------+--------------------+
|  1|     MALICE MIZER|http://www.last.f...|http://userserve-...|
|  2|  Diary of Dreams|http://www.last.f...|http://userserve-...|
|  3|Carpathian Forest|http://www.last.f...|http://userserve-...|
|  4|     Moi dix Mois|http://www.last.f...|http://userserve-...|
|  5|      Bella Morte|http://www.last.f...|http://userserve-...|
+---+-----------------+--------------------+--------------------+
only showing top 5 rows

Tags Dataset:
+-----+-----------------+
|tagID|         tagValue|
+-----+-----------------+


## 2: Preparing Our Data

### 2.1: Creating Artist Profiles

Firstly, we need to create artist profiles for our artists. This involves adding the arists, along with their tags into a common dataframe. 

In [4]:
# Join user_taggedartists with tags for tag information
artist_tags_df = user_taggedartists_df.join(tags_df, on="tagID", how="inner")

# Join artist tags with artists to get artist details and tag names
artist_tags_info_df = artist_tags_df.join(
    artists_df, artist_tags_df.artistID == artists_df.id
).select(
    artist_tags_df["artistID"],
    artists_df["name"].alias("artist_name"),
    artist_tags_df["tagValue"].alias("tag")
)


artist_profiles_df = artist_tags_info_df.groupBy("artistID", "artist_name") \
    .agg(F.collect_list("tag").alias("tags"))

# Display artist profiles
artist_profiles_df.show(5, truncate=False)

[Stage 14:>                                                         (0 + 1) / 1]

+--------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|artistID|artist_name          |tags                                                                                                                                                                                                                  

                                                                                

### 2.2: Numerically Interpreting our Data

To apply our cosine similarity calculation we need numerical interpretations for our tags. This is because to perform our calculation, we would not be able to put any letters into our formula. To do this we use 'CountVectorizer', which creates a vector of word counts for each artist's tags.

In [5]:
# Ensure tag_text is an array of strings
tags_df = artist_profiles_df.select(
    F.col("artistID"),
    F.col("artist_name"),
    F.col("tags").alias("tag_text")
)

# Vectorise tags
vectoriser = CountVectorizer(inputCol="tag_text", outputCol="raw_features")
vectorised_model = vectoriser.fit(tags_df)
vectorised_df = vectorised_model.transform(tags_df)

# Display vectorized features
vectorised_df.show(5, truncate=False)


[Stage 28:>                                                         (0 + 1) / 1]

+--------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|artistID|artist_name          |tag_text                            

                                                                                

#### Explaining Our Data

For the 'raw_features' column, we have the format (total_features, [indices], [values]). This is where:

'total_features' represents the total number of unique tags in the dataset

'indices' represents the indices of non-zero features (tags present for the artist in our case)

'values' represents the corresponding count of each tag.

### 2.3 Term Frequency - Inverse Document Frequency Calculation

In this sub-section we compute the TF-IDF for our tags associated with each artist. This is used to evaluate how important a tag is to an artist, in our case specifically. 

We combine two metrics, the term frequency and the inverse document frequency, which are worked out as such:

Term Frequency: $$ \text{TF}(t, d) = \frac{\text{Frequency of term } t \text{ in document } d}{\text{Total terms in document } d} $$

Inverse Document Frequency: $$ \text{IDF}(t) = \log\left(\frac{N}{1 + \text{DF}(t)}\right) $$

This is where:

- t is our term (or tag)
- d is our document (or artist)
- N is the total numbe of artists

Note also: The "+1" is to avoid division by zero

We would then multiply the values, however in our case we will use a built in function. This calculation emphasises terms that are frequent in a specific artist, however rare across all artists. This then helps to identify more unique tags.



In [6]:
# Compute TF-IDF
idf = IDF(inputCol="raw_features", outputCol="features")
idf_model = idf.fit(vectorised_df)
tfidf_df = idf_model.transform(vectorised_df)

# Display TF-IDF features
tfidf_df.show(5, truncate=False)

[Stage 38:>                                                         (0 + 1) / 1]

+--------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------

                                                                                

### 2.4: Normalising the Feature Vectors

Here, we ensure all the feature vectors are on the same scale. This is because some TF-IDF scores will have entirely different ranges. However, by normalising, we are ensuring calculations on a larger scale will not dominate calculations, as they are not inherently more significant. We use 'MinMaxScaler' to normalise these tag vectors. This function is doing the following calculation: 

$$
x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}
$$

Where:

- x is the original value of our feature
- x_min is the minimum value of our feature
- x_max is the maximum value of our feature
- x' is the normalised value of our feature

This ensures each feature lies in the [0,1] range, which we can see from our output.



In [7]:
# Normalise the feature vectors
scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")
scaler_model = scaler.fit(tfidf_df)
scaled_tfidf_df = scaler_model.transform(tfidf_df)

# Display scaled features
scaled_tfidf_df.show(5, truncate=False)

[Stage 51:>                                                         (0 + 1) / 1]

+--------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------

                                                                                

## 3: Computing Similarities

### 3.1: Cosine Simalarities
Now we compute cosine similarity between artists. We do this using the built in functions 'RowMatrix' and 'columnSimilarities'.

In [8]:
# Convert the scaled features column to RDD of dense vectors
row_matrix_rdd = scaled_tfidf_df.select("scaled_features").rdd.map(lambda row: row.scaled_features.toArray())

# Create a RowMatrix from the RDD
row_matrix = RowMatrix(row_matrix_rdd)

# Compute pairwise cosine similarities
similarities = row_matrix.columnSimilarities()

# Convert the similarities result back to a DataFrame for better readability
similarities_df = similarities.entries.toDF(["artistID_1", "artistID_2", "cosine_similarity"])

# Show 5 results
similarities_df.show(5, truncate=False)

24/11/26 13:14:34 WARN Executor: Managed memory leak detected; size = 36737740 bytes, task 0.0 in stage 63.0 (TID 56)
[Stage 66:>                                                         (0 + 1) / 1]

+----------+----------+--------------------+
|artistID_1|artistID_2|cosine_similarity   |
+----------+----------+--------------------+
|261       |1529      |0.07548513560963972 |
|379       |1752      |0.049629166698546515|
|1644      |7078      |0.35355339059327373 |
|1792      |2863      |0.5298129428260175  |
|20        |8414      |0.0386493975840497  |
+----------+----------+--------------------+
only showing top 5 rows



24/11/26 13:14:35 WARN Executor: Managed memory leak detected; size = 36737740 bytes, task 0.0 in stage 66.0 (TID 57)
                                                                                

### 3.2: Combining Tag Overlap

After investigating our datasets, we found that many artists were being recommended to users, despite not sharing any tags with artists they listen to. This is due to us having to numerically vectorise our tags, and common patterns in these vectors may lead to these mistakes. 

Therefore, to combat this we will also include tag overlap between artist pairs into our similarity calculations along with our cosine similarity, which we have already worked out.

The tag overlap counts shared tags between artist artist pairs. We will then create a hybrid similarity metric which accounts for both relationships.

In [9]:
# Aggregate artist tags into sets
artist_tags_df = artist_tags_info_df.groupBy("artistID") \
    .agg(F.collect_set("tag").alias("tags"))

# Self-join to compute tag overlap between artist pairs
tag_overlap_df = artist_tags_df.alias("a1").join(
    artist_tags_df.alias("a2"),
    F.col("a1.artistID") < F.col("a2.artistID")  # Ensure unique pairs
).select(
    F.col("a1.artistID").alias("artistID_1"),
    F.col("a2.artistID").alias("artistID_2"),
    F.size(F.array_intersect(F.col("a1.tags"), F.col("a2.tags"))).alias("tag_overlap_count")
)

Here, we combine our cosine similarities, with our tag overlap to make our hybrid similarity.

In [10]:
# Join cosine similarity with tag overlap
hybrid_similarity_df = similarities_df.join(
    tag_overlap_df, ["artistID_1", "artistID_2"]
).withColumn(
    "hybrid_similarity",
    F.col("cosine_similarity") * (1 + F.col("tag_overlap_count"))  # Adjusting with tag count
)


Here we want to normalise our similarity, as we again want to make sure our similarity remains comparable. 

In [11]:
# Normalise the hybrid similarity scores
max_similarity = hybrid_similarity_df.agg(F.max("hybrid_similarity")).collect()[0][0]
normalised_similarity_df = hybrid_similarity_df.withColumn(
    "normalised_hybrid_similarity",
    F.col("hybrid_similarity") / max_similarity
)


                                                                                

## 4: Our Recommender  Models

### 4.1: Recommending Artists to a User

In this subsection, we complete the process of generating artist recommendations to a specific user, based on their listening history. We aim to recommend artists the user may enjoy, excluding anybody they have interacted with. We use our hybrid similarity that we calculated above. 

In [12]:
# Input a user ID
input_user_id = 5

# Get artists the user has already interacted with
interacted_artists = user_artists_df.filter(F.col("userID") == input_user_id).select("artistID")

# Collect artistIDs as a list for filtering
interacted_artist_ids = [row["artistID"] for row in interacted_artists.collect()]

# Generate recommendations for the user using normalised similarity
user_recommendations_filtered = (
    user_artists_df.filter(F.col("userID") == input_user_id)
    .join(normalised_similarity_df, user_artists_df.artistID == normalised_similarity_df.artistID_1)
    .filter(~F.col("artistID_2").isin(interacted_artist_ids))  # Exclude already interacted artists
    .orderBy(F.col("normalised_hybrid_similarity").desc())  # Order by normalised similarity
)

# Add artist names to the recommendations
user_recommendations_with_names = (
    user_recommendations_filtered
    .join(artists_df, user_recommendations_filtered.artistID_2 == artists_df.id, how="inner")
    .select(
        "userID",
        "artistID_2",
        F.col("name").alias("artist_name"),
        "normalised_hybrid_similarity"
    )
    .orderBy(F.col("normalised_hybrid_similarity").desc())  # Confirm descending order for display
)

# Display the recommendations with artist names
user_recommendations_with_names.show(10, truncate=False)


                                                                                

+------+----------+-------------------+----------------------------+
|userID|artistID_2|artist_name        |normalised_hybrid_similarity|
+------+----------+-------------------+----------------------------+
|5     |546       |The Ting Tings     |0.35323035418430376         |
|5     |523       |Lindsay Lohan      |0.3441227408202204          |
|5     |72        |Depeche Mode       |0.317667529596587           |
|5     |63        |Enigma             |0.31420141662077566         |
|5     |504       |HIM                |0.2866998642021329          |
|5     |704       |The Pretty Reckless|0.2844757992986386          |
|5     |716       |Kaiser Chiefs      |0.2701736252851966          |
|5     |84        |Cut Copy           |0.2610187508090726          |
|5     |193       |Tears for Fears    |0.24234016869726982         |
|5     |1239      |Weezer             |0.23629649201188965         |
+------+----------+-------------------+----------------------------+
only showing top 10 rows



### 4.2: Finding Similar Artists

In this model, we will input an artist, and assuming they are in our dataset, we will try and find the most similar artists available, based on their tags.

In [13]:
# Input: Artist Name
input_artist_name = "Kanye West"  # Change this to any artist name for testing

# Check if the artist exists in the dataset
artist_exists = artists_df.filter(F.col("name") == input_artist_name).count()

if artist_exists == 0:
    print(f"The artist '{input_artist_name}' is not in the list.")
else:
    # Get the artist ID for the input artist
    input_artist_id = artists_df.filter(F.col("name") == input_artist_name).select("id").first()["id"]

    # Find similar artists using the normalised similarity matrix
    similar_artists = normalised_similarity_df.filter(
        (F.col("artistID_1") == input_artist_id) | (F.col("artistID_2") == input_artist_id)
    ).withColumn(
        "similar_artist_id",
        F.when(F.col("artistID_1") == input_artist_id, F.col("artistID_2")).otherwise(F.col("artistID_1"))
    ).join(
        artists_df, F.col("similar_artist_id") == artists_df.id
    ).select(
        F.col("similar_artist_id"),
        F.col("name").alias("similar_artist_name"),
        F.col("normalised_hybrid_similarity").alias("similarity")
    ).orderBy(F.col("similarity").desc())

    # Limit to the top similar artists
    top_similar_artists = similar_artists.limit(10)

    # Get tags for the input artist
    input_artist_tags = artist_tags_info_df.filter(
        F.col("artistID") == input_artist_id
    ).select("tag").distinct()

    # Get tags for the top similar artists
    similar_artist_tags = artist_tags_info_df.filter(
        F.col("artistID").isin([row["similar_artist_id"] for row in top_similar_artists.collect()])
    ).select("artistID", "tag").distinct()

    # Find shared tags between the input artist and top similar artists
    shared_tags = top_similar_artists.join(
        similar_artist_tags.join(
            input_artist_tags, ["tag"], "inner"
        ).groupBy("artistID").agg(
            F.collect_list("tag").alias("shared_tags")
        ),
        top_similar_artists.similar_artist_id == similar_artist_tags.artistID,
        how="left"
    ).select(
        F.col("similar_artist_name"),
        F.coalesce(F.col("shared_tags"), F.array()).alias("shared_tags"),
        F.col("similarity")
    ).orderBy(F.col("similarity").desc())  # Ensure order matches top_similar_artists

    # Display Results
    print(f"Artists similar to '{input_artist_name}':")
    top_similar_artists.show(10, truncate=False)

    print(f"Shared tags with '{input_artist_name}':")
    shared_tags.show(10, truncate=False)


                                                                                

Artists similar to 'Kanye West':


                                                                                

+-----------------+-------------------+--------------------+
|similar_artist_id|similar_artist_name|similarity          |
+-----------------+-------------------+--------------------+
|89               |Lady Gaga          |0.14288972265061317 |
|289              |Britney Spears     |0.11401473531925992 |
|377              |Linkin Park        |0.07405474289417847 |
|265              |Céline Dion        |0.049226760081588175|
|475              |Eminem             |0.04588529551812003 |
|279              |Brandy             |0.04501347750183131 |
|355              |Jason Mraz         |0.04455178860487453 |
|327              |Chris Brown        |0.04371509259061781 |
|209              |My Chemical Romance|0.04146190612350378 |
|59               |New Order          |0.03996776448679086 |
+-----------------+-------------------+--------------------+

Shared tags with 'Kanye West':




+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|similar_artist_name|shared_tags                                                                                                                                                                                                                                                                                                            |similarity          |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

## 4: Testing the Performance of Our Models

### 4.1: Analysing Performance of Our User Recommendation Model

In this section, we will evalutate the performance of the user recommendation model using Root Mean Squared Error (RMSE). This measures how well the predicted artist similarity scores align with actual user weights in our test dataset. 

I would like to note, we kept the same performance metric as our Collaborative Based Filtering model to ensure comparability.


In [14]:
from pyspark.sql import functions as F

# Split the dataset into training and test sets
train_data, test_data = user_artists_df.randomSplit([0.8, 0.2], seed=27)

# Generate recommendations for the test set users
test_user_ids = [row["userID"] for row in test_data.select("userID").distinct().collect()]

recommendations = (
    user_artists_df.filter(F.col("userID").isin(test_user_ids))
    .join(normalised_similarity_df, user_artists_df.artistID == normalised_similarity_df.artistID_1)
    .select("userID", "artistID_2", "normalised_hybrid_similarity")
)

# Merge recommendations with actual ratings from the test data
predictions_and_ratings = (
    recommendations.join(test_data, (recommendations.userID == test_data.userID) & 
                         (recommendations.artistID_2 == test_data.artistID), "inner")
    .select(F.col("normalised_hybrid_similarity").alias("prediction"), F.col("weight").alias("actual"))
)

# Calculate RMSE
rmse_df = predictions_and_ratings.withColumn("squared_error", (F.col("prediction") - F.col("actual")) ** 2)
rmse_value = rmse_df.select(F.sqrt(F.avg("squared_error"))).first()[0]

print(f"Root Mean Squared Error (RMSE) for user-based predictions: {rmse_value:.4f}")




Root Mean Squared Error (RMSE) for user-based predictions: 3495.1663


                                                                                

While our RMSE remains relatively high, at around 3500, we can see this performs better than our collaborative based filtering model which is also based on artist recommendations to users, which gave us an RMSE of around 4500. WHile better, this indicates there is much room for improvement.

I would also like to note, the collaborative bassed filtering model performed more efficiently, so for a larger dataset, this would have to be a consideration when picking a model.

### 4.2: Analysing the Peformance of Our Artist Similarity Model

Here we evaluate how well our artist similarity model performs in finding artists similar to a given artist input, based on actual user interaction data. This involves comparing the model's predictions to data derived from user behaviour.

We achieve this user behaviour data by identifying co-interacted artists. We achieve this by:

- Identifying users interacted with the input artist
- Identifying other artists these users have interacted with 
- Agreggating play counts to calculate an average co-interaction score for each similar artist. This co-interaction score is 'avg_weight'

We then calculate the RMSE using actual co-interaction scores, and our predicted scores.



In [15]:
# Input: Artist Name
input_artist_name = "Kanye West"  # Replace with the desired artist name

# Check if the artist exists in the dataset
artist_exists = artists_df.filter(F.col("name") == input_artist_name).count()

if artist_exists == 0:
    print(f"The artist '{input_artist_name}' is not in the dataset.")
else:
    # Get the artist ID for the input artist
    input_artist_id = artists_df.filter(F.col("name") == input_artist_name).select("id").first()["id"]

    # Get similar artists from the model
    predicted_similar_artists = normalised_similarity_df.filter(
        (F.col("artistID_1") == input_artist_id) | (F.col("artistID_2") == input_artist_id)
    ).withColumn(
        "similar_artist_id",
        F.when(F.col("artistID_1") == input_artist_id, F.col("artistID_2")).otherwise(F.col("artistID_1"))
    ).select("similar_artist_id", "normalised_hybrid_similarity")

    # Get actual co-interaction scores based on user behavior
actual_similarities = (
    user_artists_df.filter(F.col("artistID") == input_artist_id)
    .select(F.col("userID").alias("interacted_userID"))  # Alias userID for clarity
    .join(
        user_artists_df.alias("other"),  # Alias for the second instance of the dataset
        F.col("interacted_userID") == F.col("other.userID")
    )
    .groupBy(F.col("other.artistID").alias("similar_artist_id"))  # Alias artistID for clarity
    .agg(F.avg("other.weight").alias("avg_weight"))
    .filter(F.col("similar_artist_id") != input_artist_id)  # Exclude the input artist
)

# Combine predictions with actuals
comparison = predicted_similar_artists.join(
    actual_similarities, predicted_similar_artists.similar_artist_id == actual_similarities.similar_artist_id, "inner"
).select(
    F.col("normalised_hybrid_similarity").alias("prediction"),
    F.col("avg_weight").alias("actual")
)

# Calculate RMSE
rmse_artist_df = comparison.withColumn("squared_error", (F.col("prediction") - F.col("actual")) ** 2)
rmse_artist_value = rmse_artist_df.select(F.sqrt(F.avg("squared_error"))).first()[0]

print(f"Root Mean Squared Error (RMSE) for similar artists to '{input_artist_name}': {rmse_artist_value:.4f}")





Root Mean Squared Error (RMSE) for similar artists to 'Kanye West': 1321.4567


                                                                                

Here, with an input of Kanye West, we see a RMSE of ~1300. This is a much better RMSE than our previous model, implying our user recommendation is performing much better. However, we are cherry picking our input artist here, so I would like to see how the model would perform with an input of a random selection of artists. However, we cannot do many as this is proving to be computationally expensive, another problem for a larger dataset.

I would like to note, we have only included artists with more than 12 tags, this is because due to the sparsity of the tags in some artists, they have no similar artists for this specific content based filtering.



In [31]:
random.seed(55)

# Filter artists who have more than 12 tags
artists_with_tags_count = artist_tags_info_df.groupBy("artistID").count()
artists_with_more_than_12_tags = artists_with_tags_count.filter(F.col("count") > 12)

# Select 3 random artists with more than 10 tags
random_artists = artists_df.join(artists_with_more_than_12_tags, artists_df.id == artists_with_more_than_12_tags.artistID) \
                           .select("id", "name") \
                           .orderBy(F.rand(seed=55)) \
                           .limit(3)

# Initialize a list to store RMSE values for each artist
rmse_values = []

# Loop over each artist and compute RMSE
for row in random_artists.collect():
    input_artist_name = row["name"]
    input_artist_id = row["id"]

    # Check if the artist exists in the dataset
    artist_exists = artists_df.filter(F.col("name") == input_artist_name).count()

    if artist_exists == 0:
        print(f"The artist '{input_artist_name}' is not in the dataset.")
    else:
        # Get similar artists from the model (predicted similarities)
        predicted_similar_artists = normalised_similarity_df.filter(
            (F.col("artistID_1") == input_artist_id) | (F.col("artistID_2") == input_artist_id)
        ).withColumn(
            "similar_artist_id",
            F.when(F.col("artistID_1") == input_artist_id, F.col("artistID_2")).otherwise(F.col("artistID_1"))
        ).select("similar_artist_id", "normalised_hybrid_similarity")

        # Get actual co-interaction scores (based on user behavior)
        actual_similarities = (
            user_artists_df.filter(F.col("artistID") == input_artist_id)
            .select(F.col("userID").alias("interacted_userID"))
            .join(
                user_artists_df.alias("other"),
                F.col("interacted_userID") == F.col("other.userID")
            )
            .groupBy(F.col("other.artistID").alias("similar_artist_id"))
            .agg(F.avg("other.weight").alias("avg_weight"))
            .filter(F.col("similar_artist_id") != input_artist_id)  # Exclude the input artist
        )

        # Combine predicted similarities with actual similarities
        comparison = predicted_similar_artists.join(
            actual_similarities,
            predicted_similar_artists.similar_artist_id == actual_similarities.similar_artist_id,
            "inner"
        ).select(
            F.col("normalised_hybrid_similarity").alias("prediction"),
            F.col("avg_weight").alias("actual")
        )

        # Check if the comparison DataFrame is empty
        if comparison.count() == 0:
            print(f"No similar artists found for '{input_artist_name}'. Skipping RMSE calculation.")
            rmse_values.append(None)  # Append None or a placeholder if no comparison is available
        else:
            # Calculate RMSE for this artist
            rmse_artist_df = comparison.withColumn("squared_error", (F.col("prediction") - F.col("actual")) ** 2)
            rmse_artist_value = rmse_artist_df.select(F.sqrt(F.avg("squared_error"))).first()[0]

            # Handle case if RMSE is None
            if rmse_artist_value is not None:
                print(f"Root Mean Squared Error (RMSE) for similar artists to '{input_artist_name}': {rmse_artist_value:.4f}")
            else:
                print(f"RMSE could not be calculated for '{input_artist_name}' due to insufficient data.")
            
            # Append RMSE value for this artist to the list
            rmse_values.append(rmse_artist_value)

# Calculate the average RMSE, excluding None values
rmse_values_filtered = [x for x in rmse_values if x is not None]
average_rmse = sum(rmse_values_filtered) / len(rmse_values_filtered) if rmse_values_filtered else 0
print(f"\nAverage RMSE for the 3 random artists (with more than 12 tags): {average_rmse:.4f}")


Exception in thread "serve-DataFrame" java.net.SocketTimeoutException: Accept timed out
	at java.base/java.net.PlainSocketImpl.socketAccept(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:474)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:565)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:533)
	at org.apache.spark.security.SocketAuthServer$$anon$1.run(SocketAuthServer.scala:65)
                                                                                

Root Mean Squared Error (RMSE) for similar artists to 'Maroon 5': 1000.5391


                                                                                

Root Mean Squared Error (RMSE) for similar artists to 'Limp Bizkit': 1195.4482




Root Mean Squared Error (RMSE) for similar artists to 'Gorgoroth': 410.7118

Average RMSE for the 3 random artists (with more than 12 tags): 868.8997


                                                                                

We have produce a much better RMSE, relatively, here. This indicates our model is working well. However, this is massively limited, as after lots of testing, only allowing artists with more than 12 tags was when we started to make sure every artist had a similar recommendation.

This model is limited with this data set. It would require a larger amount of data per artist for this content based recommender system to be more effective.

## 5: Conclusion

