## Content-Based Filtering with PySpark

In this section, we explore content-based filtering using the built in PySpark. We use a dataset which includes information about music artists, their associated tags, and how users interacted with them. We aim to generate artist recommendations for users based on the tags associated with artists they have interacted with. This involves loading the data, creating profiles for the artists, vectorising the tags, calculating similarities and finally generating recommendations based on those similarities.

# Chapter 0: Importing Required Libraries

The key libraries used are from PySpark, which is ideal for handling large datasets efficiently.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.stat import Correlation
from pyspark.mllib.linalg.distributed import RowMatrix

# Chapter 1: Loading Our Data

First we initialise our Spark session. Initialising a Spark session is required for any operation in PySpark. This session allows us to leverage PySpark's machine learning capabalities. 

In [2]:
spark = SparkSession.builder \
    .appName("Content-Based Filtering") \
    .getOrCreate()

24/11/25 12:48:24 WARN Utils: Your hostname, Harrys-MacBook-Air-5.local resolves to a loopback address: 127.0.0.1; using 172.28.149.235 instead (on interface en0)
24/11/25 12:48:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/25 12:48:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Then, we define the Base Directory for our dataset files. This allows us to load the data for further processing. 

In [3]:
base_directory = "/Users/harrywilson/Desktop/DataScienceToolbox/Assessment2Data"

Finally, we load our datasets. This includes information about artists, their tags and user interactions. In this step, we also read our data into Spark DataFrames.

In [4]:
# Function to load a .dat file as a DataFrame
def load_data(filename):
    file_path = f"{base_directory}/{filename}"
    return spark.read.csv(file_path, sep="\t", header=True, inferSchema=True)

# Load datasets
user_artists_df = load_data("user_artists.dat")
artists_df = load_data("artists.dat")
tags_df = load_data("tags.dat")
user_taggedartists_df = load_data("user_taggedartists.dat")

# Display a few rows from the datasets to ensure everything is working
print("User-Artists Dataset:")
user_artists_df.show(5)

print("Artists Dataset:")
artists_df.show(5)

print("Tags Dataset:")
tags_df.show(5)

print("User-Tagged Artists Dataset:")
user_taggedartists_df.show(5)


                                                                                

User-Artists Dataset:
+------+--------+------+
|userID|artistID|weight|
+------+--------+------+
|     2|      51| 13883|
|     2|      52| 11690|
|     2|      53| 11351|
|     2|      54| 10300|
|     2|      55|  8983|
+------+--------+------+
only showing top 5 rows

Artists Dataset:
+---+-----------------+--------------------+--------------------+
| id|             name|                 url|          pictureURL|
+---+-----------------+--------------------+--------------------+
|  1|     MALICE MIZER|http://www.last.f...|http://userserve-...|
|  2|  Diary of Dreams|http://www.last.f...|http://userserve-...|
|  3|Carpathian Forest|http://www.last.f...|http://userserve-...|
|  4|     Moi dix Mois|http://www.last.f...|http://userserve-...|
|  5|      Bella Morte|http://www.last.f...|http://userserve-...|
+---+-----------------+--------------------+--------------------+
only showing top 5 rows

Tags Dataset:
+-----+-----------------+
|tagID|         tagValue|
+-----+-----------------+


# Chapter 2: Creating Artist Profiles

This first step involves creating profiles, based on the tags associated with each artist. We aggregate the tags to form this profile.

In [5]:
# Join user_taggedartists with tags for tag information
artist_tags_df = user_taggedartists_df.join(tags_df, on="tagID", how="inner")

# Join artist tags with artists to get artist details and tag names
artist_tags_info_df = artist_tags_df.join(
    artists_df, artist_tags_df.artistID == artists_df.id
).select(
    artist_tags_df["artistID"],
    artists_df["name"].alias("artist_name"),
    artist_tags_df["tagValue"].alias("tag")
)


artist_profiles_df = artist_tags_info_df.groupBy("artistID", "artist_name") \
    .agg(F.collect_list("tag").alias("tags"))

# Display artist profiles
artist_profiles_df.show(5, truncate=False)

[Stage 14:>                                                         (0 + 1) / 1]

+--------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|artistID|artist_name          |tags                                                                                                                                                                                                                  

                                                                                

To apply similarity calculation, which is vital for content based filtering, we need to convert the tags into numerical interpretations. Therefore, we use CountVectorizer, this creates a vector of word counts for each artist's tags.

In [6]:
# Ensure tag_text is an array of strings
tags_df = artist_profiles_df.select(
    F.col("artistID"),
    F.col("artist_name"),
    F.col("tags").alias("tag_text")
)

# Vectorise tags
vectoriser = CountVectorizer(inputCol="tag_text", outputCol="raw_features")
vectorised_model = vectoriser.fit(tags_df)
vectorised_df = vectorised_model.transform(tags_df)

# Display vectorized features
vectorised_df.show(5, truncate=False)


[Stage 28:>                                                         (0 + 1) / 1]

+--------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|artistID|artist_name          |tag_text                            

                                                                                

For the raw_features column, we have the format (total_features, [indices], [values]). Where:
total_features: total number of unique tags in the dataset
indices: Indices of non-zero features (tags present for the artist)
values: Corresponding count of each tag (as we got rid of duplicates, these values are all 1)


Next, we compute the Term Frequency - Inverse Document Frequency (TF-IDF). This assesses the importance of each tag within an artist's profile.

In [7]:
# Compute TF-IDF
idf = IDF(inputCol="raw_features", outputCol="features")
idf_model = idf.fit(vectorised_df)
tfidf_df = idf_model.transform(vectorised_df)

# Display TF-IDF features
tfidf_df.show(5, truncate=False)

[Stage 38:>                                                         (0 + 1) / 1]

+--------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------

                                                                                

Now, we normalise the feature vectors. Here, we are ensuring all features are on the same scale. We use MinMaxScaler to normalise the tag vectors. This is vital for similarity calculation.

In [8]:

# Normalise the feature vectors
scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")
scaler_model = scaler.fit(tfidf_df)
scaled_tfidf_df = scaler_model.transform(tfidf_df)

# Display scaled features
scaled_tfidf_df.show(5, truncate=False)


[Stage 51:>                                                         (0 + 1) / 1]

+--------+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------

                                                                                

In [9]:
from pyspark.sql import functions as F

# Step 1: Compute pairwise cosine similarity (no changes here)
# This code is already part of your existing implementation
row_matrix_rdd = scaled_tfidf_df.select("scaled_features").rdd.map(lambda row: row.scaled_features.toArray())
row_matrix = RowMatrix(row_matrix_rdd)
similarities = row_matrix.columnSimilarities()
similarities_df = similarities.entries.toDF(["artistID_1", "artistID_2", "cosine_similarity"])

# Step 2: Get the artist tags for tag overlap calculation
artist_tags_df = artist_tags_info_df.groupBy("artistID") \
    .agg(F.collect_set("tag").alias("tags"))

# Step 3: Self-join artist_tags_df to calculate tag overlap between artist pairs
tag_overlap_df = artist_tags_df.alias("a1").join(
    artist_tags_df.alias("a2"),
    F.col("a1.artistID") < F.col("a2.artistID")  # Ensures unique pairs
).select(
    F.col("a1.artistID").alias("artistID_1"),
    F.col("a2.artistID").alias("artistID_2"),
    F.size(F.array_intersect(F.col("a1.tags"), F.col("a2.tags"))).alias("tag_overlap_count")  # Calculate shared tags
)

# Step 4: Combine cosine similarity with tag overlap
hybrid_similarity_df = similarities_df.join(
    tag_overlap_df,
    ["artistID_1", "artistID_2"]
).withColumn(
    "hybrid_similarity",
    F.col("cosine_similarity") * (1 + F.col("tag_overlap_count"))  # Hybrid metric
)

# Step 5: Normalize the hybrid similarity scores
max_similarity = hybrid_similarity_df.agg(F.max("hybrid_similarity")).collect()[0][0]
normalized_similarity_df = hybrid_similarity_df.withColumn(
    "normalized_hybrid_similarity",
    F.col("hybrid_similarity") / max_similarity
)

# Step 6: Find similar artists for a specific artist (same logic as before)
input_artist_name = "Eminem"
input_artist_id = artists_df.filter(F.col("name") == input_artist_name).select("id").first()["id"]

similar_artists = normalized_similarity_df.filter(
    (F.col("artistID_1") == input_artist_id) | (F.col("artistID_2") == input_artist_id)
).withColumn(
    "similar_artist_id",
    F.when(F.col("artistID_1") == input_artist_id, F.col("artistID_2")).otherwise(F.col("artistID_1"))
).join(
    artists_df, F.col("similar_artist_id") == artists_df.id
).select(
    F.col("similar_artist_id"),
    F.col("name").alias("similar_artist_name"),
    F.col("normalized_hybrid_similarity").alias("similarity")
).orderBy(F.col("similarity").desc())

# Display results
print(f"Artists similar to '{input_artist_name}':")
similar_artists.show(10, truncate=False)


24/11/25 12:50:48 WARN Executor: Managed memory leak detected; size = 36737740 bytes, task 0.0 in stage 63.0 (TID 56)
24/11/25 12:55:15 WARN TaskMemoryManager: Failed to allocate a page (134217728 bytes), try again.
                                                                                

Artists similar to 'Eminem':




+-----------------+-------------------+-------------------+
|similar_artist_id|similar_artist_name|similarity         |
+-----------------+-------------------+-------------------+
|1613             |Jay-Z              |0.2612198716513835 |
|271              |Mos Def            |0.22742034675157372|
|195              |Bright Eyes        |0.14916947047403017|
|56               |Daft Punk          |0.14817306100618768|
|7                |Marilyn Manson     |0.14631602443488229|
|1081             |Ace of Base        |0.1299499714712334 |
|190              |Muse               |0.12390243627135306|
|288              |Rihanna            |0.11899538945412451|
|53               |Air                |0.11511055113085385|
|248              |Pharoahe Monch     |0.10867699752388749|
+-----------------+-------------------+-------------------+
only showing top 10 rows



                                                                                

# Chapter : Similarities Between Artists

We can compute cosine similarity between artists. This is done using RowMatrix and columnSimilarities.

In [9]:

# Convert the scaled features column to RDD of dense vectors
row_matrix_rdd = scaled_tfidf_df.select("scaled_features").rdd.map(lambda row: row.scaled_features.toArray())

# Create a RowMatrix from the RDD
row_matrix = RowMatrix(row_matrix_rdd)

# Compute pairwise cosine similarities
similarities = row_matrix.columnSimilarities()

# Convert the similarities result back to a DataFrame for better readability
similarities_df = similarities.entries.toDF(["artistID_1", "artistID_2", "similarity"])

# Show the top 5 results
similarities_df.show(5, truncate=False)


24/11/24 18:20:19 WARN Executor: Managed memory leak detected; size = 36737740 bytes, task 0.0 in stage 63.0 (TID 56)
[Stage 66:>                                                         (0 + 1) / 1]

+----------+----------+--------------------+
|artistID_1|artistID_2|similarity          |
+----------+----------+--------------------+
|261       |1529      |0.07548513560963972 |
|379       |1752      |0.049629166698546515|
|1644      |7078      |0.35355339059327373 |
|1792      |2863      |0.5298129428260175  |
|20        |8414      |0.0386493975840497  |
+----------+----------+--------------------+
only showing top 5 rows



24/11/24 18:20:23 WARN Executor: Managed memory leak detected; size = 36737740 bytes, task 0.0 in stage 66.0 (TID 57)
                                                                                

In [10]:
# Join user interactions with artist similarity data
user_artist_df = user_artists_df.join(artists_df, user_artists_df.artistID == artists_df.id).select(
    user_artists_df["userID"], 
    user_artists_df["artistID"]
)

# Join user interactions with artist similarity data
user_recommendations = user_artist_df.join(
    similarities_df,
    user_artist_df.artistID == similarities_df.artistID_1
).groupBy("userID", "artistID_2") \
 .agg(F.mean("similarity").alias("avg_similarity")) \
 .orderBy("userID", "avg_similarity", ascending=False)

# Display recommendations
user_recommendations.show(50, truncate=False)

24/11/24 18:20:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/11/24 18:20:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/11/24 18:20:46 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/11/24 18:20:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/11/24 18:20:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/11/24 18:20:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/11/24 18:20:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/11/24 18:20:57 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/11/24 18:21:03 WARN RowBasedKeyValueBatch: Calling spill() on

+------+----------+-------------------+
|userID|artistID_2|avg_similarity     |
+------+----------+-------------------+
|2100  |9551      |1.0                |
|2100  |7300      |1.0                |
|2100  |6776      |1.0                |
|2100  |8846      |1.0                |
|2100  |9346      |1.0                |
|2100  |6104      |0.7071067811865475 |
|2100  |8534      |0.7071067811865475 |
|2100  |9181      |0.7071067811865475 |
|2100  |9074      |0.5054855699880143 |
|2100  |8843      |0.5054855699880143 |
|2100  |8605      |0.5054855699880143 |
|2100  |9055      |0.5054855699880143 |
|2100  |9194      |0.5054855699880143 |
|2100  |9104      |0.5054855699880143 |
|2100  |8678      |0.5054855699880143 |
|2100  |9221      |0.5054855699880143 |
|2100  |8883      |0.5054855699880143 |
|2100  |9483      |0.5054855699880143 |
|2100  |8825      |0.5054855699880143 |
|2100  |9695      |0.5054855699880143 |
|2100  |9640      |0.5054855699880143 |
|2100  |9573      |0.5054855699880143 |


                                                                                

In [11]:
# Input a user I
input_user_id = 5

# Get artists the user has already interacted with
interacted_artists = user_artists_df.filter(F.col("userID") == input_user_id).select("artistID")

# Collect artistIDs as a list for filtering
interacted_artist_ids = [row["artistID"] for row in interacted_artists.collect()]

# Generate recommendations for the user
user_recommendations_filtered = (
    user_artist_df.filter(F.col("userID") == input_user_id)
    .join(similarities_df, user_artist_df.artistID == similarities_df.artistID_1)
    .filter(~F.col("artistID_2").isin(interacted_artist_ids))  # Exclude already interacted artists
    .groupBy("userID", "artistID_2")
    .agg(F.mean("similarity").alias("avg_similarity"))
    .orderBy(F.col("avg_similarity").desc())  # Ensure descending order
)

# Add artist names to the recommendations
user_recommendations_with_names = (
    user_recommendations_filtered
    .join(artists_df, user_recommendations_filtered.artistID_2 == artists_df.id, how="inner")
    .select(
        "userID",
        "artistID_2",
        F.col("name").alias("artist_name"),
        "avg_similarity"
    )
    .orderBy(F.col("avg_similarity").desc())  # Reconfirm descending order for display
)

# Display the recommendations with artist names
user_recommendations_with_names.show(10, truncate=False)


[Stage 95:>                                                         (0 + 4) / 4]

+------+----------+---------------------+------------------+
|userID|artistID_2|artist_name          |avg_similarity    |
+------+----------+---------------------+------------------+
|5     |165       |Planet Funk          |0.687188672645603 |
|5     |84        |Cut Copy             |0.6281262398957679|
|5     |66        |Faithless            |0.5990630007927742|
|5     |149       |The Sound Of Lucrecia|0.5960655954426938|
|5     |63        |Enigma               |0.5781996164107921|
|5     |87        |Deacon Blue          |0.5724274243327524|
|5     |57        |Thievery Corporation |0.5466830866235932|
|5     |79        |Fiction Factory      |0.5421251649009027|
|5     |94        |Ministry of Sound    |0.5251414738134   |
|5     |4617      |Bow Wow              |0.5132002392796673|
+------+----------+---------------------+------------------+
only showing top 10 rows



                                                                                

I will now try to make afunction that inputs an artist, and finds similar artists and displays their tags they share in common

In [12]:
from pyspark.sql import functions as F

# Input: Artist Name
input_artist_name = "Jean-Michel Jarre"  # Change this to any artist name for testing

# Step 1: Check if the artist exists in the dataset
artist_exists = artists_df.filter(F.col("name") == input_artist_name).count()

if artist_exists == 0:
    print(f"The artist '{input_artist_name}' is not in the list.")
else:
    # Step 2: Get the artist ID for the input artist
    input_artist_id = artists_df.filter(F.col("name") == input_artist_name).select("id").first()["id"]

    # Step 3: Find similar artists using the similarity matrix
    similar_artists = similarities_df.filter(
        (F.col("artistID_1") == input_artist_id) | (F.col("artistID_2") == input_artist_id)
    ).withColumn(
        "similar_artist_id",
        F.when(F.col("artistID_1") == input_artist_id, F.col("artistID_2")).otherwise(F.col("artistID_1"))
    ).join(
        artists_df, F.col("similar_artist_id") == artists_df.id
    ).select(
        F.col("similar_artist_id"),
        F.col("name").alias("similar_artist_name"),
        F.col("similarity")
    ).orderBy(F.col("similarity").desc())

    # Step 4: Get tags for the input artist and similar artists
    input_artist_tags = artist_tags_info_df.filter(
        F.col("artistID") == input_artist_id
    ).select("tag").distinct()

    similar_artist_tags = artist_tags_info_df.filter(
        F.col("artistID").isin([row["similar_artist_id"] for row in similar_artists.collect()])
    ).select("artistID", "tag").distinct()

    # Step 5: Find shared tags between the input artist and similar artists
    shared_tags = similar_artist_tags.join(
        input_artist_tags, ["tag"], "inner"
    ).groupBy("artistID").agg(
        F.collect_list("tag").alias("shared_tags")
    ).join(
        artists_df, F.col("artistID") == artists_df.id
    ).select(
        F.col("name").alias("artist_name"),
        "shared_tags"
    )

    # Display Results
    print(f"Artists similar to '{input_artist_name}':")
    similar_artists.show(10, truncate=False)

    print(f"Shared tags with '{input_artist_name}':")
    shared_tags.show(10, truncate=False)


                                                                                

Artists similar to 'Jean-Michel Jarre':


                                                                                

+-----------------+----------------------+-------------------+
|similar_artist_id|similar_artist_name   |similarity         |
+-----------------+----------------------+-------------------+
|56               |Daft Punk             |0.4072535235842014 |
|2                |Diary of Dreams       |0.37057201966529973|
|4                |Moi dix Mois          |0.34505095261119995|
|52               |Morcheeba             |0.3295423117461275 |
|66               |Faithless             |0.3254526604505519 |
|44               |Das Ich               |0.3197749222590756 |
|169              |אביתר בנאי            |0.3014104402085912 |
|19               |:wumpscut:            |0.2987023063005894 |
|296              |Sugababes             |0.29250089551853053|
|647              |Eleftheria Eleftheriou|0.2898754521821014 |
+-----------------+----------------------+-------------------+
only showing top 10 rows

Shared tags with 'Jean-Michel Jarre':


[Stage 127:>                                                        (0 + 1) / 1]

+---------------------+-----------------------------------+
|artist_name          |shared_tags                        |
+---------------------+-----------------------------------+
|Diary of Dreams      |[ambient, seen live, electronic]   |
|Moonspell            |[seen live, instrumental]          |
|Marilyn Manson       |[90s, instrumental, seen live]     |
|Combichrist          |[electronic, electro, seen live]   |
|Grendel              |[electronic]                       |
|Agonoize             |[electronic]                       |
|Hocico               |[electronic]                       |
|London After Midnight|[electronic]                       |
|The Crüxshadows      |[electronica, electronic, synthpop]|
|:wumpscut:           |[electronic]                       |
+---------------------+-----------------------------------+
only showing top 10 rows



                                                                                

In [13]:
from pyspark.sql import functions as F

# Input: Artist Name
input_artist_name = "Jean-Michel Jarre"  # Change this to any artist name for testing

# Step 1: Check if the artist exists in the dataset
artist_exists = artists_df.filter(F.col("name") == input_artist_name).count()

if artist_exists == 0:
    print(f"The artist '{input_artist_name}' is not in the list.")
else:
    # Step 2: Get the artist ID for the input artist
    input_artist_id = artists_df.filter(F.col("name") == input_artist_name).select("id").first()["id"]

    # Step 3: Find similar artists using the similarity matrix
    similar_artists = similarities_df.filter(
        (F.col("artistID_1") == input_artist_id) | (F.col("artistID_2") == input_artist_id)
    ).withColumn(
        "similar_artist_id",
        F.when(F.col("artistID_1") == input_artist_id, F.col("artistID_2")).otherwise(F.col("artistID_1"))
    ).join(
        artists_df, F.col("similar_artist_id") == artists_df.id
    ).select(
        F.col("similar_artist_id"),
        F.col("name").alias("similar_artist_name"),
        F.col("similarity")
    ).orderBy(F.col("similarity").desc())

    # Step 4: Limit to the top similar artists
    top_similar_artists = similar_artists.limit(10)

    # Step 5: Get tags for the input artist
    input_artist_tags = artist_tags_info_df.filter(
        F.col("artistID") == input_artist_id
    ).select("tag").distinct()

    # Step 6: Get tags for the top similar artists
    similar_artist_tags = artist_tags_info_df.filter(
        F.col("artistID").isin([row["similar_artist_id"] for row in top_similar_artists.collect()])
    ).select("artistID", "tag").distinct()

    # Step 7: Find shared tags between the input artist and top similar artists
    shared_tags = similar_artist_tags.join(
        input_artist_tags, ["tag"], "inner"
    ).join(
        artists_df, similar_artist_tags.artistID == artists_df.id
    ).groupBy("name").agg(
        F.collect_list("tag").alias("shared_tags")
    )

    # Display Results
    print(f"Artists similar to '{input_artist_name}':")
    top_similar_artists.show(10, truncate=False)

    print(f"Shared tags with '{input_artist_name}':")
    shared_tags.show(10, truncate=False)


                                                                                

Artists similar to 'Jean-Michel Jarre':


                                                                                

+-----------------+----------------------+-------------------+
|similar_artist_id|similar_artist_name   |similarity         |
+-----------------+----------------------+-------------------+
|56               |Daft Punk             |0.4072535235842014 |
|2                |Diary of Dreams       |0.37057201966529973|
|4                |Moi dix Mois          |0.34505095261119995|
|52               |Morcheeba             |0.3295423117461275 |
|66               |Faithless             |0.3254526604505519 |
|44               |Das Ich               |0.3197749222590756 |
|169              |אביתר בנאי            |0.3014104402085912 |
|19               |:wumpscut:            |0.2987023063005894 |
|296              |Sugababes             |0.29250089551853053|
|647              |Eleftheria Eleftheriou|0.2898754521821014 |
+-----------------+----------------------+-------------------+

Shared tags with 'Jean-Michel Jarre':


                                                                                

+----------------------+-------------------------------------------------------------------------------------------+
|name                  |shared_tags                                                                                |
+----------------------+-------------------------------------------------------------------------------------------+
|:wumpscut:            |[electronic]                                                                               |
|Sugababes             |[seen live, pop, electronic]                                                               |
|Faithless             |[electronica, 90s, electronic, chillout, pop, ambient, seen live, electro]                 |
|Eleftheria Eleftheriou|[pop]                                                                                      |
|Morcheeba             |[electronic, chillout, pop, seen live]                                                     |
|Das Ich               |[seen live]                             

In [14]:
from pyspark.sql import functions as F

# Input: Artist Name
input_artist_name = "Kanye West"  # Change this to any artist name for testing

# Step 1: Check if the artist exists in the dataset
artist_exists = artists_df.filter(F.col("name") == input_artist_name).count()

if artist_exists == 0:
    print(f"The artist '{input_artist_name}' is not in the list.")
else:
    # Step 2: Get the artist ID for the input artist
    input_artist_id = artists_df.filter(F.col("name") == input_artist_name).select("id").first()["id"]

    # Step 3: Find similar artists using the similarity matrix
    similar_artists = similarities_df.filter(
        (F.col("artistID_1") == input_artist_id) | (F.col("artistID_2") == input_artist_id)
    ).withColumn(
        "similar_artist_id",
        F.when(F.col("artistID_1") == input_artist_id, F.col("artistID_2")).otherwise(F.col("artistID_1"))
    ).join(
        artists_df, F.col("similar_artist_id") == artists_df.id
    ).select(
        F.col("similar_artist_id"),
        F.col("name").alias("similar_artist_name"),
        F.col("similarity")
    ).orderBy(F.col("similarity").desc())

    # Step 4: Limit to the top similar artists
    top_similar_artists = similar_artists.limit(10)

    # Step 5: Get tags for the input artist
    input_artist_tags = artist_tags_info_df.filter(
        F.col("artistID") == input_artist_id
    ).select("tag").distinct()

    # Step 6: Get tags for the top similar artists
    similar_artist_tags = artist_tags_info_df.filter(
        F.col("artistID").isin([row["similar_artist_id"] for row in top_similar_artists.collect()])
    ).select("artistID", "tag").distinct()

    # Step 7: Find shared tags between the input artist and top similar artists
    shared_tags = top_similar_artists.join(
        similar_artist_tags.join(
            input_artist_tags, ["tag"], "inner"
        ).groupBy("artistID").agg(
            F.collect_list("tag").alias("shared_tags")
        ),
        top_similar_artists.similar_artist_id == similar_artist_tags.artistID,
        how="left"
    ).select(
        F.col("similar_artist_name"),
        F.coalesce(F.col("shared_tags"), F.array()).alias("shared_tags"),
        F.col("similarity")
    ).orderBy(F.col("similarity").desc())  # Ensure order matches top_similar_artists

    # Display Results
    print(f"Artists similar to '{input_artist_name}':")
    top_similar_artists.show(10, truncate=False)

    print(f"Shared tags with '{input_artist_name}':")
    shared_tags.show(10, truncate=False)


                                                                                

Artists similar to 'Kanye West':


                                                                                

+-----------------+------------------------+-------------------+
|similar_artist_id|similar_artist_name     |similarity         |
+-----------------+------------------------+-------------------+
|177              |Rock Star Supernova     |0.3078493817275997 |
|2876             |Ann Rabson              |0.227429413073671  |
|45               |Mindless Self Indulgence|0.22348526162540228|
|4                |Moi dix Mois            |0.20912846762611448|
|2                |Diary of Dreams         |0.2053797338222069 |
|60               |Matt Bianco             |0.18968144847130214|
|1729             |Craig Armstrong         |0.17507524381296338|
|93               |Jean-Michel Jarre       |0.17057205980525328|
|52               |Morcheeba               |0.16863213272256558|
|637              |Helena Paparizou        |0.16508332072280235|
+-----------------+------------------------+-------------------+

Shared tags with 'Kanye West':




+------------------------+----------------------------------------------------------+-------------------+
|similar_artist_name     |shared_tags                                               |similarity         |
+------------------------+----------------------------------------------------------+-------------------+
|Rock Star Supernova     |[]                                                        |0.3078493817275997 |
|Ann Rabson              |[]                                                        |0.227429413073671  |
|Mindless Self Indulgence|[alternative, electronic, i want to dance in my underwear]|0.22348526162540228|
|Moi dix Mois            |[]                                                        |0.20912846762611448|
|Diary of Dreams         |[electronic]                                              |0.2053797338222069 |
|Matt Bianco             |[]                                                        |0.18968144847130214|
|Craig Armstrong         |[classic, electronic

                                                                                