# **Hybrid Filtering**

## Chapter 0: Theory
Hybrid filtering is the combination of different recommendation techniques, with the goal of giving more accurate and meaningful recommendations. This is a way to benefit from the elements of the different techniques and limit the effect of their drawbacks. One approach to do so is combining collaborative and content-based filtering. Weighted hybridisation optimises a hyperparameter which determines the contribution from each method. Switching hybridisation chooses which method to use according to a chosen decision rule which depends on the data. Finally, mixed hybridisation gives recommendations which have been given by each of the different methods separately.

## Chapter 1: Data

In [None]:
# Load libraries
import pandas as pd
import numpy as np
import os
from sklearn.metrics.pairwise import cosine_similarity
import time
from sklearn.model_selection import train_test_split

In [None]:
# Import datasets
artists = pd.read_csv(os.path.join('..','data','artists.dat'), delimiter='\t')
tags = pd.read_csv(os.path.join('..','data','tags.dat'), delimiter='\t',encoding='ISO-8859-1')
user_artists = pd.read_csv(os.path.join('..','data','user_artists.dat'), delimiter='\t')
user_friends = pd.read_csv(os.path.join('..','data','user_friends.dat'), delimiter='\t')
user_taggedartists_timestamps = pd.read_csv(os.path.join('..','data','user_taggedartists-timestamps.dat'), delimiter='\t')
user_taggedartists = pd.read_csv(os.path.join('..','data','user_taggedartists.dat'), delimiter='\t')

In [None]:
# Drop irrelevant columns from the Artists dataset
artists_cleaned = artists.drop(columns=['url', 'pictureURL']).drop_duplicates(keep='first') 

# Drop the irrelevant columns in the Tags dataset
tags_cleaned = tags.drop_duplicates(keep='first') 

# For the User-Artists dataset, we can filter out rows with a weight of 0, as they show no meaningful interaction
user_artists_cleaned = user_artists[user_artists['weight'] > 0]
user_artists_cleaned = user_artists_cleaned.drop_duplicates(keep='first') 

# Drop duplicates from the User-Tagged Artists Timestamps dataset
user_taggedartists_timestamps_cleaned = user_taggedartists_timestamps.drop_duplicates(keep='first') 

# Convert timestamps from ms to datetime format
user_taggedartists_timestamps_cleaned['timestamp'] = pd.to_datetime(user_taggedartists_timestamps_cleaned['timestamp'], unit='ms')

# Drop duplicates from the User-Friends dataset
user_friends_cleaned = user_friends.drop_duplicates(keep='first')

## Chapter 2: Implementation

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.recommendation import ALS

# Start Spark session
spark = SparkSession.builder.appName("CollaborativeFilteringALS").getOrCreate()

# Convert cleaned pandas DataFrames to PySpark DataFrames
artists_spark_df = spark.createDataFrame(artists_cleaned)
user_artists_spark_df = spark.createDataFrame(user_artists_cleaned)

In [None]:
# Start timing
start_time = time.time()

# ALS model setup for user-based collaborative filtering
als = ALS(userCol="userID", itemCol="artistID", ratingCol="weight", coldStartStrategy="drop", implicitPrefs=True, regParam=1.0)

# Split data into training and test sets (80% training, 20% testing)
train_data, test_data = user_artists_spark_df.randomSplit([0.8, 0.2], seed=27)

# Fit the ALS model
model = als.fit(train_data)

# Make predictions on the test set
predictions = model.transform(test_data)

# Calculate RMSE (Root Mean Squared Error)
rmse = predictions.withColumn("squared_error", (F.col("prediction") - F.col("weight"))**2)
rmse_value = rmse.select(F.sqrt(F.avg("squared_error"))).first()[0]

# Print evaluation metrics
print(f"Root Mean Squared Error (RMSE): {rmse_value:.4f}")

# Generate recommendations
user_recommendations = model.recommendForAllUsers(5)

# Create a dictionary to map artistID to artistName
artist_id_to_name = {row['id']: row['name'] for row in artists_spark_df.collect()}

# Function to map artistID to artistName and round scores to 2 decimal places
def map_recommendations(user_recommendations):
    def map_row(row):
        recommendations_with_names = [
            (artist_id_to_name.get(rec[0], "Unknown"), round(rec[1], 2)) for rec in row['recommendations']
        ]
        return (row['userID'], recommendations_with_names)

    mapped_recommendations = user_recommendations.rdd.map(map_row).toDF(["userID", "recommendations"])
    return mapped_recommendations

# Apply the artistID to name mapping function
user_recommendations_with_names = map_recommendations(user_recommendations)

# Show the final recommendations with artist names and rounded scores
user_recommendations_with_names.show(truncate=False)

# End timing
end_time = time.time()

# Print the elapsed time
print(f"Time elapsed: {end_time - start_time:.4f} seconds")


Root Mean Squared Error (RMSE): 4492.4848
+------+------------------------------------------------------------------------------------------------------------------------------+
|userID|recommendations                                                                                                               |
+------+------------------------------------------------------------------------------------------------------------------------------+
|3     |[{Hande Yener, 1.14}, {Ani DiFranco, 1.05}, {ムック, 1.0}, {Pleq, 1.0}, {Klaus Badelt, 0.98}]                                 |
|5     |[{Beck, 1.09}, {Kings of Convenience, 1.08}, {Beirut, 1.06}, {The Smiths, 1.05}, {The White Stripes, 1.05}]                   |
|6     |[{Brandy, 0.49}, {Danity Kane, 0.44}, {50 Cent, 0.44}, {B.o.B, 0.43}, {Joss Stone, 0.43}]                                     |
|12    |[{Eminem, 1.57}, {Lenny Kravitz, 1.31}, {No Doubt, 1.24}, {Scooter, 1.24}, {Guano Apes, 1.2}]                                 |
|13    |[