## Item-based Collaborative Filtering

## PERSONAL NOTES:
Runnning with pyspark
- If you get Py4JJavaError, remember to ensure pyspark system variables correctly
    - echo %PYSPARK_PYTHON%
    - echo %PYSPARK_DRIVER_PYTHON%


#### Rationale
1. Relatively large number of users compared to relevant news articles. Thus it is easier computationally to compare items than users.
2. Item stability > User stability. Once a news article is out, it's content is fixed, while a user might change taste often. This can make user-based collaborative filtering more inaccurate in relation to the user's present taste. Similarity between items is constanst, i.e. the need for recalculations will be less with item-based collaborative filtering.
3. Few news article interactions per user. This makes it harder to guess similar users as in user-based collaborative filtering.


#### Item-based collaborative filtering in a nutshell (MIND)
"Find articles that are likely to be of interest, based on shared user interest patterns. Return the top N articles that are most similar to any of the news articles the user has clicked on, based on the similarity calculations between items."
1. For each news article a user has clicked, get an overview of articles other users have also clicked
2. Matrix factorization for efficiency - Alternating Least Squares (ALS)
3. Calculate the similarity of each article (similarity of interactions) - Locality-Sensitive Hasing (LSH)
4. Repeat steps for each news articles, and sort the recommendations list according to articles with the highest cosine similarity


#### Preperation of data

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import explode, split, col, lit, desc, sum, udf, broadcast
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer, BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors, VectorUDT


spark = SparkSession.builder \
    .appName("MINDItemBasedFiltering") \
    .getOrCreate()
#.config("spark.driver.memory", "4g") \
    
# Define the schema of the dataset
schema = StructType([
    StructField("ImpressionID", IntegerType(), True),
    StructField("UserID", StringType(), True),
    StructField("Time", StringType(), True),
    StructField("History", StringType(), True),
    StructField("Impressions", StringType(), True)
])

# Load the dataset with the defined schema
data = spark.read.csv("data/MINDsmall_dev/behaviors.tsv", sep="\t", schema=schema)

data.show(5, truncate=False)

# Explode the history column into separate rows for each article per user 
# I.e. UserID | NewsArticle (that that user has stored in their history)
data = data.withColumn("NewsArticle", explode(split(col("History"), " "))) \
    .select(col("UserID").alias("user_id"), col("NewsArticle").alias("news_article"))

data.show(5, truncate=False)

+------------+------+----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ImpressionID|UserID|Time                  |History                                                                                                                                                                                                                                                                                                  |Impressions                                                                            

#### Alternating Least Squares (ALS)

In [2]:
# Prepare data to work with ALS
# Add a dummy 'rating' column to indicate interaction
data = data.withColumn("rating", lit(1))

# Index the user_id and news_article columns
user_indexer = StringIndexer(inputCol="user_id", outputCol="user_id_index").fit(data)
item_indexer = StringIndexer(inputCol="news_article", outputCol="news_article_id_index").fit(data)

data = user_indexer.transform(data)
data = item_indexer.transform(data)

# Extract mappings from StringIndexer models
user_id_index_mapping = user_indexer.labels
news_article_id_index_mapping = item_indexer.labels

# Convert mappings to DataFrames for easier use
user_id_index_df = spark.createDataFrame([(i, user_id_index_mapping[i]) for i in range(len(user_id_index_mapping))], ["user_id_index", "user_id"])
news_article_id_index_df = spark.createDataFrame([(i, news_article_id_index_mapping[i]) for i in range(len(news_article_id_index_mapping))], ["news_article_id_index", "news_article"])

# Select the final columns for ALS
data = data.select("user_id_index", "news_article_id_index", "rating")

data.show(5, truncate=False)

# Train the ALS model
# Note: We use implicitPrefs=True to indicate that we are working with implicit feedback (clicks)
als = ALS(maxIter=5, regParam=0.01, userCol="user_id_index", itemCol="news_article_id_index", ratingCol="rating", coldStartStrategy="drop", implicitPrefs=True)
model = als.fit(data)

# Extract the item factors from the ALS model
#item_factors = model.itemFactors
item_factors = model.itemFactors.limit(100) #Subset for testing

item_factors.show(5)

+-------------+---------------------+------+
|user_id_index|news_article_id_index|rating|
+-------------+---------------------+------+
|10460.0      |6.0                  |1     |
|10460.0      |279.0                |1     |
|10460.0      |1243.0               |1     |
|10460.0      |201.0                |1     |
|10460.0      |1734.0               |1     |
+-------------+---------------------+------+
only showing top 5 rows

+---+--------------------+
| id|            features|
+---+--------------------+
|  0|[-0.16858101, 0.0...|
| 10|[0.35777473, -0.5...|
| 20|[0.08643527, -0.3...|
| 30|[0.17135058, 0.15...|
| 40|[0.13522969, 0.14...|
+---+--------------------+
only showing top 5 rows



#### Calculating Similarity - Locality-Sensitive Hasing (LSH)

In [3]:
# In order to calculate the similarity between items by using Spark's LSH, we need to convert the item factors into a DenseVector
# Define a UDF that converts an array of floats into a DenseVector
to_vector = udf(lambda x: Vectors.dense(x), VectorUDT())

# Apply the UDF to the 'features' column
item_factors = item_factors.withColumn("features", to_vector("features"))

# Prepare for calculating similarity
# Initialize the LSH model
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=2.0, numHashTables=3)

# Fit the LSH model on the item factors
model_lsh = brp.fit(item_factors)

# Transform item factors to hash table
item_factors_hashed = model_lsh.transform(item_factors)

# Calculate Similiary
# Calculate approx similarity join
similar_items = model_lsh.approxSimilarityJoin(item_factors_hashed, item_factors_hashed, threshold=1.5, distCol="EuclideanDistance")

# Show some results
similar_items.select(col("datasetA.id").alias("idA"), col("datasetB.id").alias("idB"), "EuclideanDistance").show()


+---+---+------------------+
|idA|idB| EuclideanDistance|
+---+---+------------------+
|  0|990|1.4072517684549435|
|  0|930|1.4205228862444568|
|  0|910|1.4034149854427065|
|  0|900| 1.300589241076896|
|  0|880|1.4031998195466355|
|  0|870|1.3720967811574778|
|  0|850|1.3959277875122378|
|  0|840|1.3732905159559319|
|  0|830|  1.31786995895055|
|  0|820|1.3897629722269176|
|  0|810|1.3124629287894602|
|  0|800|1.3907342338714066|
|  0|780|1.3034421385096169|
|  0|760|1.4344558859378924|
|  0|750|1.3773762779332417|
|  0|740|1.4013291473390703|
|  0|730|1.4315972423762529|
|  0|720|1.4093930188192254|
|  0|710|1.4241073268436972|
|  0|700|1.3347027024321076|
+---+---+------------------+
only showing top 20 rows



### Prepare similiarity data for recommendations

In [10]:
user_item_interactions = data.select("user_id_index", "news_article_id_index").distinct()

# Step 1: Flatten similar_items for easier handling
# You might need to adjust this part based on your exact schema of similar_items
flat_similar_items = similar_items.select(
    col("datasetA.id").alias("article_id"),
    col("datasetB.id").alias("similar_article_id"),
    col("EuclideanDistance")
)

# Step 2: Filter for new recommendations per user
# Join user interactions with similar items to find potential recommendations
potential_recommendations = user_item_interactions.join(
    broadcast(flat_similar_items),
    user_item_interactions.news_article_id_index == flat_similar_items.article_id,
    "inner"
).select(
    "user_id_index",
    "similar_article_id",
    "EuclideanDistance"
).distinct()

# Step 3: Filter out articles the user has already interacted with
filtered_recommendations = potential_recommendations.join(
    broadcast(user_item_interactions),
    (potential_recommendations.user_id_index == user_item_interactions.user_id_index) & 
    (potential_recommendations.similar_article_id == user_item_interactions.news_article_id_index),
    "left_anti"
)


In [14]:
def get_top_n_recommendations(user_id, N=5):
    # Check if the user_id mapping to user_id_index is successfull and the userID exists in the dataset
    user_id_index_row = user_id_index_df.filter(col("user_id") == user_id).select("user_id_index").first()
    if user_id_index_row is None:
        print(f"No user_id_index found for user_id {user_id}")
        return None
    user_id_index = user_id_index_row["user_id_index"]
    print(f"Found user_id_index {user_id_index} for user_id {user_id}")
    
    # Filter for recommendations specific to this user_id_index
    specific_user_recommendations = filtered_recommendations.filter(
        filtered_recommendations.user_id_index == user_id_index
    )
    #specific_user_recommendations.show()

    
    # Aggregate and rank recommendations for the user
    ranked_recommendations = specific_user_recommendations.groupBy("similar_article_id").agg(
        (1 / sum("EuclideanDistance")).alias("score")
    ).orderBy(desc("score"))

    
    # Fetch the top N recommendations
    top_n_recommendations = ranked_recommendations.limit(N)
    #top_n_recommendations.show()
    
    # Alias the DataFrames to clearly distinguish between them in the join condition
    top_n_recommendations_alias = top_n_recommendations.alias("top_n")
    news_article_id_index_df_alias = news_article_id_index_df.alias("article_id_index")

    # Perform the join using the aliased DataFrames
    top_n_recommendations_mapped = top_n_recommendations_alias.join(
        news_article_id_index_df_alias, 
        col("top_n.similar_article_id") == col("article_id_index.news_article_id_index")  # Use the aliased column references
    )

    # Select the desired columns from the joined DataFrame
    top_n_recommendations_mapped = top_n_recommendations_mapped.select(
        col("top_n.similar_article_id"), 
        col("article_id_index.news_article"), 
        col("top_n.score")
    )

    
    return top_n_recommendations_mapped

# Usage example:
top_n_recommendations = get_top_n_recommendations("U80234", 5)
top_n_recommendations.show()


Found user_id_index 10460 for user_id U80234
+------------------+------------+------------------+
|similar_article_id|news_article|             score|
+------------------+------------+------------------+
|                50|      N26026|1.9272633605596365|
|               160|      N18094|1.8533538436282824|
|               240|       N3046|1.6369071560230017|
|               440|      N63855|1.6698685785949838|
|               960|      N25677|1.7231495800966772|
+------------------+------------+------------------+

