## Item-based Collaborative Filtering

#### Rationale
1. Relatively large number of users compared to relevant news articles. Thus it is easier computationally to compare items than users.
2. Item stability > User stability. Once a news article is out, it's content is fixed, while a user might change taste often. This can make user-based collaborative filtering more inaccurate in relation to the user's present taste. Similarity between items is constanst, i.e. the need for recalculations will be less with item-based collaborative filtering.
3. Few news article interactions per user. This makes it harder to guess similar users as in user-based collaborative filtering.


#### Item-based collaborative filtering in a nutshell (MIND)
"Find articles that are likely to be of interest, based on shared user interest patterns. Return the top N articles that are most similar to any of the news articles the user has clicked on, based on the similarity calculations between items."
1. For each news article a user has clicked, get an overview of articles other users have also clicked
2. Matrix factorization for efficiency - Alternating Least Squares (ALS)
3. Calculate the similarity of each article (similarity of interactions) - Locality-Sensitive Hasing (LSH)
4. Repeat steps for each news articles, and sort the recommendations list according to articles with the highest cosine similarity


##### If error runnning with pyspark:
- If you get Py4JJavaError, remember to ensure pyspark system variables correctly
    - echo %PYSPARK_PYTHON%
    - echo %PYSPARK_DRIVER_PYTHON%


### Step 1: Preperation of data

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import explode, split, col, lit, desc, sum, udf, broadcast
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer, BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors, VectorUDT
import pandas as pd


spark = SparkSession.builder \
    .appName("MINDItemBasedFiltering") \
    .config("spark.executor.memory", "8g") \
    .config("spark.driver.memory", "8g") \
    .config("spark.driver.extraJavaOptions", "-XX:+UseG1GC") \
    .config("spark.executor.extraJavaOptions", "-XX:+UseG1GC") \
    .getOrCreate()

def prepare_data_with_spark_df(pandas_df):
    # Define the schema of the dataset as in your previous code
    schema = StructType([
        StructField("ImpressionID", IntegerType(), True),
        StructField("UserID", StringType(), True),
        StructField("Time", StringType(), True),
        StructField("History", StringType(), True),
        StructField("Impressions", StringType(), True)
    ])
    
    # Convert the pandas dataframe to a spark dataframe using the defined schema
    spark_df = spark.createDataFrame(pandas_df, schema=schema)
    
    # Process the dataframe similarly to how you've done with the TSV file
    # Explode the history column into separate rows for each article per user
    spark_df = spark_df.withColumn("NewsArticle", explode(split(col("History"), " "))) \
                       .select(col("UserID").alias("user_id"), col("NewsArticle").alias("news_article"))
    
    # Return the processed Spark DataFrame
    return spark_df

#data.show(5, truncate=False)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


#### Step 2: Alternating Least Squares (ALS)

In [2]:
def alternating_least_squares(data):
    # Add a 'rating' column to indicate interaction
    data = data.withColumn("rating", lit(1))

    # Index the user_id and news_article columns
    user_indexer = StringIndexer(inputCol="user_id", outputCol="user_id_index").fit(data)
    item_indexer = StringIndexer(inputCol="news_article", outputCol="news_article_id_index").fit(data)

    # Transform data with indexers
    data = user_indexer.transform(data)
    data = item_indexer.transform(data)

    # Extract mappings from StringIndexer models
    user_id_index_mapping = user_indexer.labels
    news_article_id_index_mapping = item_indexer.labels

    # Convert mappings to DataFrames for easier use
    user_id_index_df = spark.createDataFrame(
        [(i, user_id_index_mapping[i]) for i in range(len(user_id_index_mapping))],
        ["user_id_index", "user_id"]
    )
    news_article_id_index_df = spark.createDataFrame(
        [(i, news_article_id_index_mapping[i]) for i in range(len(news_article_id_index_mapping))],
        ["news_article_id_index", "news_article"]
    )

    # Select the final columns for ALS
    data = data.select("user_id_index", "news_article_id_index", "rating")

    # Train the ALS model
    als = ALS(maxIter=5, regParam=0.01, userCol="user_id_index", itemCol="news_article_id_index", 
              ratingCol="rating", coldStartStrategy="drop", implicitPrefs=True)
    model = als.fit(data)

    # Extract the item factors from the ALS model and limit to 100 for testing
    item_factors = model.itemFactors.limit(1000)
    num_item_factors = item_factors.count()
    print(f"Number of item factors: {num_item_factors}")

    # Return the data, user_id_index_df, news_article_id_index_df, and the model for further use
    return data, user_id_index_df, news_article_id_index_df, item_factors

### Step 3: Calculating Similarity - Locality-Sensitive Hasing (LSH)

In [3]:
def calculate_similarity(item_factors):
    # Define a UDF that converts an array of floats into a DenseVector
    to_vector = udf(lambda x: Vectors.dense(x), VectorUDT())

    # Apply the UDF to the 'features' column
    item_factors = item_factors.withColumn("features", to_vector("features"))

    # Initialize the LSH model
    brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=3.0, numHashTables=2)

    # Fit the LSH model on the item factors
    model_lsh = brp.fit(item_factors)

    # Transform item factors to hash table
    item_factors_hashed = model_lsh.transform(item_factors)

    # Calculate approximate similarity join
    similar_items = model_lsh.approxSimilarityJoin(item_factors_hashed, item_factors_hashed, threshold=1.5, distCol="EuclideanDistance")

    # Select the relevant columns to clarify the output
    similar_items = similar_items.select(
        col("datasetA.id").alias("idA"), 
        col("datasetB.id").alias("idB"), 
        col("EuclideanDistance")
    )
    
    # Show some results
    # For some reason, the line below makes the program run seemingly indefinitely
    #similar_items.select(col("datasetA.id").alias("idA"), col("datasetB.id").alias("idB"), "EuclideanDistance").show(5)    

    return similar_items


### Step 4: Prepare similiarity data for recommendations

In [4]:
def all_recommendations(data, similar_items, user_id_index_df, news_article_id_index_df):
    # Step 1: Flatten similar_items for easier handling
    # Correctly reference the nested columns
    flat_similar_items = similar_items.select(
        col("idA").alias("article_id"),
        col("idB").alias("similar_article_id"),
        col("EuclideanDistance")
    )
    
    # Get distinct user-item interactions
    user_item_interactions = data.select("user_id_index", "news_article_id_index").distinct().cache()

    # Step 2: Filter for new recommendations per user
    # Join user interactions with similar items to find potential recommendations
    potential_recommendations = user_item_interactions.join(
        broadcast(flat_similar_items),
        user_item_interactions.news_article_id_index == flat_similar_items.article_id,
        "inner"
    ).select(
        "user_id_index",
        "similar_article_id",
        "EuclideanDistance"
    ).distinct()

    # Step 3: Filter out articles the user has already interacted with
    filtered_recommendations = potential_recommendations.join(
        broadcast(user_item_interactions),
        (potential_recommendations.user_id_index == user_item_interactions.user_id_index) & 
        (potential_recommendations.similar_article_id == user_item_interactions.news_article_id_index),
        "left_anti"
    )

    # Join with user_id_index_df to convert user_id_index back to user_id
    filtered_recommendations = filtered_recommendations.join(
        broadcast(user_id_index_df),
        filtered_recommendations.user_id_index == user_id_index_df.user_id_index
    )

    # Join with news_article_id_index_df to convert similar_article_id back to news_article
    filtered_recommendations = filtered_recommendations.join(
        broadcast(news_article_id_index_df),
        filtered_recommendations.similar_article_id == news_article_id_index_df.news_article_id_index
    )

    # Select the original user and news article IDs, along with the EuclideanDistance
    filtered_recommendations = filtered_recommendations.select(
        col("user_id"), 
        col("news_article"),
        col("EuclideanDistance")
    )
    
    return filtered_recommendations

### (Execution) Train collaborative filtering model

In [5]:
def train_collaborative_filtering_model(pandas_df):
    # Prepare the data as a Spark DataFrame
    data = prepare_data_with_spark_df(pandas_df)

    # Train the ALS model
    data, user_id_index_df, news_article_id_index_df, item_factors = alternating_least_squares(data)

    # Calculate the similarity between items
    similar_items = calculate_similarity(item_factors)

    # Generate recommendations for all users
    recommendations = all_recommendations(data, similar_items, user_id_index_df, news_article_id_index_df)

    return recommendations

In [6]:
def get_top_n_recommendations(user_id, filtered_recommendations, N=5):
    # Fetch recommendations for the specific user
    specific_user_recommendations = filtered_recommendations.filter(
        filtered_recommendations.user_id == user_id
    )
    
    # Aggregate and rank recommendations
    ranked_recommendations = specific_user_recommendations.groupBy("news_article").agg(
        (1 / sum("EuclideanDistance")).alias("score")
    ).orderBy(desc("score")).limit(N)
    
    return ranked_recommendations


### Item-Based Collaborative Filtering Method Example
1. Trains the model
2. Calculates and prints the top 5 recommendations for user "U80234"

In [7]:
# Example:
pandas_df = pd.read_csv("data/MINDsmall_dev/behaviors.tsv", sep="\t", header=None, names=["ImpressionID", "UserID", "Time", "History", "Impressions"])
filtered_recommendations = train_collaborative_filtering_model(pandas_df)
ranked_recommendations = get_top_n_recommendations("U80234",filtered_recommendations)
ranked_recommendations.show()

Number of item factors: 100
+------------+------------------+
|news_article|             score|
+------------+------------------+
|      N64631| 1.732766934639083|
|      N17109|1.6911987136771351|
|      N57812|1.3562689985581646|
|      N63276|1.0071215188203804|
|      N56469|0.9443795341933542|
+------------+------------------+

