# Worksheet 17

Name:  ShanshangZeng
UID: U87820096

### Topics

- Recommender Systems

### Recommender Systems

In the example in class of recommending movies to users we used the movie rating as a measure of similarity between users and movies and thus the predicted rating for a user is a proxy for how highly a movie should be recommended. So the higher the predicted rating for a user, the higher a recommendation it would be.

a) Consider a streaming platform that only has "like" or "dislike" (no 1-5 rating). Describe how you would build a recommender system in this case.

could use collaborative filtering, content-based filtering, or deep learning models to predict user preferences and recommend items based on similarities in user behavior or item characteristics.

b) Describe 3 challenges of building a recommender system

Building a recommender system presents challenges such as scalability to manage large data volumes, the cold start problem for new users or items with limited data, and data sparsity with too few interactions to make reliable predictions.

c) Why is SVD not an option for collaborative filtering?

SVD is not ideal for collaborative filtering because it requires a complete matrix and cannot handle the typical sparsity found in user-item interaction data.

d) Use the code below to train a recommender system on a dataset of amazon movies

In [None]:
import findspark
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix

findspark.init()
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize Spark session with required resources
spark = SparkSession.builder \
    .appName("Amazon Movie Recommender") \
    .config("spark.executor.memory", "28g") \
    .config("spark.driver.memory", "28g") \
    .config("spark.driver.cores", "8") \
    .getOrCreate()

# Load and preprocess the dataset
init_df = pd.read_csv("./train.csv").dropna()
init_df['UserId_fact'] = init_df['UserId'].astype('category').cat.codes
init_df['ProductId_fact'] = init_df['ProductId'].astype('category').cat.codes

# Split the dataset into training and testing sets
X_train_processed, X_test_processed, Y_train, Y_test = train_test_split(
    init_df[['UserId_fact', 'ProductId_fact']],  # directly use only needed columns
    init_df['Score'],
    test_size=0.25,
    random_state=0
)
X_train_processed['Score'] = Y_train

# Convert the pandas DataFrame to Spark DataFrame
train_df = spark.createDataFrame(X_train_processed)

# Configure the ALS model
als = ALS(
    userCol="UserId_fact",
    itemCol="ProductId_fact",
    ratingCol="Score",
    coldStartStrategy="drop",
    nonnegative=True,
    rank=100,
    maxIter=10,
    regParam=0.1
)

# Fit the ALS model
model = als.fit(train_df)

# Prepare test data and make predictions
test_df = spark.createDataFrame(X_test_processed[['UserId_fact', 'ProductId_fact']])
predictions = model.transform(test_df)

# Evaluate the model using RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="Score", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root Mean Square Error = " + str(rmse))

# Optional: Save the model for later use
# model.save("model_path")

# Prepare data for confusion matrix visualization
predictions_pandas = predictions.toPandas()
predictions_pandas['Score'] = Y_test.values

# Compute and plot the confusion matrix
cm = confusion_matrix(predictions_pandas['Score'], predictions_pandas['prediction'].round(), normalize='true')
sns.heatmap(cm, annot=True)
plt.title('Confusion Matrix of the Recommender System')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()