# Worksheet 17

Name:  JINGYI ZHANG
UID: U26578499

### Topics

- Recommender Systems

### Recommender Systems

In the example in class of recommending movies to users we used the movie rating as a measure of similarity between users and movies and thus the predicted rating for a user is a proxy for how highly a movie should be recommended. So the higher the predicted rating for a user, the higher a recommendation it would be.

a) Consider a streaming platform that only has "like" or "dislike" (no 1-5 rating). Describe how you would build a recommender system in this case.

In a streaming platform where users can only "like" or "dislike" movies, you can use a binary classification model for your recommender system. Hereâ€™s a concise overview of how to build such a system:

- 1.Data Collection: Collect data on which movies users have liked or disliked.<br><br>
- 2.Feature Engineering: Develop features based on user behavior (e.g., genres liked, viewing times, frequency of likes/dislikes) and movie characteristics (e.g., genre, director, cast).<br><br>
- 3.Model Selection: Choose a model suitable for binary classification. Logistic regression, decision trees, or even more complex models like support vector machines (SVM) or neural networks could be used, depending on the complexity and size of your dataset.<br><br>
- 4.Training: Train the model on the historical data of likes and dislikes. The model will learn the patterns of user preferences.<br><br>
- 5.Prediction and Recommendation: Use the trained model to predict whether a user would like or dislike a new or unseen movie. Recommend movies that the model predicts the user would "like".<br><br>
- 6.Evaluation: Continuously evaluate and refine the model using new user data. Metrics like accuracy, precision, recall, or F1-score can be helpful to assess performance.
This approach simplifies the recommendation task to a binary classification problem while still allowing personalized recommendations based on user preferences.

b) Describe 3 challenges of building a recommender system

Here are three key challenges of building a recommender system:

- 1.Scalability: As the number of users and items grows, the system must efficiently handle increased data without sacrificing performance, which can be technically demanding.<br><br>
- 2.Sparsity: Often, the data available for preferences (like/dislike, ratings) is sparse, meaning most users have interacted with only a small fraction of the total content, making it hard to accurately predict preferences.<br><br>
- 3.Cold Start: New users or items in the system have little or no historical data, making it challenging to provide accurate recommendations. This is particularly difficult for new users who haven't expressed any preferences yet.<br><br>

c) Why is SVD not an option for collaborative filtering?

SVD (Singular Value Decomposition) is typically not suitable for collaborative filtering directly because it requires a fully populated matrix to work, but in real-world scenarios, user-item interaction matrices are usually sparse (many missing values). Techniques that adapt SVD to handle missing values, like Funk-SVD or matrix factorization methods, are generally used instead for collaborative filtering tasks.

d) Use the code below to train a recommender system on a dataset of amazon movies

In [None]:
# Note: requires py3.10
import findspark
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix

from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

findspark.init()
conf = SparkConf()
conf.set("spark.executor.memory","28g")
conf.set("spark.driver.memory", "28g")
conf.set("spark.driver.cores", "8")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()

init_df = pd.read_csv("./train.csv").dropna()
init_df['UserId_fact'] = init_df['UserId'].astype('category').cat.codes
init_df['ProductId_fact'] = init_df['ProductId'].astype('category').cat.codes

# Split training set into training and testing set
X_train_processed, X_test_processed, Y_train, Y_test = train_test_split(
        init_df.drop(['Score'], axis=1),
        init_df['Score'],
        test_size=1/4.0,
        random_state=0
    )

X_train_processed['Score'] = Y_train
df = spark.createDataFrame(X_train_processed[['UserId_fact', 'ProductId_fact', 'Score']])
als = ALS(
    userCol="UserId_fact",
    itemCol="ProductId_fact",
    ratingCol="Score",
    coldStartStrategy="drop",
    nonnegative=True,
    rank=100
)
# param_grid = ParamGridBuilder().addGrid(
        # als.rank, [10, 50]).addGrid(
        # als.regParam, [.1]).addGrid(
        # # als.maxIter, [10]).build()
# evaluator = RegressionEvaluator(
        # metricName="rmse",
        # labelCol="Score", 
        # # predictionCol="prediction")
# cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3, parallelism = 6)
# cv_fit = cv.fit(df)
# rec_sys = cv_fit.bestModel

rec_sys = als.fit(df)
# rec_sys.save('rec_sys.obj') # so we don't have to re-train it
rec = rec_sys.transform(spark.createDataFrame(X_test_processed[['UserId_fact', 'ProductId_fact']])).toPandas()
X_test_processed['Score'] = rec['prediction'].values.reshape(-1, 1)

print("Kaggle RMSE = ", mean_squared_error(X_test_processed['Score'], Y_test, squared=False))

cm = confusion_matrix(Y_test, X_test_processed['Score'], normalize='true')
sns.heatmap(cm, annot=True)
plt.title('Confusion matrix of the classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()