# Worksheet 17

Name:  Xincheng Han
UID: U19101927

### Topics

- Recommender Systems

### Recommender Systems

In the example in class of recommending movies to users we used the movie rating as a measure of similarity between users and movies and thus the predicted rating for a user is a proxy for how highly a movie should be recommended. So the higher the predicted rating for a user, the higher a recommendation it would be.

a) Consider a streaming platform that only has "like" or "dislike" (no 1-5 rating). Describe how you would build a recommender system in this case.

To build a recommender system for a streaming platform that only allows users to "like" or "dislike" content, start by collecting data on these interactions along with user demographics and content metadata. Use this data for feature engineering to prepare for modeling. Opt for a binary classification model such as logistic regression or decision trees, and consider implementing collaborative filtering to leverage similarities among users or content. Train the model on user data, evaluate its performance using metrics like accuracy and precision, and then use it to predict the likelihood of a user liking each unwatched movie. Recommend the movies with the highest likelihood of a "like." Continuously update the model with new user feedback to maintain its accuracy and relevance.

b) Describe 3 challenges of building a recommender system

Building a recommender system faces several key challenges, including the cold start problem, where new users or items lack sufficient interaction data for accurate recommendations, leading to potentially unsatisfactory user experiences. Scalability is another significant challenge, as the system must handle growing numbers of users and items efficiently, maintaining performance without sacrificing the speed or quality of recommendations. Lastly, ensuring diversity and serendipity in recommendations is crucial; while accuracy is important, providing users with varied and unexpected content can enhance engagement but requires complex algorithmic solutions to balance well-known preferences with novel suggestions.

c) Why is SVD not an option for collaborative filtering?

SVD (Singular Value Decomposition) struggles in collaborative filtering primarily because it isn't efficient with sparse matrices typical in user-item interactions and cannot handle missing values directly. Since recommender systems often contain a vast amount of uninteracted items, leading to sparsity, SVD requires adjustments to work effectively. Additionally, the computational demands and scalability issues of performing SVD on very large datasets make it less suitable for dynamic environments, leading practitioners to prefer modified approaches like Truncated SVD or specialized matrix factorization techniques.

d) Use the code below to train a recommender system on a dataset of amazon movies

In [6]:
# Note: requires py3.10
import findspark
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix

from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

findspark.init()
conf = SparkConf()
conf.set("spark.executor.memory","28g")
conf.set("spark.driver.memory", "28g")
conf.set("spark.driver.cores", "8")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()

init_df = pd.read_csv("./train.csv").dropna()
init_df['UserId_fact'] = init_df['UserId'].astype('category').cat.codes
init_df['ProductId_fact'] = init_df['ProductId'].astype('category').cat.codes

# Split training set into training and testing set
X_train_processed, X_test_processed, Y_train, Y_test = train_test_split(
        init_df.drop(['Score'], axis=1),
        init_df['Score'],
        test_size=1/4.0,
        random_state=0
    )

X_train_processed['Score'] = Y_train
df = spark.createDataFrame(X_train_processed[['UserId_fact', 'ProductId_fact', 'Score']])
als = ALS(
    userCol="UserId_fact",
    itemCol="ProductId_fact",
    ratingCol="Score",
    coldStartStrategy="drop",
    nonnegative=True,
    rank=100
)
# param_grid = ParamGridBuilder().addGrid(
        # als.rank, [10, 50]).addGrid(
        # als.regParam, [.1]).addGrid(
        # # als.maxIter, [10]).build()
# evaluator = RegressionEvaluator(
        # metricName="rmse",
        # labelCol="Score", 
        # # predictionCol="prediction")
# cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3, parallelism = 6)
# cv_fit = cv.fit(df)
# rec_sys = cv_fit.bestModel

rec_sys = als.fit(df)
# rec_sys.save('rec_sys.obj') # so we don't have to re-train it
rec = rec_sys.transform(spark.createDataFrame(X_test_processed[['UserId_fact', 'ProductId_fact']])).toPandas()
X_test_processed['Score'] = rec['prediction'].values.reshape(-1, 1)

print("Kaggle RMSE = ", mean_squared_error(X_test_processed['Score'], Y_test, squared=False))

cm = confusion_matrix(Y_test, X_test_processed['Score'], normalize='true')
sns.heatmap(cm, annot=True)
plt.title('Confusion matrix of the classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.