# Worksheet 17

Name:  
UID:

### Topics

- Recommender Systems

### Recommender Systems

In the example in class of recommending movies to users we used the movie rating as a measure of similarity between users and movies and thus the predicted rating for a user is a proxy for how highly a movie should be recommended. So the higher the predicted rating for a user, the higher a recommendation it would be.

a) Consider a streaming platform that only has "like" or "dislike" (no 1-5 rating). Describe how you would build a recommender system in this case.

A binary classification task would be used instead of a rating prediction in order to create a recommender system on a streaming platform where users can only select "like" or "dislike" content. Based on user and movie variables like demographics or genres, you can use models like logistic regression, random forests, or neural networks to predict user preferences. These models are evaluated using measures like accuracy and F1-score, and they are trained using historical like/dislike data. Create a feedback loop to improve predictions based on new user interactions, optimize the model via feature engineering and hyperparameter tuning, and continuously monitor and update the system to accommodate shifting user preferences.

b) Describe 3 challenges of building a recommender system

Data sparsity: This is the state in which there is insufficient data regarding how users engage with particular things. Consider a brand-new online store with scant customer information. Product recommendations become challenging since the system is unable to accurately learn consumer preferences.

Cold Start Issue: This arises while working with brand-new products or people. It is difficult to recommend relevant content to a new user on a streaming platform because they do not have a watch history. Comparably, it is challenging to promote a new product to customers when it is untested and lacking interaction data on the shelf.

Balancing Exploration and Exploitation: Recommender systems frequently become stuck proposing content that users already enjoy. This is known as the "exploration vs. exploitation" dilemma. This restricts exposure to new things by generating a filter bubble. Finding a balance between recommending well-known favorites (exploitation) and exposing people to novel ideas is the difficult part.

c) Why is SVD not an option for collaborative filtering?

Because traditional SVD has trouble managing sparse matrices, which have the majority of user-item ratings missing, it is rarely utilized in collaborative filtering. Additionally, this strategy is not easily scalable, and it is difficult to alter suggestions based on fresh data without recomputation. In practical applications, changes such as Funk-SVD are employed to address these problems. These modifications to the SVD technique make it more appropriate for dynamic, large-scale recommender systems by limiting its attention to known ratings, permitting incremental updates, and optimizing its performance with partial data.

d) Use the code below to train a recommender system on a dataset of amazon movies

In [None]:
!pip install findspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=e20cb56b1b3c213bbe6f8f8395fe33c86311bdd18f112460b0d2c02cf2e0c316
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [None]:
!ls

sample_data  train.csv


In [None]:
import findspark
import pandas as pd
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

findspark.init()
conf = SparkConf().set("spark.executor.memory", "28g").set("spark.driver.memory", "28g").set("spark.driver.cores", "8")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()

init_df = pd.read_csv("./train.csv").dropna()
init_df['UserId_fact'] = init_df['UserId'].astype('category').cat.codes
init_df['ProductId_fact'] = init_df['ProductId'].astype('category').cat.codes


In [None]:
from sklearn.model_selection import train_test_split

X_train_processed, X_test_processed, Y_train, Y_test = train_test_split(
    init_df.drop(['Score'], axis=1),
    init_df['Score'],
    test_size=1/4.0,
    random_state=0
)

X_train_processed['Score'] = Y_train


In [None]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

df = spark.createDataFrame(X_train_processed[['UserId_fact', 'ProductId_fact', 'Score']])
als = ALS(userCol="UserId_fact", itemCol="ProductId_fact", ratingCol="Score", coldStartStrategy="drop", nonnegative=True)

param_grid = ParamGridBuilder()\
    .addGrid(als.rank, [50, 100])\
    .addGrid(als.regParam, [0.05, 0.1])\
    .build()

evaluator = RegressionEvaluator(metricName="rmse", labelCol="Score")
cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3)
rec_sys = cv.fit(df).bestModel


In [None]:

mean_score = Y_train.mean()
rec['prediction'].fillna(mean_score, inplace=True)

if len(rec) != len(X_test_processed):
    print("Warning: Mismatched number of predictions and test instances.")

X_test_processed['predicted_Score'] = rec['prediction']

if X_test_processed['predicted_Score'].isna().any() or Y_test.isna().any():
    print("Error: NaN values found in predictions or actual scores.")

    X_test_processed['predicted_Score'].fillna(mean_score, inplace=True)

try:
    rmse = mean_squared_error(Y_test, X_test_processed['predicted_Score'], squared=False)
    print("Kaggle RMSE = ", rmse)
except ValueError as e:
    print("Error during RMSE calculation:", e)


Error: NaN values found in predictions or actual scores.
Kaggle RMSE =  1.2638309053449663
