 <img src="./img/uva_seal.png"> 

## Recommender Systems

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: October 10, 2025

---

#### Sources:

- Advanced Analytics with Spark: Chapter 3
- [Recommendation engine with Amazon Personalize](https://aws.amazon.com/blogs/architecture/automating-recommendation-engine-training-with-amazon-personalize-and-aws-glue/)
- [Non-negative matrix factorization](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization)

#### Objectives
Introduction to recommender systems

#### Concepts

- Required data: user data, item data, interaction data
- Collaborative filtering
- Alternating least squares
- Implicit vs. explicit feedback
- Exploration vs. exploitation
- Finding similar users with cosine similarity

---

#### I. Introduction

Recommender systems are a major application of AI.

They have found widespread use across domains to recommend products to users:

- Amazon uses them in their e-commerce platform to promote products to users
- Netflix recommends streaming content to viewers
- Education technology companies build systems to recommend articles/blogs/videos to students and teachers

---

#### II. Required Data

Three datasets are required for recommender systems:

- **User Data**  
USER_ID (string), metadata fields

- **Item Data**  
ITEM_ID (string), metadata fields

- **Interaction Data**  
USER_ID (string), ITEM_ID (string), TIMESTAMP

---

**Collection Interactions: Implicit vs Explicit Feedback** 

Recommender systems use interactions between users and products to make relevant recommendations

Assumption: historical interactions will be useful in the future

**Implicit feedback** can be collected from activity, such as listens, clicks and purchases.  
- often automated with *event triggers*
- relatively easy to collect

**Explicit feedback** is often collected by asking the user to rate a product.  
- generally harder to collect
- might not match the true feeling of the user  
  Example: user says he prefers classical music (explicit), yet all recent listens are hard rock (implicit)

In practice, implicit feedback can work better in recommender systems.  

---

#### III. Common Algorithms

**A. Collaborative Filtering**  

The concept is simple and appealing: based on user preferences for items, recommend similar items the user is likely to prefer

Users assign items a rating (e.g., number of stars)

The interaction matrix is generally sparse: low interaction between users and items.

Need to compute a similarity metric; for item *i*, return the top *k* most relevant items.
Popular method is *cosine similarity*.

The cosine of two non-zero vectors **A** and **B** can be derived by using the Euclidean dot product formula:


 <img src="./img/cosine_sim.png"> 

Vectors that are closer (smaller angle) have higher cosine and higher similarity.

See [here](https://en.wikipedia.org/wiki/Collaborative_filtering) for more details.

---

**B. Alternating Least Squares (ALS)**  

Alternating Least Squares (ALS) is popular for recommendation as it scales well.

It uses **latent factors**, which are factors that are unobservable and small in number.  
In essence, the factors compress data from a high dimensional space into a much lower dimension (think PCA).

The user-item (or interaction) matrix **𝑅** is represented by the two smaller matrices **𝑈** and **𝑉**, which, when multiplied, approximately reconstruct **𝑅**.


 <img src="./img/R.png" width=500> 

 <img src="./img/UV.png" width=500> 

The matrices **U** and **V** need to be estimated.

**Strategy:**
 
1. Specify an objective: minimize sum of squared errors, including regularization term:

 <img src="./img/objective.png" width=500> 

2. Fix $V$, solve for $U$ applying ridge regression
3. Fix $U$, solve for $V$ applying ridge regression
4. Repeat 2-3 until convergence or max iterations

Steps 2 & 3 give the name alternating least squares

---

After we have these estimates, we can predict like this:

 <img src="./img/prediction.png" width=500> 

---

**Cold Start Problem** 

How to recommend new users? We are missing the row.  

How to recommend new products? We are missing the column.  

This is *cold start problem*

Common strategies:

- Use global averages (e.g., mean rating) for unseen users/items

- Content-based features (item metadata or user attributes) to estimate latent factors

- Hybrid approaches combining ALS with content-based or popularity-based recommendations


 <img src="./img/cold_start.png" width=500> 

---

#### IV. Recommendation in Spark

**A. Workflow**  

Spark MLlib supports recommendation algorithms (both RDD and DF APIs).  

Some relevant code for ALS algorithm:
```

# import from libraries
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

# instantiate model
als = ALS(
    userCol="user_id",
    itemCol="item_id",
    ratingCol="rating",
    coldStartStrategy="drop"
)

model = als.fit(ratings_df)

predictions = model.transform(ratings_df)

```

---

**B. ALS Hyperparameters**  

Some important ALS hyperparameters in Spark implementation: 

- `rank`  
The number of latent factors in the model, or equivalently, the number of columns $k$ in the user-feature and product-feature matrices.

- `maxIter`   
The number of iterations that the factorization runs. More iterations take more time but may produce a better factorization.

- `regParam`  
A standard overfitting parameter. Higher values resist overfitting, but values that are too high hurt the factorization’s accuracy.

- `alpha`  
Controls the relative weight of observed versus unobserved user-product interactions in the factorization.

- `coldStartStrategy`="drop" is used to safely handle unseen users/items.

---

**C. Spark Computing Details**  

*Cold Start*  
Spark does not automatically generate latent factors for unseen users/items.

*Distributed computing*  
ALS in Spark is designed for parallel updates:

- Each row of 𝑈 (user factors) can be solved independently across workers

- Each row of 𝑉 (item factors) can also be solved independently

- Uses block partitioning of matrices to reduce network communication and scale to large datasets

- Scales to millions of users/items

---

#### V. Amazon Personalize and Example Architecture Diagram

Amazon Personalize is a fully-featured recommender system.  

Includes different algorithms called *recipes*. See [here](https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html) for more details.

Provides parameter that controls *exploration* versus *exploitation*:  

**Explore** : items with less interaction data or relevance are recommended more frequently  
**Exploit** : recommendations are based on what we know or relevance

Personalize tests different item recommendations, learns from user interactions, and boosts recommendations for items driving better engagement.

**Example of high-level architecture for a retail customer**  
Source: Amazon Web Services

 <img src="./img/amazon_personalize.png"> 

---