 <img src="uva_seal.png"> 

## Recommender Systems

### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: October 8, 2024

---

#### Sources:

- Advanced Analytics with Spark: Chapter 3
- [Recommendation engine with Amazon Personalize](https://aws.amazon.com/blogs/architecture/automating-recommendation-engine-training-with-amazon-personalize-and-aws-glue/)
- [Non-negative matrix factorization](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization)

#### Objectives
Introduction to recommender systems

#### Concepts

- Required data: user data, item data, interaction data
- Collaborative filtering
- Alternating least squares
- Implicit vs. explicit feedback
- Exploration vs. exploitation
- Finding similar users with cosine similarity

---

#### Introduction

Recommender systems are a major application of AI.

They have found widespread use across domains to recommend products to users:

- Amazon uses them in their e-commerce platform to promote products to users
- Netflix recommends streaming content to viewers
- Education technology companies build systems to recommend articles/blogs/videos to students and teachers

---

#### Required Data

Three datasets are required for recommender systems:

- **User Data**  
USER_ID (string), metadata fields

- **Item Data**  
ITEM_ID (string), metadata fields

- **Interaction Data**  
USER_ID (string), ITEM_ID (string), TIMESTAMP

---

**Collection Interactions: Implicit vs Explicit Feedback** 

Recommender systems use interactions between users and products to make relevant recommendations

Assumption: historical interactions will be useful in the future

**Implicit feedback** can be collected from activity, such as listens, clicks and purchases.  
- often automated with *event triggers*
- relatively easy to collect

**Explicit feedback** is often collected by asking the user to rate a product.  
- generally harder to collect
- might not match the true feeling of the user  
  Example: user says he prefers classical music (explicit), yet all recent listens are hard rock (implicit)

In practice, implicit feedback can work better in recommender systems.  

---

#### Common Algorithms

**Collaborative Filtering**  

The concept is simple and appealing: based on user preferences for items, recommend similar items the user is likely to prefer

Users assign items a rating (e.g., number of stars)

The interaction matrix is generally sparse: low interaction between users and items.

Need to compute a similarity metric; for item *i*, return the top *k* most relevant items.
Popular method is *cosine similarity*.

The cosine of two non-zero vectors **A** and **B** can be derived by using the Euclidean dot product formula:


 <img src="cosine_sim.png"> 

Vectors that are closer (smaller angle) have higher cosine and higher similarity.

See [here](https://en.wikipedia.org/wiki/Collaborative_filtering) for more details.

---

**Alternating Least Squares (ALS)**  

Alternating Least Squares (ALS) is popular for recommendation as it scales well.

It uses *latent factors*, which are factors that are unobservable and small in number.  
In essence, the factors compress data from a high dimensional space into a much lower dimension (think PCA).

ALS uses a matrix factorization method called *non-negative matrix factorization* (NMF).  
Non-negative here because after doing matrix-matrix multiply of the matrix factors, the resulting matrix approximation should not have negative entries (the number of listens to a song is 0+).

**Non-Negative Matrix Factorization Example**  
Source: Wikipedia  

The matrix **V** is represented by the two smaller matrices **W** and **H**, which, when multiplied, approximately reconstruct **V**.


 <img src="nnf.png"> 

The matrices **W** and **H** need to be estimated, and this is done in an iterative process where estimates on each matrix are produced in an alternating fashion.  

---

#### Amazon Personalize and Example Architecture Diagram

Amazon Personalize is a fully-featured recommender system.  

Includes different algorithms called *recipes*. See [here](https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html) for more details.

Provides parameter that controls *exploration* versus *exploitation*:  

**Explore** : items with less interaction data or relevance are recommended more frequently  
**Exploit** : recommendations are based on what we know or relevance

Personalize tests different item recommendations, learns from user interactions, and boosts recommendations for items driving better engagement.

**Example of high-level architecture for a retail customer**  
Source: Amazon Web Services

 <img src="amazon_personalize.png"> 

---

#### Recommendation in Spark

Spark MLlib supports recommendation algorithms (both RDD and DF APIs).  

Some relevant code for ALS algorithm:
```
# import packages for RDD API
from pyspark.mllib import recommendation
from pyspark.mllib.recommendation import *

# Train the model
model = ALS.trainImplicit(trainData, rank=10, iterations=5, alpha=0.01)

```

ALS parameters in spark implementation: 

- `rank`  
The number of latent factors in the model, or equivalently, the number of columns $k$ in the user-feature and product-feature matrices.

- `iterations`  
The number of iterations that the factorization runs. More iterations take more time but may produce a better factorization.

- `lambda`  
A standard overfitting parameter. Higher values resist overfitting, but values that are too high hurt the factorization’s accuracy.

- `alpha`  
Controls the relative weight of observed versus unobserved user-product interactions in the factorization.

There is a programming assignment where you will do end-to-end implementation.

---