<left><img width=25% src="img/gw_monogram_2c.png"></left>

# Lecture 2: Designing State of the Art Recommender Systems

### CS4907/CS6365 Machine Learning

__Sardar Hamidian__<br>The George Washington Universiry

__Armin Mehrabian__<br>The George Washington Universiry

# ToC

1. <span style="color:lightgray; font-size: 0.9em;">Session 1: Introduction to recommender systems: Basics and classic techniques</span>
2. **Session 2: Beyond Rating Prediction**
3. <span style="color:lightgray; font-size: 0.9em;">Session 3: Other advanced approaches to recommending content</span>
4. <span style="color:lightgray; font-size: 0.9em;">Session 4: Recommender Systems in Industry I</span>
5. <span style="color:lightgray; font-size: 0.9em;">Session 5: Recommender Systems in Industry II</span>


## Session 2: Beyond Rating Prediction

1. Beyond Rating Prediction
2. Ranking
3. Other approaches
   - a. Recommending Similars
   - b. Social recommendations
   - c. Explore/exploit
   - d. Page Optimization
   - e. Deep Learning
4. Context-aware recommendations
5. Hybrid recommendations
6. Takeaways


## 1.1 Select Objective and Metrics

This is an essential first step.
- Choose data and metrics that connect to your business goal.
- Sample negatives smartly.
- Select validation and test set carefully (e.g., avoid time traveling).
- For metrics, prefer ranking or ranking-related metrics.
- We will discuss this topic at length in future sessions.

# Training, testing, metrics - our use case

- **In our example, and to simplify, we could choose:**
  - **Training data:** add text, userID, click/no-click, position of ad in email
  - **Metric:** NDCG or recall @3


## 1.2 Start with (Implicit) Matrix Factorization


- Experience suggests that the best single (simple) approach is implicit matrix factorization:
   - **ALS:** Alternating Least Squares (Hu et al., 2008).
   https://www.researchgate.net/publication/254464370_Alternating_least_squares_for_personalized_ranking
   - **BPR:** Bayesian Personalized Ranking (Rendle et al., 2009).
   https://arxiv.org/abs/1205.2618

# Recommended Implementations

- **Implicit**
  - Efficient
  - Python
  - Well-maintained
<center><img src="img/Implicit2.jpg" style="width:50%; height:auto;"/></center>
https://github.com/benfred/implicit

- **Quora’s QMF**
  - Efficient compiled C++ code
  - Supports many evaluation metrics
<center><img src="img/Implicit3.jpg" style="width:50%; height:auto;"/></center>
https://github.com/quora/qmf


# Modeling in our example

- **Question for the class:** Can we use basic CF in the case of daily news recommendation?


# 1.3 Decide simple candidate selection strategy

# Candidate selection and filtering
Question for the class: how do you find your candidate news?

# 1.4 A/B Test

# AB Test

- **So, you have your first implementation:**
  - Have tuned hyperparameters to optimize offline metric
  - How do you know this is working?

- **Run AB Test!**
  - Make sure offline metric (somewhat) correlates to online effect


# 1.5 Retrain

- **Retrain**
  - Data changes over time
  - Every recommendation that is shown and acted upon (or not) is new data
  - You need to retrain models with new data (incrementally or not)

# 1.6 Ensemble

- **Ensemble**
  - Now, it’s time to turn the model into a signal
  - Brainstorm about some simple potential features that you could combine with implicit MF
    - E.g., user tenure, average rating for the item, price of the item...
  - Add to MF through an ensemble

- **What model to use at the ensemble layer?**
  - Always favor the most simple -> L2-regularized Logistic Regression
  - Eventually introduce models that can benefit from non-linear effects and many features -> Gradient Boosted Decision Trees
  - Explore Learning-to-rank models -> LambdaRank
  - Deep learning

# Ensemble in our example

- **Question for the class:** Besides our basic text representation-based CF, what other features could we add?

---

# 1.7 Iterate, Feature Engineering

- **Iterate**
  - Experiment/add more features
  - Experiment with more complex models
  - Do both things in parallel
  - Continue AB testing

# Part 2. Beyond Rating Prediction
<br>
<center><img src="img/netflix21.jpg" style="width:50%; height:auto;"/></center>


# Evolution of the Recommender Problem
<center><img src="img/eval21.jpg" style="width:80%; height:auto;"/></center>


# Ranking by ratings
<br>
<center><img src="img/eval23.jpg" style="width:80%; height:auto;"/></center>


# RMSE
<br>
<center><img src="img/rmse.jpg" style="width:40%; height:auto;"/></center>


# Part 3. Ranking
* Most recommendations are presented in a sorted list
* Recommendation can be understood as a ranking problem
* Popularity is the obvious baseline
* Ratings prediction is a clear secondary data input that allows for personalization
* Many other features can be added

# Example: Two features, linear model

<center><img src="img/pred21.jpg" style="width:80%; height:auto;"/></center>


# Example: Two features, linear model
<center><img src="img/rankpop.jpg" style="width:80%; height:auto;"/></center>


# Ranking - Quora Feed

**Goal:** Present the most *interesting stories* for a user at a given time

- **Interesting** = topical relevance + social relevance + timeliness
- **Stories** = questions + answers

**ML:** Personalized learning-to-rank approach

**Relevance-ordered vs time-ordered** = big gains in engagement
<center><img src="img/rqf1.jpg" style="width:50%; height:auto;"/></center>
<center><img src="img/rqf2.jpg" style="width:50%; height:auto;"/></center>


# Learning to rank

- Machine learning problem: goal is to construct a ranking model from training data
- Training data can be a partial order or binary judgments (relevant/not relevant)
- Resulting order of the items typically induced from a numerical score
- Learning to rank is a key element for personalization
- You can treat the problem as a standard supervised classification problem

**What is learning to rank?**
https://opensourceconnections.com/blog/2017/02/24/what-is-learning-to-rank/


# Learning to rank - Metrics

- **Quality of ranking measured using metrics as:**
  - Normalized Discounted Cumulative Gain
  - Mean Reciprocal Rank (MRR)
  - Fraction of Concordant Pairs (FCP)
  - Others...
- But, it is hard to optimize machine-learned models directly on these measures (e.g., non-differentiable)
- Recent research on models that directly optimize ranking measures


# Learning to rank - Approaches

### 1. Pointwise
  - Ranking function minimizes loss function defined on individual relevance judgment
  - Ranking score based on regression or classification
  - Ordinal regression, Logistic regression, SVM, GBDT, ...

### 2. Pairwise
  - Loss function is defined on pair-wise preferences
  - Goal: minimize number of inversions in ranking
  - Ranking problem is then transformed into the binary classification problem
  - RankSVM, RankBoost, RankNet, FRank...
  - BPR is a pairwise learning to rank approach that can be applied to different methods like kNN and MF

**Learning to rank approaches:**  
https://opensourceconnections.com/blog/2017/08/03/search-as-machine-learning-prob/



### 3. Listwise

  - **Indirect Loss Function**
    - RankCosine: similarity between ranking list and ground truth as loss function
    - ListNet: KL-divergence as loss function by defining a probability distribution
    - Problem: optimization of listwise loss function may not optimize IR metrics
  - **Directly optimizing IR metric** (difficult since they are not differentiable)
    - Genetic Programming or Simulated Annealing
    - LambdaMart weights pairwise errors in RankNet by IR metric
    - Gradient descent on smoothed version of objective function (e.g., CLiMF or TFMAP)
    - SVM-MAP relaxes MAP metric by adding to SVM constraints
    - AdaRank uses boosting to optimize NDCG

# Part 4. Other approaches

# 4.1 Similarity

### Similars 
   - **Displayed in many different contexts**
        - In response to user actions/context (search, list add…)
   - **Because you watched… rows**
<center><img src="img/similar2.jpg" style="width:70%; height:auto;"/></center>

## Similars: Related Questions

- Given interest in question A (source), what other questions will be interesting?
- Not only about similarity, but also "interestingness"
- Features such as:
  - Textual
  - Co-visit
  - Topics
  - ...
- Important for logged-out use case

<div style="text-align: center;">
    <img src="img/similar21.jpg" style="width:40%; height:auto; display: inline-block; float: right; margin-left: 2%;" />
</div>


## Similars: Graph Base

<center><img src="img/gsimilar.jpg" style="width:60%; height:auto;"/></center>

## Example of Graph-Based Similarity: SimRank

- **SimRank** (Jeh & Widom, 02): "two objects are similar if they are referenced by similar objects."

<div style="text-align: center;">
    <img src="img/gsimex12.jpg" style="width:45%; height:auto; display: inline-block; margin-right: 2%;" />
    <img src="img/gsimex1.jpg" style="width:45%; height:auto; display: inline-block;" />
</div>


## Similarity Ensembles

- Similarity can refer to different dimensions:
  - Similar in metadata/tags
  - Similar in user play behavior
  - Similar in user rating behavior
  - ...
- Combine them using an ensemble:
  - Weights are learned using regression over existing response
  - Or... some MAB explore/exploit approach
- The final concept of "similarity" responds to what users vote as similar

## 4.2 Social Recommendations



## Examples

- Quora people recommendations
- Spotify’s Friend activity
- LinkedIn’s feed recommendations
- ...

<div style="text-align: center;">
    <img src="img/exp22.jpg" style="width:45%; height:auto; display: inline-block; margin-right: 2%;" />
    <img src="img/exp21.jpg" style="width:45%; height:auto; display: inline-block;" />
</div>


## Social and Trust-based Recommenders

- A social recommender system recommends items that are "popular" in the social proximity of the user.
- Social proximity = trust (can also be topic-specific)
- Given two individuals - the source (node A) and sink (node C) - derive how much the source should trust the sink.
- Classic algorithms:
  - Advogato (Levien)
  - Appleseed (Ziegler and Lausen)
  - MoleTrust (Massa and Avesani)
  - TidalTrust (Golbeck)

<div style="text-align: center;">
    <img src="img/str.jpg" style="width:45%; height:auto; display: inline-block; margin-right: 2%;" />
</div>



## Other Ways to Use Social

- Social connections can be used in combination with other approaches.
- In particular, "friendships" can be fed into collaborative filtering methods in different ways:
  - Replace or modify user-user "similarity" by using social network information.
  - Use social connection as a part of the ML objective function as regularizer
  - ...


## User Trust/Expertise Inference at Quora

- **Goal:** Infer user’s trustworthiness in relation to a given topic.
- We take into account:
  - Answers written on topic
  - Upvotes/downvotes received
  - Endorsements
  - ...
- Trust/expertise propagates through the network
- Must be taken into account by other algorithms
<div style="text-align: center;">
    <img src="img/ppl.jpg" style="width:45%; height:auto; display: inline-block; margin-right: 2%;" />
</div>


## 4.3 Page Optimization


## Page Composition

![Page Composition](img/page_composition_1.jpg)

## Full-page Optimization

- Recommendations are rarely displayed in isolation
  - Rankings are combined with many other elements to make a page
- Want to optimize the whole page
- Jointly solving for set of items and their placement
- While incorporating:
  - Diversity, freshness, exploration
  - Depth and coverage of the item set
  - Non-recommendation elements (navigation, editorial, etc.)
- Needs work hand-in-hand with the UX

![Full-page Optimization](img/full_page_optimization1.jpg)
![Full-page Optimization](img/full_page_optimization2.jpg)

## Page Composition

![Page Composition](img/page_composition_2.jpg)

> From "Modeling User Attention and Interaction on the Web" 2014 - PhD Thesis by Dmitry Lagun (Emory U.)


## User Attention Modeling

![User Attention Modeling](img/user_attention_modeling1.jpg)
![User Attention Modeling](img/user_attention_modeling2.jpg)

> From "Modeling User Attention and Interaction on the Web" 2014 - PhD Thesis by Dmitry Lagun (Emory U.)


## Page Composition

<div style="font-size: 1.5em; color: #4F3BB8; text-align: center;">

**Accurate** vs. **Diverse**<br>
**Discovery** vs. **Continuation**<br>
**Depth** vs. **Coverage**<br>
**Freshness** vs. **Stability**<br>
**Recommendations** vs. **Tasks**
</div>

* To put things together we need to combine different elements:
  - Navigational/Attention Model
  - Personalized Relevance Model
  - Diversity Model

<img src="img/page_composition1.jpg" alt="Page Composition 1" style="width:50%; height:auto;" />
<img src="img/page_composition22.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />


# 4.5 Deep Learning


## Explore/Exploit

- One of the key issues when building any kind of personalization algorithm is how to trade off:
  - **Exploitation**: *Cashing in on what we know about the user right now*  
  - **Exploration**: *Using the interaction as an opportunity to learn more about the user*

- We need to have informed and optimal strategies to drive that tradeoff:
  - **Solution**: *Pick a reasonable set of candidates and show users only "enough" to gather information on them*


# Multi-armed Bandits

<img src="img/arm1.jpg" alt="Page Composition 1" style="width:50%; height:auto;" />
<img src="img/arm2.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/arm3.jpg" alt="Page Composition 1" style="width:50%; height:auto;" />
<img src="img/arm4.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />

## 4.5 Deep Learning

## Deep Learning for Collaborative Filtering

- Spotify uses Recurrent Networks for Playlist Prediction ([link](http://erikbern.com/?p=589))

<img src="img/DCF.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />

## Deep Learning for Collaborative Filtering

- In order to predict the next track or movie a user is going to watch, we need to define a distribution $P(y_i|h_i)$
  - If we choose Softmax as is common practice, we get:  
  $
  P(y_i|h_i) = \frac{\exp(h_i^T a_j)}{\sum_k \exp(h_k^T a_k)}
  $
      - **Problem**: Denominator (over all examples) is very expensive to compute
      - **Solution**: Build a tree that implements a hierarchical softmax
- More details on the blog post


## Deep Learning for Content-based Recommendations

- Another application of Deep Learning to recommendations also from Spotify  
  - [SPOTIFY](http://benanne.github.io/2014/08/05/spotify-cnns.html) also [Deep content-based music recommendation](https://papers.nips.cc/paper_files/paper/2013/hash/b3ba8f1bee1238a2f37603d90b58898d-Abstract.html)
- Application to coldstart new titles when very little CF information is available
- Using mel-spectrograms from the audio signal as input
- Training the deep neural network to predict 40 latent factors coming from Spotify’s CF solution
<img src="img/DMCF.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />

<img src="img/DLS1.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/DLS2.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/DLS3.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/DLS4.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/DLS5.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/DLS6.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />

# 5. Context Aware Recommendations


# N-dimensional model

![N-dimensional model](img/CAW1.jpg)


# Tensor Factorization

<img src="img/TF1.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/TF2.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />

**HOSVD**: Higher Order Singular Value Decomposition

<center><small>Multiverse Recommendation: N-dimensional Tensor Factorization for Context-aware Collaborative Filtering</small></center>


# Factorization Machines

- Generalization of regularized matrix (and tensor) factorization approaches combined with linear (or logistic) regression
- Problem: Each new adaptation of matrix or tensor factorization requires deriving new learning algorithms
    - Hard to adapt to new domains and add data sources
    - Hard to advance the learning algorithms across approaches
    - Hard to incorporate non-categorical variables


# Factorization Machines

- Approach: Treat input as a real-valued feature vector
    - Model both linear and pair-wise interaction of k features (i.e. polynomial regression)
    - Traditional machine learning will overfit
    - Factor pairwise interactions between features
    - Reduced dimensionality of interactions promote generalization
    - Different matrix factorizations become different feature representations
    - Tensors: Additional higher-order interactions
- Combines "generality of machine learning/regression with quality of factorization models"


# Factorization Machines

- Each feature gets a weight value and a factor vector
    - O(dk) parameters
- Each feature gets a weight value and a factor vector
  - $O(dk)$ parameters
      - $b \in \mathbb{R}, \mathbf{w} \in \mathbb{R}^d, \mathbf{V} \in \mathbb{R}^{d \times k}$

- Model equation:

  $f(\mathbf{x}) = b + \sum_{i=1}^{d} w_i x_i + \sum_{i=1}^{d} \sum_{j=i+1}^{d} x_i x_j \mathbf{v}_i^\top \mathbf{v}_j$ - $O(d^2)$

  $= b + \sum_{i=1}^{d} w_i x_i + \frac{1}{2} \sum_{f=1}^{k} \left( \left( \sum_{i=1}^{d} x_i v_{i,f} \right)^2 - \sum_{i=1}^{d} x_i^2 v_{i,f}^2 \right)$ - $O(kd)$



# Factorization Machines

- Two categorical variables (u, i) encoded as real values:
    - FM becomes identical to MF with biases:
        - ![Factorization Machines Formula](img/FMT1.jpg)
- FM becomes identical to MF with biases:

  $f(\mathbf{x}) = b + w_u + w_i + \mathbf{v}_u^\top \mathbf{v}_i$

  *From Rendle (2012) KDD Tutorial*



# Factorization Machines

- Makes it easy to add a time signal
    - Equivalent equation:
        - ![Factorization Machines Time Signal Equation](img/FMT2.jpg)

- Equivalent equation:

  $f(\mathbf{x}) = b + w_u + w_i + x_t w_t + \mathbf{v}_u^\top \mathbf{v}_i + x_t \mathbf{v}_u^\top \mathbf{v}_t + x_t \mathbf{v}_i^\top \mathbf{v}_t$

  *From Rendle (2012) KDD Tutorial*

# Factorization Machines (Rendle, 2010)

- **L2 regularized**
  - Regression: Optimize RMSE
  - Classification: Optimize logistic log-likelihood
  - Ranking: Optimize scores

- **Can be trained using:**
  - SGD
  - Adaptive SGD
  - ALS
  - MCMC

- **Gradient:**

  $\frac{\partial}{\partial \theta} f(\mathbf{x}) = \begin{cases}
  1 & \text{if } \theta \text{ is } b \\
  x_i & \text{if } \theta \text{ is } w_i \\
  x_i \sum_{j=1}^{d} v_{j,f} x_j - v_{i,f} x_i^2 & \text{if } \theta \text{ is } v_{i,f}
  \end{cases}$

- **Least squares SGD:**

  $\theta' = \theta - \eta \left( \left( f(\mathbf{x}) - y \right) \frac{\partial}{\partial \theta} f(\mathbf{x}) + \lambda \theta \right)$


# Factorization Machines (Rendle, 2010)

- Learning parameters:
    - Number of factors
    - Iterations
    - Initialization scale
    - Regularization (SGD, ALS) – Multiple
    - Step size (SGD, A-SGD)
    - MCMC removes the need to set those hyperparameters


## 6. Warning: The right evaluation might matter more than the model you choose


## Offline/Online Testing Process
<img src="img/OF1.jpg" alt="Page Composition 2" style="width:100%; height:auto;" />


## Executing A/B Tests
<img src="img/OF2.jpg" alt="Page Composition 2" style="width:100%; height:auto;" />

## Offline Testing

<img src="img/OF4.jpg" alt="Page Composition 2" style="width:100%; height:auto;" />



## Offline Metrics

- For baseline metrics, prefer ranking or ranking-related metrics
- You might want to measure other aspects of the recommendation such as diversity (Maximum Marginal Relevance) or novelty
- In practice, you should keep a bank of metrics that you measure offline and over time connect to your online A/B test results, running posthoc analysis
- Find short-term surrogates that map to long-term improvements

<img src="img/OF41.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/OF42.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />
<img src="img/OF43.jpg" alt="Page Composition 2" style="width:50%; height:auto;" />