# Sampling Bias Corrected Neural Modeling for Large Corpus Item Recommendations

## Modeling Framework: Two-tower

We have a set of queries and items. Queries and items are represented by feature vectors

- Queries: $\{x_i\}^N_{i=1}$
- Items: $\{y_j\}^M_{j=1}$

Here $x_i$ and $y_i$ are both mixtures of a wide variety of features (e.g. sparse IDs and dense features) and could be in a a very high dimensional space. The goal is to retrieve a subset of items given a query. In personalization scenario, we assume user and context are fully captured in $x_i$. Note that we begin with a finite number of queries and items to explain the intuition. Our modeling framework works without such an assumption.

We aim to build a model with two parameterized embedding functions:

- $u: X \times \mathbb{R}^d \rightarrow \mathbb{R}^k$
- $v: Y \times \mathbb{R}^d \rightarrow \mathbb{R}^k$

They map model parameters $\theta \in \mathbb{R}^d$ and features of query and candidates to a k-dimensional embedding space. The output of the model is the inner product of two embeddings.

$$
s(x, y) = \left \langle u(x, \theta), v(y, \theta) \right \rangle
$$

The goal is to learn model parameter $\theta$ from a training dataset of `T` examples.

$$
\tau = \{(x_i, y_i, r_i)\}^T_{i=1}
$$

where $(x_i, y_i)$ denotes the pair of query and item, and $r_i \in mathbb{R}$ is the associated reward for each pair. The reward does not have to be user ratings. It could be user engagement time or clicks.

Given a query $x$, a common choice for the probability distribution of picking candidate $y$ from M items is based on the softmax function.

$$
P(y \mid x;\theta) = \frac{e^{s(x,y)}}{\Sigma_{j=1}^{M} e^{s(x,y_j)}}
$$

By further incorporating rewards $r_i$, we consider the following weighted log-likelihood as the loss function.

$$
L_{\tau}(\theta) = \frac{-1}{T} \Sigma_{i \in T} r_i \cdot log \left( P(y_i \mid x_i; \theta) \right)
$$

When M is very large, it is not feasible to include all candidate examples in computing the denominator. A common idea is to use a subset of items in constructing the denominator. Given a mini-batch of B pairs for each $i \in B$, the batch softmax is

$$
P_B(y_i \mid x_i; \theta) = \frac{e^{s(x_i, y_i)}}{\Sigma_{j \in B} e^{s(x_i, y_j)}}
$$

In-batch items are normally sampled from a power-law distribution. As a result, the probability function introduces a large bias toward full softmax: popular items are overly penalized as negatives due to the high probability of being included in a batch. Inspired by the logQ correction used in sampled softmax model, we correct each logit $s(x_i,y_j)$ by the following equation.

$$
s^c(x_i, y_j) = s(x_i, y_j) - log(p_j)
$$

Here $p_j$ denotes the sampling probability of item $j$ in a random batch. With the correction, we have

$$
P_B^c(y_i \mid x_i; \theta) = \frac{e^{s^c(x_i, y_i)}}{e^{s^c(x_i, y_i)} + \Sigma_{j \in B, j \neq i} e^{s^c(x_i, y_j)}}
$$

The mini-batch loss function is

$$
L_B(\theta) = \frac{-1}{B} \Sigma_{i \in B} r_i \cdot log\left( P_B^c (y_i \mid x_i; \theta) \right)
$$

Running SGD with learning rate $\gamma$ yields the model parameter update as

$$
\theta = \theta - \gamma \cdot \nabla L_B(\theta)
$$

### Algorithm 1 Training

**Inputs**

- Two parameterized embedding functions $u(...,\theta)$ and $v(...,\theta)$ where each one maps input features to an embedding space through a neural network.

- Learning rate $\gamma$ either fixed or adaptive.

**Repeat**

- Sample or receive a batch of training data from a stream
- Obtain the estimated sampling probability $p_i$ from each $y_i$ from the frequency estimation algorithm below.
- Construct loss $L_B(\theta)$
- Apply backpropagation and update $\theta$

**Until** stopping criterion


### Algorithm 2 Streaming Frequency Estimation

**Inputs**

- Learning rate $\alpha$.
- Arrays `A` and `B` with size `H`
- Hash function `h` with output space `H`

**Training**

For steps `t = 1, 2, ...`, sample a batch of items $\beta$. For each $y \in \beta$ do:

$$
B[h(y)] = (1 - \alpha) \cdot B[h(y)] + \alpha \cdot (t - A[h(y)])
$$

$$
A[h(y)] = t
$$

**Until** stopping criterion

During interference step, for any item $y$, estimated sampling probability

$$
\hat{p} = \frac{1}{B[h(y)]}
$$