# Comparing Minimum Risk Training and REINFORCE
## Diego Molla-Aliod
LTG, 6 May 2019

In this talk I show some notes and reflections between Minimum Risk Training and REINFORCE, based on the following papers:

- Ayana, Shen, S., Zhao, Y., Liu, Z., & Sun, M. (2016). Neural Headline Generation with Sentence-wise Optimization. Retrieved from http://arxiv.org/abs/1604.01904
- Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., & Liu, Y. (2015). Minimum Risk Training for Neural Machine Translation. Retrieved from http://arxiv.org/abs/1512.02433


# Minimum Risk Training (Shen et al, 2015)
MRT has the following advantages over MLE:
1. **Direct optimization with respect to evaluation metrics**: MRT introduces evaluation metrics as loss functions and aims to minimize expected loss on the training data.
2. **Applicable to arbitrary loss functions**, which are not necessarily differentiable.
3. **Transparent to architectures**: MRT does not assume the specific architectures of NMT and can be applied to ant end-to-end NMT systems.

# End-to-end NMT: Using MLE


![NMT-background](images/NMT-background.png)

![NMT-background2](images/NMT-background2.png)

![NMT-background3](images/NMT-background3.png)

# Problems with MLE

1. **Exposure bias**: While the models are trained only on the training data distribution, they are used to generate target words on previous model predictions, which can be erroneous, at test time.
2. **Loss function**: MLE usually uses the cross-entropy loss focusing on word-level errors to maximize the probability of the next correct word, which might hardly correlate well with corpus-level and sentence-level evaluation metrics such as BLEU.

# Enters MRT

In MRT, the *risk* is defined as the expected loss with respect to the posterior distribution:

![MRT-risk](images/MRT-risk.png)

where $\cal Y(x^{(s)})$ is a set of all possible translations for $x^{(s)}$ and $\Delta(y,y^{(s)})$ is a loss function. **The loss function is not parameterized and thus it does not need to be differentiable.**


The training objective of MRT is to minimise the risk on the training data:
![MRT-objective](images/MRT-objective.png)

In MRT, the partial derivative with respect to a model parameter $\theta_i$ is given by:
![MRT-derivative](images/MRT-derivative.png)
* Since Eq. (10) suggests there is no need to differentiate $\Delta(y,y^{(s)})$, MRT allows arbitrary non-differentiable loss functions.
* In addition, the approach is transparent to architectures and can be applied to arbitrary end-to-end NMT models.

# Solving MRT
However, the expectations in Eq. (10) are usually intractable to calculate due to:
1. the exponential search space of $\cal Y(x^{(s)})$,
2. the non-decomposability of the loss function $\Delta(y,y^{(s)})$, and
3. the context sensitiveness of NMT.

To alleviate this problem, Shen et al. (2015) propose to **use a subset of the full search space to approximate the posterior distribution** and introduce a new training objective:
![MRT-risk-estimate](images/MRT-risk-estimate.png)![MRT-risk-estimate2](images/MRT-risk-estimate2.png) 
Where:
1. $S(x^{(s)}) \subset \cal Y(x^{(s)})$ is a sampled subset of the search space (see Algorithm 1),
2. $Q(y|x^{(s)};\theta)^\alpha)$ is a distribution defined on the subspace $S(x^{(s)})$, and
3. $\alpha$ is a hyper-parameter that controls the sharpness of the $Q$ distribution.

![Sampling](images/Sampling.png)

Given the sampled space, the partial derivative with respect to a model parameter $\theta_i$ of $\tilde{\cal R}(\theta)$ is given by
![MRT-derivative2](images/MRT-derivative2.png)
Since $|S(x^{(s)}|$ can be relatively small (in their experiments, 100 was good enough), the expectations in Eq.(14) can be efficiently calculated by explicitly enumerating all candidates in $S(x^{((s)})$.

# REINFORCE