# Evaluating Topic Models

This document straightens out how we're going to handle scoring our models. There are two things we look at

1. Predictive Likelihood
2. Document Completion

We the compare what we've written to Heinrich, Wallach, and Teh

# Predictive Likelihood

We consider models whose "global" parameters are $\Phi$ and whose document-local parameters -- if they exist -- are $\theta_d$ for each document $d$. We also consider document-local indicators $z_d$.

We want to predict the likelihood of a new observation $x_*$ according to our inferred parameters.

## ML and MAP 

For maximum likelihood, and MAP, we learn a _value_ $\Phi$ (not a distribution) and then use that value, so we have

$$
\begin{align*}
p(x_*|\Phi) = \int_{\theta_*} \sum_{z_*} p(x_*|z_*, \Phi)p(z|\theta_*)p(\theta_*|\alpha)
\end{align*}
$$

which for mixture models, simplifies to


$$
\begin{align*}
p(x_*|\Phi) = \sum_{z_*} p(x_*|z_*, \Phi)p(z|\alpha)
\end{align*}
$$

### Mixture Models (Point Estimate)

For a mixture model, trained using EM, the solution above exactly works, and simplifies to

$$
p(x_*|\Phi) = \sum_{k=1}^{K} p(x_* | \phi_k)p(z=k|\alpha)
$$

which is the _exact_ solution for ML / MAP inference

## Bayesian Methods

For Bayesian methods we want to fully capture our uncertainty, and so _avoid_ doing point estimates.

Hence our predictive likelihood is

$$
\begin{align*}
p(x_*|\mathcal{X}) = \int_\Phi \int_{\theta_*} \sum_{z_*}
p(x_*|\Phi, z_*)p(z_*|\theta_*)p(\theta_*|\alpha)p(\Phi, \Theta, \alpha|\mathcal{X})
\end{align*}
$$

We treat the hyper-parameters $\alpha$ as parameters, so in fact we're using $p(x_*|\mathcal{X}; \alpha)$

There are three approaches to this. 

### Point Estimates
The first is just use point estimates, i.e. let

$$
\begin{align*}
\hat{\Phi} & = \mathbb{E}_{\Phi|\mathcal{X}}\left[\Phi\right]
\end{align*}
$$

Then for a mixture model we have
$$
\begin{align*}
p(x_*|\mathcal{X}) & = \int_\Phi \sum_{z_*}
p(x_*|\Phi, z_*)p(z_*|\alpha)p(\Phi|\mathcal{X}) \\
 & = \int_\Phi \sum_{z_*}
p(x_*|\hat{\Phi}, z_*)p(z_*|\alpha)p(\Phi|\mathcal{X}) \\
 & = \sum_k
p(x_*|\hat{\phi}_k)p(z_* = k|\alpha)\int_\Phi p(\Phi|\mathcal{X}) \\
 & = \sum_k
p(x_*|\hat{\phi}_k)p(z_* = k|\alpha) \\
\end{align*}
$$

In this case we're using the Bayesian framework to generate better parameter estimates (and so avoid overfitting) _only_; and not using it to generate proper predictive distributions.

For a topic model this is FIXME

### Exact Estimates using Sampling

For mixture models we approximate

$$
\begin{align*}
p(x_*|\mathcal{X}) & = \int_\Phi \sum_{z_*}
    p(x_*|\Phi, z_*)p(z_*|\alpha)p(\Phi|\mathcal{X}) \\
 & = \sum_s \sum_{z_*}
    p(x_*|\Phi_s, z_*) & \Phi_s \sim p(\Phi|\mathcal{X}) \\
\end{align*}
$$

For this to work we need a sampling distribution $p(\Phi|\mathcal{X})$. 

Frequently you might use an approximation such as $q(\Phi)$, which you can compare against $p(\Phi|\mathcal{X})$ to get samples $\Phi_s$. Metropolis methods, particularly, Gibbs sampling, can be used for this purpose. In general if you have used Gibbs sampling for inference, you will have a source of samples that already define the posterior and can be used for this purpose.

Since we are only interested in the evaluating the expectation $\mathbb{E}_{\Phi|\mathcal{X}}\left[p(x_*|\Phi)\right]$ we might be tempted to use importance sampling, in which samples from $\Phi_s \sim q(\Phi)$ are weighted by $p(\Phi|\mathcal{X})\text{ / }q(\Phi)$. However note that for text in particular the dimensionality of $\Phi$ is very large (typically more than $K \times 10000$ for $K$ topics), and importance sampling is notoriously bad in high dimensions with potentially infinite variance (FIXME why). This degenerate behaviour can in particular be triggered by a mismatch in the tails of distributions, and since variational methods are known (cite Bishop) to often understate variance due to their mode-seeking behaviour, importance sampling is peculiarly unsuited to estimation of expections using variational proposal distributions.

For topic models not much changes, we approximate

$$
\begin{align*}
p(x_*|\mathcal{X}) & = \int_\Phi \int_{\theta_*} \sum_{z_*}
    p(x_*|\Phi, z_*)p(z_*|\theta_*)p(\theta_*, \Phi|\mathcal{X}) \\
 & = \sum_s \sum_{z_*}
    p(x_*|\Phi_s, z_*)p(z_*|\theta_*^{(s)}) & \theta_*^{(s)}, \Phi_s \sim p(\theta_*, \Phi|x_*, \mathcal{X})
\end{align*}
$$

In the case where Gibbs sampling has been used as an inference mechanism, and so independent approximate posteriors exists for $\Phi$ and $\theta_*$, it should be possible for generate new samples for $\Phi$ and $\theta_*$ jointly. Assuming the inference was run to convergence in the "training" phase, then the sampling distribution for $\Phi$ should be stationary (FIXME), and future samples drawn from it should represent the same distribution as previous samples (excluding "burn-in"). For the same reason it is possible -- if preferred -- to re-use the existing set of samples for $\Phi$ to estimate $\theta_*$.

As before, importance sampling using approximate distributions $q(\Phi)$ and $q(\theta_*)$ is not recommended.

### Exact Estimates using Deterministic Approximations

Assume we have an approximation for our posterior, which is $q(\Phi) \approx p(\Phi|\mathcal{X})$, or -- in the case of topic models -- $q(\Phi, \Theta) \approx p(\Phi, \Theta|\mathcal{X})$. Typically this approximation is chosen to simplify analytical inference, and so often factorises as $q(\Phi, \Theta) = q(\Phi)q(\Theta)$

For mixture models we approximate

$$
\begin{align*}
p(x_*|\mathcal{X}) & = \int_\Phi \sum_{z_*}
    p(x_*|\Phi, z_*)p(z_*|\alpha)p(\Phi|\mathcal{X}) \\
 & = \int_\Phi \sum_{z_*}
    p(x_*|\Phi, z_*)p(z_*|\alpha)q(\Phi)
\end{align*}
$$

at which point, given an appropriate selection of $q(\Phi)$, it should be possible to analytically evaluate the integral.

A similar approach occurs with topic models.

$$
\begin{align*}
p(x_*|\mathcal{X}) & = \int_\Phi \int_{\theta_*} \sum_{z_*}
    p(x_*|\Phi, z_*)p(z_*|\theta_*)p(\theta_*|\alpha)p(\Phi, \theta_*|x_*, \mathcal{X}) \\
 & = \int_\Phi \int_{\theta_*} \sum_{z_*}
    p(x_*|\Phi, z_*)p(z_*|\theta_*)q(\theta_*)q(\Phi)
\end{align*}
$$

# Document Completion

In practice topic models are often used to create latent representations of documents. The predictive likelihood doesn't measure the performance of the model for this task

Document completion is a metric which obtains a low-rank representation of a portion of a document $x_*^{(q)}$, and then evaluates the predictive likelihood of the remainder of the document $x_*^{(e)}$ given that fixed representation. The better the representation, the better it should explain the remaining words.

For a mixture model, the documention completion metric involves first obtaining the posterior over topic assignment $z_*^{(q)}$ given $x_*^{(q)}$ and the model parameters. In the case of ML and MAP methods that can be obtained directly 

$$
p(z_*^{(q)} = k | x_*^{(q)}, \Phi) = \frac{p(x_*^{(q)}|\phi_k)p(z_*=k|\alpha)}{\sum_j p(x_*^{(q)}|\phi_j)p(z_*=j|\alpha)}
$$

In the case of a Bayesian inference scheme, you need to evaluate

$$
\begin{align*}
p(z_*^{(q)}| x_*^{(q)}, \mathcal{X})  & \propto \int_\Phi p(x_*^{(q)}, z_*^{(q)}, \Phi| \mathcal{X}) \\
& = \int_\Phi p(x_*^{(q)}, | z_*^{(q)}, \Phi) p(z_*^{(q)}|\alpha) p(\Phi|\mathcal{X})\\
\end{align*}
$$
and re-normalise to obtain a categorical distribution. As with inference for the posterior-predictive, you can either use sampling, or approximate $q(\Phi) \approx p(\Phi|\mathcal{X})$ in such a way that the integral can be analytically evaluated 



To evaluate the metric one then calculates

$$
\begin{align*}
p(x_*^{(e)}|x_*^{(q)}, \mathcal{X}) = \mathbb{E}_{z_*^{(q)} | x_*^{(q)}}\left[
    \int_\Phi p(x_*^{(e)}|\Phi, z_*^{(q)}) p(\Phi|\mathcal{X})
\right]
\end{align*}
$$

In the case where there's no posterior distribution on $\Phi$ -- e.g. you used a point-estimation method like EM -- then this simplifies to


$$
\begin{align*}
p(x_*^{(e)}|x_*^{(q)}, \Phi) & = \mathbb{E}_{z_*^{(q)} | x_*^{(q)}}\left[
    \int_\Phi p(x_*^{(e)}|\Phi, z)
\right] \\
 &= \sum_k p(x_*^{(e)}|\phi_k)p(z_*^{(q)} = k | x_*^{(q)}, \Phi)
\end{align*}
$$

For topic models, instead of obtaining, and employing, a categorical posterior-distribution over the choice of which _single_ topic $z_*$ was assigned to the document, we have a Dirichet posterior-distribution over which _mixture_ of topics were assigned to the document, $\theta_*$

This changes the method only very slightly. As before the document-completion method is

$$
\begin{align*}
p(x_*^{(e)}|x_*^{(q)}, \mathcal{X}) = \mathbb{E}_{\theta_*^{(q)} | x_*^{(q)}}\left[
    \int_\Phi \sum_{z_*} p(x_*^{(e)}|\Phi, z_*^{(q)}) p(z_*|\theta_*) p(\theta_*,\Phi|x_*^{(q)}, \mathcal{X})
\right]
\end{align*}
$$

As with the predictive posterior, the integral can be approximated either by sampling; or by employing a deterministic approximation $q(\theta_*, \Phi) \approx p(\theta_*, \Phi|x_*, \mathcal{X})$ permitting the derivation of an analytical solution.

We assume Bayesian methods are being employed to obtain parameter estimates. In that case obtaining the posterior means marginalizing 

$$
\begin{align*}
p(\theta_*|x_*^{(q)}, \mathcal{X}) = \int_\Phi p(\theta_*, \Phi|x_*, \mathcal{X})
\end{align*}
$$


In cases where sampling based inference was performed, the samples can be used to create the  marginal posterior distribution $p(\theta_*|x_*^{(q)}, \mathcal{X})$. If a mean field approximation was employed $q(\theta_*)q(\Phi) = p(\theta_*, \Phi|x_*, \mathcal{X})$ the approximate $q(w)$ can be plugged in


$$
\begin{align*}
p(x_*^{(e)}|x_*^{(q)}, \mathcal{X}) = \mathbb{E}_{\theta_*^{(q)} | x_*^{(q)}}\left[
    \int_\Phi \sum_{z_*} p(x_*^{(e)}|\Phi, z_*^{(q)}) p(z_*|\theta_*) q(\theta_*)q(\Phi)
\right]
\end{align*}
$$

and used to analytically evaluate the integral. In cases where analytically evaluating the integral is itself impossible, we can substitute in point-estimates

$$
\begin{align*}
\hat{\theta}_* & = \mathbb{E}_q\left[\theta_*\right] \\
\hat{\Phi} & = \mathbb{E}_q\left[\Phi\right] \\
\end{align*}
$$

and then we get 
$$
\begin{align*}
p(x_*^{(e)}|x_*^{(q)}, \mathcal{X}) & \approx \mathbb{E}_{\theta_*^{(q)} | x_*^{(q)}}\left[
    \int_\Phi \sum_{z_*} p(x_*^{(e)}|\hat{\Phi}, z_*^{(q)}) p(z_*|\hat{\theta}_*) q(\theta_*)q(\Phi)
\right] \\
& = \sum_{z_*} p(x_*^{(e)}|\hat{\Phi}, z_*^{(q)}) p(z_*|\hat{\theta}_*) \int_\Phi \int_{\theta_*} q(\theta_*)q(\Phi) \\
& = \sum_{z_*} p(x_*^{(e)}|\hat{\Phi}, z_*^{(q)}) p(z_*|\hat{\theta}_*) 
\end{align*}
$$

# How does this compare with other methods?

## Heinrich

He used the point-estimate substitution trick:

$$
\begin{align*}
p(x_*|\mathcal{M}) 
& = \prod_n \sum_k p(x_{*n}|\phi_k)p(z_{*n} = k|\theta_*) \\
& = \prod_t \left( \sum_k \phi_{kt} \theta_{*k} \right)^{n_{*t}}
\end{align*}
$$

So why does this work then?

## Teh

Used point estimates the same as Heinrich

## Wallach 

It's a really funny paper. Given a fixed, point-estimate for $\Phi$ and $\alpha$, she's concerned with estimating

$$
p(x|\Phi, \alpha)
$$

which is just a component of the broader method

$$
p(x|\mathcal{X}) = \int_\Phi \int_\alpha p(x|\Phi, \alpha)p(\Phi, \alpha|\mathcal{X})
$$

It's possible to obtain samples via Gibb's sampling

$$
z_{*nk} \propto \phi_{kw_{*n}} \frac{
    n_{*k}^{(\setminus n)} + \alpha_k
}{
    n - 1 + \sum_j \alpha_j
}
$$

And of course, once that's done, it's possible to estimate a sample of $\theta_{*}$ as:

$$
\theta_*^{(s)} = \frac{
    n_{*k}^{(s)} + \alpha_k
}{
    n - 1 + \sum_j \alpha_j
}
$$

And so determine the likelihood just as the Heinrich and Teh  (section "5.1 Estimated $\theta$" method)

$$
\begin{align}
p(x|\Phi, \alpha) = \frac{1}{S}\sum_s 
\left(
    \prod_t \left(
        \sum_k \phi_kt \theta^{(s)}_{*k}
    \right)^{n_{*t}} 
  \right)
\end{align}
$$

So the aim of the paper is how to estimate the embedded document specific $\theta_*$ (and so $z_{*n}$) for all 

<font color='red'>
    Something that this illustrates is that neither Teh nor Heinrich are giving the proper posterior probability $p(x|\mathcal{X})$ in the context of document-completion: i.e. instead of reporting $p(x^{(e)}|x^{(q)}, \mathcal{X})$ they are instead reporting $p(x^{(e)}|x^{(q)}, \Phi, \alpha)$
</font>

Another thing that follows is that you can't really use the variational bound as a proxy for predictive likelihood. The variational bound is an approximation of the marginal, likelihood according to prior distributions, but a different bound arises when doing posterior predictive, as shown by Bishop with the Variational Mixture of Gaussians example.

Another issue is that she goes straight to sampling without every looking at analytical solutions. $p(z|\alpha)$ has a Polya distribution (since $\theta_{*}$ got marginalized out. 

After that though it's the usual issue. She's dissatisfied with the decomposition

$$
p(w|\Phi, \alpha) = \int_\theta \sum_z p(w|\Phi, z)p(z|\theta)p(\theta|\alpha)
$$

as it essentially means sampling from the prior. However by analogy with mixture models, this is in fact the proper way to go about things. All other methods to avoid this are a bit squiffy.

> Importance sampling does not work well when sampling from high-dimensinal disributions. Unless the proposal distribution is a near-perfect approxiamtion to the target distribution, the variance of the estimator will be very large. When sampling continuous values, such as $\theta$, the estimator may have infinite variance.

The harmonic mean method is probably most justifiable, if it still a little squiffy.

What makes this all justifiable -- ish -- is that in documention completion you are using one set of words to sample a posterior which you then use to evaluate the remaining set of words.