Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9888][MLlib]User guide for new LDA features #8254

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 97 additions & 14 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -443,23 +443,106 @@ LDA can be thought of as a clustering algorithm as follows:
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
on a statistical model of how text documents are generated.

LDA takes in a collection of documents as vectors of word counts.
It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides:

* Topics: Inferred topics, each of which is a probability distribution over terms (words).
* Topic distributions for documents: For each non empty document in the training set, LDA gives a probability distribution over topics. (EM only). Note that for empty documents, we don't create the topic distributions. (EM only)
LDA supports different inference algorithms via `setOptimizer` function.
`EMLDAOptimizer` learns clustering using
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
on the likelihood function and yields comprehensive results, while
`OnlineLDAOptimizer` uses iterative mini-batch sampling for [online
variational
inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
and is generally memory friendly.

LDA takes the following parameters:
LDA takes in a collection of documents as vectors of word counts and the
following parameters:

* `k`: Number of topics (i.e., cluster centers)
* `maxIterations`: Limit on the number of iterations of EM used for learning
* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.

*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
support prediction on new documents, and it does not have a Python API. These will be added in the future.
* `LDAOptimizer`: Optimizer to use for learning the LDA model, either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually just called "optimizer" in public API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

`EMLDAOptimizer` or `OnlineLDAOptimizer`
* `docConcentration`: Dirichlet parameter for prior over documents'
distributions over topics. Larger values encourage smoother inferred
distributions.
* `topicConcentration`: Dirichlet parameter for prior over topics'
distributions over terms (words). Larger values encourage smoother
inferred distributions.
* `maxIterations`: Limit on the number of iterations.
* `checkpointInterval`: If using checkpointing (set in the Spark
configuration), this parameter specifies the frequency with which
checkpoints will be created. If `maxIterations` is large, using
checkpointing can help reduce shuffle file sizes on disk and help with
failure recovery.


All of MLlib's LDA models support:

* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"n" is not the parameter name, and the parameter actually specifies the number of terms per topic. For all of these, I don't think you need to specify a parameter name, just the gist of what the method does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

which is a probability distribution over terms (words).
* `topicsMatrix`: For each non empty document in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topicsMatrix returns the topics (k x vocabSize), not the document-topic distributions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

training set, LDA gives a probability distribution over topics. Note
that for empty documents, we don't create the topic distributions.

*Note*: LDA is still an experimental feature under active development.
As a result, certain features are only available in one of the two
optimizers / models generated by the optimizer. The following
discussion will describe each optimizer/model pair separately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also comment that the distributed model can be converted into a local model, but not vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


**EMLDAOptimizer and DistributedLDAModel**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name these sections after the algorithm to be more recognizable: Expectation-Maximization and Online Variational Bayes. The optimizer and model names can be featured just below the section titles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


For the parameters provided to `LDA`:

* `docConcentration`: Only symmetric priors are supported, so all values
in the provided `k`-dimensional vector must be identical. All values
must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
(uniform `k` dimensional vector with value $(50 / k) + 1$
* `topicConcentration`: Only symmetric priors supported. Values must be
$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
* `maxIterations`: Interpreted as maximum number of EM iterations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "Interpreted as"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
the inferred topics but also the full training corpus and topic
distributions for each document in the training corpus. A
`DistributedLDAModel` supports:

* `topTopicsPerDocument(k)`: The top `k` topics and their weights for
each document in the training corpus
* `topDocumentsPerTopic(k)`: The top `k` documents for each topic and
the corresponding weight of the topic in the documents.
* `logPrior`: log probability of the estimated topics and
document-topic distributions given the hyperparameters
`docConcentration` and `topicConcentration`
* `logLikelihood`: log likelihood of the training corpus, given the
inferred topics and document-topic distributions

**OnlineLDAOptimizer and LocalLDAModel**

For the parameters provided to `LDA`:

* `docConcentration`: Asymmetric priors can be used by passing in a
vector with values equal to the Dirichlet parameter in each of the `k`
dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
* `topicConcentration`: Only symmetric priors supported. Values must be
$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
* `maxIterations`: Interpreted as maximum number of minibatches to
submit.

In addition, `OnlineLDAOptimizer` accepts the following parameters:

* `miniBatchFraction`: Fraction of corpus sampled and used at each
iteration
* `optimizeAlpha`: If set to true, performs maximum-likelihood
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how I missed this earlier, but a few issues with alpha:

  • I don't want us to use "alpha." We should use docConcentration instead to be consistent.
  • The public getter and setter are misspelled: "getOptimzeAlpha"

Can you please fix the getter/setter in a separate PR since it's an API change? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPARK-10230 will track that

Should we deprecate the public APIs that reference alpha?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: We'll rename get/setOptimizeAlpha, but will leave the others alone.

estimation of the hyperparameter `alpha` (aka `docConcentration`)
after each minibatch and returns the optimized `alpha` in the resulting
`LDAModel`
* `tau0` and `kappa`: Used for learning-rate decay, which is computed by
$(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.

`OnlineLDAOptimizer` produces a `LocalLDAModel`, which only stores the
inferred topics. A `LocalLDAModel` supports:

* `logLikelihood(documents)`: Calculates a lower bound on the provided
`documents` given the inferred topics.
* `logPerplexity(documents)`: Calculates an upper bound on the
perplexity of the provided `documents` given the inferred topics.

**Examples**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,6 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
}
val topicsMat = Matrices.fromBreeze(brzTopics)

// TODO: initialize with docConcentration, topicConcentration, and gammaShape after SPARK-9940
new LocalLDAModel(topicsMat, docConcentration, topicConcentration, gammaShape)
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext {
// Train a model
val lda = new LDA()
lda.setK(k)
.setOptimizer(new EMLDAOptimizer)
.setDocConcentration(topicSmoothing)
.setTopicConcentration(termSmoothing)
.setMaxIterations(5)
Expand Down