Skip to content

Commit

Permalink
Code review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
Feynman Liang committed Aug 25, 2015
1 parent b8b9f9a commit 7401012
Showing 1 changed file with 38 additions and 24 deletions.
62 changes: 38 additions & 24 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -438,10 +438,13 @@ sameModel = PowerIterationClusteringModel.load(sc, "myModelPath")
is a topic model which infers topics from a collection of text documents.
LDA can be thought of as a clustering algorithm as follows:

* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
on a statistical model of how text documents are generated.
* Topics correspond to cluster centers, and documents correspond to
examples (rows) in a dataset.
* Topics and documents both exist in a feature space, where feature
vectors are vectors of word counts (bag of words).
* Rather than estimating a clustering using a traditional distance, LDA
uses a function based on a statistical model of how text documents are
generated.

LDA supports different inference algorithms via `setOptimizer` function.
`EMLDAOptimizer` learns clustering using
Expand All @@ -453,10 +456,10 @@ inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
and is generally memory friendly.

LDA takes in a collection of documents as vectors of word counts and the
following parameters:
following parameters (set using the builder pattern):

* `k`: Number of topics (i.e., cluster centers)
* `LDAOptimizer`: Optimizer to use for learning the LDA model, either
* `optimizer`: Optimizer to use for learning the LDA model, either
`EMLDAOptimizer` or `OnlineLDAOptimizer`
* `docConcentration`: Dirichlet parameter for prior over documents'
distributions over topics. Larger values encourage smoother inferred
Expand All @@ -474,18 +477,25 @@ failure recovery.

All of MLlib's LDA models support:

* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of
which is a probability distribution over terms (words).
* `topicsMatrix`: For each non empty document in the
training set, LDA gives a probability distribution over topics. Note
that for empty documents, we don't create the topic distributions.
* `describeTopics`: Returns the top terms and their weights for each topics
* `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column
is a topic

*Note*: LDA is still an experimental feature under active development.
As a result, certain features are only available in one of the two
optimizers / models generated by the optimizer. The following
discussion will describe each optimizer/model pair separately.
optimizers / models generated by the optimizer. Currently, a distributed
model can be converted into a local model (during which we assume a
uniform `docConcentration` document-topic prior), but not vice-versa.

**EMLDAOptimizer and DistributedLDAModel**
The following discussion will describe each optimizer/model pair
separately.

**Expectation Maximization**

Implemented in
[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer)
and
[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel).

For the parameters provided to `LDA`:

Expand All @@ -495,24 +505,29 @@ must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
(uniform `k` dimensional vector with value $(50 / k) + 1$
* `topicConcentration`: Only symmetric priors supported. Values must be
$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
* `maxIterations`: Interpreted as maximum number of EM iterations.
* `maxIterations`: The maximum number of EM iterations.

`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
the inferred topics but also the full training corpus and topic
distributions for each document in the training corpus. A
`DistributedLDAModel` supports:

* `topTopicsPerDocument(k)`: The top `k` topics and their weights for
* `topTopicsPerDocument`: The top topics and their weights for
each document in the training corpus
* `topDocumentsPerTopic(k)`: The top `k` documents for each topic and
* `topDocumentsPerTopic`: The top documents for each topic and
the corresponding weight of the topic in the documents.
* `logPrior`: log probability of the estimated topics and
document-topic distributions given the hyperparameters
`docConcentration` and `topicConcentration`
* `logLikelihood`: log likelihood of the training corpus, given the
inferred topics and document-topic distributions

**OnlineLDAOptimizer and LocalLDAModel**
**Online Variational Bayes**

Implemented in
[`OnlineLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/OnlineLDAOptimizer.html)
and
[`LocalLDAModel`](api/scala/org/apache/spark/mllib/clustering/LocalLDAModel.html).

For the parameters provided to `LDA`:

Expand All @@ -522,17 +537,16 @@ dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
* `topicConcentration`: Only symmetric priors supported. Values must be
$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
* `maxIterations`: Interpreted as maximum number of minibatches to
submit.
* `maxIterations`: Maximum number of minibatches to submit.

In addition, `OnlineLDAOptimizer` accepts the following parameters:

* `miniBatchFraction`: Fraction of corpus sampled and used at each
iteration
* `optimizeAlpha`: If set to true, performs maximum-likelihood
estimation of the hyperparameter `alpha` (aka `docConcentration`)
after each minibatch and returns the optimized `alpha` in the resulting
`LDAModel`
* `optimizeDocConcentration`: If set to true, performs maximum-likelihood
estimation of the hyperparameter `docConcentration` (aka `alpha`)
after each minibatch and sets the optimized `docConcentration` in the
returned `LocalLDAModel`
* `tau0` and `kappa`: Used for learning-rate decay, which is computed by
$(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.

Expand Down

0 comments on commit 7401012

Please sign in to comment.