Code review comments

apache · Aug 25, 2015 · 7401012 · 7401012
1 parent b8b9f9a
commit 7401012
Showing 1 changed file with 38 additions and 24 deletions.
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
@@ -438,10 +438,13 @@ sameModel = PowerIterationClusteringModel.load(sc, "myModelPath")
 is a topic model which infers topics from a collection of text documents.
 LDA can be thought of as a clustering algorithm as follows:
 
-* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.
-* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts.
-* Rather than estimating a clustering using a traditional distance, LDA uses a function based
- on a statistical model of how text documents are generated.
+* Topics correspond to cluster centers, and documents correspond to
+examples (rows) in a dataset.
+* Topics and documents both exist in a feature space, where feature
+vectors are vectors of word counts (bag of words).
+* Rather than estimating a clustering using a traditional distance, LDA
+uses a function based on a statistical model of how text documents are
+generated.
 
 LDA supports different inference algorithms via `setOptimizer` function.
 `EMLDAOptimizer` learns clustering using
@@ -453,10 +456,10 @@ inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
 and is generally memory friendly.
 
 LDA takes in a collection of documents as vectors of word counts and the
-following parameters:
+following parameters (set using the builder pattern):
 
 * `k`: Number of topics (i.e., cluster centers)
-* `LDAOptimizer`: Optimizer to use for learning the LDA model, either
+* `optimizer`: Optimizer to use for learning the LDA model, either
 `EMLDAOptimizer` or `OnlineLDAOptimizer`
 * `docConcentration`: Dirichlet parameter for prior over documents'
 distributions over topics. Larger values encourage smoother inferred
@@ -474,18 +477,25 @@ failure recovery.
 
 All of MLlib's LDA models support:
 
-* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of
-which is a probability distribution over terms (words).
-* `topicsMatrix`: For each non empty document in the
-training set, LDA gives a probability distribution over topics. Note
-that for empty documents, we don't create the topic distributions.
+* `describeTopics`: Returns the top terms and their weights for each topics
+* `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column
+is a topic
 
 *Note*: LDA is still an experimental feature under active development.
 As a result, certain features are only available in one of the two
-optimizers / models generated by the optimizer. The following
-discussion will describe each optimizer/model pair separately.
+optimizers / models generated by the optimizer. Currently, a distributed
+model can be converted into a local model (during which we assume a
+uniform `docConcentration` document-topic prior), but not vice-versa.
 
-**EMLDAOptimizer and DistributedLDAModel**
+The following discussion will describe each optimizer/model pair
+separately.
+
+**Expectation Maximization**
+
+Implemented in
+[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer)
+and
+[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel).
 
 For the parameters provided to `LDA`:
 
@@ -495,24 +505,29 @@ must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
 (uniform `k` dimensional vector with value $(50 / k) + 1$
 * `topicConcentration`: Only symmetric priors supported. Values must be
 $> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
-* `maxIterations`: Interpreted as maximum number of EM iterations.
+* `maxIterations`: The maximum number of EM iterations.
 
 `EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
 the inferred topics but also the full training corpus and topic
 distributions for each document in the training corpus. A
 `DistributedLDAModel` supports:
 
- * `topTopicsPerDocument(k)`: The top `k` topics and their weights for
+ * `topTopicsPerDocument`: The top topics and their weights for
  each document in the training corpus
- * `topDocumentsPerTopic(k)`: The top `k` documents for each topic and
+ * `topDocumentsPerTopic`: The top documents for each topic and
  the corresponding weight of the topic in the documents.
  * `logPrior`: log probability of the estimated topics and
  document-topic distributions given the hyperparameters
  `docConcentration` and `topicConcentration`
  * `logLikelihood`: log likelihood of the training corpus, given the
  inferred topics and document-topic distributions
 
-**OnlineLDAOptimizer and LocalLDAModel**
+**Online Variational Bayes**
+
+Implemented in
+[`OnlineLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/OnlineLDAOptimizer.html)
+and
+[`LocalLDAModel`](api/scala/org/apache/spark/mllib/clustering/LocalLDAModel.html).
 
 For the parameters provided to `LDA`:
 
@@ -522,17 +537,16 @@ dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
 default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
 * `topicConcentration`: Only symmetric priors supported. Values must be
 $>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
-* `maxIterations`: Interpreted as maximum number of minibatches to
-submit.
+* `maxIterations`: Maximum number of minibatches to submit.
 
 In addition, `OnlineLDAOptimizer` accepts the following parameters:
 
 * `miniBatchFraction`: Fraction of corpus sampled and used at each
 iteration
-* `optimizeAlpha`: If set to true, performs maximum-likelihood
-estimation of the hyperparameter `alpha` (aka `docConcentration`)
-after each minibatch and returns the optimized `alpha` in the resulting
-`LDAModel`
+* `optimizeDocConcentration`: If set to true, performs maximum-likelihood
+estimation of the hyperparameter `docConcentration` (aka `alpha`)
+after each minibatch and sets the optimized `docConcentration` in the
+returned `LocalLDAModel`
 * `tau0` and `kappa`: Used for learning-rate decay, which is computed by
 $(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.