From 7401012e597d51113605134b3080b8735c9239c4 Mon Sep 17 00:00:00 2001 From: Feynman Liang Date: Tue, 25 Aug 2015 10:49:12 -0700 Subject: [PATCH] Code review comments --- docs/mllib-clustering.md | 62 ++++++++++++++++++++++++---------------- 1 file changed, 38 insertions(+), 24 deletions(-) diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index 8c1562c9a1a57..280dff10d6b7b 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -438,10 +438,13 @@ sameModel = PowerIterationClusteringModel.load(sc, "myModelPath") is a topic model which infers topics from a collection of text documents. LDA can be thought of as a clustering algorithm as follows: -* Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset. -* Topics and documents both exist in a feature space, where feature vectors are vectors of word counts. -* Rather than estimating a clustering using a traditional distance, LDA uses a function based - on a statistical model of how text documents are generated. +* Topics correspond to cluster centers, and documents correspond to +examples (rows) in a dataset. +* Topics and documents both exist in a feature space, where feature +vectors are vectors of word counts (bag of words). +* Rather than estimating a clustering using a traditional distance, LDA +uses a function based on a statistical model of how text documents are +generated. LDA supports different inference algorithms via `setOptimizer` function. `EMLDAOptimizer` learns clustering using @@ -453,10 +456,10 @@ inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. LDA takes in a collection of documents as vectors of word counts and the -following parameters: +following parameters (set using the builder pattern): * `k`: Number of topics (i.e., cluster centers) -* `LDAOptimizer`: Optimizer to use for learning the LDA model, either +* `optimizer`: Optimizer to use for learning the LDA model, either `EMLDAOptimizer` or `OnlineLDAOptimizer` * `docConcentration`: Dirichlet parameter for prior over documents' distributions over topics. Larger values encourage smoother inferred @@ -474,18 +477,25 @@ failure recovery. All of MLlib's LDA models support: -* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of -which is a probability distribution over terms (words). -* `topicsMatrix`: For each non empty document in the -training set, LDA gives a probability distribution over topics. Note -that for empty documents, we don't create the topic distributions. +* `describeTopics`: Returns the top terms and their weights for each topics +* `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column +is a topic *Note*: LDA is still an experimental feature under active development. As a result, certain features are only available in one of the two -optimizers / models generated by the optimizer. The following -discussion will describe each optimizer/model pair separately. +optimizers / models generated by the optimizer. Currently, a distributed +model can be converted into a local model (during which we assume a +uniform `docConcentration` document-topic prior), but not vice-versa. -**EMLDAOptimizer and DistributedLDAModel** +The following discussion will describe each optimizer/model pair +separately. + +**Expectation Maximization** + +Implemented in +[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer) +and +[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel). For the parameters provided to `LDA`: @@ -495,16 +505,16 @@ must also be $> 1.0$. Providing `Vector(-1)` results in default behavior (uniform `k` dimensional vector with value $(50 / k) + 1$ * `topicConcentration`: Only symmetric priors supported. Values must be $> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$. -* `maxIterations`: Interpreted as maximum number of EM iterations. +* `maxIterations`: The maximum number of EM iterations. `EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only the inferred topics but also the full training corpus and topic distributions for each document in the training corpus. A `DistributedLDAModel` supports: - * `topTopicsPerDocument(k)`: The top `k` topics and their weights for + * `topTopicsPerDocument`: The top topics and their weights for each document in the training corpus - * `topDocumentsPerTopic(k)`: The top `k` documents for each topic and + * `topDocumentsPerTopic`: The top documents for each topic and the corresponding weight of the topic in the documents. * `logPrior`: log probability of the estimated topics and document-topic distributions given the hyperparameters @@ -512,7 +522,12 @@ distributions for each document in the training corpus. A * `logLikelihood`: log likelihood of the training corpus, given the inferred topics and document-topic distributions -**OnlineLDAOptimizer and LocalLDAModel** +**Online Variational Bayes** + +Implemented in +[`OnlineLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/OnlineLDAOptimizer.html) +and +[`LocalLDAModel`](api/scala/org/apache/spark/mllib/clustering/LocalLDAModel.html). For the parameters provided to `LDA`: @@ -522,17 +537,16 @@ dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in default behavior (uniform `k` dimensional vector with value $(1.0 / k)$) * `topicConcentration`: Only symmetric priors supported. Values must be $>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$. -* `maxIterations`: Interpreted as maximum number of minibatches to -submit. +* `maxIterations`: Maximum number of minibatches to submit. In addition, `OnlineLDAOptimizer` accepts the following parameters: * `miniBatchFraction`: Fraction of corpus sampled and used at each iteration -* `optimizeAlpha`: If set to true, performs maximum-likelihood -estimation of the hyperparameter `alpha` (aka `docConcentration`) -after each minibatch and returns the optimized `alpha` in the resulting -`LDAModel` +* `optimizeDocConcentration`: If set to true, performs maximum-likelihood +estimation of the hyperparameter `docConcentration` (aka `alpha`) +after each minibatch and sets the optimized `docConcentration` in the +returned `LocalLDAModel` * `tau0` and `kappa`: Used for learning-rate decay, which is computed by $(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.