JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib #476

yinxusen · 2014-04-22T05:26:32Z

(This PR is based on a joint work done with @liancheng four months ago.)

Overview

LDA is a classical topic model in machine learning, that provides the ability to extract topics from corpus. Gibbs sampling (GS for short) is a common way to optimize LDA model.

The LDA model consists of four matrices, two 1-dim matrices:

Document counts
Topic counts

plus two 2-dim matrices:

Document-Topic counts
Topic-Term counts

Implementation details

An accumulator is used to aggregate all updated values and applies them on the old model computed in the last iteration.
Chalk is used for term segmentation. Though it is easy to rewrite it with Lucene analyzers, I think MLlib should not take the burden to maintain an implementation of tokenizer.
SparkContext.wholeTextFiles() is convenient for offline experimentation, while SparkContext.textFile() is better for online applications.
Document dictionary and term dictionary are broadcasted to translate document names and terms into Int IDs.
Topic assignment matrix from the last iteration is cached for the current iteration, and then unpersisted to release memory.
LDA suffers similar stack overflow problem of MLlib ALS (SPARK-1006). To workaround this issue, we checkpoint every a few iterations.

AmplabJenkins · 2014-04-22T05:27:55Z

Merged build triggered.

AmplabJenkins · 2014-04-22T05:28:02Z

Merged build started.

AmplabJenkins · 2014-04-22T06:06:40Z

Merged build finished.

AmplabJenkins · 2014-04-22T06:06:40Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14318/

etrain · 2014-04-30T01:32:05Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala

+    docCounts: Vector,
+    topicCounts: Vector,
+    docTopicCounts: Array[Vector],
+    topicTermCounts: Array[Vector])


I expect that this will be really big - maybe the last two variables should be RDDs - similar to what we do with ALS.

That's make sense. I think the docTopicCounts could be sliced easily W.R.T. documents partitions. But for topicTermCounts, it's hard to do slice. I'll find a way to settle it.

etrain · 2014-04-30T02:04:50Z

Before I get too deep into this review - I want to step back and think about whether we expect the model in this case to be on the order of the size of the data - I think it is, and if so, we may want to consider representing the model as RDD[DocumentTopicFeatures] and RDD[TopicWordFeatures], similar to what we do with ALS. This may change the algorithm substantially.

Separately, maybe it makes sense to have a concrete use case to work with (reuters dataset or something) so that we can evaluate how much memory actually gets used given a reasonably sized corpus.

Perhaps @mengxr or @jegonzal has a strong opinion on this.

etrain · 2014-04-30T02:06:05Z

Also, speaking of @jegonzal maybe this is a natural first point of integration between MLlib and GraphX - I know the GraphX folks have an implementation of LDA, and maybe this is a chance for us to leverage that work.

yinxusen · 2014-04-30T02:43:41Z

Yep, I know @jegonzal for his paper Parallel Gibbs Sampling. But I only have the idea of the implementation on GraphLab and not find the impl in GraphX. It's great if I have the chance to talk with Joseph offline.

Besides, I will add a use case for reuters dataset and try to fix the issues put above.

jegonzal · 2014-04-30T02:57:07Z

I would be happy to talk more about this after the OSDI deadline. As far as storing the model (or more precisely the counts and samples) as an a RDD, I think this really is necessary. The model in this case should be on the order of the size of the data.

Essentially what you want is the ability to join the term topic counts with the document topic counts for each token in a given document. Given these two counts tables (along with the background distribution of topics in the entire corpus) you can compute the new topic assignment.

Here is an implementation of the collapsed Gibbs sampler for LDA using GraphX: amplab/graphx#113

yinxusen · 2014-04-30T08:29:37Z

Yep, thanks @jegonzal and @etrain , I'll try to fix these issues and look forward to the next step updating and discussion.

witgo · 2014-08-04T07:00:03Z

mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

+
+    // Tokenize and filter terms
+    val almostData = sc.wholeTextFiles(dir, minSplits).map { case (fileName, content) =>
+      val tokens = JavaWordTokenizer(content)


We should allow users to customize here. We can add a parameter tokenizer: (String) => Iterable[String] to loadCorpus, and dirStopWords is not required.

mengxr · 2014-09-26T17:10:05Z

@yinxusen Per discussion on https://issues.apache.org/jira/browse/SPARK-1405, we want to have a GraphX-based implementation and distributed representation of the topic model. Do you mind closing this PR? Thanks for your contribution and @etrain @jegonzal and @witgo for code review!

`spark_examples_2.11-2.2.0.jar` should be `spark-examples_2.11-2.2.0-k8s-0.3.0.jar`

Merge from upstream

Set the go version for conformance tests

…ark metrics (apache#476)

* KE-37052 translate boolean column to V2Predicate * update spark version

…#479) * Revert "KE-37052 translate boolean column to V2Predicate (apache#477)" This reverts commit 7796f19. * KE-37052 translate boolean column to V2Predicate (apache#476) * KE-37052 translate boolean column to V2Predicate * update spark version

yinxusen and others added 5 commits April 22, 2014 09:45

initial commit

1f8793a

fix import style

e137287

ready for PR

7378cff

Code cleanup

063ff0f

fix minor error

45b157e

yinxusen changed the title ~~JIRA issue: [SPARK-1405](https://issues.apache.org/jira/browse/SPARK-1405) Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib~~ JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib Apr 22, 2014

etrain reviewed Apr 30, 2014
View reviewed changes

witgo reviewed Aug 4, 2014
View reviewed changes

witgo mentioned this pull request Aug 16, 2014

[WIP][SPARK-1405][MLLIB]collapsed Gibbs sampling based latent Dirichlet allocation #1983

Closed

asfgit closed this in f341e1c Oct 2, 2014

liancheng deleted the lda branch October 2, 2014 07:21

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017

spark-examples jar filename misses k8s-0.3.0 (apache#476)

1e63a60

`spark_examples_2.11-2.2.0.jar` should be `spark-examples_2.11-2.2.0-k8s-0.3.0.jar`

mccheah pushed a commit to mccheah/spark that referenced this pull request Feb 14, 2019

Merge pull request apache#476 from palantir/rk/more-merge

6f62a0d

Merge from upstream

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#476 from mrhillsman/setgoversion

e29dcb5

Set the go version for conformance tests

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

SPARK-460: Spark Metrics for CollectD Configuration for collecting Sp…

90cacc6

…ark metrics (apache#476)

RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022

KE-37052 translate boolean column to V2Predicate (apache#476)

a2f6a77

* KE-37052 translate boolean column to V2Predicate * update spark version

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib #476

JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib #476

yinxusen commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

etrain Apr 30, 2014

yinxusen Apr 30, 2014

etrain commented Apr 30, 2014

etrain commented Apr 30, 2014

yinxusen commented Apr 30, 2014

jegonzal commented Apr 30, 2014

yinxusen commented Apr 30, 2014

witgo Aug 4, 2014

mengxr commented Sep 26, 2014

JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib #476

JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib #476

Conversation

yinxusen commented Apr 22, 2014

Overview

Implementation details

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

etrain Apr 30, 2014

Choose a reason for hiding this comment

yinxusen Apr 30, 2014

Choose a reason for hiding this comment

etrain commented Apr 30, 2014

etrain commented Apr 30, 2014

yinxusen commented Apr 30, 2014

jegonzal commented Apr 30, 2014

yinxusen commented Apr 30, 2014

witgo Aug 4, 2014

Choose a reason for hiding this comment

mengxr commented Sep 26, 2014