Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib #476

Closed
wants to merge 5 commits into from

Conversation

yinxusen
Copy link
Contributor

(This PR is based on a joint work done with @liancheng four months ago.)

Overview

LDA is a classical topic model in machine learning, that provides the ability to extract topics from corpus. Gibbs sampling (GS for short) is a common way to optimize LDA model.

The LDA model consists of four matrices, two 1-dim matrices:

  • Document counts
  • Topic counts

plus two 2-dim matrices:

  • Document-Topic counts
  • Topic-Term counts

Implementation details

  • An accumulator is used to aggregate all updated values and applies them on the old model computed in the last iteration.
  • Chalk is used for term segmentation. Though it is easy to rewrite it with Lucene analyzers, I think MLlib should not take the burden to maintain an implementation of tokenizer.
  • SparkContext.wholeTextFiles() is convenient for offline experimentation, while SparkContext.textFile() is better for online applications.
  • Document dictionary and term dictionary are broadcasted to translate document names and terms into Int IDs.
  • Topic assignment matrix from the last iteration is cached for the current iteration, and then unpersisted to release memory.
  • LDA suffers similar stack overflow problem of MLlib ALS (SPARK-1006). To workaround this issue, we checkpoint every a few iterations.

@yinxusen yinxusen changed the title JIRA issue: [SPARK-1405](https://issues.apache.org/jira/browse/SPARK-1405) Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib JIRA issue: [SPARK-1405] Gibbs sampling based Latent Dirichlet Allocation (LDA) for MLlib Apr 22, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14318/

docCounts: Vector,
topicCounts: Vector,
docTopicCounts: Array[Vector],
topicTermCounts: Array[Vector])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that this will be really big - maybe the last two variables should be RDDs - similar to what we do with ALS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's make sense. I think the docTopicCounts could be sliced easily W.R.T. documents partitions. But for topicTermCounts, it's hard to do slice. I'll find a way to settle it.

@etrain
Copy link
Contributor

etrain commented Apr 30, 2014

Before I get too deep into this review - I want to step back and think about whether we expect the model in this case to be on the order of the size of the data - I think it is, and if so, we may want to consider representing the model as RDD[DocumentTopicFeatures] and RDD[TopicWordFeatures], similar to what we do with ALS. This may change the algorithm substantially.

Separately, maybe it makes sense to have a concrete use case to work with (reuters dataset or something) so that we can evaluate how much memory actually gets used given a reasonably sized corpus.

Perhaps @mengxr or @jegonzal has a strong opinion on this.

@etrain
Copy link
Contributor

etrain commented Apr 30, 2014

Also, speaking of @jegonzal maybe this is a natural first point of integration between MLlib and GraphX - I know the GraphX folks have an implementation of LDA, and maybe this is a chance for us to leverage that work.

@yinxusen
Copy link
Contributor Author

Yep, I know @jegonzal for his paper Parallel Gibbs Sampling. But I only have the idea of the implementation on GraphLab and not find the impl in GraphX. It's great if I have the chance to talk with Joseph offline.

Besides, I will add a use case for reuters dataset and try to fix the issues put above.

@jegonzal
Copy link
Contributor

I would be happy to talk more about this after the OSDI deadline. As far as storing the model (or more precisely the counts and samples) as an a RDD, I think this really is necessary. The model in this case should be on the order of the size of the data.

Essentially what you want is the ability to join the term topic counts with the document topic counts for each token in a given document. Given these two counts tables (along with the background distribution of topics in the entire corpus) you can compute the new topic assignment.

Here is an implementation of the collapsed Gibbs sampler for LDA using GraphX: amplab/graphx#113

@yinxusen
Copy link
Contributor Author

Yep, thanks @jegonzal and @etrain , I'll try to fix these issues and look forward to the next step updating and discussion.


// Tokenize and filter terms
val almostData = sc.wholeTextFiles(dir, minSplits).map { case (fileName, content) =>
val tokens = JavaWordTokenizer(content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow users to customize here. We can add a parameter tokenizer: (String) => Iterable[String] to loadCorpus, and dirStopWords is not required.

@mengxr
Copy link
Contributor

mengxr commented Sep 26, 2014

@yinxusen Per discussion on https://issues.apache.org/jira/browse/SPARK-1405, we want to have a GraphX-based implementation and distributed representation of the topic model. Do you mind closing this PR? Thanks for your contribution and @etrain @jegonzal and @witgo for code review!

@asfgit asfgit closed this in f341e1c Oct 2, 2014
@liancheng liancheng deleted the lda branch October 2, 2014 07:21
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017
`spark_examples_2.11-2.2.0.jar`   should be  `spark-examples_2.11-2.2.0-k8s-0.3.0.jar`
mccheah pushed a commit to mccheah/spark that referenced this pull request Feb 14, 2019
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Set the go version for conformance tests
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022
* KE-37052 translate boolean column to V2Predicate

* update spark version
RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022
…#479)

* Revert "KE-37052 translate boolean column to V2Predicate (apache#477)"

This reverts commit 7796f19.

* KE-37052 translate boolean column to V2Predicate (apache#476)

* KE-37052 translate boolean column to V2Predicate

* update spark version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants