Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Spot-165 Expose optional online optimizer #54

Merged

Conversation

rabarona
Copy link

This PR implements changes requested in Jira Spot-165. Adds the option to select the algorithm implementation for LDA, online or EM.

Main changes

  • Added a new parameter to run spot-ml. This new parameter determines the implementation of LDA algorithm; EM optimizer or Online optimizer.
  • Modified spot-setup/spot.conf, added LDA_OPTIMIZER, LDA_ALPHA and LDA_BETA to configure LDA optimizer parameters and select implementation.
  • Modified ml_ops.sh to accept these new parameters mentioned above.
  • Modified SpotLDAWrapper to receive the optimizer parameter and run LDA with the selected optimizer.
  • Updated unit tests to execute tests with EM and Online optimizer.
  • Updated ML_OPS.md with information about new parameters.

@lujacab
Copy link

lujacab commented Jun 15, 2017

+1

1 similar comment
@raypanduro
Copy link
Contributor

+1

val ldaCorpus: RDD[(Long, Vector)] =
formatSparkLDAInput(docWordCountCache,
documentDictionary,
wordDictionary,
sqlContext)

val corpusSize = ldaCorpus.count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only used in the online case, might as well put it by line 90

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this, thanks.


// If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name

if (ldaSeed.nonEmpty) {
lda.setSeed(ldaSeed.get)
}

val (wordTopicMat, docTopicDist) = ldaOptimizer match {
case _: EMLDAOptimizer => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a space between the _ and the : ?

case _ : EMLDAOptimizer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick reformat will fix this. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick check with Reformat Code and it didn't change, seems correct.


// If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name

if (ldaSeed.nonEmpty) {
lda.setSeed(ldaSeed.get)
}

val (wordTopicMat, docTopicDist) = ldaOptimizer match {
case _: EMLDAOptimizer => {
val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's awful we have to do a case, but I don't know of a way to avoid it
I think it's Spark's fault

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is because each optimizer will return a different implementation of LDAModel. We could just get either DistributedLDAModel or OnlineLDAModel and then from there if it's DistributedLDAModel convert to Online and if it's Online leave it as it is and then use only OnlineLDAOptimizer methods, however, my tests showed results changed when you convert Distributed model to Online and then get topic distribution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already solved in the spark.ml library, for what I remember both implementations return the same model called just: LDAModel from the org.apache.spark.ml.clustering.


//Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size)
val wordTopicMat: Matrix = distLDAModel.topicsMatrix
case _: OnlineLDAOptimizer => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be a space between _ and : ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick reformat will fix this. Thanks.

Copy link
Contributor

@NathanSegerlind NathanSegerlind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor quibbles in spacing, ow good

@NathanSegerlind
Copy link
Contributor

I just noticed the conflicts.

Ricardo - could you resolve the conflicts and resubmit?

@rabarona
Copy link
Author

Right, I need to rebase my branch with the current master branch.

Ricardo Barona added 9 commits June 16, 2017 17:30
… Proxy implementations.

Refactored code in SpotLDAWrapper to implement one or the other LDA optimizers.
…, alpha and beta.

Updated spot.conf to include the same parameters.
…NathanSegerlind

Fixed inline comments format, added one space after //.
Fixed conflicts after rebasing with incubator-spot/master
Fixed minor typos in DNSSuspiciousConnectsAnalysisTest.scala
@rabarona rabarona force-pushed the SPOT-165-Expose_optional_online_optimizer branch from a444d9e to cdf1eee Compare June 17, 2017 00:35
Copy link
Contributor

@lujangus lujangus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work.


// If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name

if (ldaSeed.nonEmpty) {
lda.setSeed(ldaSeed.get)
}

val (wordTopicMat, docTopicDist) = ldaOptimizer match {
case _: EMLDAOptimizer => {
val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already solved in the spark.ml library, for what I remember both implementations return the same model called just: LDAModel from the org.apache.spark.ml.clustering.

val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
logger.setLevel(Level.WARN)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After all these changes are done I still have pending the alpha and beta tuning. I still think these paramaters are quite low if we consider that the default for both parameters is 1.0/K.

I also suggest for the future to call the parameters: docConcentration and topicConcentration to avoid confusion. Even the LDA committers recognize that different papers call the parameters differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asfgit asfgit merged commit cdf1eee into apache:master Jun 17, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants