Spot-165 Expose optional online optimizer #54

rabarona · 2017-06-14T16:55:38Z

This PR implements changes requested in Jira Spot-165. Adds the option to select the algorithm implementation for LDA, online or EM.

Main changes

Added a new parameter to run spot-ml. This new parameter determines the implementation of LDA algorithm; EM optimizer or Online optimizer.
Modified spot-setup/spot.conf, added LDA_OPTIMIZER, LDA_ALPHA and LDA_BETA to configure LDA optimizer parameters and select implementation.
Modified ml_ops.sh to accept these new parameters mentioned above.
Modified SpotLDAWrapper to receive the optimizer parameter and run LDA with the selected optimizer.
Updated unit tests to execute tests with EM and Online optimizer.
Updated ML_OPS.md with information about new parameters.

lujacab · 2017-06-15T15:55:39Z

+1

raypanduro · 2017-06-15T16:48:38Z

+1

NathanSegerlind · 2017-06-16T19:34:45Z

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala

    val ldaCorpus: RDD[(Long, Vector)] =
      formatSparkLDAInput(docWordCountCache,
        documentDictionary,
        wordDictionary,
        sqlContext)

+    val corpusSize = ldaCorpus.count()


this is only used in the online case, might as well put it by line 90

I will change this, thanks.

NathanSegerlind · 2017-06-16T19:43:07Z

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala


    // If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name

    if (ldaSeed.nonEmpty) {
      lda.setSeed(ldaSeed.get)
    }

+    val (wordTopicMat, docTopicDist) = ldaOptimizer match {
+      case _: EMLDAOptimizer => {


shouldn't there be a space between the _ and the : ?

case _ : EMLDAOptimizer

A quick reformat will fix this. Thanks.

Did a quick check with Reformat Code and it didn't change, seems correct.

NathanSegerlind · 2017-06-16T19:43:41Z

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala


    // If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name

    if (ldaSeed.nonEmpty) {
      lda.setSeed(ldaSeed.get)
    }

+    val (wordTopicMat, docTopicDist) = ldaOptimizer match {
+      case _: EMLDAOptimizer => {
+        val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel]


it's awful we have to do a case, but I don't know of a way to avoid it
I think it's Spark's fault

Well, this is because each optimizer will return a different implementation of LDAModel. We could just get either DistributedLDAModel or OnlineLDAModel and then from there if it's DistributedLDAModel convert to Online and if it's Online leave it as it is and then use only OnlineLDAOptimizer methods, however, my tests showed results changed when you convert Distributed model to Online and then get topic distribution.

I think this is already solved in the spark.ml library, for what I remember both implementations return the same model called just: LDAModel from the org.apache.spark.ml.clustering.

NathanSegerlind · 2017-06-16T19:43:57Z

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala


-    //Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size)
-    val wordTopicMat: Matrix = distLDAModel.topicsMatrix
+      case _: OnlineLDAOptimizer => {


should there be a space between _ and : ?

A quick reformat will fix this. Thanks.

NathanSegerlind

minor quibbles in spacing, ow good

NathanSegerlind · 2017-06-16T19:46:05Z

I just noticed the conflicts.

Ricardo - could you resolve the conflicts and resubmit?

rabarona · 2017-06-16T20:46:32Z

Right, I need to rebase my branch with the current master branch.

… Proxy implementations. Refactored code in SpotLDAWrapper to implement one or the other LDA optimizers.

…, alpha and beta. Updated spot.conf to include the same parameters.

…NathanSegerlind

…NathanSegerlind Fixed inline comments format, added one space after //.

… LDA_BETA

Fixed conflicts after rebasing with incubator-spot/master Fixed minor typos in DNSSuspiciousConnectsAnalysisTest.scala

lujangus

Good work.

lujangus · 2017-06-17T00:56:14Z

spot-ml/src/main/scala/org/apache/spot/lda/SpotLDAWrapper.scala


    // If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name

    if (ldaSeed.nonEmpty) {
      lda.setSeed(ldaSeed.get)
    }

+    val (wordTopicMat, docTopicDist) = ldaOptimizer match {
+      case _: EMLDAOptimizer => {
+        val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel]


I think this is already solved in the spark.ml library, for what I remember both implementations return the same model called just: LDAModel from the org.apache.spark.ml.clustering.

lujangus · 2017-06-17T01:03:42Z

spot-ml/src/test/scala/org/apache/spot/lda/SpotLDAWrapperTest.scala

    val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
    logger.setLevel(Level.WARN)
+


After all these changes are done I still have pending the alpha and beta tuning. I still think these paramaters are quite low if we consider that the default for both parameters is 1.0/K.

I also suggest for the future to call the parameters: docConcentration and topicConcentration to avoid confusion. Even the LDA committers recognize that different papers call the parameters differently.

Jira issue created
https://issues.apache.org/jira/browse/SPOT-177

NathanSegerlind reviewed Jun 16, 2017

View reviewed changes

NathanSegerlind approved these changes Jun 16, 2017

View reviewed changes

Ricardo Barona added 9 commits June 16, 2017 17:30

Adding support for OnlineLDAOptimizer. Still need to work on TODOs.

84537d1

Unit test fails for OnlineOptimizer, need to fix

440b675

Adding OnlineLDAOptimizer support

ec5d040

Added unit testing test configuration for online optimizer in DNS and…

917a7b0

… Proxy implementations. Refactored code in SpotLDAWrapper to implement one or the other LDA optimizers.

Updated ml_ops.sh and ml_test.sh to include LDA parameters: optimizer…

f95e395

…, alpha and beta. Updated spot.conf to include the same parameters.

Made changes to SpotLDAWrapper and ml_ops.sh based on feedback from @…

9a71071

…NathanSegerlind Fixed inline comments format, added one space after //.

Fixed errors after rebase with incubator-spot/master

0fbea08

Updated ML_OPS.md with information about LDA_OPTIMIZER, LDA_ALPHA and…

1de5a65

… LDA_BETA

Made changes after code review.

cdf1eee

Fixed conflicts after rebasing with incubator-spot/master Fixed minor typos in DNSSuspiciousConnectsAnalysisTest.scala

rabarona force-pushed the SPOT-165-Expose_optional_online_optimizer branch from a444d9e to cdf1eee Compare June 17, 2017 00:35

lujangus approved these changes Jun 17, 2017

View reviewed changes

asfgit merged commit cdf1eee into apache:master Jun 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spot-165 Expose optional online optimizer #54

Spot-165 Expose optional online optimizer #54

rabarona commented Jun 14, 2017

lujacab commented Jun 15, 2017

raypanduro commented Jun 15, 2017

NathanSegerlind Jun 16, 2017

rabarona Jun 16, 2017

NathanSegerlind Jun 16, 2017

rabarona Jun 16, 2017

rabarona Jun 17, 2017

NathanSegerlind Jun 16, 2017

rabarona Jun 16, 2017

lujangus Jun 17, 2017

NathanSegerlind Jun 16, 2017

rabarona Jun 16, 2017

NathanSegerlind left a comment

NathanSegerlind commented Jun 16, 2017

rabarona commented Jun 16, 2017

lujangus left a comment

lujangus Jun 17, 2017

lujangus Jun 17, 2017

NathanSegerlind Jun 17, 2017

		val logger = LogManager.getLogger("SuspiciousConnectsAnalysis")
		logger.setLevel(Level.WARN)

Spot-165 Expose optional online optimizer #54

Spot-165 Expose optional online optimizer #54

Conversation

rabarona commented Jun 14, 2017

Main changes

lujacab commented Jun 15, 2017

raypanduro commented Jun 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NathanSegerlind left a comment

Choose a reason for hiding this comment

NathanSegerlind commented Jun 16, 2017

rabarona commented Jun 16, 2017

lujangus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment