-
Notifications
You must be signed in to change notification settings - Fork 224
Spot-165 Expose optional online optimizer #54
Spot-165 Expose optional online optimizer #54
Conversation
+1 |
1 similar comment
+1 |
val ldaCorpus: RDD[(Long, Vector)] = | ||
formatSparkLDAInput(docWordCountCache, | ||
documentDictionary, | ||
wordDictionary, | ||
sqlContext) | ||
|
||
val corpusSize = ldaCorpus.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only used in the online case, might as well put it by line 90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change this, thanks.
|
||
// If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name | ||
|
||
if (ldaSeed.nonEmpty) { | ||
lda.setSeed(ldaSeed.get) | ||
} | ||
|
||
val (wordTopicMat, docTopicDist) = ldaOptimizer match { | ||
case _: EMLDAOptimizer => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't there be a space between the _ and the : ?
case _ : EMLDAOptimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick reformat will fix this. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick check with Reformat Code and it didn't change, seems correct.
|
||
// If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name | ||
|
||
if (ldaSeed.nonEmpty) { | ||
lda.setSeed(ldaSeed.get) | ||
} | ||
|
||
val (wordTopicMat, docTopicDist) = ldaOptimizer match { | ||
case _: EMLDAOptimizer => { | ||
val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's awful we have to do a case, but I don't know of a way to avoid it
I think it's Spark's fault
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is because each optimizer will return a different implementation of LDAModel. We could just get either DistributedLDAModel or OnlineLDAModel and then from there if it's DistributedLDAModel convert to Online and if it's Online leave it as it is and then use only OnlineLDAOptimizer methods, however, my tests showed results changed when you convert Distributed model to Online and then get topic distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is already solved in the spark.ml library, for what I remember both implementations return the same model called just: LDAModel from the org.apache.spark.ml.clustering.
|
||
//Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size) | ||
val wordTopicMat: Matrix = distLDAModel.topicsMatrix | ||
case _: OnlineLDAOptimizer => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be a space between _ and : ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick reformat will fix this. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor quibbles in spacing, ow good
I just noticed the conflicts. Ricardo - could you resolve the conflicts and resubmit? |
Right, I need to rebase my branch with the current master branch. |
… Proxy implementations. Refactored code in SpotLDAWrapper to implement one or the other LDA optimizers.
…, alpha and beta. Updated spot.conf to include the same parameters.
…NathanSegerlind Fixed inline comments format, added one space after //.
Fixed conflicts after rebasing with incubator-spot/master Fixed minor typos in DNSSuspiciousConnectsAnalysisTest.scala
a444d9e
to
cdf1eee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work.
|
||
// If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name | ||
|
||
if (ldaSeed.nonEmpty) { | ||
lda.setSeed(ldaSeed.get) | ||
} | ||
|
||
val (wordTopicMat, docTopicDist) = ldaOptimizer match { | ||
case _: EMLDAOptimizer => { | ||
val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is already solved in the spark.ml library, for what I remember both implementations return the same model called just: LDAModel from the org.apache.spark.ml.clustering.
val logger = LogManager.getLogger("SuspiciousConnectsAnalysis") | ||
logger.setLevel(Level.WARN) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After all these changes are done I still have pending the alpha and beta tuning. I still think these paramaters are quite low if we consider that the default for both parameters is 1.0/K.
I also suggest for the future to call the parameters: docConcentration and topicConcentration to avoid confusion. Even the LDA committers recognize that different papers call the parameters differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jira issue created
https://issues.apache.org/jira/browse/SPOT-177
This PR implements changes requested in Jira Spot-165. Adds the option to select the algorithm implementation for LDA, online or EM.
Main changes