[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes#15826
[SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes#15826yanboliang wants to merge 4 commits intoapache:masterfrom
Conversation
…te extra pass on data ## What changes were proposed in this pull request? [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the ```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. However, there are some difference between mllib and ml to handle labels: * mllib allow input labels as {-1, +1}, however, ml assumes the input labels in range [0, numClasses). * mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the assumption mention above. During the copy in [SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use ```val labels = data.map(_.label).distinct().collect().sorted``` to get the distinct labels firstly, and then encode the labels for training. It involves extra Spark job compared with the original implementation. Since ```NaiveBayes``` only do one pass aggregation during training, adding another one seems less efficient. We can get the labels in a single pass along with ```NaiveBayes``` training and send them to MLlib side. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15402 from yanboliang/spark-17835.
| private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli) | ||
| private[classification] val supportedModelTypes = Set(Multinomial, Bernoulli) | ||
|
|
||
| private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: Vector) => { |
There was a problem hiding this comment.
@thunterdb I define these two functions still by val which should be more appropriate, since there is difference between val and def to define a function. http://stackoverflow.com/questions/18887264/what-is-the-difference-between-def-and-val-to-define-a-function
There was a problem hiding this comment.
@yanboliang indeed it makes a difference when defining methods inside another method. In the current case, you brought the function as a value of the companion object. You can safely turn it into a method
|
Test build #68398 has finished for PR 15826 at commit
|
thunterdb
left a comment
There was a problem hiding this comment.
@yanboliang thanks! I have a few small comments.
| private[spark] val supportedModelTypes = Set(Multinomial, Bernoulli) | ||
| private[classification] val supportedModelTypes = Set(Multinomial, Bernoulli) | ||
|
|
||
| private[NaiveBayes] val requireNonnegativeValues: Vector => Unit = (v: Vector) => { |
There was a problem hiding this comment.
@yanboliang indeed it makes a difference when defining methods inside another method. In the current case, you brought the function as a value of the companion object. You can safely turn it into a method
| s"Naive Bayes requires nonnegative feature values but found $v.") | ||
| } | ||
|
|
||
| private[NaiveBayes] val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => { |
| if (isML) { | ||
| private[spark] def trainWithLabelCheck( | ||
| dataset: Dataset[_], | ||
| positiveLabel: Boolean = true): NaiveBayesModel = { |
There was a problem hiding this comment.
Since it is an internal method, I suggest to not use a default argument and specify it explicitly above.
|
Test build #68443 has finished for PR 15826 at commit
|
|
@thunterdb Any more comments? Thanks! |
|
@yanboliang that looks great, thank you. LGTM. |
|
Merged into master and branch-2.1. Thanks for reviewing. |
## What changes were proposed in this pull request? * Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it. * Avoid capturing the outer object for ```modelType```. * Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15826 from yanboliang/spark-14077-2. (cherry picked from commit 22cb3a0) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
## What changes were proposed in this pull request? * Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it. * Avoid capturing the outer object for ```modelType```. * Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15826 from yanboliang/spark-14077-2.
What changes were proposed in this pull request?
trainWithLabelCheckand makemllib.NaiveBayescall into it.modelType.requireNonnegativeValuesandrequireZeroOneBernoulliValuesto companion object.How was this patch tested?
Existing tests.