Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10017] [MLlib]: ML model broadcasts should be stored in private vars: mllib NaiveBayes #8241

Closed
wants to merge 7 commits into from

Conversation

sabhyankar
Copy link

ML model broadcasts should be stored in private vars: mllib NaiveBayes

testData.mapPartitions { iter =>
val model = bcModel.value
val model = bcModel.get.value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we will still want to make a local reference to bcModel here so we don't end up shipping the entire object across the wire.

@sabhyankar
Copy link
Author

Thanks for pointing that out @holdenk ! I have pushed a change to the PR!

@holdenk
Copy link
Contributor

holdenk commented Aug 18, 2015

Awesome, it also seemed to be in many of your related PRs, you might want to update those as well.

@sabhyankar
Copy link
Author

@holdenk yep :) I am updating those as well!

@holdenk
Copy link
Contributor

holdenk commented Aug 18, 2015

Great :)

@sabhyankar
Copy link
Author

@holdenk Not sure if you are reviewing the other PRs, but the fix should now be in all of them. Thx!

@holdenk
Copy link
Contributor

holdenk commented Aug 18, 2015

I can take a look at some of the others too, I'm not a committer though so will still need a follow up review, but I can take a first pass :)

@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

import java.lang.{Iterable => JIterable}

import org.apache.spark.broadcast.Broadcast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So import ordering, this should probably be moved down to be with the other org.apache.spark imports.

@sabhyankar
Copy link
Author

Hi @holdenk I have moved the common logic to a trait and am mixin it with the model. Let me know if you see something that needs to be corrected. I am still getting familiar with Spark and the coding standards so I may have missed something!

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2015

Awesome, great work :) Looking at the trait, it seems like you could maybe even move the var inside of the trait. Also passing the RDD around to get the context out of it seems a little funky, I'd probably just pass the context instead.

@holdenk
Copy link
Contributor

holdenk commented Aug 19, 2015

I know my own PRs get triggered by jenkins, lets see if I can tell jenkins this is good to test or if we will need to get someone else.

@sabhyankar
Copy link
Author

Thanks @holdenk ! That makes sense. I moved the private var to the trait. Let me know if you see something else that is out of place. I believe the generic types I am using are ok. However, let me know if there is a problem. Thanks again for these suggestions :)

@sabhyankar
Copy link
Author

I forgot to mention, I am also only passing in the SparkContext now. I should have done that to begin with :(

@holdenk
Copy link
Contributor

holdenk commented Aug 20, 2015

Glad to help, maybe @jkbradley can say if this is ok to test to jenkins as he was the author of the related JIRA :)

@jkbradley
Copy link
Member

I'll make some comments, but can you please resolve conflicts here?

@jkbradley
Copy link
Member

ok to test

import org.apache.spark.broadcast.Broadcast
import org.apache.spark.SparkContext

import scala.reflect.ClassTag
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

organize imports (Please follow style guide [https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide])

@jkbradley
Copy link
Member

I like this approach. I'll check again after the conflicts have been resolved. Thanks!


import scala.reflect.ClassTag

private[mllib] trait Broadcastable[T] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is for broadcasting models, it shouldn't be mixed in with anything else. Also, the inner type of bcModel should always be the same as the class mixing in Broadcastable.

Thus, how about we rename Broadcastable -> BroadcastableModel, upper bound the type of T, and constrain what types can mix this trait in with a self type annotation (i.e. altogether trait BroadcastableModel[T <: Model] { self: T =>)

@sabhyankar
Copy link
Author

@feynmanliang @jkbradley thank you for the review guys. Traits will not allow context bound (to ClassTag) for type parameters and so I ended up using implicits in the method. I am going to try doing an upper bound with a self type annotation. However, I feel we are still going to need an implicit in the method (ill see if the compiler complains).

As for the other changes, ill fix the imports/newline issues and fix the merge conflicts. I will also close #8248 #8247 #8243 #8241 and #8249 and merge all those into a single PR. However, since I am new to Spark, can you tell me what Jira I should open that PR against? Perhaps the umbrella SPARK-10014?

@feynmanliang
Copy link
Contributor

@sabhyankar why is the ClassTag necessary? What's the error you're getting?

You can tag multiple JIRA issues in a single PR, see #7507 for example

@sabhyankar
Copy link
Author

@feynmanliang without the ClassTag, the compiler complains that no ClassTag is available for our type T.

No ClassTag available for T
[error] case None => bcModel = Some(sc.broadcast(modelToBc))

Another unrelated question I have is that there is no common trait for the models under mllib so what would I upperbound the type T in the trait?

@jkbradley
Copy link
Member

@sabhyankar Actually, I'd recommend just keeping 1 PR per JIRA. I think that's easier for avoiding conflicts, even if the changes are all pretty similar.

@jkbradley
Copy link
Member

I think the classtag is needed because Broadcast needs the classtag.

@feynmanliang This trait is for spark.mllib too, not just spark.ml. (Model is only a concept in spark.ml)

@sabhyankar
Copy link
Author

@jkbradley I have resolved the conflicts and updated the PR with the changes identified. Let me know if you see any issues. Since I am not combining PRs, this will have to go in before I update #8248 #8247 #8243 #8241 and #8249

@feynmanliang
Copy link
Contributor

OK, I understand why the ClassTag is necessary, thanks.

@jkbradley @sabhyankar I have some minor concerns:

  1. I'm a little uncomfortable that we don't restrict the types this trait can be mixed in to; this might lead to people using this in places we don't expect. This isn't a problem for ML models but since there's upper bound for mllib models in general, we have to decide what's better:

    a. duplicate the broadcast code in each mllib model to prevent unintended use of this trait
    b. leave it to each developer's best judgement

    Personally I'm indifferent, leaning slightly towards (a).

  2. We should make the filename consistent with the attribute name

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.json4s.JsonDSL._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these imports now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feynmanliang These json* imports were always there. They were simply reordered by the import order plugin I used for IntelliJ. The only additional import is for the Broadcastable trait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, those need to go back to where they were before, see style guide

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, ill take care of that.

asfgit pushed a commit that referenced this pull request Aug 26, 2015
…ndom and mllib.stat

The same as #8241 but for `mllib.stat` and `mllib.random`.

cc feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8439 from mengxr/SPARK-10242.

(cherry picked from commit c3a5484)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
asfgit pushed a commit that referenced this pull request Aug 26, 2015
…ndom and mllib.stat

The same as #8241 but for `mllib.stat` and `mllib.random`.

cc feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #8439 from mengxr/SPARK-10242.
@sabhyankar
Copy link
Author

@jkbradley @feynmanliang I can certainly update the PR and change the filename and trait name to be the same (Broadcastable).

I understand the concern regarding the Broadcastable trait not being restricted. However, since we don't have a common trait for the ml and mllib Models, I am not sure if there is any other option except duplicating the code in each of the models. Let me know what you guys advise. For now the PR is updated with the earlier recommendations and merge conflicts have been resolved.

@feynmanliang
Copy link
Contributor

@jkbradley and I talked offline, summary: lets make it private[mllib] and clearly document in the trait's scaladoc that it's to prevent rebroadcasting during prediction.

@sabhyankar
Copy link
Author

@feynmanliang - That sounds good - I can document the intent of the Broadcastable trait in the scaladoc. The reason I made this trait private[spark] is because we want to use it in both mllib and ml models (e.g #8243). If I make this private[mllib], we wont be able to reuse it in the ml models unless we duplicate it... Any suggestions on how to work around this?

@jkbradley
Copy link
Member

I'm OK with it being private to spark (not mllib) as long as the docs warn about usage.

@sabhyankar
Copy link
Author

Sounds good @jkbradley . I dont think there is anything pending on my side for this PR. Let me know if you see something otherwise. I will update the other PRs once this is in.

@feynmanliang
Copy link
Contributor

LGTM pending tests

@jkbradley @mengxr can you trigger tests?

@mengxr
Copy link
Contributor

mengxr commented Aug 27, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Aug 27, 2015

Test build #41697 has finished for PR 8241 at commit 965aaec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LogisticRegressionModel @Since("1.3.0") (
    • class SVMModel @Since("1.1.0") (
    • class BinaryClassificationMetrics @Since("1.3.0") (
    • class MulticlassMetrics @Since("1.1.0") (predictionAndLabels: RDD[(Double, Double)])
    • class MultilabelMetrics @Since("1.2.0") (predictionAndLabels: RDD[(Array[Double], Array[Double])])
    • class RegressionMetrics @Since("1.2.0") (
    • class FPGrowthModel[Item: ClassTag] @Since("1.3.0") (
    • class FreqItemset[Item] @Since("1.3.0") (
    • class FreqSequence[Item] @Since("1.5.0") (
    • class PrefixSpanModel[Item] @Since("1.5.0") (
    • class DenseMatrix @Since("1.3.0") (
    • class SparseMatrix @Since("1.3.0") (
    • class DenseVector @Since("1.0.0") (
    • class SparseVector @Since("1.0.0") (
    • class BlockMatrix @Since("1.3.0") (
    • class CoordinateMatrix @Since("1.0.0") (
    • class IndexedRowMatrix @Since("1.0.0") (
    • class RowMatrix @Since("1.0.0") (
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
    • case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)

@mengxr
Copy link
Contributor

mengxr commented Aug 28, 2015

Couple issues. Some were already mentioned by @feynmanliang :

  1. Broadcasting the entire model. For example, we might just need some pieces of it. For linear models, we only need the weights, but not summary statistics.
  2. Using trait. What if we have two objects to broadcast? For example, if we have a local matrix factorization model. Are we going to make them a tuple then broadcast?
  3. Models are mutable, and we didn't put a contract and say users shouldn't modify them. After this PR, if users modify the model after RDD prediction, local prediction will see changes but RDD prediction will not. Have we discussed this yet?

@sabhyankar
Copy link
Author

Hi @mengxr The current approach of rebroadcasting on every predict broadcasts the entire model and so I suppose the first issue you identified also applies to the current approach. Currently the trait doesnt have a limitation on where it can be mixed-in (except the private to spark) and the type of object it can broadcast. So I suppose we can broadcast multiple objects as a tuple if needed. I agree that given that models are mutable, preventing a rebroadcast via this PR will prevent RDD changes to be seen. However, given the Jira req I assumed that this was ok. Let me know if that is not the case.

@mengxr
Copy link
Contributor

mengxr commented Aug 28, 2015

@sabhyankar Yes, the first issue also applies to the current approach. But I was expecting to solve it with this effort (broadcast less and only once). It is awkward to declare class with Broadcastable[(Double, Array[Double], Vector)]. It doesn't provide much information but makes the code a little hard to read.

Issue 3 is actually the most important one because it changes the behavior. A safer approach would be making a new class for NaiveBayesModel with immutable (or by-contract immutable) data, which only needs broadcast once. This definitely requires more discussion before we make the change.

@jkbradley
Copy link
Member

@sabhyankar Thanks for discussing these issues. Given the various arguments, I'd vote for:

  • eliminating the trait --- It's a nice idea but seems like it saves too little code to be worthwhile, since there is no easy one-trait-fits-all solution.
  • discussing mutability of models on the JIRA before proceeding [https://issues.apache.org/jira/browse/SPARK-10014]

@sabhyankar
Copy link
Author

closing per discussion on parent Jira

@sabhyankar sabhyankar closed this Sep 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants