Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28159][ML] Make the transform natively in ml framework to avoid extra conversion #24963

Closed
wants to merge 10 commits into from

Conversation

@zhengruifeng
Copy link
Contributor

commented Jun 25, 2019

What changes were proposed in this pull request?

Make the transform natively in ml framework to avoid extra conversion.
There are many TODOs in current ml module, like // TODO: Make the transformer natively in ml framework to avoid extra conversion. in ChiSqSelector.
This PR is to make ml algs no longer need to convert ml-vector to mllib-vector in transforms.
Including: LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler.

How was this patch tested?

existing testsuites

@zhengruifeng

This comment has been minimized.

Copy link
Contributor Author

commented Jun 25, 2019

KMeans and BikMeans are left alone, since there are many classes needed to be created on the ml side.

@SparkQA

This comment has been minimized.

Copy link

commented Jun 25, 2019

Test build #106872 has finished for PR 24963 at commit e9e9c65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jun 25, 2019

Test build #106879 has finished for PR 24963 at commit f34112c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@srowen

This comment has been minimized.

Copy link
Member

commented Jun 25, 2019

Before I review, can you update the JIRA and PR with detail about what you're trying to do here? there's no real info.

@zhengruifeng zhengruifeng force-pushed the zhengruifeng:to_ml_vector branch Jun 26, 2019
@SparkQA

This comment has been minimized.

Copy link

commented Jun 26, 2019

Test build #106912 has finished for PR 24963 at commit 51ee85c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Copy link
Member

left a comment

So, this copies the implementation of a lot of these algorithms? hm, that seems bad from a maintenance standpoint. This is just to avoid the conversion of vector classes? I wonder if there are easier answers. For example, many .mllib implementations probably just use the vector as an array of double immediately. If so, could they expose that directly so that the .ml implementation can call the logic more directly? or, is the vector conversion overhead so significant? it should be mostly re-wrapping the same values and indices?

@dongjoon-hyun dongjoon-hyun added the ML label Jun 26, 2019
@zhengruifeng

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

@srowen Your method is more reasonable. It is better to maintain only one impl. I will try to add a method with array of double as input, on the .mllib side.

@zhengruifeng

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

I means on the .mllib side, directly return a udf .ml.vector => double for the call on the .ml side.

@zhengruifeng zhengruifeng force-pushed the zhengruifeng:to_ml_vector branch to 92d555c Jun 27, 2019
@SparkQA

This comment has been minimized.

Copy link

commented Jun 27, 2019

Test build #106962 has finished for PR 24963 at commit 92d555c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jun 28, 2019

Test build #106997 has finished for PR 24963 at commit 10ba449.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
}
}

private[spark] def compress(features: NewVector): NewVector = {

This comment has been minimized.

Copy link
@srowen

srowen Jun 28, 2019

Member

These seem like general methods, not specific to chi-squared. Do we not already do some of this work in the Vector constructors or an existing utility method?

This comment has been minimized.

Copy link
@srowen

srowen Jun 28, 2019

Member

Likewise here, I don't think we want to handle .ml vectors in .mllib. I think the idea is to make this .mllib method more generic, perhaps just operating on indices and values?

@@ -28,6 +28,7 @@ import org.apache.spark.SparkContext
import org.apache.spark.annotation.Since
import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
import org.apache.spark.graphx.{Edge, EdgeContext, Graph, VertexId}
import org.apache.spark.ml.linalg.{Vector => NewVector, Vectors => NewVectors}

This comment has been minimized.

Copy link
@srowen

srowen Jun 28, 2019

Member

Ah OK, I think we don't want to import .ml vectors in .mllib here. But the method below is only used in .ml now. Just move it to .ml.clustering.LDAModel with your changes?

k: Int,
seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
val (ids: List[Int], cts: Array[Double]) = termCounts match {
case v: NewDenseVector => ((0 until v.size).toList, v.values)

This comment has been minimized.

Copy link
@srowen

srowen Jun 28, 2019

Member

I think we want to avoid materializing this list of indices. In the dense case it's redundant. If not passed, assume the dense case?

@SparkQA

This comment has been minimized.

Copy link

commented Jul 1, 2019

Test build #107058 has finished for PR 24963 at commit bd813db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@srowen
srowen approved these changes Jul 1, 2019
Copy link
Member

left a comment

OK yes I think that's the right way to make this change.

@zhengruifeng

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2019

@srowen PCA is an exception, since it use matrix multiplication.

@zhengruifeng

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2019

@srowen In this PR, I found that it will be more convenient (like IDF/ElementwiseProduct/StandardScaler) if there is some methods in linalg like:

def mapActive(f: (Int, Double) => Double): Vector
return a new vector whose values are computed by orignial vector and function f,
and f is only applied on active elements.

def updateActive(f: (Int, Double) => Double): Unit
like mapActive but update the values in-place

How do you think about this?

@SparkQA

This comment has been minimized.

Copy link

commented Jul 2, 2019

Test build #107092 has finished for PR 24963 at commit 5730ab7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jul 2, 2019

Test build #107094 has finished for PR 24963 at commit d54d073.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Copy link
Member

left a comment

Looking good, one more question.

So, am I right that generally you have:

  • Broken out parts of the implementation in .mllib to expose indices/values methods
  • Call those methods from .mllib and .ml implementations directly to avoid vector conversion?
k: Int,
seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
val (ids: List[Int], cts: Array[Double]) = termCounts match {
case v: DenseVector => ((0 until v.size).toList, v.values)

This comment has been minimized.

Copy link
@srowen

srowen Jul 2, 2019

Member

Here and elsewhere, as an optimization, can we avoid (0 until v.size).toList)? pass an empty list in this case or something, and then deduce that the indices are just the same length as the values?

You're generally solving this with separate sparse/dense methods which could be fine too if it doesn't result in too much code duplication and improves performance in the dense case.

This comment has been minimized.

Copy link
@srowen

srowen Jul 3, 2019

Member

Looks good then except we might be able to make one more optimization here?

This comment has been minimized.

Copy link
@zhengruifeng

zhengruifeng Jul 3, 2019

Author Contributor

I just look into the usage of indices ids, and find that it is used as slicing indices like val expElogbetad = expElogbeta(indices, ::).toDenseMatrix.
I will have a try.

This comment has been minimized.

Copy link
@zhengruifeng

zhengruifeng Jul 3, 2019

Author Contributor

I am afraid that an empty list may not help to simplify the impl.
since in place like private[clustering] def submitMiniBatch(batch: RDD[(Long, Vector)]): OnlineLDAOptimizer, we still have to create a List for slicing.

@zhengruifeng

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2019

@srowen Yes, I broken out the impl in .mllib to expose methods for dense and spares (excpet PCA), and call them from .ml to avoid conversion.

@SparkQA

This comment has been minimized.

Copy link

commented Jul 3, 2019

Test build #107148 has finished for PR 24963 at commit f1314fb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Jul 3, 2019

Test build #107153 has finished for PR 24963 at commit 096d204.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@srowen

This comment has been minimized.

Copy link
Member

commented Jul 8, 2019

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.