Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient #16571

Closed
wants to merge 4 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

eliminate the usage of MultivariateOnlineSummarizer

How was this patch tested?

existing tests, manual tests in spark-shell

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71310 has finished for PR 16571 at commit dd0e0c3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71315 has finished for PR 16571 at commit 671e566.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71318 has finished for PR 16571 at commit 5839ac2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map {
case Row(v: Vector) => OldVectors.fromML(v)
}
val summary = Statistics.colStats(input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the call to colStats and vector conversion that was so inefficient? do you have any performance numbers to justify the change, since it does make the code more complicated.

case _ =>
}
max
}, combOp = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may make the code clearer to move the seqOp and combOp to separate methods

val maxAbs = Array.tabulate(n) { i => math.max(math.abs(minVals(i)), math.abs(maxVals(i))) }

val maxAbs = dataset.select($(inputCol)).rdd.map {
row => row.getAs[Vector](0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would moving this inside the tree aggregate possibly make the computation faster? eg operate on row of vector? that way you wouldn't have to pass through the data twice

val maxAbs = Array.tabulate(n) { i => math.max(math.abs(minVals(i)), math.abs(maxVals(i))) }

val maxAbs = dataset.select($(inputCol)).rdd.map {
row => row.getAs[Vector](0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, I think it might make the code clearer to:
1.) map to Array[Double] similar to what you did with vector but take the abs
2.) instead of using treeAggregate just do a simple reduce on the arrays by getting the max for each slot
that would simplify the code more. Would that be worse performance-wise?

require(min.length == vec.size,
s"Dimensions mismatch when adding new sample: ${min.length} != ${vec.size}")
vec.foreachActive {
case (i, v) if v != 0.0 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks the same as the code above, can you refactor to a separate function and call it from both places?

@sethah
Copy link
Contributor

sethah commented Jan 14, 2017

Can we please discuss, on the JIRA, whether this is something we actually want to do? @srowen raises a point that I tend to agree with, so I'd prefer not to proceed with code review until we are sure about it.

@imatiach-msft
Copy link
Contributor

@sethah thank you for your concern, I added my thoughts to the JIRA

@zhengruifeng zhengruifeng changed the title [SPARK-19208][ML] MaxAbsScaler and MinMaxScaler are very inefficient [SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient Jan 16, 2017
@zhengruifeng
Copy link
Contributor Author

In the jira, we decide to optimize MultivariateOnlineSummarizer first, so this pr will be closed.

@zhengruifeng zhengruifeng deleted the new_ma branch January 19, 2017 12:57
@WeichenXu123
Copy link
Contributor

This PR is very similar to my early PR. Is that right? @jkbradley #14950

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants