-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19208][ML][WIP] MaxAbsScaler and MinMaxScaler are very inefficient #16571
Conversation
Test build #71310 has finished for PR 16571 at commit
|
Test build #71315 has finished for PR 16571 at commit
|
Test build #71318 has finished for PR 16571 at commit
|
val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map { | ||
case Row(v: Vector) => OldVectors.fromML(v) | ||
} | ||
val summary = Statistics.colStats(input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the call to colStats and vector conversion that was so inefficient? do you have any performance numbers to justify the change, since it does make the code more complicated.
case _ => | ||
} | ||
max | ||
}, combOp = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may make the code clearer to move the seqOp and combOp to separate methods
val maxAbs = Array.tabulate(n) { i => math.max(math.abs(minVals(i)), math.abs(maxVals(i))) } | ||
|
||
val maxAbs = dataset.select($(inputCol)).rdd.map { | ||
row => row.getAs[Vector](0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would moving this inside the tree aggregate possibly make the computation faster? eg operate on row of vector? that way you wouldn't have to pass through the data twice
val maxAbs = Array.tabulate(n) { i => math.max(math.abs(minVals(i)), math.abs(maxVals(i))) } | ||
|
||
val maxAbs = dataset.select($(inputCol)).rdd.map { | ||
row => row.getAs[Vector](0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, I think it might make the code clearer to:
1.) map to Array[Double] similar to what you did with vector but take the abs
2.) instead of using treeAggregate just do a simple reduce on the arrays by getting the max for each slot
that would simplify the code more. Would that be worse performance-wise?
require(min.length == vec.size, | ||
s"Dimensions mismatch when adding new sample: ${min.length} != ${vec.size}") | ||
vec.foreachActive { | ||
case (i, v) if v != 0.0 => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks the same as the code above, can you refactor to a separate function and call it from both places?
Can we please discuss, on the JIRA, whether this is something we actually want to do? @srowen raises a point that I tend to agree with, so I'd prefer not to proceed with code review until we are sure about it. |
@sethah thank you for your concern, I added my thoughts to the JIRA |
In the jira, we decide to optimize MultivariateOnlineSummarizer first, so this pr will be closed. |
This PR is very similar to my early PR. Is that right? @jkbradley #14950 |
What changes were proposed in this pull request?
eliminate the usage of
MultivariateOnlineSummarizer
How was this patch tested?
existing tests, manual tests in spark-shell