-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup #25324
Conversation
Test build #108515 has finished for PR 25324 at commit
|
I see what you're doing, but does it make enough difference to justify the extra complexity? The only optimization that seemed possibly helpful was not referencing the boolean property each time, but pulling it into a variable. The rest doesn't seem like it would matter. Making the method private is OK, and removing the unused var. |
f73729c
to
f5c5291
Compare
Test build #108619 has finished for PR 25324 at commit
|
if (isBinary) { | ||
val localNumFeatures = $(numFeatures) | ||
|
||
val hashUDF = if ($(binary)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm saying, is it worth duplicating the UDF definition to lift this check out? just make a local val binary = $(binary)
if you want to avoid referencing the property every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I misunderstood the comments
Test build #108736 has finished for PR 25324 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT @huaxingao ?
val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0) | ||
terms.foreach { term => | ||
val i = indexOf(term) | ||
if (isBinary) { | ||
val i = Utils.nonNegativeMod(hashFunc(term), localNumFeatures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, yeah I think we had discussed whether it's better to reuse indexOf or call this directly with a local var. I'm OK with either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we now have indexOf
, I guess it might be slight better to reuse this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm neutral on it. As a matter of software design, reusing indexOf
makes sense. It may be more performant to inline it. Is there evidence it makes a performance difference?
Test build #108797 has finished for PR 25324 at commit
|
retest this please |
Test build #108803 has finished for PR 25324 at commit
|
Merged to master |
What changes were proposed in this pull request?
some cleanup and tiny optimization
1, since the
transformImpl
method in the .mllib side is no longer used in the .ml side, the scope should be limited;2, in the
hashUDF
, valnumOfFeatures
is never used;3, in the udf, it is inefficient to involve param getter (
$(numFeatures)
/$(binary)
) directly or via methodindexOf
(($(numFeatures)
) . instead, the getter should be called outside of the udf;How was this patch tested?
existing suites