Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup #25324

Closed
wants to merge 4 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

some cleanup and tiny optimization
1, since the transformImpl method in the .mllib side is no longer used in the .ml side, the scope should be limited;
2, in the hashUDF, val numOfFeatures is never used;
3, in the udf, it is inefficient to involve param getter ($(numFeatures)/$(binary)) directly or via method indexOf (($(numFeatures)) . instead, the getter should be called outside of the udf;

How was this patch tested?

existing suites

@SparkQA
Copy link

SparkQA commented Aug 1, 2019

Test build #108515 has finished for PR 25324 at commit f73729c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 2, 2019

I see what you're doing, but does it make enough difference to justify the extra complexity? The only optimization that seemed possibly helpful was not referencing the boolean property each time, but pulling it into a variable. The rest doesn't seem like it would matter.

Making the method private is OK, and removing the unused var.

@SparkQA
Copy link

SparkQA commented Aug 4, 2019

Test build #108619 has finished for PR 25324 at commit f5c5291.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (isBinary) {
val localNumFeatures = $(numFeatures)

val hashUDF = if ($(binary)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm saying, is it worth duplicating the UDF definition to lift this check out? just make a local val binary = $(binary) if you want to avoid referencing the property every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I misunderstood the comments

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108736 has finished for PR 25324 at commit 0d5ca8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT @huaxingao ?

val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
terms.foreach { term =>
val i = indexOf(term)
if (isBinary) {
val i = Utils.nonNegativeMod(hashFunc(term), localNumFeatures)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah I think we had discussed whether it's better to reuse indexOf or call this directly with a local var. I'm OK with either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we now have indexOf, I guess it might be slight better to reuse this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm neutral on it. As a matter of software design, reusing indexOf makes sense. It may be more performant to inline it. Is there evidence it makes a performance difference?

@zhengruifeng zhengruifeng changed the title [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup and Tiny Optimizations [SPARK-21481][ML][FOLLOWUP] HashingTF Cleanup Aug 8, 2019
@SparkQA
Copy link

SparkQA commented Aug 8, 2019

Test build #108797 has finished for PR 25324 at commit 66436f0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 8, 2019

Test build #108803 has finished for PR 25324 at commit 66436f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 9, 2019

Merged to master

@srowen srowen closed this in 8b08e14 Aug 9, 2019
@zhengruifeng zhengruifeng deleted the hashingtf_cleanup branch August 10, 2019 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants