Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF #25250

Closed
wants to merge 4 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add indexOf method for ml.feature.HashingTF.

How was this patch tested?

Add Unit test.

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108141 has finished for PR 25250 at commit 242ac86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some perf ideas while we're here.
Does this need to go in pyspark too?

val func = (terms: Seq[_]) => {
val seq = hashingTF.transformImpl(terms)
Vectors.sparse(hashingTF.numFeatures, seq)
val hashUDF = udf { (terms: Seq[_]) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: I think you can omit the inner set of braces here

val termFrequencies = mutable.HashMap.empty[Int, Double]
val setTF =
if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
terms.foreach { term =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the lines from here be unindented one unit?

if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
terms.foreach { term =>
val i = Utils.nonNegativeMod(hashFunc(term), $(numFeatures))
termFrequencies.put(i, setTF(i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be overthinking this, but it might be faster to not get the value for the key and then update the value for the key. If the mutable map has .withDefaultValue(0.0) then...

if ($(binary)) {
  termFrequences(i) = 1.0
} else {
  termFrequences(i) += 1.0
}

It may mean reading $(binary) to a local val to make sure it's not accessed repeatedly. Or just duplicate the foreach over terms in two cases so that binary is checked only once.

val setTF =
if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
terms.foreach { term =>
val i = Utils.nonNegativeMod(hashFunc(term), $(numFeatures))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this now call indexOf?
The reason you might not is to avoid accessing $(numFeatures) every time, but if so, then that can be a local val.

@huaxingao
Copy link
Contributor Author

I will work on the python part.

@SparkQA
Copy link

SparkQA commented Jul 26, 2019

Test build #108187 has finished for PR 25250 at commit c1f413c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2019

Test build #108188 has finished for PR 25250 at commit 86b02d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (isBinary) {
termFrequencies(i) = 1.0
} else {
termFrequencies(i) = termFrequencies(i) + 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be += 1.0 ?

val isBinary = $(binary)
val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
terms.foreach { term =>
val i = indexOf (term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove space after indexOf

val isBinary = $(binary)
val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
terms.foreach { term =>
val i = indexOf (term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra space after indexOf?

@@ -912,6 +912,10 @@ class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures, Java
>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
True
>>> df = spark.createDataFrame([(["a", "a", "b", "b", "c", "d"],)], ["words"])
>>> hashingTF = HashingTF(numFeatures=100, inputCol="words", outputCol="features")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe a smaller number like 10?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, maybe just reuse hashingTF above?

@@ -912,6 +912,10 @@ class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures, Java
>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
True
>>> df = spark.createDataFrame([(["a", "a", "b", "b", "c", "d"],)], ["words"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df isn't used, right?

@SparkQA
Copy link

SparkQA commented Jul 26, 2019

Test build #108223 has finished for PR 25250 at commit a6938b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 28, 2019

Merged to master

@srowen srowen closed this in 70f82fd Jul 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants