[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF #25250

huaxingao · 2019-07-25T00:01:34Z

What changes were proposed in this pull request?

Add indexOf method for ml.feature.HashingTF.

How was this patch tested?

Add Unit test.

SparkQA · 2019-07-25T01:23:58Z

Test build #108141 has finished for PR 25250 at commit 242ac86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looks good, just some perf ideas while we're here.
Does this need to go in pyspark too?

srowen · 2019-07-25T16:51:43Z

mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala

-    val func = (terms: Seq[_]) => {
-      val seq = hashingTF.transformImpl(terms)
-      Vectors.sparse(hashingTF.numFeatures, seq)
+    val hashUDF = udf { (terms: Seq[_]) => {


Tiny nit: I think you can omit the inner set of braces here

srowen · 2019-07-25T16:52:17Z

mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala

+      val termFrequencies = mutable.HashMap.empty[Int, Double]
+      val setTF =
+        if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
+        terms.foreach { term =>


Should the lines from here be unindented one unit?

srowen · 2019-07-25T17:01:41Z

mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala

+        if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
+        terms.foreach { term =>
+          val i = Utils.nonNegativeMod(hashFunc(term), $(numFeatures))
+          termFrequencies.put(i, setTF(i))


I may be overthinking this, but it might be faster to not get the value for the key and then update the value for the key. If the mutable map has .withDefaultValue(0.0) then...

if ($(binary)) { termFrequences(i) = 1.0 } else { termFrequences(i) += 1.0 }

It may mean reading $(binary) to a local val to make sure it's not accessed repeatedly. Or just duplicate the foreach over terms in two cases so that binary is checked only once.

srowen · 2019-07-25T17:02:14Z

mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala

+      val setTF =
+        if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
+        terms.foreach { term =>
+          val i = Utils.nonNegativeMod(hashFunc(term), $(numFeatures))


Can this now call indexOf?
The reason you might not is to avoid accessing $(numFeatures) every time, but if so, then that can be a local val.

huaxingao · 2019-07-26T00:11:01Z

I will work on the python part.

SparkQA · 2019-07-26T01:20:38Z

Test build #108187 has finished for PR 25250 at commit c1f413c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-26T02:43:38Z

Test build #108188 has finished for PR 25250 at commit 86b02d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-07-26T13:07:04Z

mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala

+        if (isBinary) {
+          termFrequencies(i) = 1.0
+        } else {
+          termFrequencies(i) = termFrequencies(i) + 1.0


I think this can be += 1.0 ?

srowen · 2019-07-26T13:07:16Z

mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala

+      val isBinary = $(binary)
+      val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
+      terms.foreach { term =>
+        val i = indexOf (term)


Nit: remove space after indexOf

viirya · 2019-07-26T13:38:58Z

mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala

+      val isBinary = $(binary)
+      val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
+      terms.foreach { term =>
+        val i = indexOf (term)


Remove extra space after indexOf?

viirya · 2019-07-26T13:49:37Z

python/pyspark/ml/feature.py

@@ -912,6 +912,10 @@ class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures, Java
    >>> loadedHashingTF = HashingTF.load(hashingTFPath)
    >>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
    True
+    >>> df = spark.createDataFrame([(["a", "a", "b", "b", "c", "d"],)], ["words"])
+    >>> hashingTF = HashingTF(numFeatures=100, inputCol="words", outputCol="features")


nit: maybe a smaller number like 10?

or, maybe just reuse hashingTF above?

viirya · 2019-07-26T13:51:58Z

python/pyspark/ml/feature.py

@@ -912,6 +912,10 @@ class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures, Java
    >>> loadedHashingTF = HashingTF.load(hashingTFPath)
    >>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
    True
+    >>> df = spark.createDataFrame([(["a", "a", "b", "b", "c", "d"],)], ["words"])


df isn't used, right?

SparkQA · 2019-07-26T17:01:11Z

Test build #108223 has finished for PR 25250 at commit a6938b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-07-28T13:32:51Z

Merged to master

[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF

242ac86

dongjoon-hyun added the ML label Jul 25, 2019

srowen reviewed Jul 25, 2019

View reviewed changes

address comments

c1f413c

add indexOf in python version of HashingTF

86b02d3

srowen reviewed Jul 26, 2019

View reviewed changes

viirya reviewed Jul 26, 2019

View reviewed changes

address comments

a6938b2

srowen approved these changes Jul 27, 2019

View reviewed changes

viirya approved these changes Jul 28, 2019

View reviewed changes

srowen closed this in 70f82fd Jul 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF #25250

[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF #25250

huaxingao commented Jul 25, 2019

SparkQA commented Jul 25, 2019

srowen left a comment

srowen Jul 25, 2019

srowen Jul 25, 2019

srowen Jul 25, 2019

srowen Jul 25, 2019

huaxingao commented Jul 26, 2019

SparkQA commented Jul 26, 2019

SparkQA commented Jul 26, 2019

srowen Jul 26, 2019

srowen Jul 26, 2019

viirya Jul 26, 2019

viirya Jul 26, 2019

viirya Jul 26, 2019

viirya Jul 26, 2019

SparkQA commented Jul 26, 2019

srowen commented Jul 28, 2019

[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF #25250

[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF #25250

Conversation

huaxingao commented Jul 25, 2019

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 25, 2019

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Jul 26, 2019

SparkQA commented Jul 26, 2019

SparkQA commented Jul 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 26, 2019

srowen commented Jul 28, 2019