-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21481][ML] Add indexOf method in ml.feature.HashingTF #25250
Conversation
Test build #108141 has finished for PR 25250 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just some perf ideas while we're here.
Does this need to go in pyspark too?
val func = (terms: Seq[_]) => { | ||
val seq = hashingTF.transformImpl(terms) | ||
Vectors.sparse(hashingTF.numFeatures, seq) | ||
val hashUDF = udf { (terms: Seq[_]) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit: I think you can omit the inner set of braces here
val termFrequencies = mutable.HashMap.empty[Int, Double] | ||
val setTF = | ||
if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0 | ||
terms.foreach { term => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the lines from here be unindented one unit?
if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0 | ||
terms.foreach { term => | ||
val i = Utils.nonNegativeMod(hashFunc(term), $(numFeatures)) | ||
termFrequencies.put(i, setTF(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be overthinking this, but it might be faster to not get the value for the key and then update the value for the key. If the mutable map has .withDefaultValue(0.0)
then...
if ($(binary)) {
termFrequences(i) = 1.0
} else {
termFrequences(i) += 1.0
}
It may mean reading $(binary)
to a local val to make sure it's not accessed repeatedly. Or just duplicate the foreach over terms in two cases so that binary is checked only once.
val setTF = | ||
if ($(binary)) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0 | ||
terms.foreach { term => | ||
val i = Utils.nonNegativeMod(hashFunc(term), $(numFeatures)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this now call indexOf
?
The reason you might not is to avoid accessing $(numFeatures)
every time, but if so, then that can be a local val.
I will work on the python part. |
Test build #108187 has finished for PR 25250 at commit
|
Test build #108188 has finished for PR 25250 at commit
|
if (isBinary) { | ||
termFrequencies(i) = 1.0 | ||
} else { | ||
termFrequencies(i) = termFrequencies(i) + 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be += 1.0 ?
val isBinary = $(binary) | ||
val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0) | ||
terms.foreach { term => | ||
val i = indexOf (term) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove space after indexOf
val isBinary = $(binary) | ||
val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0) | ||
terms.foreach { term => | ||
val i = indexOf (term) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra space after indexOf
?
python/pyspark/ml/feature.py
Outdated
@@ -912,6 +912,10 @@ class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures, Java | |||
>>> loadedHashingTF = HashingTF.load(hashingTFPath) | |||
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures() | |||
True | |||
>>> df = spark.createDataFrame([(["a", "a", "b", "b", "c", "d"],)], ["words"]) | |||
>>> hashingTF = HashingTF(numFeatures=100, inputCol="words", outputCol="features") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe a smaller number like 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or, maybe just reuse hashingTF
above?
python/pyspark/ml/feature.py
Outdated
@@ -912,6 +912,10 @@ class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures, Java | |||
>>> loadedHashingTF = HashingTF.load(hashingTFPath) | |||
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures() | |||
True | |||
>>> df = spark.createDataFrame([(["a", "a", "b", "b", "c", "d"],)], ["words"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df
isn't used, right?
Test build #108223 has finished for PR 25250 at commit
|
Merged to master |
What changes were proposed in this pull request?
Add indexOf method for ml.feature.HashingTF.
How was this patch tested?
Add Unit test.