-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF #2494
Conversation
*/ | ||
if(df(j) >= minimumOccurence) { | ||
inv(j) = math.log((m + 1.0)/ (df(j) + 1.0)) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else branch is not needed hers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the branch is needed. I don't modify the df vector -- I perform the filtering in idf().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you allocate your inv array, by default, all values are 0. You do not need to set it to 0 again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I didn't consider that. I'll remove the else and add a comment to clarify the default behavior. Thanks.
One question, with this parameter set, it also filter out words that is very important to some documents. Say, that if some word occurs many times in 1 or 2 documents, queries with that word should return these 2 documents. In other words, this approach may filter out words with high tf*idf values. How do you handle this case? Also, even with low df terms skipped, the output still take the same space. Any further thoughts on this to reduce the space? |
@Ishiihara If you look at the original JIRA, this was the functionality requested by the user. For the case you mention (high TF in a couple of documents), you would want to handle that separately in the transform() function where you could consider both the IDF and TF values. As per space, it could be beneficial to create sparser vectors as a result of the filtering. However, I chose not to make that change since it may cause problems for some users since they would expect the resulting TF-IDF vectors to have the same values as the sparse or dense TF vectors. The way I've implemented the changes minimizes the overall effect on the user. I believe a separate PR should be created for considering space optimizations if they are going to change the API. |
I removed an unnecessary else blockpointed out by @Ishiihara and added a comment for clarification. Thanks @Ishiihara ! |
QA tests have started for PR 2494 at commit
|
QA tests have finished for PR 2494 at commit
|
Test FAILed. |
@rnowling Please run sbt/sbt scalastyle on your local machine to clear out style issues. |
*/ | ||
@Experimental | ||
class IDF { | ||
class IDF(minimumOccurence: Long) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a val before minimumOccurence. Alternatively, if you want to set set minimumOccurence after new IDF(), you can define a private field and use a setter to set the value.
QA tests have started for PR 2494 at commit
|
QA tests have finished for PR 2494 at commit
|
Test FAILed. |
QA tests have started for PR 2494 at commit
|
@Ishiihara Thanks for pointing out the style check -- I found and fixed the style error in IDF.scala. Thanks for mentioning options for the mimimumOccurence members. I decided to add the val keyword over adding a setter. Earlier, I had considered several approaches including making it an optional parameter and adding a Scala-style setter, however I found that neither provided clean Java interoperability. As a result, I settled on the overloaded constructor approach, which is also a better match for Scala's emphasis on immutability. Since creating IDF's is inexpensive, I don't think performance will be an issue. |
QA tests have finished for PR 2494 at commit
|
Test FAILed. |
Jenkins failed because I broke backwards compatibility in DocumentFrequencyAggregator by adding a required parameter to the constructor. I added a second parameter-less constructor that should fix the problem. |
QA tests have started for PR 2494 at commit
|
QA tests have finished for PR 2494 at commit
|
Test PASSed. |
* | ||
* @param minimumOccurence minimum of documents in which a term | ||
* should appear for filtering | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra lines
Test FAILed. |
Tests failed because of failing to fetch from GitHub. Can we re-run them? Thanks. |
test this please |
Test FAILed. |
@mengxr tests failed again. Here's where the error occurs:
Is it possible that a change was made to Jenkins that isn't populating the git command correctly? |
@rnowling let's retry :) test this please |
@mengxr doesn't look like the tests started -- maybe Jenkins ignores comments that address users? Thanks! |
test this please |
QA tests have started for PR 2494 at commit
|
QA tests have finished for PR 2494 at commit
|
Test FAILed. |
The flume sink test failed -- unrelated to my PR. |
@rnowling We have to see Jenkins happy before merge. @tdas @harishreedharan Could you take a look at the failed test? Thanks! |
test this please |
QA tests have started for PR 2494 at commit
|
QA tests have finished for PR 2494 at commit
|
Test FAILed. |
@mengxr Flume test failed again. And I agree with you -- needs to pass tests (even if failures are unrelated) before we commit. We went through this with one of my recent doc fixes, too. :) |
I am working on the test that failed in the first run (with the SparkSinkSuite). I didn't write the test (or the suite) that failed in the 2nd run, but I will take a look at it later today. |
test this please |
QA tests have started for PR 2494 at commit
|
QA tests have finished for PR 2494 at commit
|
Test PASSed. |
LGTM (and Jenkins finally). Merged into master. Thanks! |
Thanks @mengxr and @Ishiihara ! |
This PR for SPARK-3614 adds functionality for filtering out terms which do not appear in at least a minimum number of documents.
This is implemented using a minimumOccurence parameter (default 0). When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0. As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.
This PR makes the following changes: