Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF #2494

Closed
wants to merge 15 commits into from
Closed

[SPARK-3614][MLLIB] Add minimumOccurence filtering to IDF #2494

wants to merge 15 commits into from

Conversation

rnowling
Copy link
Contributor

This PR for SPARK-3614 adds functionality for filtering out terms which do not appear in at least a minimum number of documents.

This is implemented using a minimumOccurence parameter (default 0). When terms' document frequencies are less than minimumOccurence, their IDFs are set to 0, just like when the DF is 0. As a result, the TF-IDFs for the terms are found to be 0, as if the terms were not present in the documents.

This PR makes the following changes:

  • Add a minimumOccurence parameter to the IDF and DocumentFrequencyAggregator classes.
  • Create a parameter-less constructor for IDF with a default minimumOccurence value of 0 to remain backwards-compatibility with the original IDF API.
  • Sets the IDFs to 0 for terms which DFs are less than minimumOccurence
  • Add tests to the Spark IDFSuite and Java JavaTfIdfSuite test suites
  • Updated the MLLib Feature Extraction programming guide to describe the new feature

*/
if(df(j) >= minimumOccurence) {
inv(j) = math.log((m + 1.0)/ (df(j) + 1.0))
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else branch is not needed hers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the branch is needed. I don't modify the df vector -- I perform the filtering in idf().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you allocate your inv array, by default, all values are 0. You do not need to set it to 0 again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I didn't consider that. I'll remove the else and add a comment to clarify the default behavior. Thanks.

@Ishiihara
Copy link
Contributor

One question, with this parameter set, it also filter out words that is very important to some documents. Say, that if some word occurs many times in 1 or 2 documents, queries with that word should return these 2 documents. In other words, this approach may filter out words with high tf*idf values. How do you handle this case? Also, even with low df terms skipped, the output still take the same space. Any further thoughts on this to reduce the space?

@rnowling
Copy link
Contributor Author

@Ishiihara If you look at the original JIRA, this was the functionality requested by the user. For the case you mention (high TF in a couple of documents), you would want to handle that separately in the transform() function where you could consider both the IDF and TF values.

As per space, it could be beneficial to create sparser vectors as a result of the filtering. However, I chose not to make that change since it may cause problems for some users since they would expect the resulting TF-IDF vectors to have the same values as the sparse or dense TF vectors. The way I've implemented the changes minimizes the overall effect on the user. I believe a separate PR should be created for considering space optimizations if they are going to change the API.

@rnowling
Copy link
Contributor Author

I removed an unnecessary else blockpointed out by @Ishiihara and added a comment for clarification. Thanks @Ishiihara !

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have started for PR 2494 at commit a200bab.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

QA tests have finished for PR 2494 at commit a200bab.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IDF(minimumOccurence: Long)
    • class DocumentFrequencyAggregator(minimumOccurence: Long) extends Serializable

@SparkQA
Copy link

SparkQA commented Sep 22, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20671/

@Ishiihara
Copy link
Contributor

@rnowling Please run sbt/sbt scalastyle on your local machine to clear out style issues.

*/
@Experimental
class IDF {
class IDF(minimumOccurence: Long) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a val before minimumOccurence. Alternatively, if you want to set set minimumOccurence after new IDF(), you can define a private field and use a setter to set the value.

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have started for PR 2494 at commit 6897252.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have finished for PR 2494 at commit 6897252.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IDF(val minimumOccurence: Long)
    • class DocumentFrequencyAggregator(val minimumOccurence: Long) extends Serializable

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20675/

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have started for PR 2494 at commit 1801fd2.

  • This patch merges cleanly.

@rnowling
Copy link
Contributor Author

@Ishiihara Thanks for pointing out the style check -- I found and fixed the style error in IDF.scala.

Thanks for mentioning options for the mimimumOccurence members. I decided to add the val keyword over adding a setter. Earlier, I had considered several approaches including making it an optional parameter and adding a Scala-style setter, however I found that neither provided clean Java interoperability. As a result, I settled on the overloaded constructor approach, which is also a better match for Scala's emphasis on immutability. Since creating IDF's is inexpensive, I don't think performance will be an issue.

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have finished for PR 2494 at commit 1801fd2.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IDF(val minimumOccurence: Long)
    • class DocumentFrequencyAggregator(val minimumOccurence: Long) extends Serializable

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20676/

@rnowling
Copy link
Contributor Author

Jenkins failed because I broke backwards compatibility in DocumentFrequencyAggregator by adding a required parameter to the constructor. I added a second parameter-less constructor that should fix the problem.

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have started for PR 2494 at commit 1fc09d8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

QA tests have finished for PR 2494 at commit 1fc09d8.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IDF(val minimumOccurence: Long)
    • class DocumentFrequencyAggregator(val minimumOccurence: Long) extends Serializable

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20703/

*
* @param minimumOccurence minimum of documents in which a term
* should appear for filtering
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra lines

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20716/

@rnowling
Copy link
Contributor Author

Tests failed because of failing to fetch from GitHub. Can we re-run them? Thanks.

@mengxr
Copy link
Contributor

mengxr commented Sep 23, 2014

test this please

@SparkQA
Copy link

SparkQA commented Sep 23, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20728/

@rnowling
Copy link
Contributor Author

@mengxr tests failed again.

Here's where the error occurs:

Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/*" returned status code 143

Is it possible that a change was made to Jenkins that isn't populating the git command correctly?

@mengxr
Copy link
Contributor

mengxr commented Sep 23, 2014

@rnowling let's retry :)

test this please

@rnowling
Copy link
Contributor Author

@mengxr doesn't look like the tests started -- maybe Jenkins ignores comments that address users? Thanks!

@mengxr
Copy link
Contributor

mengxr commented Sep 25, 2014

test this please

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have started for PR 2494 at commit 0aa3c63.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have finished for PR 2494 at commit 0aa3c63.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IDF(val minDocFreq: Int)
    • class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20787/

@rnowling
Copy link
Contributor Author

The flume sink test failed -- unrelated to my PR.

@mengxr
Copy link
Contributor

mengxr commented Sep 25, 2014

@rnowling We have to see Jenkins happy before merge. @tdas @harishreedharan Could you take a look at the failed test? Thanks!

@mengxr
Copy link
Contributor

mengxr commented Sep 25, 2014

test this please

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have started for PR 2494 at commit 0aa3c63.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 25, 2014

QA tests have finished for PR 2494 at commit 0aa3c63.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IDF(val minDocFreq: Int)
    • class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20810/

@rnowling
Copy link
Contributor Author

@mengxr Flume test failed again.

And I agree with you -- needs to pass tests (even if failures are unrelated) before we commit. We went through this with one of my recent doc fixes, too. :)

@harishreedharan
Copy link
Contributor

I am working on the test that failed in the first run (with the SparkSinkSuite). I didn't write the test (or the suite) that failed in the 2nd run, but I will take a look at it later today.

@mengxr
Copy link
Contributor

mengxr commented Sep 26, 2014

test this please

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 2494 at commit 0aa3c63.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 2494 at commit 0aa3c63.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class IDF(val minDocFreq: Int)
    • class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20850/

@asfgit asfgit closed this in ec9df6a Sep 26, 2014
@mengxr
Copy link
Contributor

mengxr commented Sep 26, 2014

LGTM (and Jenkins finally). Merged into master. Thanks!

@rnowling
Copy link
Contributor Author

Thanks @mengxr and @Ishiihara !

@rnowling rnowling deleted the spark-3614-idf-filter branch September 26, 2014 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants