-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15938]Adding "support" property to MLlib Association Rule #13656
Closed
Closed
Changes from 1 commit
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentionally typed as Double. In the future, it could be fraction value ( < 1.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the meaning of this should ever be overloaded. Support is a count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two major considerations:
support
is a fraction value in [0.0, 1.0]. It's possible for us to align with it in the future.freqUnion: Double,
andfreqAntecedent: Double
to be fraction value [0.0, 1.0] although they are both counts now. I don't want to destroy the flexibility and break API in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, we use support as a fraction. Well, then best to be consistent and return it as a fraction of the data set size. I can't imagine having a method sometimes return a value with one type of semantics and sometimes another. Just make two methods.
freqUnion however appears to be a count only, and is even explicitly called a 'frequency'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose
freqUnion
is made as a Double on purpose for the same reason. (flexibility for the future)Making
support
a fraction now requires that we must keep the dataset size info inFPGrowthModel
andAssociationRule
. Yet that would introduce API change in the constructor. I thought we should avoid breaking API between 2.0 and 2.1.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno, that seems like a mistake to me. It should be a
Long
if it's a count, and should expose alternative factory methods to accept input of different types if needed. Overloading one argument seems like a hack and I'd prefer not to extend it (or fix it).See SPARK-15930 which concerns adding the input size just for this reason, I assume. We haven't released 2.0, and so could in theory still put in a change to the constructor. I agree, we might however have to deprecate the existing one, add a new one, and still deal with calls to the old constructor, which would mean it's not possible to compute values that are a fraction of the whole data set. This in turn may argue for clearly separating inputs/outputs that are counts vs percentages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it hard to just deprecating the old constructor and still keep
support
as a fraction if no dataset size is passed in.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not possible to implement in that case. There's an argument for just adding the parameter and removing the old constructor for 2.0 in order to support this without the convolutions. I'd love to get a thumbs up from @jkbradley or @mengxr though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support
should be a fraction to be consistent with the semantic ofminSupport
in FPGrowth and PrefixSpan. There should be a compatible way to addsupport
.Rule
is not a case class and its constructor is package private. So this should be easy to add. Another approach is to add total number of records in the model, so people can calculate the support easily.