[SPARK-15938]Adding "support" property to MLlib Association Rule #13656

hhbyyh · 2016-06-14T07:23:53Z

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-15938

Support is an indication of how frequently the item-set appears in the database. Besides confidence, "Support" is another critical property for Association rule.
References:
https://en.wikipedia.org/wiki/Association_rule_learning
http://www.philippe-fournier-viger.com/spmf/index.php?link=documentation.php#allassociationrules
https://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf

Support can be either the count of appearances or the fraction within the dataset. I choose to use the count as:

API compatibility: Currently both FPGrowthModel and Association Rule does not have the information about size of the dataset. I'd try to avoid breaking a list of public APIs.
This also refers to the API of SPMF. http://www.philippe-fournier-viger.com/spmf/index.php?link=documentation.php#allassociationrules.

In the next steps, we could add constraint like minSupport as in other libraries. FPGrowthModel should also contains the size of the dataset.

How was this patch tested?

existing ut.

hhbyyh · 2016-06-14T07:26:35Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala

+     */
+    @Since("2.1.0")
+    def support: Double = freqUnion.toDouble
+


This is intentionally typed as Double. In the future, it could be fraction value ( < 1.0).

I don't think the meaning of this should ever be overloaded. Support is a count.

Two major considerations:

In most definition and text books, support is a fraction value in [0.0, 1.0]. It's possible for us to align with it in the future.

Current implementation of Association rule actually allows both
freqUnion: Double, and
freqAntecedent: Double
to be fraction value [0.0, 1.0] although they are both counts now. I don't want to destroy the flexibility and break API in the future.

Ah right, we use support as a fraction. Well, then best to be consistent and return it as a fraction of the data set size. I can't imagine having a method sometimes return a value with one type of semantics and sometimes another. Just make two methods.

freqUnion however appears to be a count only, and is even explicitly called a 'frequency'.

I suppose freqUnion is made as a Double on purpose for the same reason. (flexibility for the future)

Making support a fraction now requires that we must keep the dataset size info in FPGrowthModel and AssociationRule. Yet that would introduce API change in the constructor. I thought we should avoid breaking API between 2.0 and 2.1.

Dunno, that seems like a mistake to me. It should be a Long if it's a count, and should expose alternative factory methods to accept input of different types if needed. Overloading one argument seems like a hack and I'd prefer not to extend it (or fix it).

See SPARK-15930 which concerns adding the input size just for this reason, I assume. We haven't released 2.0, and so could in theory still put in a change to the constructor. I agree, we might however have to deprecate the existing one, add a new one, and still deal with calls to the old constructor, which would mean it's not possible to compute values that are a fraction of the whole data set. This in turn may argue for clearly separating inputs/outputs that are counts vs percentages.

I find it hard to just deprecating the old constructor and still keep support as a fraction if no dataset size is passed in.

Yes, it's not possible to implement in that case. There's an argument for just adding the parameter and removing the old constructor for 2.0 in order to support this without the convolutions. I'd love to get a thumbs up from @jkbradley or @mengxr though.

support should be a fraction to be consistent with the semantic of minSupport in FPGrowth and PrefixSpan. There should be a compatible way to add support. Rule is not a case class and its constructor is package private. So this should be easy to add. Another approach is to add total number of records in the model, so people can calculate the support easily.

hhbyyh · 2016-06-14T08:07:16Z

Thanks for the review @srowen

SparkQA · 2016-06-14T08:10:47Z

Test build #60475 has finished for PR 13656 at commit 60efd05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-06-14T08:18:32Z

Hm, I suppose the problem is that you're returning a count here though. See also SPARK-15930 which is related, and concerns tracking the total size of the input.

hhbyyh · 2016-06-14T08:35:48Z

Yes, I've linked the two issues and provided some illustration about the fraction/count choice in the description.

hhbyyh · 2016-06-14T08:58:01Z

@srowen I'm also working on the ml.fpm, in which it's easier to include more information in the model and rules. I would suggest:

Use support as count, and avoid any API break;
Let's just keep mllib.fpm unchanged, I'll close the PR.

SparkQA · 2016-06-23T21:48:29Z

Test build #61130 has finished for PR 13656 at commit 8b16676.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-06-23T21:52:02Z

I made a quick change to demo what's it like if we pass the data size along FPGrowthModel and AssociationRules.

SparkQA · 2016-06-23T22:06:53Z

Test build #61131 has finished for PR 13656 at commit ed384c7.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-10-07T20:50:28Z

I'm not sure if this is something that would still be considered since we aren't doing new development for MLlib anymore. It might make more sense to work on https://issues.apache.org/jira/browse/SPARK-14503 and then implement this after.

hhbyyh · 2017-03-14T05:29:09Z

Close this and add the support to ml.fpm. #17280

add support for association rule

60efd05

hhbyyh reviewed Jun 14, 2016
View reviewed changes

hhbyyh added 2 commits June 23, 2016 07:45

Merge remote-tracking branch 'upstream/master' into supportAsso

f638d25

add data size

8b16676

java style

ed384c7

hhbyyh closed this Mar 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-15938]Adding "support" property to MLlib Association Rule #13656

[SPARK-15938]Adding "support" property to MLlib Association Rule #13656

hhbyyh commented Jun 14, 2016 •

edited

hhbyyh Jun 14, 2016

srowen Jun 14, 2016

hhbyyh Jun 14, 2016 •

edited

srowen Jun 14, 2016

hhbyyh Jun 14, 2016 •

edited

srowen Jun 14, 2016

hhbyyh Jun 14, 2016

srowen Jun 14, 2016

mengxr Jun 22, 2016

hhbyyh commented Jun 14, 2016

SparkQA commented Jun 14, 2016

srowen commented Jun 14, 2016

hhbyyh commented Jun 14, 2016

hhbyyh commented Jun 14, 2016 •

edited

SparkQA commented Jun 23, 2016

hhbyyh commented Jun 23, 2016 •

edited

SparkQA commented Jun 23, 2016

holdenk commented Oct 7, 2016

hhbyyh commented Mar 14, 2017

[SPARK-15938]Adding "support" property to MLlib Association Rule #13656

[SPARK-15938]Adding "support" property to MLlib Association Rule #13656

Conversation

hhbyyh commented Jun 14, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

hhbyyh Jun 14, 2016

Choose a reason for hiding this comment

srowen Jun 14, 2016

Choose a reason for hiding this comment

hhbyyh Jun 14, 2016 • edited

Choose a reason for hiding this comment

srowen Jun 14, 2016

Choose a reason for hiding this comment

hhbyyh Jun 14, 2016 • edited

Choose a reason for hiding this comment

srowen Jun 14, 2016

Choose a reason for hiding this comment

hhbyyh Jun 14, 2016

Choose a reason for hiding this comment

srowen Jun 14, 2016

Choose a reason for hiding this comment

mengxr Jun 22, 2016

Choose a reason for hiding this comment

hhbyyh commented Jun 14, 2016

SparkQA commented Jun 14, 2016

srowen commented Jun 14, 2016

hhbyyh commented Jun 14, 2016

hhbyyh commented Jun 14, 2016 • edited

SparkQA commented Jun 23, 2016

hhbyyh commented Jun 23, 2016 • edited

SparkQA commented Jun 23, 2016

holdenk commented Oct 7, 2016

hhbyyh commented Mar 14, 2017

hhbyyh commented Jun 14, 2016 •

edited

hhbyyh Jun 14, 2016 •

edited

hhbyyh Jun 14, 2016 •

edited

hhbyyh commented Jun 14, 2016 •

edited

hhbyyh commented Jun 23, 2016 •

edited