[SPARK-6006][SQL]: Optimize count distinct for high cardinality columns #4764

saucam · 2015-02-25T12:24:59Z

Currently the plan for count distinct looks like this :

Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS _c0#437L]
Exchange SinglePartition
Aggregate true, [snAppProtocol#448], [snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513]
!OutputFaker [snAppProtocol#448]
ParquetTableScan [snAppProtocol#587], (ParquetRelation hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 00%3A55%3A00), []

This can be slow if there are too many distinct values in a column. This PR changes the above plan to :

Aggregate false, [], [SUM(_c0#437L) AS totalCount#514L]
Exchange SinglePartition
Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS _c0#437L]
Exchange (HashPartitioning [snAppProtocol#448], 200)
Aggregate true, [snAppProtocol#448], [snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513]
!OutputFaker [snAppProtocol#448]
ParquetTableScan [snAppProtocol#587], (ParquetRelation hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 00%3A55%3A00), []

This way even if there are too many distinct values; we insert them into partial maps and computation remains distributed and thus faster.

saucam · 2015-02-25T12:33:06Z

@marmbrus can you please guide how to rewrite this in a better way ?

marmbrus · 2015-02-25T21:19:14Z

test this please

SparkQA · 2015-02-25T21:35:07Z

Test build #27960 has finished for PR 4764 at commit 4125e2e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

saucam · 2015-02-26T07:45:24Z

can we test this again please ?

srowen · 2015-02-26T09:29:51Z

Jenkins, retest this please.

SparkQA · 2015-02-26T09:44:10Z

Test build #27994 has finished for PR 4764 at commit edee0d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

saucam · 2015-02-26T14:13:30Z

Fixed the null count test failure. Optimization works only in case of single count distinct in select clause

saucam · 2015-02-27T06:46:07Z

please retest

saucam · 2015-03-06T05:47:59Z

please restest

marmbrus · 2015-04-03T00:19:56Z

ok to test

SparkQA · 2015-04-03T01:33:15Z

Test build #29635 has finished for PR 4764 at commit 6883b42.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

…rayBuffer

… distinct in select clause (since it relies on hash partitioning of column values)

saucam · 2015-04-04T16:00:37Z

fixed the test case of zero count when there is no data. rebased with latest master. please retest

AmplabJenkins · 2015-04-04T16:04:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29709/
Test FAILed.

saucam · 2015-04-05T11:44:42Z

fixed test failures because of class cast exceptions. Please retest.

marmbrus · 2015-04-09T02:00:10Z

Thanks for working ont his and sorry for the delay in reviewing it. My high level feedback is that I think we should optimize handling of distinct aggregation, but there are already plans to do this more holistically instead of as a point solution. If this is really important to you for some specific production workload, we could consider adding something simple now and removing it later, but otherwise I'd prefer to wait for the full solution.

More specifically, I have some advice on how I would structure this if we were to move forward with this approach.

Code style: In general for the whole optimizer we try to avoid the use of vars and while loops, preferring functional constructs where possible. vars and while loops are okay in performance critical code.
Placement: Instead of making changes to analysis (only resolution and type coercion should happen here) and planning, I think this should be a single rule inside of the Optimizer. This is because it starts with a valid logical plan and ends with a valid logical plan, but is rewriting it to be more efficient.
SumZero: Where possible, prefer to compose existing constructs. i.e., I think this could just be a coalesce(sum(...), 0) instead of duplicating a significant amount of code.

marmbrus · 2015-04-09T02:19:12Z

As a very rough sketch (this is totally untested and I'm probably missing cases), I'd hope the solution could look something like the following:

object OptimizeSimpleDistincts extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case Aggregate(Nil, Seq(agg), c) =>
      val rewritten = agg transform {
        case CountDistinct(Seq(c)) => Count(c)
        case SumDistinct(c) => Sum(c)
      }

      if (rewritten != agg) {
        Aggregate(Nil, rewritten.asInstanceOf[NamedExpression] :: Nil, Distinct(c))
      } else {
        plan
      }
  }
}

With tests of course :) See FilterPushdownSuite for an example.

saucam · 2015-04-12T06:37:27Z

hi @marmbrus , can you share other plans of modifying aggregates that you mentioned earlier? Can I help with that ? Otherwise i'll modify this one for now as you have suggested.

marmbrus · 2015-04-13T21:42:20Z

Here is the JIRA: SPARK-4366. Unless you think you will have something in the next day or two, would you mind closing this PR. I'd like to keep the PR queue to only active issues so that we don't missing things. Thanks!

saucam · 2015-04-14T12:47:39Z

thanks @marmbrus . Let me refactor this then and open another PR later.

saucam changed the title ~~SPARK-6006: Optimize count distinct for high cardinality columns~~ [SPARK-6006][SQL]: Optimize count distinct for high cardinality columns Feb 25, 2015

Yash Datta added 5 commits April 4, 2015 17:33

SPARK-6006: Optimize count distinct for high cardinality columns

8902649

SPARK-6006: Fix code style

db51fd8

SPARK-6006: Make count distinct match generic using Seq instead of Ar…

e257d55

…rayBuffer

SPARK-6006: Optimization to be applied only in case of a single count…

d3f5c8f

… distinct in select clause (since it relies on hash partitioning of column values)

SPARK-6006: Count distinct should return 0 in case of no data

04a7e09

saucam force-pushed the optcountdis branch from 6883b42 to 04a7e09 Compare April 4, 2015 15:42

SPARK-6006: Count returned should be LongType

833ae7f

saucam closed this Apr 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6006][SQL]: Optimize count distinct for high cardinality columns #4764

[SPARK-6006][SQL]: Optimize count distinct for high cardinality columns #4764

saucam commented Feb 25, 2015

saucam commented Feb 25, 2015

marmbrus commented Feb 25, 2015

SparkQA commented Feb 25, 2015

saucam commented Feb 26, 2015

srowen commented Feb 26, 2015

SparkQA commented Feb 26, 2015

saucam commented Feb 26, 2015

saucam commented Feb 27, 2015

saucam commented Mar 6, 2015

marmbrus commented Apr 3, 2015

SparkQA commented Apr 3, 2015

saucam commented Apr 4, 2015

AmplabJenkins commented Apr 4, 2015

saucam commented Apr 5, 2015

marmbrus commented Apr 9, 2015

marmbrus commented Apr 9, 2015

saucam commented Apr 12, 2015

marmbrus commented Apr 13, 2015

saucam commented Apr 14, 2015

[SPARK-6006][SQL]: Optimize count distinct for high cardinality columns #4764

[SPARK-6006][SQL]: Optimize count distinct for high cardinality columns #4764

Conversation

saucam commented Feb 25, 2015

saucam commented Feb 25, 2015

marmbrus commented Feb 25, 2015

SparkQA commented Feb 25, 2015

saucam commented Feb 26, 2015

srowen commented Feb 26, 2015

SparkQA commented Feb 26, 2015

saucam commented Feb 26, 2015

saucam commented Feb 27, 2015

saucam commented Mar 6, 2015

marmbrus commented Apr 3, 2015

SparkQA commented Apr 3, 2015

saucam commented Apr 4, 2015

AmplabJenkins commented Apr 4, 2015

saucam commented Apr 5, 2015

marmbrus commented Apr 9, 2015

marmbrus commented Apr 9, 2015

saucam commented Apr 12, 2015

marmbrus commented Apr 13, 2015

saucam commented Apr 14, 2015