Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6006][SQL]: Optimize count distinct for high cardinality columns #4764

Closed
wants to merge 6 commits into from

Conversation

saucam
Copy link

@saucam saucam commented Feb 25, 2015

Currently the plan for count distinct looks like this :

Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS _c0#437L]
Exchange SinglePartition
Aggregate true, [snAppProtocol#448], [snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513]
!OutputFaker [snAppProtocol#448]
ParquetTableScan [snAppProtocol#587], (ParquetRelation hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 00%3A55%3A00), []

This can be slow if there are too many distinct values in a column. This PR changes the above plan to :

Aggregate false, [], [SUM(_c0#437L) AS totalCount#514L]
Exchange SinglePartition
Aggregate false, [snAppProtocol#448], [CombineAndCount(partialSets#513) AS _c0#437L]
Exchange (HashPartitioning [snAppProtocol#448], 200)
Aggregate true, [snAppProtocol#448], [snAppProtocol#448,AddToHashSet(snAppProtocol#448) AS partialSets#513]
!OutputFaker [snAppProtocol#448]
ParquetTableScan [snAppProtocol#587], (ParquetRelation hdfs://192.168.160.57:9000/data/collector/13/11/14, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.hive.HiveContext@6b1ed434, [ptime#443], ptime=2014-11-13 00%3A55%3A00), []

This way even if there are too many distinct values; we insert them into partial maps and computation remains distributed and thus faster.

@saucam saucam changed the title SPARK-6006: Optimize count distinct for high cardinality columns [SPARK-6006][SQL]: Optimize count distinct for high cardinality columns Feb 25, 2015
@saucam
Copy link
Author

saucam commented Feb 25, 2015

@marmbrus can you please guide how to rewrite this in a better way ?

@marmbrus
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2015

Test build #27960 has finished for PR 4764 at commit 4125e2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@saucam
Copy link
Author

saucam commented Feb 26, 2015

can we test this again please ?

@srowen
Copy link
Member

srowen commented Feb 26, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27994 has finished for PR 4764 at commit edee0d2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@saucam
Copy link
Author

saucam commented Feb 26, 2015

Fixed the null count test failure. Optimization works only in case of single count distinct in select clause

@saucam
Copy link
Author

saucam commented Feb 27, 2015

please retest

@saucam
Copy link
Author

saucam commented Mar 6, 2015

please restest

@marmbrus
Copy link
Contributor

marmbrus commented Apr 3, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Apr 3, 2015

Test build #29635 has finished for PR 4764 at commit 6883b42.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@saucam
Copy link
Author

saucam commented Apr 4, 2015

fixed the test case of zero count when there is no data. rebased with latest master. please retest

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29709/
Test FAILed.

@saucam
Copy link
Author

saucam commented Apr 5, 2015

fixed test failures because of class cast exceptions. Please retest.

@marmbrus
Copy link
Contributor

marmbrus commented Apr 9, 2015

Thanks for working ont his and sorry for the delay in reviewing it. My high level feedback is that I think we should optimize handling of distinct aggregation, but there are already plans to do this more holistically instead of as a point solution. If this is really important to you for some specific production workload, we could consider adding something simple now and removing it later, but otherwise I'd prefer to wait for the full solution.

More specifically, I have some advice on how I would structure this if we were to move forward with this approach.

  • Code style: In general for the whole optimizer we try to avoid the use of vars and while loops, preferring functional constructs where possible. vars and while loops are okay in performance critical code.
  • Placement: Instead of making changes to analysis (only resolution and type coercion should happen here) and planning, I think this should be a single rule inside of the Optimizer. This is because it starts with a valid logical plan and ends with a valid logical plan, but is rewriting it to be more efficient.
  • SumZero: Where possible, prefer to compose existing constructs. i.e., I think this could just be a coalesce(sum(...), 0) instead of duplicating a significant amount of code.

@marmbrus
Copy link
Contributor

marmbrus commented Apr 9, 2015

As a very rough sketch (this is totally untested and I'm probably missing cases), I'd hope the solution could look something like the following:

object OptimizeSimpleDistincts extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case Aggregate(Nil, Seq(agg), c) =>
      val rewritten = agg transform {
        case CountDistinct(Seq(c)) => Count(c)
        case SumDistinct(c) => Sum(c)
      }

      if (rewritten != agg) {
        Aggregate(Nil, rewritten.asInstanceOf[NamedExpression] :: Nil, Distinct(c))
      } else {
        plan
      }
  }
}

With tests of course :) See FilterPushdownSuite for an example.

@saucam
Copy link
Author

saucam commented Apr 12, 2015

hi @marmbrus , can you share other plans of modifying aggregates that you mentioned earlier? Can I help with that ? Otherwise i'll modify this one for now as you have suggested.

@marmbrus
Copy link
Contributor

Here is the JIRA: SPARK-4366. Unless you think you will have something in the next day or two, would you mind closing this PR. I'd like to keep the PR queue to only active issues so that we don't missing things. Thanks!

@saucam
Copy link
Author

saucam commented Apr 14, 2015

thanks @marmbrus . Let me refactor this then and open another PR later.

@saucam saucam closed this Apr 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants