Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation #11583

Closed
wants to merge 16 commits into from

Conversation

aray
Copy link
Contributor

@aray aray commented Mar 8, 2016

What changes were proposed in this pull request?

The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them.

I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column.

How was this patch tested?

Additional unit tests in DataFramePivotSuite and manual larger scale testing.

@aray
Copy link
Contributor Author

aray commented Mar 8, 2016

cc @rxin and @yhuai since you two were involved in the original version

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52681 has finished for PR 11583 at commit bffc7aa.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52682 has finished for PR 11583 at commit 359a374.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Mar 10, 2016

Here are some quick benchmark results on a ~1 million row dataset

@yhuai
Copy link
Contributor

yhuai commented Mar 10, 2016

@aray Thank you for working on it. The results look very cool! I may not have time to review this PR within this week. I will try to find time next week to take a look.

@aray
Copy link
Contributor Author

aray commented Mar 23, 2016

@yhuai do you have time this week to look at this patch?

@yhuai
Copy link
Contributor

yhuai commented Mar 23, 2016

Sorry.

I will review this one this week.

override lazy val inputAggBufferAttributes: Seq[AttributeReference] =
aggBufferAttributes.map(_.newInstance())

override lazy val inputTypes: Seq[AbstractDataType] = children.map(_.dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use inputTypes to ask the analyzer to do type casting. So, if there is a value column that has an invalid data type, the analyzer will complain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by this, but no casting is needed.

* Remove threshold of 10 for pivot values in Analyzer
* Change updateFunction into a partial function so support can be checked without try/catch
* Scaladoc for PivotFirst
* Move children, inputTypes, nullable, and dataType to beginning
* Added comments
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56095 has finished for PR 11583 at commit 32e97a2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56103 has finished for PR 11583 at commit 1723046.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@aray
Copy link
Contributor Author

aray commented Apr 18, 2016

@yhuai I've addressed all your comments, ready for you to take another look. Sorry for the delay.

@aray
Copy link
Contributor Author

aray commented May 2, 2016

@yhuai can we get this merged for 2.0?

@yhuai
Copy link
Contributor

yhuai commented May 2, 2016

test this please



override lazy val aggBufferAttributes: Seq[AttributeReference] =
pivotIndex.toList.sortBy(_._2).map(kv => AttributeReference(kv._1.toString, valueDataType)())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we avoid of using lazy val for aggBufferAttributes, aggBufferSchema, and inputAggBufferAttributes?

@yhuai
Copy link
Contributor

yhuai commented May 2, 2016

@aray This PR looks good. I will merge this after it passes tests. Can you send out a follow up pr to address my comments?

@aray
Copy link
Contributor Author

aray commented May 2, 2016

Sure, will do tonight.

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57537 has finished for PR 11583 at commit 1723046.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented May 2, 2016

Merging to master and 2.0 branch.

@asfgit asfgit closed this in 9927441 May 2, 2016
asfgit pushed a commit that referenced this pull request May 2, 2016
…es with two phase aggregation

## What changes were proposed in this pull request?

The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them.

I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column.

## How was this patch tested?

Additional unit tests in DataFramePivotSuite and manual larger scale testing.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #11583 from aray/fast-pivot.

(cherry picked from commit 9927441)
Signed-off-by: Yin Huai <yhuai@databricks.com>
asfgit pushed a commit that referenced this pull request May 3, 2016
…stinct values with two phase aggregation

## What changes were proposed in this pull request?

This is a follow up PR for #11583. It makes 3 lazy vals into just vals and adds unit test coverage.

## How was this patch tested?

Existing unit tests and additional unit tests.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #12861 from aray/fast-pivot-follow-up.

(cherry picked from commit d8f528c)
Signed-off-by: Yin Huai <yhuai@databricks.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request May 3, 2016
…stinct values with two phase aggregation

## What changes were proposed in this pull request?

This is a follow up PR for apache#11583. It makes 3 lazy vals into just vals and adds unit test coverage.

## How was this patch tested?

Existing unit tests and additional unit tests.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes apache#12861 from aray/fast-pivot-follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants