Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row #7813

Closed
wants to merge 19 commits into from
Closed

[SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row #7813

wants to merge 19 commits into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Jul 31, 2015

This PR adds a base aggregation iterator AggregationIterator, which is used to create SortBasedAggregationIterator (for sort-based aggregation) and UnsafeHybridAggregationIterator (first it tries hash-based aggregation and falls back to the sort-based aggregation (using external sorter) if we cannot allocate memory for the map). With these two iterators, we will not need existing iterators and I am removing those. Also, we can use a single physical Aggregate operator and it internally determines what iterators to used.

https://issues.apache.org/jira/browse/SPARK-9240

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39133 has finished for PR 7813 at commit 84ceb3a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39171 has finished for PR 7813 at commit 1912097.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor Author

yhuai commented Jul 31, 2015

Seems even if we use a single element of an array or a struct (e.g. array[0]) in the grouping expression, we do not get the item until we start the aggregate operator, which caused the problem of java.lang.UnsupportedOperationException: Not supported DataType: ArrayType(StringType,true).

@yhuai
Copy link
Contributor Author

yhuai commented Jul 31, 2015

The failed query is SELECT value, myCol from (SELECT key, array(value[0]) AS value FROM tmp_pyang_src_rcfile GROUP BY value[0], key) a LATERAL VIEW explode(value) myTable AS myCol;

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39240 has finished for PR 7813 at commit 6cedb51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39480 has finished for PR 7813 at commit 964f88b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39481 has finished for PR 7813 at commit fee0eef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai yhuai changed the title [SPARK-9240] [SQL] [WIP] Hybrid aggregate operator using unsafe row [SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row Aug 3, 2015
@@ -50,7 +50,7 @@ import scala.collection.JavaConversions._
object TestHive
extends TestHiveContext(
new SparkContext(
System.getProperty("spark.sql.test.master", "local[32]"),
System.getProperty("spark.sql.test.master", "local[2]"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this back to 32

@@ -409,6 +409,9 @@ private[spark] object SQLConf {
val USE_SQL_AGGREGATE2 = booleanConf("spark.sql.useAggregate2",
defaultValue = Some(true), doc = "<TODO>")

val USE_HYBRID_AGGREGATE = booleanConf("spark.sql.aggregate.useHybridAggregate",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want a config flag here?

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39497 has finished for PR 7813 at commit 21fd15f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39498 has finished for PR 7813 at commit 0f1b06f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39502 has finished for PR 7813 at commit c9cf3b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case (None, None) => (currentBuffer: MutableRow, row: InternalRow) => {}

case other =>
sys.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an error case -- we should throw IllegalStateException, and make it clear that if we hit this path, it's a bug.

Right now it sounds as if this operator just cannot handle a legitimate case.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39509 has finished for PR 7813 at commit 74d93c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #39517 has finished for PR 7813 at commit e317e2b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Aug 3, 2015

I'm going to merge this. I think this needs more refactoring, but we can do those as follow-ups.

@asfgit asfgit closed this in 1ebd41b Aug 3, 2015

// The value of the input KV Iterator has the format of groupingExprs + aggregation buffer.
// We need to project the aggregation buffer out.
private def projectInputBufferToUnsafe(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just remove this function and inline it. We don't want an extra iterator overhead to process the rows.

Each iterator actually adds a lot of overhead, and here it doesn't buy you any code reduction (on the contrary it increases complexity due to the extra abstraction).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

asfgit pushed a commit that referenced this pull request Aug 6, 2015
…w up)

This is the followup of #7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7954 from yhuai/agg-followUp and squashes the following commits:

4d2f4fc [Yin Huai] Add comments and free map.
0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process.
91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to  TungstenAggregateIteraotr and make it only work with UnsafeRow.
asfgit pushed a commit that referenced this pull request Aug 6, 2015
…w up)

This is the followup of #7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7954 from yhuai/agg-followUp and squashes the following commits:

4d2f4fc [Yin Huai] Add comments and free map.
0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process.
91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to  TungstenAggregateIteraotr and make it only work with UnsafeRow.

(cherry picked from commit 3504bf3)
Signed-off-by: Reynold Xin <rxin@databricks.com>
kiszk pushed a commit to kiszk/spark-gpu that referenced this pull request Dec 26, 2015
…w up)

This is the followup of apache/spark#7813. It renames `HybridUnsafeAggregationIterator` to `TungstenAggregationIterator` and makes it only work with `UnsafeRow`. Also, I add a `TungstenAggregate` that uses `TungstenAggregationIterator` and make `SortBasedAggregate` (renamed from `SortBasedAggregate`) only works with `SafeRow`.

Author: Yin Huai <yhuai@databricks.com>

Closes #7954 from yhuai/agg-followUp and squashes the following commits:

4d2f4fc [Yin Huai] Add comments and free map.
0d7ddb9 [Yin Huai] Add TungstenAggregationQueryWithControlledFallbackSuite to test fall back process.
91d69c2 [Yin Huai] Rename UnsafeHybridAggregationIterator to  TungstenAggregateIteraotr and make it only work with UnsafeRow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants