Skip to content

HIVE-26184: COLLECT_SET with GROUP BY is very slow when some keys are highly skewed#3253

Merged
kgyrtkirk merged 1 commit into
apache:masterfrom
okumin:HIVE-26184-collect-set
Jun 13, 2022
Merged

HIVE-26184: COLLECT_SET with GROUP BY is very slow when some keys are highly skewed#3253
kgyrtkirk merged 1 commit into
apache:masterfrom
okumin:HIVE-26184-collect-set

Conversation

@okumin

@okumin okumin commented Apr 28, 2022

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This would reduce the time complexity of COLLECT_SET from O({maximum length} * {num rows}) into O({maximum length} + {num rows}).

https://issues.apache.org/jira/browse/HIVE-26184

Why are the changes needed?

I'm observing some reducers take much time due to this issue.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I have run the reproduction case in HIVE-26184 with this patch and confirmed the reduce vertex finished more than 30x faster.

@okumin

okumin commented May 9, 2022

Copy link
Copy Markdown
Contributor Author

CI failed but I think it's not apparently caused by this PR.

[2022-05-08T14:33:40.267Z] [ERROR] Failures: 
[2022-05-08T14:33:40.267Z] [ERROR]   TestRpc.testServerPort:234 Port should match configured one:22 expected:<32951> but was:<22>

DongWei-4 pushed a commit to DongWei-4/hive that referenced this pull request Oct 28, 2022
… highly skewed (apache#3253) (okumin reviewed by Zoltan Haindrich)
dengzhhu653 pushed a commit to dengzhhu653/hive that referenced this pull request Dec 15, 2022
… highly skewed (apache#3253) (okumin reviewed by Zoltan Haindrich)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants