Skip to content

Commit f581e81

Browse files
pan3793yaooqinn
authored andcommitted
[KYUUBI #1215][DOC] Document incremental collection
### _Why are the changes needed?_ Document incremental collection, close #1215 ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [x] Add screenshots for manual tests if appropriate <img width="1914" alt="1" src="https://user-images.githubusercontent.com/26535726/157191008-451bee2a-eb6b-4bb6-869b-5b0f75a21448.png"> <img width="1919" alt="2" src="https://user-images.githubusercontent.com/26535726/157191016-e183bbf5-aa4a-496d-a250-5d14cf04101d.png"> <img width="1920" alt="3" src="https://user-images.githubusercontent.com/26535726/157191026-343a39d7-0ab8-4886-9a51-3670394ef6be.png"> - [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #2057 from pan3793/doc. Closes #1215 82677a0 [Cheng Pan] grammar 3be0e3f [Cheng Pan] fix b467e97 [Cheng Pan] Update d256cba [Cheng Pan] compress picture b3c5fb6 [Cheng Pan] Fix 4c06307 [Cheng Pan] [KYUUBI #1215][DOC] Document incremental collection Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 54dfb4b) Signed-off-by: Kent Yao <yao@apache.org>
1 parent 1cca417 commit f581e81

File tree

3 files changed

+127
-0
lines changed

3 files changed

+127
-0
lines changed
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
<!--
2+
- Licensed to the Apache Software Foundation (ASF) under one or more
3+
- contributor license agreements. See the NOTICE file distributed with
4+
- this work for additional information regarding copyright ownership.
5+
- The ASF licenses this file to You under the Apache License, Version 2.0
6+
- (the "License"); you may not use this file except in compliance with
7+
- the License. You may obtain a copy of the License at
8+
-
9+
- http://www.apache.org/licenses/LICENSE-2.0
10+
-
11+
- Unless required by applicable law or agreed to in writing, software
12+
- distributed under the License is distributed on an "AS IS" BASIS,
13+
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
- See the License for the specific language governing permissions and
15+
- limitations under the License.
16+
-->
17+
18+
<div align=center>
19+
20+
![](../../imgs/kyuubi_logo.png)
21+
22+
</div>
23+
24+
# Solution for Big Result Sets
25+
26+
Typically, when a user submits a SELECT query to Spark SQL engine, the Driver calls `collect` to trigger calculation and
27+
collect the entire data set of all tasks(a.k.a. partitions of an RDD), after all partitions data arrived, then the
28+
client pulls the result set from the Driver through the Kyuubi Server in small batch.
29+
30+
Therefore, the bottleneck is the Spark Driver for a query with a big result set. To avoid OOM, Spark has a configuration
31+
`spark.driver.maxResultSize` which default is `1g`, you should enlarge it as well as `spark.driver.memory` if your
32+
query has result set in several GB. But what if the result set size is dozens GB or event hundreds GB? It would be best
33+
if you have incremental collection mode.
34+
35+
## Incremental collection
36+
37+
Since v1.4.0-incubating, Kyuubi supports incremental collection mode, it is a solution for big result sets. This feature
38+
is disabled in default, you can turn on it by setting the configuration `kyuubi.operation.incremental.collect` to `true`.
39+
40+
The incremental collection changes the gather method from `collect` to `toLocalIterator`. `toLocalIterator` is a Spark
41+
action that sequentially submits Jobs to retrieve partitions. As each partition is retrieved, the client through pulls
42+
the result set from the Driver through the Kyuubi Server streamingly. It reduces the Driver memory significantly from
43+
the size of the complete result set to the maximum partition.
44+
45+
The incremental collection is not the silver bullet, you should turn it on carefully, because it can significantly hurt
46+
performance. And even in incremental collection mode, when multiple queries execute concurrently, each query still requires
47+
one partition of data in Driver memory. Therefore, it is still important to control the number of concurrent queries to
48+
avoid OOM.
49+
50+
## Use in single connections
51+
52+
As above explains, the incremental collection mode is not suitable for common query sense, you can enable incremental
53+
collection mode for specific queries by using
54+
55+
```
56+
beeline -u 'jdbc:hive2://kyuubi:10009/?spark.driver.maxResultSize=8g;spark.driver.memory=12g#kyuubi.engine.share.level=CONNECTION;kyuubi.operation.incremental.collect=true' \
57+
--incremental=true \
58+
-f big_result_query.sql
59+
```
60+
61+
`--incremental=true` is required for beeline client, otherwise, the entire result sets is fetched and buffered before
62+
being displayed, which may cause client side OOM.
63+
64+
## Change incremental collection mode in session
65+
66+
The configuration `kyuubi.operation.incremental.collect` can also be changed using `SET` in session.
67+
68+
```
69+
~ beeline -u 'jdbc:hive2://localhost:10009'
70+
Connected to: Apache Kyuubi (Incubating) (version 1.5.0-SNAPSHOT)
71+
72+
0: jdbc:hive2://localhost:10009/> set kyuubi.operation.incremental.collect=true;
73+
+---------------------------------------+--------+
74+
| key | value |
75+
+---------------------------------------+--------+
76+
| kyuubi.operation.incremental.collect | true |
77+
+---------------------------------------+--------+
78+
1 row selected (0.039 seconds)
79+
80+
0: jdbc:hive2://localhost:10009/> select /*+ REPARTITION(5) */ * from range(1, 10);
81+
+-----+
82+
| id |
83+
+-----+
84+
| 2 |
85+
| 6 |
86+
| 7 |
87+
| 0 |
88+
| 5 |
89+
| 3 |
90+
| 4 |
91+
| 1 |
92+
| 8 |
93+
| 9 |
94+
+-----+
95+
10 rows selected (1.929 seconds)
96+
97+
0: jdbc:hive2://localhost:10009/> set kyuubi.operation.incremental.collect=false;
98+
+---------------------------------------+--------+
99+
| key | value |
100+
+---------------------------------------+--------+
101+
| kyuubi.operation.incremental.collect | false |
102+
+---------------------------------------+--------+
103+
1 row selected (0.027 seconds)
104+
105+
0: jdbc:hive2://localhost:10009/> select /*+ REPARTITION(5) */ * from range(1, 10);
106+
+-----+
107+
| id |
108+
+-----+
109+
| 2 |
110+
| 6 |
111+
| 7 |
112+
| 0 |
113+
| 5 |
114+
| 3 |
115+
| 4 |
116+
| 1 |
117+
| 8 |
118+
| 9 |
119+
+-----+
120+
10 rows selected (0.128 seconds)
121+
```
122+
123+
From the Spark UI, we can see that in incremental collection mode, the query produces 5 jobs (in red square), and in
124+
normal mode, only produces 1 job (in blue square).
125+
126+
![](../../imgs/spark/incremental_collection.png)

docs/deployment/spark/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,4 @@ Even if you don't use Kyuubi, as a simple Spark user, I'm sure you'll find the n
3232

3333
dynamic_allocation
3434
aqe
35+
incremental_collection
116 KB
Loading

0 commit comments

Comments
 (0)