[fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error #14661

xiaoDjun · 2022-11-29T04:02:37Z

Proposed changes

Issue Number: close #14600

Problem summary

In spark load, the job failed when the partition column is not in the duplicate key column list.

We traced the sparkDpp.java and found that, the partition columns ware built with the key index in base metadata. But when partition column was not key, an IndexOutOfBoundsException occurs when obtaining it through DppColumns.

The exception is:

java.lang.IndexOutOfBoundsException: Index: 3, Size: 1 
at java.util.ArrayList.rangeCheck(ArrayList.java:657) 
at java.util.ArrayList.get(ArrayList.java:433) 
at org.apache.doris.load.loadv2.dpp.DppColumns.<init>(DppColumns.java:39) 
at org.apache.doris.load.loadv2.dpp.DorisRangePartitioner.getPartition(DorisRangePartitioner.java:56) 
at org.apache.doris.load.loadv2.dpp.SparkDpp$2.call(SparkDpp.java:487) 
at org.apache.doris.load.loadv2.dpp.SparkDpp$2.call(SparkDpp.java:460) 
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143) 
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143) 
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:109) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Driver stacktrace:

Checklist(Required)

Does it affect the original behavior:
- Yes
- No
- I don't know
Has unit tests been added:
- Yes
- No
- No Need
Has document been added or modified:
- Yes
- No
- No Need
Does it need to update dependencies:
- Yes
- No
Are there any changes that cannot be rolled back:
- Yes (If Yes, please explain WHY)
- No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

…exOutOfBoundsException error

hf200012

LGTM

github-actions · 2022-11-29T04:06:54Z

PR approved by at least one committer and no changes requested.

github-actions · 2022-11-29T04:06:57Z

PR approved by anyone and no changes requested.

hf200012

LGTM

github-actions · 2022-11-29T04:38:06Z

PR approved by at least one committer and no changes requested.

hello-stephen · 2022-11-29T05:17:25Z

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 35.61 seconds
load time: 486 seconds
storage size: 17123333158 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221129051721_clickbench_pr_54806.html

[fix](spark load)partition column is not duplicate key，spark load Ind…

833b810

…exOutOfBoundsException error

xiaoDjun changed the title ~~[fix](spark load)partition column is not duplicate key, spark load IndexOutOfBoundsException error~~ [fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error Nov 29, 2022

hf200012 previously approved these changes Nov 29, 2022

View reviewed changes

github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Nov 29, 2022

[fix]java checkstyle

75bfdab

xiaoDjun dismissed hf200012’s stale review via 75bfdab November 29, 2022 04:13

github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 29, 2022

[fix]java checkstyle

5763b2d

hf200012 added area/spark-load Issues or PRs related to the spark load usercase Important user case type label dev/1.1.5 labels Nov 29, 2022

[fix]java checkstyle

47805fd

hf200012 approved these changes Nov 29, 2022

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 29, 2022

yiguolei merged commit c5f9fd5 into apache:master Nov 29, 2022

yiguolei added dev/1.1.5-merged and removed dev/1.1.5 labels Dec 5, 2022

yiguolei mentioned this pull request Dec 13, 2022

Release Note 1.1.5 #15032

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error #14661

[fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error #14661

xiaoDjun commented Nov 29, 2022 •

edited

Loading

hf200012 left a comment

github-actions bot commented Nov 29, 2022

github-actions bot commented Nov 29, 2022

hf200012 left a comment

github-actions bot commented Nov 29, 2022

hello-stephen commented Nov 29, 2022

[fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error #14661

[fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error #14661

Conversation

xiaoDjun commented Nov 29, 2022 • edited Loading

Proposed changes

Problem summary

Checklist(Required)

Further comments

hf200012 left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 29, 2022

github-actions bot commented Nov 29, 2022

hf200012 left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 29, 2022

hello-stephen commented Nov 29, 2022

xiaoDjun commented Nov 29, 2022 •

edited

Loading