Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error #14661

Merged
merged 4 commits into from
Nov 29, 2022

Conversation

xiaoDjun
Copy link
Contributor

@xiaoDjun xiaoDjun commented Nov 29, 2022

Proposed changes

Issue Number: close #14600

Problem summary

In spark load, the job failed when the partition column is not in the duplicate key column list.

We traced the sparkDpp.java and found that, the partition columns ware built with the key index in base metadata. But when partition column was not key, an IndexOutOfBoundsException occurs when obtaining it through DppColumns.

The exception is:

java.lang.IndexOutOfBoundsException: Index: 3, Size: 1 
at java.util.ArrayList.rangeCheck(ArrayList.java:657) 
at java.util.ArrayList.get(ArrayList.java:433) 
at org.apache.doris.load.loadv2.dpp.DppColumns.<init>(DppColumns.java:39) 
at org.apache.doris.load.loadv2.dpp.DorisRangePartitioner.getPartition(DorisRangePartitioner.java:56) 
at org.apache.doris.load.loadv2.dpp.SparkDpp$2.call(SparkDpp.java:487) 
at org.apache.doris.load.loadv2.dpp.SparkDpp$2.call(SparkDpp.java:460) 
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143) 
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143) 
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:109) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Driver stacktrace:

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes (If Yes, please explain WHY)
    • No

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@xiaoDjun xiaoDjun changed the title [fix](spark load)partition column is not duplicate key, spark load IndexOutOfBoundsException error [fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error Nov 29, 2022
hf200012
hf200012 previously approved these changes Nov 29, 2022
Copy link
Contributor

@hf200012 hf200012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Nov 29, 2022
@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 29, 2022
@hf200012 hf200012 added area/spark-load Issues or PRs related to the spark load usercase Important user case type label dev/1.1.5 labels Nov 29, 2022
Copy link
Contributor

@hf200012 hf200012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 29, 2022
@hello-stephen
Copy link
Contributor

TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 35.61 seconds
load time: 486 seconds
storage size: 17123333158 Bytes
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20221129051721_clickbench_pr_54806.html

@yiguolei yiguolei merged commit c5f9fd5 into apache:master Nov 29, 2022
@yiguolei yiguolei mentioned this pull request Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/spark-load Issues or PRs related to the spark load dev/1.1.5-merged reviewed usercase Important user case type label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Spark load failed when partition column is not a key in duplicate model
4 participants