New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1720] when query incr view of mor table which has many delete records use sparksql/hive-beeline, StackOverflowError #2721
Conversation
…ecords use sparksql/hive-beeline, StackOverflowError
test step: before patch: step1: val df = spark.range(0, 1000000).toDF("keyid") // bulk_insert 100w row (keyid from 0 to 1000000) merge(df, 4, "default", "hive_9b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert") step2: val df = spark.range(0, 900000).toDF("keyid") // delete 90w row (keyid from 0 to 900000) delete(df, 4, "default", "hive_9b") step3: query on beeline/spark-sql : select count(col3) from hive_9b_rt After patch: delete function: merge function: |
@garyli1019 could you help me review this pr, thanks |
Codecov Report
@@ Coverage Diff @@
## master #2721 +/- ##
============================================
- Coverage 51.72% 9.40% -42.33%
+ Complexity 3601 48 -3553
============================================
Files 476 54 -422
Lines 22595 1989 -20606
Branches 2409 236 -2173
============================================
- Hits 11687 187 -11500
+ Misses 9889 1789 -8100
+ Partials 1019 13 -1006
Flags with carried forward coverage won't be shown. Click here to find out more. |
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
Show resolved
Hide resolved
@garyli1019 : fix the label as required. (sev:critical or sev:high). Also fix the corresponding jira if applicable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Tips
What is the purpose of the pull request
Fix the StackOverflowError on [HUDI-1720]
now RealtimeCompactedRecordReader.next deal with delete records by recursion, see:
hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java
Line 106 in 6e803e0
however when the log file contains many delete record, the logcial of RealtimeCompactedRecordReader.next will lead stackOverflowError
we can use Loop instead of recursion。
Brief change log
(for example:)
Verify this pull request
Manually verified the change by running a job locally
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.