New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-4040] Fix data mismatch incase of compaction failure and retry success #3994
Conversation
@QiangCai : please check this. |
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2876/ |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4632/ |
retest this please |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4640/ |
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2884/ |
@@ -398,27 +398,29 @@ public static void mergeIndexAndWriteSegmentFile(CarbonTable carbonTable, String | |||
* @throws IOException | |||
*/ | |||
public static String writeSegmentFile(CarbonTable carbonTable, String segmentId, String UUID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the update, it will have more than one timestamp, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, Let me check and modify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed now. waiting for the build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QiangCai : I have thought about, only non update scenario I can handle this issue. For update there is no easy way currently to find out which is stale and which is not. One way for update is to read old segment file and add files that timestamp greater than old segment file content + old segment file content.
Build Failed with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/2903/ |
Build Failed with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/4659/ |
Hi, Ajantha. I have fixed this in #3999 , please have a check. |
@marchpure: I think your PR is not handling all the scenarios, so update scenario and test case will fail (even mine also) |
yes. update test case all passes |
@marchpure : If you dont assign UUID for all update case or compaction after update case, all testcases will pass. But the original data mismatch will still happen for segment that has stale files and went for update and compaction (that scenario you can test) |
yes. I have test it. in pr3999, update will generate new segment like merge into. which can void recreate segment file. |
@marchpure : yes, If update writes into new segment, It can fix the issue. If you handle this in your PR then I will close mine. |
handled in #3999 , I will close this. |
Why is this PR needed?
For compaction, we don't register in-progress segment. so, when unable to get a table status lock. compaction can fail. That time compaction partial segment needs to be cleaned. If the partial segment is failed to clean up due to unable to get lock or IO issues. When the user retries the compaction. carbon uses the same segment id. so while writing the segment file for new compaction. list only the files mapping to the current compaction, not all the files which contain stale files.
What changes were proposed in this PR?
While writing the segment file, consider index files belongs to the current load only in the segment folder.
Does this PR introduce any user interface change?
Is any new testcase added?