[FLINK-39778][s3] Recoverable writer silently loses the in-flight tail on resume#28268
[FLINK-39778][s3] Recoverable writer silently loses the in-flight tail on resume#28268Samrat002 wants to merge 2 commits into
Conversation
Izeren
left a comment
There was a problem hiding this comment.
I haven't reviewed all tests in details yet, but I would like first to understand better the logic about partially uploaded subparts. My initial impression from the FLIP was that we would like to store incomplete parts in-line in state and treat part upload as atomic operation. Did we choose not to do that? I wonder because I am not sure in reliability of incomplete parts. Are they subject to the same lifecycle policies as incomplete MPUs or some different policy?
| s3AccessHelper.getObject(s3recoverable.incompleteObjectName(), target); | ||
| if (downloaded != s3recoverable.incompleteObjectLength()) { | ||
| throw new IOException( | ||
| "Incomplete-tail object " |
There was a problem hiding this comment.
This exception doesn't tell what are the implications. Does it mean that state is corrupted and can't be recovered unless object on S3 is restored? If so, would be useful to explain it. Would help both oncall engineer and to classify such errors correctly (retriable/non-retriable)
| * <p><b>Thread safety:</b> not thread-safe. Use a single thread per instance, matching the | ||
| * single-thread invariant of the production {@link NativeS3RecoverableFsDataOutputStream}. | ||
| */ | ||
| public final class InMemoryNativeS3Operations extends NativeS3ObjectOperations { |
There was a problem hiding this comment.
I think it would it be better to have NativeS3ObjectOperations as interface otherwise you still would need to bring in all sdk based implementation dependencies.
There was a problem hiding this comment.
Agreed. An interface would make the test seam cleaner and keep SDK types off the test classpath. The reason I didn't do it here is scope:
NativeS3ObjectOperations returns SDK types directly (CompletedPart, UploadPartResponse, PartETag-like records), so introducing an interface means either
(a) extracting Flink-owned DTOs to replace those return types across NativeS3RecoverableFsDataOutputStream / NativeS3Committer / NativeS3RecoverableWriter, or
(b) leaving the SDK types in the interface signatures , which doesn't actually remove the dependency.
Both options are meaningful refactors that I'd rather not bundle into a data-loss bugfix. For now the test subclass passes null for the SDK ctor args and overrides every method it touches, so no SDK client is constructed at test time (the SDK is only on the compile classpath, which it already is for main code). I'll file a follow-up to do the interface extraction properly — happy to take it on right after this lands. WDYT?
Checked https://github.com/localstack/localstack. It is not maintained anymore.
| * <p><b>Thread safety:</b> not thread-safe. Use a single thread per instance, matching the | ||
| * single-thread invariant of the production {@link NativeS3RecoverableFsDataOutputStream}. | ||
| */ | ||
| public final class InMemoryNativeS3Operations extends NativeS3ObjectOperations { |
There was a problem hiding this comment.
If this is meant to be used as test harness for FileSystem testing (as replacement of localStack). Arguably it is good to have tests for it too
| InMemoryNativeS3Operations s3 = new InMemoryNativeS3Operations(); | ||
| NativeS3RecoverableWriter writer1 = | ||
| NativeS3RecoverableWriter.writer( | ||
| s3, tmp.toString(), MIN_PART_SIZE, /* maxConcurrent */ 1); |
There was a problem hiding this comment.
What is the impact of concurrency on these incomplete subparts? How many of incomplete subparts can exist per file path at the same time?
There was a problem hiding this comment.
- Flink's RecoverableWriter contract gives a single writer instance ownership of a single open stream per file path. There is no concurrent writer for the same path.
- Each persist() call uploads a fresh side object under /.incomplete/ (unique per call, see NativeS3RecoverableFsDataOutputStream). So back-to-back persist()s do not race. they produce distinct objects.
- At any point in time, the number of live side objects for a given path equals the number of un-retired checkpoints that carried tail bytes for that file. When Flink retires a checkpoint it calls cleanupRecoverableState(), which deletes the corresponding side object
NativeS3RecoverableWriter#cleanupRecoverableState. - On recovery, only the side object referenced by the restored ResumeRecoverable is consulted. Older side objects from earlier checkpoints are independent. Flink's retention policy governs their lifetime, not the recover path.
So there's no concurrency on the side object itself. It's written once during persist(), read at most once during recover(), and deleted once during checkpoint retirement.
Yes, you are right. That was initially discussed. what we are observing at the scale of production, users don't really set policies. There are billions of MPU get accumulated and, leading to high cost. |
What is the purpose of the change
NativeS3RecoverableWriter.recover()silently discarded the sub-part-size tail thatpersist()had durably uploaded to S3 as a side object. After a crash-and-restore cycle, any bytes written since the last full-part boundary were permanently lost, violating Flink's exactly-once guarantee.This patch fixes the data loss by downloading the side object during
recover()and seeding the resumed output stream with those bytes before accepting further writes.Brief change log
recover()inNativeS3RecoverableWriternow downloads the incomplete-tail side object and seeds the resumed stream with those bytes before accepting new writes. AdownloadIncompleteTail()helper validates the length and cleans up the local file on failure.NativeS3RecoverableFsDataOutputStreamgains a resume constructor that opens the seed file in append mode so position accounting is correct from the start.Verifying this change
UT to showcase the bug and fix working
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no) noDocumentation
Was generative AI tooling used to co-author this PR?