-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-28513] Fix Flink Table API CSV streaming sink throws SerializedThrowable exception #21458
Conversation
fb0fead
to
34fa290
Compare
fileStream.sync(); | ||
// for s3 there is no sync supported. | ||
// instead calling persist() to put data into s3. | ||
persist(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not look like these are equivalent. It seems as though .sync()
is blocking and persist()
is async. Is there a way to way for persist to complete to retain the semantics here?
Also, not tests failed or added for this change. Can we add a test please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test and validated manually in EMR cluster via writing Csv data in s3 of (10Gb and 70GB)
Made changes and added test. |
.../src/test/java/org/apache/flink/fs/s3/common/writer/S3RecoverableFsDataOutputStreamTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@dannycranmer please review whenever time. |
@@ -126,7 +126,16 @@ public long getPos() throws IOException { | |||
|
|||
@Override | |||
public void sync() throws IOException { | |||
fileStream.sync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Samrat002 this is concerning me: "The S3 file system connector: (yes / no / don't know) maybe". When is this method called? Is it possible we can violate the semantics of the 2-phase commit File Sink here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync
method is called on the following scenerios
S3RecoverableWriter
FlinkS3FileSystem
creates new instance ofS3RecoverableWriter
whencreateRecoverableWriter()
method is calledCsvBulkWriter
usesFlinkS3FileSystem
and calls recoverableWriter.BulkWriter
This change will not alter any processing guarantee.
In the current changes in sync()
method , it takes the lock first then makes a call to filesystem flush and commits remaining blocks (writes to s3). This flow results in exactly once . Same code flow is implemented for AzureBlobFsRecoverableDataOutputStream
.
From the class BlockBlobAppendStream
public void hsync() throws IOException {
if (this.compactionEnabled) {
this.flush();
}
}
3ecd545
to
dd9b2db
Compare
…dThrowable exception
@dannycranmer please review whenever time |
I have taken an example where a datagen table is created with 2 fields
Here is the below flink-conf file used for the cluster (also these configs are picked in job ) Attaching the jobmanager log for insertion of data in csvformated s3 path which uses CsvBulkWriter and maintains 2 phase commit. It can be noted that 2 phase commit is happening at checkpoint trigger. Additional job executed seperately to read data from name_table. @dannycranmer @hlteoh37 please review if this satisfy the guarentee for exactly once . |
Ok this looks good to me. Thanks for fixing and testing @Samrat002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the deep dive @Samrat002
In hindsight I'm quite concerned that we have merged this without any change to the tests. We run nightly tests for the FileSink and StreamingFileSink against S3. Why have those not failed? Why haven't we made improvement to them before merging this in? |
Thanks for flagging @MartijnVisser. I'd agree that it would be good to update tests to reflect this discovered bug in the Filesystem S3 integration. I had forgotten that we have a test suite for S3 Filesystem integration! I see it has already been flagged up in the newer PR #23725 (review). Let's use that JIRA + PR to track the test suite updates |
What is the purpose of the change
CSVBulkWriter calls
sync()
function at the closing time. sync() works for all the file system that are syncable in nature like hdfs and others. S3 currently don't support anysync()
function.Brief change log
This change modifies
sync()
method to flush all data in buffer and close file and commit the write.Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no) noDocumentation