[FLINK-28513] Fix Flink Table API CSV streaming sink throws SerializedThrowable exception #21458

Samrat002 · 2022-12-06T07:24:11Z

What is the purpose of the change

CSVBulkWriter calls sync() function at the closing time. sync() works for all the file system that are syncable in nature like hdfs and others. S3 currently don't support any sync() function.

Brief change log

This change modifies sync() method to flush all data in buffer and close file and commit the write.

Verifying this change

Added unit test to test sync .
Verified in sample EMR cluster.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no) no
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no) no
The serializers: (yes / no / don't know) no
The runtime per-record code paths (performance sensitive): (yes / no / don't know) no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know) no
The S3 file system connector: (yes / no / don't know) no

Documentation

Does this pull request introduce a new feature? (yes / no) no
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) no

flinkbot · 2022-12-06T07:30:18Z

CI report:

07d6ca5 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

dannycranmer · 2022-12-13T09:46:53Z

...base/src/main/java/org/apache/flink/fs/s3/common/writer/S3RecoverableFsDataOutputStream.java

-        fileStream.sync();
+        // for s3 there is no sync supported.
+        // instead calling persist() to put data into s3.
+        persist();
    }


It does not look like these are equivalent. It seems as though .sync() is blocking and persist() is async. Is there a way to way for persist to complete to retain the semantics here?

Also, not tests failed or added for this change. Can we add a test please?

Added test and validated manually in EMR cluster via writing Csv data in s3 of (10Gb and 70GB)

Samrat002 · 2023-05-01T18:02:23Z

Made changes and added test.
@dannycranmer please review whenever time

.../src/test/java/org/apache/flink/fs/s3/common/writer/S3RecoverableFsDataOutputStreamTest.java

hlteoh37

LGTM!

Samrat002 · 2023-06-28T14:05:01Z

@dannycranmer please review whenever time.

dannycranmer · 2023-06-30T15:32:07Z

...base/src/main/java/org/apache/flink/fs/s3/common/writer/S3RecoverableFsDataOutputStream.java

@@ -126,7 +126,16 @@ public long getPos() throws IOException {

    @Override
    public void sync() throws IOException {
-        fileStream.sync();


@Samrat002 this is concerning me: "The S3 file system connector: (yes / no / don't know) maybe". When is this method called? Is it possible we can violate the semantics of the 2-phase commit File Sink here?

sync method is called on the following scenerios

S3RecoverableWriter

FlinkS3FileSystem creates new instance of S3RecoverableWriter when createRecoverableWriter() method is called

CsvBulkWriter uses FlinkS3FileSystem and calls recoverableWriter.

BulkWriter

This change will not alter any processing guarantee.

In the current changes in sync() method , it takes the lock first then makes a call to filesystem flush and commits remaining blocks (writes to s3). This flow results in exactly once . Same code flow is implemented for AzureBlobFsRecoverableDataOutputStream .

From the class BlockBlobAppendStream

public void hsync() throws IOException { if (this.compactionEnabled) { this.flush(); } }

…dThrowable exception

Samrat002 · 2023-08-04T15:26:34Z

@dannycranmer please review whenever time

Samrat002 · 2023-09-02T18:04:54Z

I have taken an example where a datagen table is created with 2 fields fname and lname. Also created another table which is of type filesystem and points to a specfic s3 path and format used is csv.

 -- create a genertor table 
CREATE TABLE generator (
    fname STRING,
    lname STRING
) WITH (
  'connector' = 'datagen'
  
);

-- create a sample dynamic table with connector filesystem. It supports csv as format. 
CREATE TABLE `name_table` (
  `fname` STRING,
  `lname` STRING
) with (
'connector'='filesystem',
'format' = 'csv',
'path' = 's3://dbsamrat-flink-dev/data/default/name_table'
);

-- run a job to insert data in table (s3)
insert into name_table select * from generator;

Here is the below flink-conf file used for the cluster (also these configs are picked in job )

Attaching the jobmanager log for insertion of data in csvformated s3 path which uses CsvBulkWriter and maintains 2 phase commit.
jobmanager.log

It can be noted that 2 phase commit is happening at checkpoint trigger.

Additional job executed seperately to read data from name_table.
count_jobmanager.log

@dannycranmer @hlteoh37 please review if this satisfy the guarentee for exactly once .

hlteoh37 · 2023-09-04T10:41:50Z

Ok this looks good to me. Thanks for fixing and testing @Samrat002

dannycranmer

Thanks for the deep dive @Samrat002

MartijnVisser · 2023-12-28T14:06:32Z

In hindsight I'm quite concerned that we have merged this without any change to the tests. We run nightly tests for the FileSink and StreamingFileSink against S3. Why have those not failed? Why haven't we made improvement to them before merging this in?

hlteoh37 · 2024-01-02T11:22:20Z

Thanks for flagging @MartijnVisser. I'd agree that it would be good to update tests to reflect this discovered bug in the Filesystem S3 integration. I had forgotten that we have a test suite for S3 Filesystem integration!

I see it has already been flagged up in the newer PR #23725 (review). Let's use that JIRA + PR to track the test suite updates

Samrat002 marked this pull request as draft December 6, 2022 07:24

flinkbot added component=FileSystems component=TableSQL/API labels Dec 6, 2022

Samrat002 changed the title ~~[FLINK-28513][hotfix] stream.sync is not supported for all fileformats~~ [FLINK-28513][hotfix] stream.sync is not supported for s3 fileformat Dec 7, 2022

Samrat002 force-pushed the FLINK-28513 branch from fb0fead to 34fa290 Compare December 8, 2022 03:52

Samrat002 marked this pull request as ready for review December 8, 2022 07:22

dannycranmer requested changes Dec 13, 2022

View reviewed changes

Samrat002 force-pushed the FLINK-28513 branch from 34fa290 to 65b70e9 Compare May 1, 2023 18:00

Samrat002 changed the title ~~[FLINK-28513][hotfix] stream.sync is not supported for s3 fileformat~~ [FLINK-28513] Fix Flink Table API CSV streaming sink throws SerializedThrowable exception May 1, 2023

Samrat002 requested a review from dannycranmer May 1, 2023 18:01

Samrat002 force-pushed the FLINK-28513 branch from 65b70e9 to 40e398f Compare May 2, 2023 03:25

hlteoh37 reviewed May 2, 2023

View reviewed changes

.../src/test/java/org/apache/flink/fs/s3/common/writer/S3RecoverableFsDataOutputStreamTest.java Show resolved Hide resolved

Samrat002 requested a review from hlteoh37 May 4, 2023 04:18

hlteoh37 approved these changes May 10, 2023

View reviewed changes

dannycranmer requested changes Jun 30, 2023

View reviewed changes

Samrat002 force-pushed the FLINK-28513 branch 2 times, most recently from 3ecd545 to dd9b2db Compare August 4, 2023 15:22

[FLINK-28513] Fix Flink Table API CSV streaming sink throws Serialize…

07d6ca5

…dThrowable exception

Samrat002 force-pushed the FLINK-28513 branch from dd9b2db to 07d6ca5 Compare August 4, 2023 15:24

Samrat002 requested a review from dannycranmer August 4, 2023 15:25

dannycranmer approved these changes Sep 4, 2023

View reviewed changes

hlteoh37 merged commit e921489 into apache:master Sep 4, 2023

This was referenced Sep 4, 2023

[FLINK-28513][Backport][release-1.17] Fix Flink Table API CSV streaming sink throws #23351

Merged

[FLINK-28513][Backport][release-1.18] Fix Flink Table API CSV streaming sink throws #23352

Merged

Samrat002 deleted the FLINK-28513 branch September 4, 2023 16:10

MartijnVisser mentioned this pull request Dec 28, 2023

[FLINK-33536] Fix Flink Table API CSV streaming sink fails with IOException: Stream closed #23725

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-28513] Fix Flink Table API CSV streaming sink throws SerializedThrowable exception #21458

[FLINK-28513] Fix Flink Table API CSV streaming sink throws SerializedThrowable exception #21458

Samrat002 commented Dec 6, 2022 •

edited

flinkbot commented Dec 6, 2022 •

edited

dannycranmer Dec 13, 2022

Samrat002 Aug 4, 2023

Samrat002 commented May 1, 2023 •

edited

hlteoh37 left a comment

Samrat002 commented Jun 28, 2023

dannycranmer Jun 30, 2023

Samrat002 Aug 4, 2023

Samrat002 commented Aug 4, 2023

Samrat002 commented Sep 2, 2023

hlteoh37 commented Sep 4, 2023

dannycranmer left a comment

MartijnVisser commented Dec 28, 2023

hlteoh37 commented Jan 2, 2024

[FLINK-28513] Fix Flink Table API CSV streaming sink throws SerializedThrowable exception #21458

[FLINK-28513] Fix Flink Table API CSV streaming sink throws SerializedThrowable exception #21458

Conversation

Samrat002 commented Dec 6, 2022 • edited

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Dec 6, 2022 • edited

CI report:

dannycranmer Dec 13, 2022

Choose a reason for hiding this comment

Samrat002 Aug 4, 2023

Choose a reason for hiding this comment

Samrat002 commented May 1, 2023 • edited

hlteoh37 left a comment

Choose a reason for hiding this comment

Samrat002 commented Jun 28, 2023

dannycranmer Jun 30, 2023

Choose a reason for hiding this comment

Samrat002 Aug 4, 2023

Choose a reason for hiding this comment

Samrat002 commented Aug 4, 2023

Samrat002 commented Sep 2, 2023

hlteoh37 commented Sep 4, 2023

dannycranmer left a comment

Choose a reason for hiding this comment

MartijnVisser commented Dec 28, 2023

hlteoh37 commented Jan 2, 2024

Samrat002 commented Dec 6, 2022 •

edited

flinkbot commented Dec 6, 2022 •

edited

Samrat002 commented May 1, 2023 •

edited