[FLINK-14266][table] Introduce RowCsvInputFormat to new CSV module #9884

JingsongLi · 2019-10-12T02:38:51Z

What is the purpose of the change

Now, we have an old CSV, but that is not standard CSV support. we should support the RFC-compliant CSV format for table/sql.

Brief change log

Add RowCsvInputFormat and Use jackson ObjectReader.readValues(InputStream). We need deal with half-line reading when splitting large files into multiple splits. The difficulties are:

ObjectReader do not know current read offset, it has buffer to cache more bytes. But we need stop in the right place for reading a FileSplit. We use BoundedInputStream.
For the half-line reading, in open, we look for the next delimiter for split start, discard the first half of the line; and look for the next delimiter for split end to complete the whole line.

Verifying this change

RowCsvInputFormatTest and RowCsvInputFormatSplitTest

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

flinkbot · 2019-10-12T02:41:18Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit a15a359 (Wed Dec 04 14:56:42 UTC 2019)

Warnings:

No documentation files were touched! Remember to keep the Flink docs up to date!

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

flinkbot · 2019-10-12T03:06:44Z

CI report:

9f9a638 Travis: FAILURE Azure: FAILURE
3b33dda UNKNOWN

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

twalthr

Thanks @JingsongLi for the PR. I added an initial set of comments. It would be great if we could further reduce the number of limitations. The CSV format is one of the most important batch connectors and should have a feature set similar to the (de)serialization schema. Otherwise we need to document a lot of limitations in descriptors and docs.

twalthr · 2019-10-17T15:04:43Z

flink-formats/flink-csv/src/main/java/org/apache/flink/formats/csv/RowCsvInputFormat.java

+		private int[] selectedFields;
+
+		/**
+		 * Creates a CSV deserialization schema for the given {@link TypeInformation} and file paths


Please make sure to update all comments if you copy code.

twalthr · 2019-10-17T15:07:35Z

...formats/flink-csv/src/test/java/org/apache/flink/formats/csv/RowCsvInputFormatSplitTest.java

+/**
+ * Test split logic for {@link RowCsvInputFormat}.
+ */
+public class RowCsvInputFormatSplitTest {


I'm wondering if these tests are sufficient. What about strings sourrounded by " or escaped delimiters? Could you copy more tests around escaping for the deserialization schema tests?

twalthr · 2019-10-17T15:09:33Z

flink-formats/flink-csv/src/test/java/org/apache/flink/formats/csv/RowCsvInputFormatTest.java

+		assertTrue(format.reachedEnd());
+	}
+
+	// NOTE: new csv not support configure multi chars field delimiter.


all limitations mentioned here should also be mentioned in the input format class as well as in the descriptor

twalthr · 2019-10-17T15:29:30Z

Btw did you have a look at #4660. Isn't this issue a duplicate?

JingsongLi · 2019-10-24T02:54:34Z

Thanks @JingsongLi for the PR. I added an initial set of comments. It would be great if we could further reduce the number of limitations. The CSV format is one of the most important batch connectors and should have a feature set similar to the (de)serialization schema. Otherwise we need to document a lot of limitations in descriptors and docs.

Thanks @twalthr for your review.
These limitations are compared with the previous CsvInputFormat in flink-java, not the RFC-(de)serialization schema in flink-csv. Some can continue to improve, some are more difficult (relying on Jackson).
You are right, we need to document a lot of limitations in docs.

JingsongLi · 2019-10-24T03:06:57Z

Btw did you have a look at #4660. Isn't this issue a duplicate?

The differences are:

This PR is in flink-csv instead of flink-table.
This PR is consistent with the existing (de)serialization schema.
This PR deals with escaping characters with line delimiter.

There are some other reasons why I put forward this PR:

[FLINK-7050][table] Add support of RFC compliant CSV parser for Table Source #4660 has not been updated for quite some time.
This PR is just a small part of [FLINK-7050][table] Add support of RFC compliant CSV parser for Table Source #4660. [FLINK-7050][table] Add support of RFC compliant CSV parser for Table Source #4660 can be cut into parts.

JingsongLi · 2019-10-25T07:40:10Z

Hi @twalthr , fixed comments, please take a look~

leonardBang

@JingsongLi , Thanks for your great contribution, I left some minor comments

leonardBang · 2020-03-30T10:08:58Z

flink-formats/flink-csv/src/main/java/org/apache/flink/formats/csv/AbstractCsvInputFormat.java

+
+		long realStart = splitStart;
+		if (splitStart != 0) {
+			realStart = findNextLegalSeparator();


maybe findLegalSplitStart will be more meaningful?

I think we can use findNextLineStartOffset.

leonardBang · 2020-03-30T10:11:11Z

flink-formats/flink-csv/src/main/java/org/apache/flink/formats/csv/AbstractCsvInputFormat.java

+
+		if (splitLength != READ_WHOLE_SPLIT_FLAG) {
+			stream.seek(splitStart + splitLength);
+			long firstByteOfNextLine = findNextLegalSeparator();


how about using startOfNextSplit?

nextLineStartOffset

leonardBang · 2020-03-30T10:15:14Z

flink-formats/flink-csv/src/main/java/org/apache/flink/formats/csv/RowCsvInputFormat.java

+ */
+public class RowCsvInputFormat extends AbstractCsvInputFormat<Row> {
+
+	private static final long serialVersionUID = 1L;


generating serialVersionUID using IDE would be better?

According Flink code style, 1 is the first uid.

leonardBang · 2020-03-30T10:19:24Z

flink-formats/flink-csv/src/main/java/org/apache/flink/formats/csv/AbstractCsvInputFormat.java

+
+				long pos = stream.getPos();
+
+				// deal with "\r\n", next one maybe '\n', so we need skip it.


Is there exists other similar special chars combination in csv？

Jackson/standard csv only support '\r' '\n' '\r\n'. No '\n\r'

leonardBang · 2020-03-30T10:26:59Z

flink-formats/flink-csv/src/test/java/org/apache/flink/formats/csv/RowCsvInputFormatTest.java

+		OutputStreamWriter wrt = new OutputStreamWriter(new FileOutputStream(tempFile), StandardCharsets.UTF_8);
+		wrt.write(content);
+		wrt.close();
+		return new FileInputSplit(0, new Path(tempFile.toURI().toString()), start,


We should add tests that input split comes from input spilt's num > 1 file so we can cover the findNextLegalSeparator ?
eg：
FileInputSplit split1 = new FileInputSplit(0, split.getPath(), 0, split.getLength() / 2, split.getHostnames());
FileInputSplit split2 = new FileInputSplit(1, split.getPath(), split1.getLength(), split.getLength(), split.getHostnames());

RowCsvInputFormatSplitTest?

RowCsvInputFormatSplitTest?
ok

JingsongLi · 2020-03-31T03:54:45Z

Hi @twalthr , do you have other concerns?

JingsongLi · 2020-04-02T03:44:35Z

Travis passed: https://travis-ci.org/github/JingsongLi/flink/builds/669534732

This closes apache#9884

rmetzger added the review=description? label Oct 12, 2019

rmetzger added the component=Connectors/FileSystem label Oct 12, 2019

twalthr requested changes Oct 17, 2019

View reviewed changes

[FLINK-14266][table] Introduce RowCsvInputFormat to new CSV module

0570c71

JingsongLi force-pushed the csv branch from a15a359 to 0570c71 Compare March 30, 2020 06:32

Abstract CsvInputFormat

e9ab774

leonardBang reviewed Mar 30, 2020

View reviewed changes

change names

bdcdb0a

JingsongLi added 2 commits April 1, 2020 11:05

checkstyle

9f9a638

checkstyle

3b33dda

JingsongLi merged commit 04096fc into apache:master Apr 7, 2020

KarmaGYZ pushed a commit to KarmaGYZ/flink that referenced this pull request Apr 10, 2020

[FLINK-14266][csv] Introduce RowCsvInputFormat to CSV module

07e95fa

This closes apache#9884

leonardBang pushed a commit to leonardBang/flink that referenced this pull request Apr 10, 2020

[FLINK-14266][csv] Introduce RowCsvInputFormat to CSV module

7ad4198

This closes apache#9884

JingsongLi deleted the csv branch April 26, 2020 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-14266][table] Introduce RowCsvInputFormat to new CSV module #9884

[FLINK-14266][table] Introduce RowCsvInputFormat to new CSV module #9884

JingsongLi commented Oct 12, 2019

flinkbot commented Oct 12, 2019 •

edited

flinkbot commented Oct 12, 2019 •

edited

twalthr left a comment

twalthr Oct 17, 2019

twalthr Oct 17, 2019

twalthr Oct 17, 2019

twalthr commented Oct 17, 2019

JingsongLi commented Oct 24, 2019

JingsongLi commented Oct 24, 2019

JingsongLi commented Oct 25, 2019

leonardBang left a comment

leonardBang Mar 30, 2020

JingsongLi Mar 30, 2020

leonardBang Mar 30, 2020

JingsongLi Mar 30, 2020

leonardBang Mar 30, 2020

JingsongLi Mar 30, 2020

leonardBang Mar 30, 2020

JingsongLi Mar 30, 2020

leonardBang Mar 30, 2020

JingsongLi Mar 30, 2020

leonardBang Mar 30, 2020

JingsongLi commented Mar 31, 2020

JingsongLi commented Apr 2, 2020


		long pos = stream.getPos();

		// deal with "\r\n", next one maybe '\n', so we need skip it.

[FLINK-14266][table] Introduce RowCsvInputFormat to new CSV module #9884

[FLINK-14266][table] Introduce RowCsvInputFormat to new CSV module #9884

Conversation

JingsongLi commented Oct 12, 2019

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Oct 12, 2019 • edited

Automated Checks

Review Progress

flinkbot commented Oct 12, 2019 • edited

CI report:

twalthr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twalthr commented Oct 17, 2019

JingsongLi commented Oct 24, 2019

JingsongLi commented Oct 24, 2019

JingsongLi commented Oct 25, 2019

leonardBang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JingsongLi commented Mar 31, 2020

JingsongLi commented Apr 2, 2020

flinkbot commented Oct 12, 2019 •

edited

flinkbot commented Oct 12, 2019 •

edited