New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-14266][table] Introduce RowCsvInputFormat to new CSV module #9884
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit a15a359 (Wed Dec 04 14:56:42 UTC 2019) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JingsongLi for the PR. I added an initial set of comments. It would be great if we could further reduce the number of limitations. The CSV format is one of the most important batch connectors and should have a feature set similar to the (de)serialization schema. Otherwise we need to document a lot of limitations in descriptors and docs.
private int[] selectedFields; | ||
|
||
/** | ||
* Creates a CSV deserialization schema for the given {@link TypeInformation} and file paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure to update all comments if you copy code.
/** | ||
* Test split logic for {@link RowCsvInputFormat}. | ||
*/ | ||
public class RowCsvInputFormatSplitTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if these tests are sufficient. What about strings sourrounded by "
or escaped delimiters? Could you copy more tests around escaping for the deserialization schema tests?
assertTrue(format.reachedEnd()); | ||
} | ||
|
||
// NOTE: new csv not support configure multi chars field delimiter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all limitations mentioned here should also be mentioned in the input format class as well as in the descriptor
Btw did you have a look at #4660. Isn't this issue a duplicate? |
Thanks @twalthr for your review. |
The differences are:
There are some other reasons why I put forward this PR:
|
Hi @twalthr , fixed comments, please take a look~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JingsongLi , Thanks for your great contribution, I left some minor comments
|
||
long realStart = splitStart; | ||
if (splitStart != 0) { | ||
realStart = findNextLegalSeparator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe findLegalSplitStart will be more meaningful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use findNextLineStartOffset
.
|
||
if (splitLength != READ_WHOLE_SPLIT_FLAG) { | ||
stream.seek(splitStart + splitLength); | ||
long firstByteOfNextLine = findNextLegalSeparator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about using startOfNextSplit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextLineStartOffset
*/ | ||
public class RowCsvInputFormat extends AbstractCsvInputFormat<Row> { | ||
|
||
private static final long serialVersionUID = 1L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generating serialVersionUID using IDE would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According Flink code style, 1 is the first uid.
|
||
long pos = stream.getPos(); | ||
|
||
// deal with "\r\n", next one maybe '\n', so we need skip it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there exists other similar special chars combination in csv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jackson/standard csv only support '\r' '\n' '\r\n'. No '\n\r'
OutputStreamWriter wrt = new OutputStreamWriter(new FileOutputStream(tempFile), StandardCharsets.UTF_8); | ||
wrt.write(content); | ||
wrt.close(); | ||
return new FileInputSplit(0, new Path(tempFile.toURI().toString()), start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add tests that input split comes from input spilt's num > 1 file so we can cover the findNextLegalSeparator
?
eg:
FileInputSplit split1 = new FileInputSplit(0, split.getPath(), 0, split.getLength() / 2, split.getHostnames());
FileInputSplit split2 = new FileInputSplit(1, split.getPath(), split1.getLength(), split.getLength(), split.getHostnames());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowCsvInputFormatSplitTest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RowCsvInputFormatSplitTest
?
ok
Hi @twalthr , do you have other concerns? |
What is the purpose of the change
Now, we have an old CSV, but that is not standard CSV support. we should support the RFC-compliant CSV format for table/sql.
Brief change log
Add RowCsvInputFormat and Use jackson ObjectReader.readValues(InputStream). We need deal with half-line reading when splitting large files into multiple splits. The difficulties are:
Verifying this change
RowCsvInputFormatTest and RowCsvInputFormatSplitTest
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation