-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-21406][RecordFormat] build AvroParquetRecordFormat for the new FileSource #17501
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 9d5ac3c (Tue Dec 14 09:07:22 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing this pull request but I have a few preliminary questions about the design.
Every time I read something about parquet formats I always think the format should be based on the BulkFormat
interface. Why did you base your implementation on the StreamFormat?
As a second point, I'd like to see an IT case using the new format with the FileSource
. Did you already test this?
...k-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/RecordFormat.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice discussion! I think I now better understand the contribution.
Can you rename the class AvroParquetRecordFormat
to ParquetAvro...
? The codebase already has a ParquetAvroWriters
class and it would be nice to keep it consistent.
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Show resolved
Hide resolved
I named it after the naming convention from apache parquet lib e.g. |
R:@echauchot |
@echauchot Thanks for your interest and effort. It would be great if you would review this PR. Appreciate it. Please be aware of the "draft" of the current PR status. There are a lot information written in the PR description at the beginning that might hopefully give you the background info for the review. |
Sure, I saw the comments about split and data types etc... But I feel unconfortable about draft PRs because they usually cannot be merged as is. In the case of your PR, merging it without the split support could not be done. So I guess the correct way to proceed is to use this PR as an environment for design discussions and add extra commits until the PR is ready for prod @JingGe @fapaul WDYT ? |
you are right, that was the idea of the draft PR. Speaking of the splitting support specifically, which will make the implementation way more complicated, this PR might be merged without it, because we didn't get any requirement for it from the business side. If you have any strong requirement w.r.t. the splitting, we'd like to know and reconsider it. |
I think splitting is mandatory because if you read a big parquet file with no split support, then all the content will end up in a single task manager which will lead to OOM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JingGe Thanks for your work ! I must admit I have mixed feelings about this PR: I feel like it is very java-stream and single-split oriented. I think as Fabian that implementing BulkFormat
would be better.
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...parquet/src/test/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormatTest.java
Outdated
Show resolved
Hide resolved
I agree with you, I have actually the same concern, especially from the SQL perspective. I didn't really understand your concern about OOM, because on the upper side we can control it via The goal of this PR is to make everyone on the same page w.r.t. the implementation design. Once the design is settled down, the splitting support as a feature could be easily done in a follow-up PR. That is why I wrote in the PR description explicitly at the beginning that "Splitting is not supported in this version". I will update it with more background info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your contribution. The general approach looks good. There are some things that I'd like to reconsider though:
- I'm not sure if the answer to the confusing in the 3 Format interfaces should be to add a 4.
- An ITCase is missing; the current code is not working and will always yield NPE during execution.
...k-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/RecordFormat.java
Outdated
Show resolved
Hide resolved
...k-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/RecordFormat.java
Outdated
Show resolved
Hide resolved
...k-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/RecordFormat.java
Outdated
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...k-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/RecordFormat.java
Outdated
Show resolved
Hide resolved
Yes I see that there is a countermeasure regarding possible OOM (fetch size) but still, for performance reasons, the split is important. Otherwise the parallelism is sub-optimal and Flink focuses on performance. I'm not a committer on the Flink project so it is not my decision to merge this PR without split but I would tend not to merge without split support to avoid that a user suffers from this lack of performance which seems to not meet project quality standards. @AHeise WDYT ? |
f9f0630
to
9ae5e4d
Compare
I'm fine with a follow-up ticket/PR on that one to keep things going. Having any support for AvroParquet is better than having none. But it should be done before 1.15 release anyhow, such that end-users see only the splittable version. We plan to support splitting for all formats with sync marks but in general the role of splitting has shifted, since the whole big data processing moved from block-based storages to cloud storages. Earlier, splitting was also needed to support data locality, which doesn't apply anymore. So now it's only needed to speed up ingestion (you can always rebalance after source), so it is necessary only for the most basic pipelines. TL;DR while splitting is still a should-have feature, the days of must-have are gone. |
Having the split before any user could see the non-split version was all I was interested in. So delivering it before 1.15 release looks perfect ! |
b481049
to
8b1a7f9
Compare
@flinkbot run azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current state look already very good to me.
@StephanEwen asked offline why we are targeting a feature that is specific to DataStream
and if we are not repeating the same mistake of the past.
So could you please try and spike what would need to be done to port the solution to the existing ParquetVectorizedInputFormat
? I'm a bit skeptical here:
- We have no concept of vectorized data in
DataStream
(should we?). Exposing the table types toDataStream
is not a good option. - On the other hand, integrating
AvroParquet
intoTable
also has no value asTable
API expects table types.
Maybe we really found out that for columnar formats, we simply can't share the implementation forDataStream
andTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current state looks good. I'm wondering if ITCase should also cover specific and reflective records. But it might already be enough to cover with unit test.
I have made a proposal on how to address the unsatisfying recovery + merge it with ParquetVectorizedInputFormat
(probably by extracting a common ancestor) PTAL.
...k-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/StreamFormat.java
Outdated
Show resolved
Hide resolved
...ts/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetReaders.java
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...parquet/src/test/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormatTest.java
Show resolved
Hide resolved
flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/Datum.java
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
51d9550
to
1364bd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably last round of review. Doing the final check after your next push.
...ts/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetReaders.java
Show resolved
Hide resolved
...ink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetRecordFormat.java
Outdated
Show resolved
Hide resolved
...ts/flink-parquet/src/main/java/org/apache/flink/formats/parquet/avro/AvroParquetReaders.java
Show resolved
Hide resolved
...k-connector-files/src/main/java/org/apache/flink/connector/file/src/reader/StreamFormat.java
Outdated
Show resolved
Hide resolved
b8a96d9
to
d3ffea2
Compare
0507c16
to
84e5e70
Compare
AvroParquetRecordFormat impl with UT and IT
84e5e70
to
b8f5219
Compare
@flinkbot run azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you very much for enhancing Parquet.
…nto Avro types. This closes apache#17501.
What is the purpose of the change
The goal of this PR is to provide AvroParquetRecordFormat implementation to read avro GenericRecords from parquet via the new Flink FileSource.
Another goal of this PR is to make everyone on the same page w.r.t. the implementation design. Once the design is settled down, the splitting support and more avro record types beyond the GenericRecord as new features will be easily done in a follow-up PRs.
This is the draft PR, there is one failed unit test case, which is documented out for now with TODO comment. I am still working on it.
Brief change log
Open Questions
To give you some background, the original idea was to let AvroParquetRecordFormat implement FileRecordFormat. After considering that FileRecordFormat and StreamFormat have too many commons and StreamFormat has more built-in features like compression support(via StreamFormatAdapter), current design is based on StreamFormat. In order to keep SIP clear, 2-levels interfaces have been defined. Let StreamFormat focuses on the abstract input stream and let RecordFormat pay attention to the concrete FileSystem, i.e. the Path. RecordFormat provides default implementation for the overloaded createReader(...) methods. Subclasses are therefore not forced to implement them.
Following are some questions open for discussion:
Verifying this change
This change added tests and can be verified as follows:
New unit test validates that :
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation