-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-28876][python][format/orc] Support writing RowData into Orc files #20505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9783b83 to
f35f82a
Compare
2cd5e87 to
30e8b26
Compare
HuangXingBo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vancior Thanks a lot for the PR. Looks good overall. I only have left two comments.
| {{< /tabs >}} | ||
|
|
||
| For PyFlink users, `OrcBulkWriters.for_row_data_vectorization` could be used to create `BulkWriterFactory` to write `Row` records to files in Orc format. | ||
| It should be noted that if the preceding operator of sink is an operator producing `RowData` records, e.g. CSV source, it needs to be converted to `Row` records before writing to sink. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to give a some description to help pyflink users to understand RowData, which can be a doc link or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to eliminate this in another PR.
| ) | ||
|
|
||
| @staticmethod | ||
| def _create_properties(conf: Configuration): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method can be moved to datastream.utils?
801ea8a to
5fdb220
Compare
5fdb220 to
42a3f34
Compare
What is the purpose of the change
This PR supports
OrcBulkWriters.for_row_data_vectorizationAPI to create aBulkWriterFactorythat writes rows into Orc files in a batch fashion. This branch will be rebased after #20499 merged.Verifying this change
This change added tests and can be verified as follows:
FileSinkOrcBulkWritersTestsin test_orc.pyDoes this pull request potentially affect one of the following parts:
@Public(Evolving): (yes)Documentation