Skip to content

Conversation

@Vancior
Copy link
Contributor

@Vancior Vancior commented Aug 8, 2022

What is the purpose of the change

This PR supports OrcBulkWriters.for_row_data_vectorization API to create a BulkWriterFactory that writes rows into Orc files in a batch fashion. This branch will be rebased after #20499 merged.

Verifying this change

This change added tests and can be verified as follows:

  • FileSinkOrcBulkWritersTests in test_orc.py

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (docs & Python Sphinx doc)

@Vancior Vancior force-pushed the feat/py_orc_bulk_write branch from 9783b83 to f35f82a Compare August 8, 2022 15:42
@flinkbot
Copy link
Collaborator

flinkbot commented Aug 8, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@Vancior Vancior force-pushed the feat/py_orc_bulk_write branch 3 times, most recently from 2cd5e87 to 30e8b26 Compare August 9, 2022 02:38
Copy link
Contributor

@HuangXingBo HuangXingBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vancior Thanks a lot for the PR. Looks good overall. I only have left two comments.

{{< /tabs >}}

For PyFlink users, `OrcBulkWriters.for_row_data_vectorization` could be used to create `BulkWriterFactory` to write `Row` records to files in Orc format.
It should be noted that if the preceding operator of sink is an operator producing `RowData` records, e.g. CSV source, it needs to be converted to `Row` records before writing to sink.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to give a some description to help pyflink users to understand RowData, which can be a doc link or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to eliminate this in another PR.

)

@staticmethod
def _create_properties(conf: Configuration):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method can be moved to datastream.utils?

@Vancior Vancior force-pushed the feat/py_orc_bulk_write branch 2 times, most recently from 801ea8a to 5fdb220 Compare August 9, 2022 15:26
@Vancior Vancior force-pushed the feat/py_orc_bulk_write branch from 5fdb220 to 42a3f34 Compare August 9, 2022 15:52
@dianfu dianfu closed this in cf3beff Aug 10, 2022
huangxiaofeng10047 pushed a commit to huangxiaofeng10047/flink that referenced this pull request Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants