New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-16483][python] Add Python building blocks to make sure the basic functionality of vectorized Python UDF could work #11342
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 27f93cc (Sat Mar 07 14:01:14 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dianfu Thanks a lot for the PR. Left some minor comments below.
flink-python/setup.py
Outdated
@@ -224,7 +224,8 @@ def remove_if_exists(file_path): | |||
author_email='dev@flink.apache.org', | |||
python_requires='>=3.5', | |||
install_requires=['py4j==0.10.8.1', 'python-dateutil==2.8.0', 'apache-beam==2.19.0', | |||
'cloudpickle==1.2.2', 'avro-python3>=1.8.1,<=1.9.1', 'jsonpickle==1.2'], | |||
'cloudpickle==1.2.2', 'avro-python3>=1.8.1,<=1.9.1', 'jsonpickle==1.2', | |||
'pandas>=0.23.4, <=0.25.3', 'pyarrow>=0.15.1, <=0.16.0'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the blank between the version range? It seems it contains no blank usually.
|
||
|
||
def _create_udtf(f, input_types, result_types, deterministic, name): | ||
return UserDefinedTableFunctionWrapper(f, input_types, result_types, deterministic, name) | ||
|
||
|
||
def udf(f=None, input_types=None, result_type=None, deterministic=None, name=None): | ||
def udf(f=None, input_types=None, result_type=None, deterministic=None, name=None, | ||
udf_type="general"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a PythonFunctionKind
class which contains GENERAL
and PANDAS
? It is consistent with Java and also would be easier for users to add the parameter.
@@ -46,6 +46,12 @@ | |||
|
|||
private static final String SCHEMA_ARROW_CODER_URN = "flink:coder:schema:scalar_function:arrow:v1"; | |||
|
|||
static { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should not contain this commit in this PR? as it is included in FLINK-16273?otherwise there will be two commits for addressing the same problem.
self._batch_reader = load_from_stream(self._resettable_io) | ||
|
||
self._resettable_io.set_input_bytes(in_stream.read_all()) | ||
table = pa.Table.from_batches([next(self._batch_reader)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we only get the first batch in the batch list?
self._resettable_io = ResettableIO() | ||
|
||
def encode_to_stream(self, cols, out_stream, nested): | ||
if not hasattr(self, "_batch_writer"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we must init the _batch_writer here instead of in __init__
?
self._batch_writer.write_batch(self._create_batch(cols)) | ||
|
||
def decode_from_stream(self, in_stream, nested): | ||
if not hasattr(self, "_batch_reader"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can avoid these hasattr
. Can we use pa.ipc.open_stream
read the stream directly and return the batch?
…ic functionality of vectorized Python UDF could work
@hequn8128 Thanks a lot for the review. Have updated the PR. |
The Python e2e tests of jdk11 has also passed: https://api.travis-ci.org/v3/job/660451220/log.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dianfu Thanks a lot for the udpate and provided the test link. LGTM.
Merging...
What is the purpose of the change
This pull request add the Python building blocks to make sure the basic functionality of vectorized Python UDF could work.
Brief change log
Verifying this change
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation