New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-1927][py] Operator distribution rework #931
Conversation
Python operators are no longer serialized and shipped across the cluster. Instead the plan file is executed on each node, followed by usage of the respective operator object. removed dill library [FLINK-2173] filepaths are always explicitly passed to python
process.getOutputStream().write((id + "\n").getBytes()); | ||
process.getOutputStream().write((inputFilePath + "\n").getBytes()); | ||
process.getOutputStream().write((outputFilePath + "\n").getBytes()); | ||
process.getOutputStream().flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could reuse process.getOutputStream()
here by saving it to a variable.
Thanks for the pull request @zentol! +1 for removing the dill library. As far as I can see, we handle all the serialization ourselves now. We only used the Dill library to serialize the user-defined function alongside with the operator. Now, the operator is extracted from the plan which has been distributed in the Python files to the nodes beforehand. The plan is only send once to generate the Java execution plan. The old behavior was to serialize the operator, pass it to Java, and sent it back again during execution. Performance-wise the new implementation could even be a bit faster. +1 for explicitly passing the file paths. Java and Python have different ways to determine temporary file paths and this has been a problem in the passed on some platforms. Your changes are sensible and this looks good to merge. |
Thanks for the review @mxm . I've addressed the cosmetic issue you mentioned, and added a small fix for a separate issue as well (error reporting was partially broken). |
I think we can merge this later on. |
…aths Python operators are no longer serialized and shipped across the cluster. Instead the plan file is executed on each node, followed by usage of the respective operator object. - removed dill library - filepaths are always explicitly passed to python - fix error reporting This closes apache#931.
…aths Python operators are no longer serialized and shipped across the cluster. Instead the plan file is executed on each node, followed by usage of the respective operator object. - removed dill library - filepaths are always explicitly passed to python - fix error reporting This closes apache#931.
Python operators are no longer serialized and shipped across the
cluster. Instead the plan file is executed on each node, followed by
usage of the respective operator object.
removed dill library
also fixed [FLINK-2173] by always passing file paths explicitly to python