Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-1927][py] Operator distribution rework #931

Closed
wants to merge 4 commits into from

Conversation

zentol
Copy link
Contributor

@zentol zentol commented Jul 22, 2015

Python operators are no longer serialized and shipped across the
cluster. Instead the plan file is executed on each node, followed by
usage of the respective operator object.

removed dill library
also fixed [FLINK-2173] by always passing file paths explicitly to python

Python operators are no longer serialized and shipped across the
cluster. Instead the plan file is executed on each node, followed by
usage of the respective operator object.

removed dill library
[FLINK-2173] filepaths are always explicitly passed to python
process.getOutputStream().write((id + "\n").getBytes());
process.getOutputStream().write((inputFilePath + "\n").getBytes());
process.getOutputStream().write((outputFilePath + "\n").getBytes());
process.getOutputStream().flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reuse process.getOutputStream() here by saving it to a variable.

@mxm
Copy link
Contributor

mxm commented Jul 29, 2015

Thanks for the pull request @zentol!

+1 for removing the dill library. As far as I can see, we handle all the serialization ourselves now. We only used the Dill library to serialize the user-defined function alongside with the operator. Now, the operator is extracted from the plan which has been distributed in the Python files to the nodes beforehand. The plan is only send once to generate the Java execution plan. The old behavior was to serialize the operator, pass it to Java, and sent it back again during execution. Performance-wise the new implementation could even be a bit faster.

+1 for explicitly passing the file paths. Java and Python have different ways to determine temporary file paths and this has been a problem in the passed on some platforms.

Your changes are sensible and this looks good to merge.

@zentol
Copy link
Contributor Author

zentol commented Jul 29, 2015

Thanks for the review @mxm .

I've addressed the cosmetic issue you mentioned, and added a small fix for a separate issue as well (error reporting was partially broken).

@mxm
Copy link
Contributor

mxm commented Jul 30, 2015

I think we can merge this later on.

@asfgit asfgit closed this in 68b1559 Jul 30, 2015
@zentol zentol deleted the python_operator4 branch September 19, 2015 11:43
nikste pushed a commit to nikste/flink that referenced this pull request Sep 29, 2015
…aths

Python operators are no longer serialized and shipped across the
cluster. Instead the plan file is executed on each node, followed by
usage of the respective operator object.

- removed dill library
- filepaths are always explicitly passed to python
- fix error reporting

This closes apache#931.
nltran pushed a commit to nltran/flink that referenced this pull request Jan 8, 2016
…aths

Python operators are no longer serialized and shipped across the
cluster. Instead the plan file is executed on each node, followed by
usage of the respective operator object.

- removed dill library
- filepaths are always explicitly passed to python
- fix error reporting

This closes apache#931.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants