Support staging binary distributions (wheel files) of Beam SDK.#5110
Support staging binary distributions (wheel files) of Beam SDK.#5110aaltay merged 3 commits intoapache:masterfrom
Conversation
|
R: @charlesccychen |
| if os.path.exists(tgz_expected): | ||
| return tgz_expected | ||
| raise RuntimeError( | ||
| 'Failed to download a distribution for the running SDK. Expected ' |
There was a problem hiding this comment.
Should we keep "source distribution", since the other if branch has a corresponding error mentioning "binary distribution"?
|
Thanks for such a prompt review, @charlesccychen. |
|
Thanks Valentyn! LGTM. R: @aaltay |
| """ | ||
| if sdk_location.endswith('.whl'): | ||
| if sdk_location.startswith('http'): | ||
| raise RuntimeError('Staging SDK wheel from an HTTP location is currently' |
There was a problem hiding this comment.
_stage_beam_sdk is doing the staging. Do you want to check for this error there? It sounds like this method should be concerned about picking a good staging name only.
(Also, do we support reading from gcs locations?)
There was a problem hiding this comment.
good point, this check does not belong here, and actually is useless, since HTTP downloader creates it's own file name after download:
To support download wheels from HTTP we need to save the wheel file under it's original name. I was not sure if URL download path is used by anyone so didn't change that logic. I can add a TODO there if you think we need to support it.
Reading wheels from GCS is supported with this PR.
There was a problem hiding this comment.
You can update the JIRA, or open a new one. It is OK to not change the PR now.
| raise RuntimeError('Staging SDK wheel from an HTTP location is currently' | ||
| 'not supported.') | ||
| _, wheel_filename = FileSystems.split(sdk_location) | ||
| if (wheel_filename.startswith('apache_beam') or |
There was a problem hiding this comment.
google_cloud_dataflow does not have a wheel file (and will probably not have it in the future). We do not need to check for that.
| _, wheel_filename = FileSystems.split(sdk_location) | ||
| if (wheel_filename.startswith('apache_beam') or | ||
| wheel_filename.startswith('google_cloud_dataflow')): | ||
| return wheel_filename |
There was a problem hiding this comment.
Should we also check that wheel_filename has the right architecture name etc. in it that matches Dataflow workers?
There was a problem hiding this comment.
My plan was to leave the decision to the worker. We also control the naming in another part of this module where we download the file.
There was a problem hiding this comment.
That is fine. But is a delayed error. Maybe a open a JIRA to reconsider this.
| staged_path = FileSystems.join(google_cloud_options.staging_location, | ||
| names.DATAFLOW_SDK_TARBALL_FILE) | ||
| if stage_tarball_from_remote_location: | ||
| staged_path = FileSystems.join( |
There was a problem hiding this comment.
Should staged_path be moved inside the else? I do not see it being used outside that branch.
There was a problem hiding this comment.
Thanks, I like that.
| sdk_remote_location, staged_path) | ||
| local_download_file = _dependency_file_download( | ||
| sdk_remote_location, temp_dir) | ||
| staged_name = _desired_sdk_filename_in_staging_location(local_download_file) |
There was a problem hiding this comment.
Would not this support staging a wheel file from an http location, since we just downloaded it above as a local file?
There was a problem hiding this comment.
As I said above, as long as we preserve the original wheel name, it will work.
| '%s==%s' % (package_name, version), | ||
| '--no-binary', ':all:', '--no-deps'] | ||
|
|
||
| logging.info('Executing command: %s', cmd_args) |
There was a problem hiding this comment.
Is it possible to restructure this code to reduce duplication. For example as following?
Set common pip args
Optionally add binary related args
Executed common code (check_call ...)
Check the output
|
|
||
| def _download_pypi_sdk_package(temp_dir): | ||
| def _download_pypi_sdk_package(temp_dir, fetch_binary=False, | ||
| language_version_tag='27', |
There was a problem hiding this comment.
Except for the fetch_binary do you need the other args for now? For dataflow they should be fixed for the foreseeable future. Maybe the language version is an exception though.
There was a problem hiding this comment.
I think we should keep the other args here since this keeps the method general, while the wheel package is defined as all of these properties together. It is equivalent to hard-code these in the pip download method call, but I think it's a reasonable choice to keep these as defaults in kwargs.
| staged_sdk_files.append(sdk_binary_staged_name) | ||
| except RuntimeError as e: | ||
| logging.warn('Failed to download requested binary distribution ' | ||
| 'of the SDK: %s', repr(e)) |
There was a problem hiding this comment.
For some reason repr(e) does not keep the exception message that CalledProcessError contains:
This is what it looks like: WARNING:root:Failed to download requested binary distribution of the SDK: RuntimeError('CalledProcessError()',). There is a command line in the logs above, so it's clear what goes wrong, but I was wondering what is a better way of passing exception message here?
There was a problem hiding this comment.
traceback.format_exc() perhaps? Please do a follow up update or open a JIRA.
|
This LGTM. Thanks @tvalentyn for fixing this problem! |
|
@aaltay Do you have any other comments I should address before we can merge this? |
|
Thanks for review & merge, @aaltay, @charlesccychen. Will follow up with remaining action items. |
[BEAM-3950] Support staging binary (wheel) distributions of Beam SDK to Dataflow. This can help reduce worker startup time latency for Beam users on Dataflow runner.
This PR introduces two changes to current behavior:
--sdk_location points to a wheel, the staged SDK file name will not be renamed todataflow_python_sdk.tar, instead original name will be used.--sdk_locationis not set, try to download and stage both SDK sources and SDK wheel from PyPi. Wheel file is downloaded and staged on a best effort - if PyPi does not have desired wheel, only the sources will be staged, and pipeline execution will continue.Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue.mvn clean verifyto make sure basic checks pass. A more thorough check will be performed on your pull request automatically.