Skip to content

Conversation

@ostrokach
Copy link
Contributor

@ostrokach ostrokach commented Jun 7, 2019

Currently, the --temp_location argument is parsed by the GoogleCloudOptions class. However, --temp_location or a similar argument may be useful for other runners as well.

For example, in the case of DirectRunner, temp_location may be the place where the runner stores its temporary files, if it ever supports out of process shuffles, combines, etc. In the case of InteractiveRunner, temp_location may be a good default location for the runner to store its cache files. In the case of the WriteToFiles transform, setting temp_location to p.options.view_as(GoogleCloudOptions).temp_location makes it appear as if temp_location has to be on GCS, which from my understanding, is not the case unless we are using the DataflowRunner?

CC: @ivant @ananvay


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@aaltay
Copy link
Member

aaltay commented Jun 10, 2019

Should we have a gcp_temp_location which defaults to temp_location (as in Java :

)

IIRC @udim had some taught on this.

@ostrokach ostrokach force-pushed the feature/temp_location_to_standard branch 7 times, most recently from 3d407ac to 63cf8ae Compare July 28, 2019 01:16
@aaltay aaltay requested a review from udim July 29, 2019 15:51
@ostrokach
Copy link
Contributor Author

ostrokach commented Aug 1, 2019

@aaltay @udim I updated the PR to include both --temp_location and --gcp_temp_location. However, I am not sure if the division of concerns between those to options matches what we have with the Java SDK?

I would think that most people would continue to use --temp_location even when using GCP, unless they want to separate their Dataflow temp files from the SDK temp files?

@aaltay
Copy link
Member

aaltay commented Aug 2, 2019

Another issues with this change is that, it is backward incompatible to move an option from its current class. This will break user pipelines that are setting these options in their code.

Unless there is a workaround we may want to delay this until beam 3.0.

@ostrokach are you blocked on this change?

@robertwb
Copy link
Contributor

robertwb commented Aug 2, 2019 via email

@ostrokach
Copy link
Contributor Author

ostrokach commented Aug 2, 2019

@aaltay I am not blocked on this change; I just figured that it is strange to use GoogleCloudOptions for things that are not GCloud-related, like cache location. But there are things I can do to get around this, like having a separate --cache_location argument for example.

@robertwb The PipelineOptions class collects the arguments defined by every subclass into a single _BeamArgumentParser, which is basically just an argparse.ArgumentParser. We could over-write the add_argument method of _BeamArgumentParser in order to skip duplicate arguments, but there does not seem to be a clean API for doing this without relying on "private" attributes, and then there are edge cases where maybe the type or the destination of the new argument is different from that of the old, and those cases may end up being hard to debug.

@robertwb
Copy link
Contributor

robertwb commented Aug 9, 2019 via email

@aaltay
Copy link
Member

aaltay commented Aug 9, 2019

@robertwb suggestion sounds good. We can change it so that it will be

StandardOptions.temp_location
GcpOptions.gcp_temp_location (defaulting to StandardOptions.temp_location)

Also, how does this options, their defaults, and which ones are being required works in Java SDK?

@ostrokach ostrokach force-pushed the feature/temp_location_to_standard branch 6 times, most recently from f5a4a9d to a318de2 Compare September 11, 2019 20:37
@ostrokach ostrokach force-pushed the feature/temp_location_to_standard branch from a318de2 to 7eea599 Compare September 25, 2019 21:06
@ostrokach
Copy link
Contributor Author

@aaltay @robertwb I updated the PR so that:

  • Move temp_location attribute to StandardOptions.
  • Calling GoogleCloudOptions.temp_location falls back to StandardOptoins.temp_locatioin and emits a deprecation warning.
  • GoogleCloudOptions.gcp_temp_location is used for Dataflow related tasks requiring a GCS bucket. If GoogleCloudOptions.gcp_temp_location is not provided, we fall back to StandardOptoins.temp_locatioin.

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not getting back to this. The change looks good, just two suggestions.

since='2.16.0',
custom_message=(
'GoogleCloudOptions.temp_location is deprecated since %since%. '
'Use GoogleCloudOptions.gcp_temp_location instead.'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly also suggest StandardOptions.temp_location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the message to read:

          'GoogleCloudOptions.temp_location is deprecated since %since%. '
          'Please use StandardOptions.temp_location or '
          'GoogleCloudOptions.gcp_temp_location, as appropriate.'))

Does that sound better?

choices=['COST_OPTIMIZED', 'SPEED_OPTIMIZED'],
help='Set the Flexible Resource Scheduling mode')

def __getattr__(self, name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: consider using @property here.

Copy link
Contributor Author

@ostrokach ostrokach Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first thought as well, but I couldn't figure out how to get it to work.

The problem is that attribute access of the GoogleCloudOptions class ends up being performed using PipelineOptions.__getattr__ and PipelineOptions.__setattr__, and those only check if the argument is set in the (child) class's _add_argparse_args(cls, parser) method, not if the (child) class actually has that attribute.

Since we can't have both StandardOptions._add_argparse_args and GoogleCloudOptions._add_argparse_args define the temp_location argument, I ended up patching the GoogleCloudOptions.__getattr__ and GoogleCloudOptions.__setattr__ methods to add an additional check for the temp_location attribute (and, if not, default to the parent's __getattr__ / __setattr__).

@ostrokach ostrokach force-pushed the feature/temp_location_to_standard branch 4 times, most recently from 92e71ee to 548d546 Compare November 18, 2019 23:13
@robertwb
Copy link
Contributor

@ostrokach Can you rebase this on the latest master?

`temp_location` can potentially be a relevant argument for all runners.
For example, if the DirectRunner were to perform out-of-core shuffles or
combines, it would need a place to store temporary files. Spark also
distinguishes between `SPARK_LOCAL_DIR` and `SPARK_WORKER_DIR` environment variables.
@ostrokach ostrokach force-pushed the feature/temp_location_to_standard branch from 548d546 to e9ee05b Compare January 14, 2020 16:10
@ostrokach
Copy link
Contributor Author

retest this please

1 similar comment
@ostrokach
Copy link
Contributor Author

retest this please

@stale
Copy link

stale bot commented Mar 16, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Mar 16, 2020
@stale
Copy link

stale bot commented Mar 24, 2020

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants