Skip to content

[BEAM-7577] Allow ValueProviders in Datastore Query filters#8950

Merged
udim merged 10 commits intoapache:masterfrom
EDjur:BEAM-7577/allow-valueproviders-ReadFromDatastore
Aug 2, 2019
Merged

[BEAM-7577] Allow ValueProviders in Datastore Query filters#8950
udim merged 10 commits intoapache:masterfrom
EDjur:BEAM-7577/allow-valueproviders-ReadFromDatastore

Conversation

@EDjur
Copy link
Contributor

@EDjur EDjur commented Jun 26, 2019

I have a use case where I need to supply Datastore Query filters at runtime. This PR allows the usage of ValueProviders when constructing the Datastore Query and converts them to their expected str-equivalents when running in a pipeline in _to_client_query().

Related Jira ticket: https://issues.apache.org/jira/browse/BEAM-7577

I have tested this by building my local version of beam and using it in the sdk_location flag when running a Dataflow job on GCP.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@EDjur
Copy link
Contributor Author

EDjur commented Jun 26, 2019

R: @aaltay Appreciate any feedback and comments!

@aaltay
Copy link
Member

aaltay commented Jun 26, 2019

R: @udim for reviewing related to Datastore
R: @azurezyq for ValueProvider related review.

Thank you @EDjur for sending this.

@EDjur
Copy link
Contributor Author

EDjur commented Jun 27, 2019

Thanks for the feedback! Will push an update once I've run tests locally.

Edit: Looks like my local pylint didn't catch the same issues as the one in Jenkins. Should be fixed now.

@EDjur
Copy link
Contributor Author

EDjur commented Jul 1, 2019

@udim Made some edits based on your comments, let me know what you think!

@EDjur
Copy link
Contributor Author

EDjur commented Jul 3, 2019

I've noticed that a small change might be needed in datastoreio.py or alternatively in query_splitter.py in order to use this together with ReadFromDatastore. Specifically, the validate_split function in query_splitter.py is causing issues when using value providers as a filter:

  for filter in query.filters:
    if filter[1] in ['<', '<=', '>', '>=']:
      raise SplitNotPossibleError('Query cannot have any inequality filters.')

Since this function is run before the query is converted to a client_query by calling the _to_client_query method, filter here will be of type ValueProvider, which does not support indexing, therefore raising a TypeError.

I'm thinking that we should perhaps evaluate the values of our ValueProvider-filter before calculating the split. But this means we cannot evaluate in _to_client_query, which I thought was a neat solution that wasn't particularly hacky.

For context, the flow is essentially the expand method in ReadFromDatastore that calls the SplitQuery before Read, and Read is what causes the _to_client_query method to be called.

Question is basically where the best place is to evaluate these filters.

@udim What's your take on this?

Edit: Will explore this again after fixing the other issue first.

@udim
Copy link
Member

udim commented Jul 24, 2019

run python 2 postcommit

@udim
Copy link
Member

udim commented Jul 24, 2019

run python 3.5 postcommit

@udim
Copy link
Member

udim commented Jul 24, 2019

I've noticed that a small change might be needed in datastoreio.py or alternatively in query_splitter.py in order to use this together with ReadFromDatastore. Specifically, the validate_split function in query_splitter.py is causing issues when using value providers as a filter:

  for filter in query.filters:
    if filter[1] in ['<', '<=', '>', '>=']:
      raise SplitNotPossibleError('Query cannot have any inequality filters.')

Since this function is run before the query is converted to a client_query by calling the _to_client_query method, filter here will be of type ValueProvider, which does not support indexing, therefore raising a TypeError.

I'm thinking that we should perhaps evaluate the values of our ValueProvider-filter before calculating the split. But this means we cannot evaluate in _to_client_query, which I thought was a neat solution that wasn't particularly hacky.

For context, the flow is essentially the expand method in ReadFromDatastore that calls the SplitQuery before Read, and Read is what causes the _to_client_query method to be called.

Question is basically where the best place is to evaluate these filters.

@udim What's your take on this?

Edit: Will explore this again after fixing the other issue first.

I would put

self.filters = self._set_runtime_filters(filters)

in Query.__init__. I believe that solves both issues.

Copy link
Member

@udim udim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidentally approved this thinking the code was reverted to using Iterable[Tuple[ValueProvider, ValueProvider, ValueProvider]].

@EDjur
Copy link
Contributor Author

EDjur commented Jul 25, 2019

I accidentally approved this thinking the code was reverted to using Iterable[Tuple[ValueProvider, ValueProvider, ValueProvider]].

Yep noticed, I will revert the changes now :)

@EDjur
Copy link
Contributor Author

EDjur commented Jul 25, 2019

I would put

self.filters = self._set_runtime_filters(filters)

in Query.__init__. I believe that solves both issues.

Will this not raise an error due to calling .get() on ValueProvider from a non-runtime context? As the Query is instantiated before executing the pipeline?

One solution could be to just check in the query splitter if it is a ValueProvider and then execute .get() on it if it is.

@EDjur
Copy link
Contributor Author

EDjur commented Jul 27, 2019

Run Python PreCommit

@udim
Copy link
Member

udim commented Aug 2, 2019

Will this not raise an error due to calling .get() on ValueProvider from a non-runtime context? As the Query is instantiated before executing the pipeline?

Yes, you're right. (I haven't written code that uses ValueProviders and it's trickier than it seems.)

@udim
Copy link
Member

udim commented Aug 2, 2019

run python 3.5 postcommit

@EDjur
Copy link
Contributor Author

EDjur commented Aug 2, 2019

(I haven't written code that uses ValueProviders and it's trickier than it seems.)

My thoughts exactly 😬

Cheers for the code duplication fix!

@udim udim merged commit 7a0bc8d into apache:master Aug 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants