[BEAM-7917] Fix datastore writes failing on retry #9294

sadovnychyi · 2019-08-08T05:59:38Z

When we call batch.begin() the status is set as IN_PROGRESS and then to FINISHED after any call to commit(), successful or not.
commit() can only be called on a batch with status IN_PROGRESS -- it will fail with ValueError otherwise.

I cannot say how safe is it to use those private properties of another library from here, but it fixes the issue. We could also re-create batch from scratch (so we don't have to change that private status) -- but not sure if that would cost us any performance.

https://github.com/googleapis/google-cloud-python/blob/master/datastore/google/cloud/datastore/batch.py
https://issues.apache.org/jira/browse/BEAM-7917

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

When we call batch.begin() the status is set as IN_PROGRESS and then to FINISHED after *any* call to commit(), successful or not. commit() can only be called on a batch with status IN_PROGRESS -- it will fail with ValueError otherwise. https://github.com/googleapis/google-cloud-python/blob/master/datastore/google/cloud/datastore/batch.py https://issues.apache.org/jira/browse/BEAM-7917

sadovnychyi · 2019-08-08T06:36:46Z

R: @tvalentyn

tvalentyn · 2019-08-08T17:42:33Z

R: @udim who is more familiar with this IO.

sdks/python/apache_beam/io/gcp/datastore/v1new/helper.py

sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py

sadovnychyi · 2019-08-12T11:13:07Z

Pushed an update, please take another look.

If we stop constructing batch object in DatastoreMutateFn then it's more complicated to calculate the batch size in bytes so far -- we would still have to convert entity/key into protobuf, so performance hit since we are doing it twice (when adding new element and when building a batch). This could be an issue when writing a lot of data.

There's also no public API for commiting a bunch of protobufs directly (which would be perfect for us).

So what I come up with is that we are doing everything just like before, but we also keep original non-protobuf entity/key so we can use those to reconstruct the branch only on retry. Should be no performance hit unless we are already failing.

(didn't actually test this yet in the wild, will do once we are clear on implementation)

sdks/python/apache_beam/io/gcp/datastore/v1new/helper.py

udim · 2019-08-30T16:27:33Z

Pushed an update, please take another look.

Made a pass today. Sorry for the delay!

If we stop constructing batch object in DatastoreMutateFn then it's more complicated to calculate the batch size in bytes so far -- we would still have to convert entity/key into protobuf, so performance hit since we are doing it twice (when adding new element and when building a batch). This could be an issue when writing a lot of data.

I believe it is possible to avoid converting and calculating more than once. See my other comments.

There's also no public API for commiting a bunch of protobufs directly (which would be perfect for us).

Yes, that's the tradeoff with using the new client.

udim · 2019-08-30T16:07:55Z

sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py

@@ -340,12 +340,13 @@ def finish_bundle(self):
    def _init_batch(self):
      self._batch_bytes_size = 0
      self._batch = self._client.batch()
-      self._batch.begin()
+      self._batch_mutations = []

    def _flush_batch(self):
      # Flush the current batch of mutations to Cloud Datastore.
      latency_ms = helper.write_mutations(


Having both _batch_mutations and _batch seems redundant.
I would move write_mutations into this class and create the batch object in it.
You could then take advantage of add_element_to_batch to populate the batch.

Regarding ByteSize(), we could split add_element_to_batch into 2 part (just a suggestion):

def element_to_client_batch_item(self, element): if not isinstance(element, types.Entity): raise ValueError('apache_beam.io.gcp.datastore.v1new.datastoreio.Entity' ' expected, got: %s' % type(element)) if not element.key.project: element.key.project = self._project client_entity = element.to_client_entity() if client_entity.key.is_partial: raise ValueError('Entities to be written to Cloud Datastore must ' 'have complete keys:\n%s' % client_entity) return client_entity def add_to_batch(self, client_batch_item): self._batch.put(client_entity)

I've made related changes but still couldn't avoid storing batch and raw batch elements separately.
We could skip the rest of elements (over the bytes limit) when building the batch in write_mutations and let them be picked up on the next batch (so we won't purge _batch but there might be some elements left from previous batch) -- but it seems to create more confusion.

I think your choice is a valid compromise with what to do with the results of element_to_client_batch_item. I see 2 options here:

Save client_element items.

Discard client_element items after calling ByteSize() on them.

The first option seems more CPU efficient while the second seems more memory efficient. I don't know which option is faster, but from my limited experience saving CPU at the expense of RAM seems like a good tradeoff.

sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py

aaltay · 2019-10-04T16:23:52Z

What are the next steps for this PR?

sadovnychyi · 2019-10-04T16:38:28Z

Another review pass would be helpful.

aaltay · 2019-10-04T22:28:05Z

@udim could you please make another pass when you get a chance?

udim · 2019-10-29T23:21:31Z

run python precommit

udim

Thanks! LGTM

* Fix datastore writes failing on retry When we call batch.begin() the status is set as IN_PROGRESS and then to FINISHED after *any* call to commit(), successful or not. commit() can only be called on a batch with status IN_PROGRESS -- it will fail with ValueError otherwise. https://github.com/googleapis/google-cloud-python/blob/master/datastore/google/cloud/datastore/batch.py https://issues.apache.org/jira/browse/BEAM-7917 * Reconstruct datastore batch from scratch after failures * Move write_mutations into datastoreio

udim requested changes Aug 8, 2019

View reviewed changes

sdks/python/apache_beam/io/gcp/datastore/v1new/helper.py Outdated Show resolved Hide resolved

Reconstruct datastore batch from scratch after failures

72d6517

sadovnychyi commented Aug 12, 2019

View reviewed changes

sdks/python/apache_beam/io/gcp/datastore/v1new/datastoreio.py Outdated Show resolved Hide resolved

sadovnychyi commented Aug 12, 2019

View reviewed changes

sdks/python/apache_beam/io/gcp/datastore/v1new/helper.py Outdated Show resolved Hide resolved

udim reviewed Aug 30, 2019

View reviewed changes

Move write_mutations into datastoreio

e005cc6

udim approved these changes Nov 1, 2019

View reviewed changes

udim merged commit c870c63 into apache:master Nov 1, 2019

pl04351820 pushed a commit to pl04351820/beam that referenced this pull request Dec 20, 2023

docs: fix intersphinx reference to requests (apache#9294)

e859f3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-7917] Fix datastore writes failing on retry #9294

[BEAM-7917] Fix datastore writes failing on retry #9294

sadovnychyi commented Aug 8, 2019

sadovnychyi commented Aug 8, 2019

tvalentyn commented Aug 8, 2019

sadovnychyi commented Aug 12, 2019

udim commented Aug 30, 2019

udim Aug 30, 2019

udim Aug 30, 2019

sadovnychyi Sep 1, 2019

udim Nov 1, 2019 •

edited

Loading

aaltay commented Oct 4, 2019

sadovnychyi commented Oct 4, 2019

aaltay commented Oct 4, 2019

udim commented Oct 29, 2019

udim left a comment

[BEAM-7917] Fix datastore writes failing on retry #9294

[BEAM-7917] Fix datastore writes failing on retry #9294

Conversation

sadovnychyi commented Aug 8, 2019

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

sadovnychyi commented Aug 8, 2019

tvalentyn commented Aug 8, 2019

sadovnychyi commented Aug 12, 2019

udim commented Aug 30, 2019

udim Aug 30, 2019

Choose a reason for hiding this comment

udim Aug 30, 2019

Choose a reason for hiding this comment

sadovnychyi Sep 1, 2019

Choose a reason for hiding this comment

udim Nov 1, 2019 • edited Loading

Choose a reason for hiding this comment

aaltay commented Oct 4, 2019

sadovnychyi commented Oct 4, 2019

aaltay commented Oct 4, 2019

udim commented Oct 29, 2019

udim left a comment

Choose a reason for hiding this comment

udim Nov 1, 2019 •

edited

Loading