Skip to content

Conversation

@yixiaoshen
Copy link
Contributor

Allows DatastoreIO to be configured with a read_time, which will make the Beam workload issue all its Datastore reads using the same read_time and get the consistent read result from Datastore across all its split reads.

addresses #22051


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@asf-ci
Copy link

asf-ci commented Jun 24, 2022

Can one of the admins verify this patch?

4 similar comments
@asf-ci
Copy link

asf-ci commented Jun 24, 2022

Can one of the admins verify this patch?

@asf-ci
Copy link

asf-ci commented Jun 24, 2022

Can one of the admins verify this patch?

@asf-ci
Copy link

asf-ci commented Jun 24, 2022

Can one of the admins verify this patch?

@asf-ci
Copy link

asf-ci commented Jun 24, 2022

Can one of the admins verify this patch?

@codecov
Copy link

codecov bot commented Jun 24, 2022

Codecov Report

Merging #22052 (af7e828) into master (54b0784) will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #22052      +/-   ##
==========================================
- Coverage   74.19%   74.17%   -0.02%     
==========================================
  Files         706      706              
  Lines       93229    93229              
==========================================
- Hits        69168    69154      -14     
- Misses      22793    22807      +14     
  Partials     1268     1268              
Flag Coverage Δ
python 83.54% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...n/apache_beam/ml/gcp/recommendations_ai_test_it.py 73.46% <0.00%> (-2.05%) ⬇️
.../python/apache_beam/transforms/periodicsequence.py 98.38% <0.00%> (-1.62%) ⬇️
...eam/runners/portability/fn_api_runner/execution.py 92.44% <0.00%> (-0.65%) ⬇️
...ks/python/apache_beam/runners/worker/data_plane.py 87.57% <0.00%> (-0.57%) ⬇️
sdks/python/apache_beam/runners/direct/executor.py 96.46% <0.00%> (-0.55%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.30% <0.00%> (-0.50%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us.

@yixiaoshen
Copy link
Contributor Author

R: @pcostell

@pcostell
Copy link

pcostell commented Jul 7, 2022

LGTM

@yixiaoshen
Copy link
Contributor Author

R: @chamikaramj

@chamikaramj chamikaramj self-requested a review July 19, 2022 20:29
Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Timestamp readTimeProto = Timestamps.fromMillis(readTime.getMillis());
return querySplitter.getSplits(query, partitionId, numSplits, datastore, readTimeProto);
}
return querySplitter.getSplits(query, partitionId, numSplits, datastore);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, do sub-queries maintain the same read time ?

Also, I wonder if we can implement a better splitter function here by using different read times (but that can be separate).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sub-queries all maintain the same read time, in that way we can guarantee a consistent read across the entire database from the beam workload.

Split function queries Datastore using the split points (a special index), having the split function use the same read time as all other queries will make sure the split points are accurate as of that read time, this can impact how well and balanced the beam workload can be partitioned.

.read()
.withProjectId(PROJECT_ID)
.withQuery(QUERY)
.withReadTime(TIMESTAMP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do existing tests fail without setting the new parameter ?
If so, could customers run into similar issues ?

Copy link
Contributor Author

@yixiaoshen yixiaoshen Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the added readTime is optional and does not need to be specified if customers don't care about read consistency across sub-queries. e.g. just above this test (testBuildReadWithReadTime) there's another test (testBuildRead) that builds a DatastoreIO read with exactly the same arguments except that it doesn't specify readTime.

Also in V1ReadIT test, it does a read without specifying read time, and does another read with read time, both can work. So customers existing workload don't need to be changed and it won't break.

@chamikaramj
Copy link
Contributor

Run Java PreCommit

@chamikaramj
Copy link
Contributor

Run Java_PVR_Flink_Batch PreCommit

@chamikaramj
Copy link
Contributor

Retest this please

@chamikaramj
Copy link
Contributor

Run SQL_Java11 PreCommit

@chamikaramj chamikaramj merged commit edd915d into apache:master Jul 27, 2022
@yixiaoshen yixiaoshen deleted the ys-datastore-readtime branch July 27, 2022 18:51
@apilloud
Copy link
Member

@yixiaoshen
Copy link
Contributor Author

attempting to fix with #22484

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants