New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-3342] Create a Cloud Bigtable IO connector for Python #8457
Conversation
…bility requirement)
…c Jenkins build errors
@chamikaramj and @aaltay, can you PTAL? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Please let me know when this is good for another review. |
I asked @mf2199 to remove the use of the |
Either way is fine. But if you need my help to get the dynamic work rebanancing issue resolved for the current (mostly ready) version happy to help with that as well. If we go PTransforms / DoFn's. route we'll have to wait till SDF to support dynamic work rebalancing. |
@chamikaramj, Cloud Bigtable isn't an ideal candidate for dynamic work rebalancing in general. Java does its best to approximate dynamic work rebalancing, but that had some unintended consequences. I think that we ought to approach the Python connector with as simple of an implementation as possible until we know for sure that it absolutely needs the fancy Dataflow features. |
…rd BoundedSource class recommended for a general case.
@mf2199 please let me know when this is ready for review again. |
@chamikaramj You could review it now. |
@mf2199, the tests fail, and it looks like there is a conflicting file. |
sample_row_keys.insert(0, first_key) | ||
sample_row_keys = list(sample_row_keys) | ||
|
||
def split_source(unused_impulse): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method doesn't need to exist. I would think that you could use bundles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the way it's implemented in Apache Beam iobase.Read(PTransform). The FlatMap needs a callable object to process the elements in parallel, and the split_source
makes up that callable. I'd also suggest we use similar naming convention for better unification/readability. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Would it make sense to call this _split_source
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed it too. The iobase.Read
version doesn't have the underscore. It's whether we prefer the "proper" underscored way or the "unified" non-underscored one.
Is there an ETA for landing this? Thanks for all the work! |
Sorry about the delay here. Will do another review round early next week. |
Thanks for the update! |
Sorry about the dalay. Will take a look this week. |
Can you please address test failures and conflicts ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@chamikaramj For some reason there no longer appear to be any conflicts. |
+1 for starting a new PR. It's surprising to hear that Jenkins IT trigger does not capture your updates. Hopefully you'll not run into this in the new PR. If you do prob. worth an email to the dev list to check if someone else has run into that. |
@chamikaramj Fixed an error and tried to re-trigger Jenkins with PR #11295 - still no luck. Maybe it's really worth asking around. |
Only committers can trigger Jenkins tests. I triggered Python PreCommit and PostCommit for the new PR. Lemme know if tests should be re-triggered or a different test suite should be triggered. |
@aaltay That PR was opened mostly to re-test the build errors. As it turned out, I'm unable to run those from my end, like it's normally done with some other Google repos. Anyway, closed that for now, will reopen if needed. |
I reopened that PR and triggered tests. Please address any failures. Let's continue the review there. Closing this PR. |
@chamikaramj The other PR uses a different branch. I'm gonna update it then. |
Thanks. |
The initial version of Google Cloud Bigtable IO connector.
The connector implements BigtableSource() class as the BoundedSource, using LexicographicKeyRangeTracker() class as the corresponding RangeTracker.At this stage, the table is read as a whole. The two supplementary files, 'bigtableio_test.py' and 'bigtableio_it_test.py', provide the code for unit and integration tests, respectively.Note about the unit test: As the evidence suggests, the assert_split_at_fraction_exhaustive() function of 'source_test_utils.py' fails to work properly with the LexicographicKeyRangeTracker() class. Patching the 'source_test_utils.py' eliminated some but not all of the errors. Since all the other tests are passed, including the integration test, and the issue seems to be unrelated to the BigtableSource() code, it was decided to temporarily bypass the test_dynamic_work_rebalancing() function until the 'source_test_utils.py' is fully debugged.
Note about the integration test: The test script requires certain command line arguments. Refer to 'bigtableio_it_test.py' for more specifics.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.