Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-3342] Create a Cloud Bigtable IO connector for Python #8457

Closed
wants to merge 45 commits into from

Conversation

mf2199
Copy link

@mf2199 mf2199 commented May 1, 2019

The initial version of Google Cloud Bigtable IO connector. The connector implements BigtableSource() class as the BoundedSource, using LexicographicKeyRangeTracker() class as the corresponding RangeTracker. At this stage, the table is read as a whole. The two supplementary files, 'bigtableio_test.py' and 'bigtableio_it_test.py', provide the code for unit and integration tests, respectively.

Note about the unit test: As the evidence suggests, the assert_split_at_fraction_exhaustive() function of 'source_test_utils.py' fails to work properly with the LexicographicKeyRangeTracker() class. Patching the 'source_test_utils.py' eliminated some but not all of the errors. Since all the other tests are passed, including the integration test, and the issue seems to be unrelated to the BigtableSource() code, it was decided to temporarily bypass the test_dynamic_work_rebalancing() function until the 'source_test_utils.py' is fully debugged.

Note about the integration test: The test script requires certain command line arguments. Refer to 'bigtableio_it_test.py' for more specifics.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@mf2199 mf2199 changed the title Initial version of Google Cloud Bigtable IO connector [BEAM-7212] Initial version of Google Cloud Bigtable IO connector May 2, 2019
@sduskis
Copy link
Contributor

sduskis commented May 3, 2019

@chamikaramj and @aaltay, can you PTAL?

@mf2199 mf2199 changed the title [BEAM-7212] Initial version of Google Cloud Bigtable IO connector [BEAM-3342] Initial version of Google Cloud Bigtable IO connector May 3, 2019
@chamikaramj chamikaramj self-requested a review May 3, 2019 17:03
Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio_it_test.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio_it_test.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio_test.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio_test.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio_test.py Outdated Show resolved Hide resolved
@chamikaramj
Copy link
Contributor

Please let me know when this is good for another review.

@sduskis
Copy link
Contributor

sduskis commented May 19, 2019

I asked @mf2199 to remove the use of the BoundedSource and use PTransforms / DoFn's. I don't think we ought to support desired bundle size or dynamic rebalancing at this point in time in the Python connector.

@chamikaramj
Copy link
Contributor

Either way is fine. But if you need my help to get the dynamic work rebanancing issue resolved for the current (mostly ready) version happy to help with that as well.

If we go PTransforms / DoFn's. route we'll have to wait till SDF to support dynamic work rebalancing.

@sduskis
Copy link
Contributor

sduskis commented May 20, 2019

@chamikaramj, Cloud Bigtable isn't an ideal candidate for dynamic work rebalancing in general. Java does its best to approximate dynamic work rebalancing, but that had some unintended consequences. I think that we ought to approach the Python connector with as simple of an implementation as possible until we know for sure that it absolutely needs the fancy Dataflow features.

@iemejia iemejia changed the title [BEAM-3342] Initial version of Google Cloud Bigtable IO connector [BEAM-3342] Create a Cloud Bigtable IO connector for Python Jun 14, 2019
…rd BoundedSource class recommended for a general case.
@chamikaramj
Copy link
Contributor

@mf2199 please let me know when this is ready for review again.

@mf2199
Copy link
Author

mf2199 commented Jun 25, 2019

@chamikaramj You could review it now.

@sduskis
Copy link
Contributor

sduskis commented Jul 1, 2019

@mf2199, the tests fail, and it looks like there is a conflicting file.

.gitignore Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sample_row_keys.insert(0, first_key)
sample_row_keys = list(sample_row_keys)

def split_source(unused_impulse):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't need to exist. I would think that you could use bundles

Copy link
Author

@mf2199 mf2199 Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the way it's implemented in Apache Beam iobase.Read(PTransform). The FlatMap needs a callable object to process the elements in parallel, and the split_source makes up that callable. I'd also suggest we use similar naming convention for better unification/readability. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Would it make sense to call this _split_source?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed it too. The iobase.Read version doesn't have the underscore. It's whether we prefer the "proper" underscored way or the "unified" non-underscored one.

@eddie-scio
Copy link

Is there an ETA for landing this? Thanks for all the work!

@chamikaramj
Copy link
Contributor

Sorry about the delay here. Will do another review round early next week.

@eddie-scio
Copy link

Thanks for the update!

@mf2199 mf2199 requested a review from pabloem March 4, 2020 16:11
@chamikaramj
Copy link
Contributor

Sorry about the dalay. Will take a look this week.

@chamikaramj
Copy link
Contributor

Can you please address test failures and conflicts ?

Copy link
Contributor

@chamikaramj chamikaramj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

sdks/python/apache_beam/io/gcp/bigtableio.py Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/bigtableio.py Outdated Show resolved Hide resolved
@mf2199
Copy link
Author

mf2199 commented Apr 17, 2020

Can you please address test failures and conflicts ?

@chamikaramj For some reason there no longer appear to be any conflicts.

@chamikaramj
Copy link
Contributor

+1 for starting a new PR. It's surprising to hear that Jenkins IT trigger does not capture your updates. Hopefully you'll not run into this in the new PR. If you do prob. worth an email to the dev list to check if someone else has run into that.

mf2199 added a commit to MaxxleLLC/beam that referenced this pull request May 8, 2020
@mf2199 mf2199 requested a review from chamikaramj May 8, 2020 22:25
@mf2199
Copy link
Author

mf2199 commented May 8, 2020

@chamikaramj Fixed an error and tried to re-trigger Jenkins with PR #11295 - still no luck. Maybe it's really worth asking around.

@chamikaramj
Copy link
Contributor

Only committers can trigger Jenkins tests. I triggered Python PreCommit and PostCommit for the new PR. Lemme know if tests should be re-triggered or a different test suite should be triggered.

@aaltay
Copy link
Member

aaltay commented May 21, 2020

There are still failing tests on #11295. @mf2199 - What is the next step for this PR?

@aaltay
Copy link
Member

aaltay commented Jun 6, 2020

There are still failing tests on #11295. @mf2199 - What is the next step for this PR?

PIng on this? What is our plan for this PR?

@mf2199
Copy link
Author

mf2199 commented Jun 6, 2020

@aaltay That PR was opened mostly to re-test the build errors. As it turned out, I'm unable to run those from my end, like it's normally done with some other Google repos. Anyway, closed that for now, will reopen if needed.

@chamikaramj
Copy link
Contributor

I reopened that PR and triggered tests. Please address any failures. Let's continue the review there.

Closing this PR.

@chamikaramj chamikaramj closed this Jun 7, 2020
@mf2199
Copy link
Author

mf2199 commented Jun 7, 2020

@chamikaramj The other PR uses a different branch. I'm gonna update it then.

@chamikaramj
Copy link
Contributor

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants