[BEAM-3342] Create a Cloud Bigtable IO connector for Python #8457

mf2199 · 2019-05-01T18:51:02Z

The initial version of Google Cloud Bigtable IO connector. ~~The connector implements BigtableSource() class as the BoundedSource, using LexicographicKeyRangeTracker() class as the corresponding RangeTracker.~~ At this stage, the table is read as a whole. The two supplementary files, 'bigtableio_test.py' and 'bigtableio_it_test.py', provide the code for unit and integration tests, respectively.

Note about the unit test: As the evidence suggests, the assert_split_at_fraction_exhaustive() function of 'source_test_utils.py' fails to work properly with the LexicographicKeyRangeTracker() class. Patching the 'source_test_utils.py' eliminated some but not all of the errors. Since all the other tests are passed, including the integration test, and the issue seems to be unrelated to the BigtableSource() code, it was decided to temporarily bypass the test_dynamic_work_rebalancing() function until the 'source_test_utils.py' is fully debugged.

Note about the integration test: The test script requires certain command line arguments. Refer to 'bigtableio_it_test.py' for more specifics.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---			---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

…bility requirement)

…c Jenkins build errors

sduskis · 2019-05-03T11:44:17Z

@chamikaramj and @aaltay, can you PTAL?

chamikaramj

Thanks.

sdks/python/apache_beam/io/gcp/bigtableio.py

sdks/python/apache_beam/io/gcp/bigtableio_it_test.py

sdks/python/apache_beam/io/gcp/bigtableio_test.py

chamikaramj · 2019-05-17T16:59:59Z

Please let me know when this is good for another review.

sduskis · 2019-05-19T18:24:20Z

I asked @mf2199 to remove the use of the BoundedSource and use PTransforms / DoFn's. I don't think we ought to support desired bundle size or dynamic rebalancing at this point in time in the Python connector.

chamikaramj · 2019-05-20T17:33:52Z

Either way is fine. But if you need my help to get the dynamic work rebanancing issue resolved for the current (mostly ready) version happy to help with that as well.

If we go PTransforms / DoFn's. route we'll have to wait till SDF to support dynamic work rebalancing.

sduskis · 2019-05-20T17:48:01Z

@chamikaramj, Cloud Bigtable isn't an ideal candidate for dynamic work rebalancing in general. Java does its best to approximate dynamic work rebalancing, but that had some unintended consequences. I think that we ought to approach the Python connector with as simple of an implementation as possible until we know for sure that it absolutely needs the fancy Dataflow features.

…rd BoundedSource class recommended for a general case.

chamikaramj · 2019-06-24T17:52:44Z

@mf2199 please let me know when this is ready for review again.

mf2199 · 2019-06-25T07:19:20Z

@chamikaramj You could review it now.

…(left commented)

sduskis · 2019-07-01T12:58:49Z

@mf2199, the tests fail, and it looks like there is a conflicting file.

.gitignore

sdks/python/apache_beam/io/gcp/bigtableio.py

sduskis · 2019-07-01T13:59:28Z

sdks/python/apache_beam/io/gcp/bigtableio.py

+        sample_row_keys.insert(0, first_key)
+        sample_row_keys = list(sample_row_keys)
+
+    def split_source(unused_impulse):


This method doesn't need to exist. I would think that you could use bundles

This is the way it's implemented in Apache Beam iobase.Read(PTransform). The FlatMap needs a callable object to process the elements in parallel, and the split_source makes up that callable. I'd also suggest we use similar naming convention for better unification/readability. What do you think?

Got it. Would it make sense to call this _split_source?

I noticed it too. The iobase.Read version doesn't have the underscore. It's whether we prefer the "proper" underscored way or the "unified" non-underscored one.

eddie-scio · 2019-07-17T22:05:04Z

Is there an ETA for landing this? Thanks for all the work!

chamikaramj · 2019-07-19T21:21:28Z

Sorry about the delay here. Will do another review round early next week.

eddie-scio · 2019-07-19T21:23:07Z

Thanks for the update!

chamikaramj · 2020-03-18T15:02:27Z

Sorry about the dalay. Will take a look this week.

chamikaramj · 2020-03-23T18:10:22Z

Can you please address test failures and conflicts ?

chamikaramj

Thanks!

sdks/python/apache_beam/io/gcp/bigtableio.py

sdks/python/apache_beam/io/gcp/bigtableio_read_it_test.py

mf2199 · 2020-04-17T17:35:37Z

Can you please address test failures and conflicts ?

@chamikaramj For some reason there no longer appear to be any conflicts.

chamikaramj · 2020-05-08T17:56:42Z

+1 for starting a new PR. It's surprising to hear that Jenkins IT trigger does not capture your updates. Hopefully you'll not run into this in the new PR. If you do prob. worth an email to the dev list to check if someone else has run into that.

mf2199 · 2020-05-08T22:28:48Z

@chamikaramj Fixed an error and tried to re-trigger Jenkins with PR #11295 - still no luck. Maybe it's really worth asking around.

chamikaramj · 2020-05-08T22:32:23Z

Only committers can trigger Jenkins tests. I triggered Python PreCommit and PostCommit for the new PR. Lemme know if tests should be re-triggered or a different test suite should be triggered.

aaltay · 2020-05-21T18:32:11Z

There are still failing tests on #11295. @mf2199 - What is the next step for this PR?

aaltay · 2020-06-06T00:19:47Z

There are still failing tests on #11295. @mf2199 - What is the next step for this PR?

PIng on this? What is our plan for this PR?

mf2199 · 2020-06-06T16:37:55Z

@aaltay That PR was opened mostly to re-test the build errors. As it turned out, I'm unable to run those from my end, like it's normally done with some other Google repos. Anyway, closed that for now, will reopen if needed.

chamikaramj · 2020-06-07T00:24:31Z

I reopened that PR and triggered tests. Please address any failures. Let's continue the review there.

Closing this PR.

mf2199 · 2020-06-07T00:27:34Z

@chamikaramj The other PR uses a different branch. I'm gonna update it then.

chamikaramj · 2020-06-07T00:35:18Z

Thanks.

mf2199 added 7 commits May 1, 2019 14:34

Initial version of Google Cloud Bigtable IO connector

6fa024d

Changed import package name

156358a

Fixed paranthesis formatting for 'print' commands (Python 3.x compati…

90697db

…bility requirement)

changes to 'import' directives

ad9d3b9

[no comments]

8cc6485

experimenting with the code format - an attempt to tackle non-specifi…

99b0efd

…c Jenkins build errors

Partially undoing the latest changes.

c2270cb

mf2199 changed the title ~~Initial version of Google Cloud Bigtable IO connector~~ [BEAM-7212] Initial version of Google Cloud Bigtable IO connector May 2, 2019

mf2199 changed the title ~~[BEAM-7212] Initial version of Google Cloud Bigtable IO connector~~ [BEAM-3342] Initial version of Google Cloud Bigtable IO connector May 3, 2019

chamikaramj self-requested a review May 3, 2019 17:03

chamikaramj requested changes May 10, 2019

View reviewed changes

iemejia changed the title ~~[BEAM-3342] Initial version of Google Cloud Bigtable IO connector~~ [BEAM-3342] Create a Cloud Bigtable IO connector for Python Jun 14, 2019

Implemented the PTransform/DoFn scheme as a replacement to the standa…

b8c85bb

…rd BoundedSource class recommended for a general case.

Chenged default num_workers value from 300 to 10

acc5a20

Removed private classes from the interface; disabled the experiments …

8819ee8

…(left commented)

Merge branch 'master' into bigtableio

c4bd3ef

sduskis reviewed Jul 1, 2019

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

sduskis suggested changes Jul 1, 2019

View reviewed changes

Addressing the PR coments

89d9424

mf2199 requested a review from pabloem March 4, 2020 16:11

chamikaramj reviewed Mar 23, 2020

View reviewed changes

mf2199 added 11 commits April 2, 2020 15:18

Code rearranged

c3d4249

Updated docs

fdc426d

Minor fix

1962866

Minor fix

096953c

Minor fix

a8878d5

Minor fix

7eca091

Formatting...

d8c0130

Merge branch 'master' into bigtableio

3f6f37c

Removed usage of iobase.SourceBundle

e9874c4

refactoring

0d964ac

Added unit tests

bc2a8dd

Minor lint fix

7bb69e1

mf2199 added a commit to MaxxleLLC/beam that referenced this pull request May 8, 2020

Code transfer from apache#8457

254913c

mf2199 requested a review from chamikaramj May 8, 2020 22:25

Merge branch 'master' of https://github.com/apache/beam into bigtableio

2f3e37b

chamikaramj closed this Jun 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-3342] Create a Cloud Bigtable IO connector for Python #8457

[BEAM-3342] Create a Cloud Bigtable IO connector for Python #8457

mf2199 commented May 1, 2019 •

edited

sduskis commented May 3, 2019

chamikaramj left a comment

chamikaramj commented May 17, 2019

sduskis commented May 19, 2019

chamikaramj commented May 20, 2019

sduskis commented May 20, 2019

chamikaramj commented Jun 24, 2019

mf2199 commented Jun 25, 2019

sduskis commented Jul 1, 2019

sduskis Jul 1, 2019

mf2199 Jul 2, 2019 •

edited

sduskis Jul 2, 2019

mf2199 Jul 2, 2019

eddie-scio commented Jul 17, 2019

chamikaramj commented Jul 19, 2019

eddie-scio commented Jul 19, 2019

chamikaramj commented Mar 18, 2020

chamikaramj commented Mar 23, 2020

chamikaramj left a comment

mf2199 commented Apr 17, 2020

chamikaramj commented May 8, 2020

mf2199 commented May 8, 2020

chamikaramj commented May 8, 2020

aaltay commented May 21, 2020

aaltay commented Jun 6, 2020

mf2199 commented Jun 6, 2020

chamikaramj commented Jun 7, 2020

mf2199 commented Jun 7, 2020

chamikaramj commented Jun 7, 2020

[BEAM-3342] Create a Cloud Bigtable IO connector for Python #8457

[BEAM-3342] Create a Cloud Bigtable IO connector for Python #8457

Conversation

mf2199 commented May 1, 2019 • edited

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

sduskis commented May 3, 2019

chamikaramj left a comment

Choose a reason for hiding this comment

chamikaramj commented May 17, 2019

sduskis commented May 19, 2019

chamikaramj commented May 20, 2019

sduskis commented May 20, 2019

chamikaramj commented Jun 24, 2019

mf2199 commented Jun 25, 2019

sduskis commented Jul 1, 2019

sduskis Jul 1, 2019

Choose a reason for hiding this comment

mf2199 Jul 2, 2019 • edited

Choose a reason for hiding this comment

sduskis Jul 2, 2019

Choose a reason for hiding this comment

mf2199 Jul 2, 2019

Choose a reason for hiding this comment

eddie-scio commented Jul 17, 2019

chamikaramj commented Jul 19, 2019

eddie-scio commented Jul 19, 2019

chamikaramj commented Mar 18, 2020

chamikaramj commented Mar 23, 2020

chamikaramj left a comment

Choose a reason for hiding this comment

mf2199 commented Apr 17, 2020

chamikaramj commented May 8, 2020

mf2199 commented May 8, 2020

chamikaramj commented May 8, 2020

aaltay commented May 21, 2020

aaltay commented Jun 6, 2020

mf2199 commented Jun 6, 2020

chamikaramj commented Jun 7, 2020

mf2199 commented Jun 7, 2020

chamikaramj commented Jun 7, 2020

mf2199 commented May 1, 2019 •

edited

mf2199 Jul 2, 2019 •

edited