[BEAM-11266] Python IO MongoDB: add bucket_auto aggregation option for bundling in Atlas. #13350

nikie · 2020-11-15T12:52:03Z

This fixes issue BEAM-11266 allowing to use ReadFromMongoDB with MongoDB Atlas by optionally using @bucketAuto MongoDB aggregation instead of splitVector:

  pipeline | ReadFromMongoDB(uri='mongodb+srv://user:pwd@cluster0.mongodb.net',
                             db='testdb',
                             coll='input',
                             bucket_auto=True)

This enhancement is based on the solution, provided by Susumu Asaga in this comment to the issue BEAM-4567, related to Java MongoDB connector.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	Dataflow	Samza	Twister2
Go	---	---	---
Java
Python		---	---
XLang		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---		---	---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

…r bundling in Atlas.

nikie · 2020-11-15T12:54:34Z

Hi! Could someone, review this, please?
R: @chamikaramj
R: @aaltay
R: @y1chi

chamikaramj · 2020-11-16T03:00:33Z

@y1chi can you take a look ?

iemejia · 2020-11-16T13:25:51Z

@nikie in case you have the extra cycles to create a PR for the Java connector I will be glad to take a look

y1chi · 2020-11-16T18:32:48Z

sdks/python/apache_beam/io/mongodbio.py

+      # single document not splittable
+      return []
+    size = self.estimate_size()
+    bucket_count = size // desired_chunk_size


The split function will likely be called recursively for dynamic rebalancing, so for a range with start_pos and end_pos, it can be further split upon backend request, so it might not be reasonable to always use the total collection size divided by desired_chunk_size to calculate the bucket count. Is it possible to only get the buckets within the give _id range? and we can probably use an average document size times the number of documents to calculate the size of the range being split.

Thanks, @y1chi!
I will look closer how it works in Java sdk and will try to filter by _id ranges.

y1chi · 2020-11-16T18:35:55Z

sdks/python/apache_beam/io/mongodbio.py

+    if size % desired_chunk_size != 0:
+      bucket_count += 1
+    with beam.io.mongodbio.MongoClient(self.uri, **self.spec) as client:
+      buckets = list(


the return buckets should guarantee the _id range is start_pos and end_pos otherwise same document could be read multiple times.

Fixed - return buckets cover all the requested range.

nikie · 2020-11-17T22:18:21Z

@nikie in case you have the extra cycles to create a PR for the Java connector I will be glad to take a look

@iemejia, Java connector already has this feature implemented - withBucketAuto method.

iemejia · 2020-11-18T09:49:17Z

Thanks @nikie I have forgottent this was already fixed (and didn't check 🥵 ) !

…r bundling in Atlas. Bucket_auto mode: - Respects dynamic rebalancing, filter by _id range in each split. - Respects custom filter by merging it with the _id range filter, so that splits hold similar number of docs actually matching the filter (not possible in splitVector mode). - Estimates bundle size for non-initial splits by counting docs with filters applied and using 'avgObjSize' from MongoDB collstats. - Uses bundle generator common with splitVector mode for clarity of covering all the same cases. Misc: - Refactor _merge_id_filter to use '$and' only if necessary. - Fix one-off issue with single-document-not-splittable checks for both bucket_auto and splitVector modes (unit test added, before the fix if branches were unreachable). Unit tests: - Increase coverage and sanity. - Refactor collection mock filter and projection handling. Integration tests: - Add read cases: splitVector/bucket_auto * filter/no-filter. - Add checks for expected docs count.

nikie · 2020-11-22T19:50:03Z

@y1chi
I have implemented your suggested changes and more (see the last commit message for more details):

auto-bucketing respects not only _id range, but also custom filter for both docs counting and the aggregation (this might feel like an overhead, but should provide more precise splits);
improved unit and integration tests.

Java's MongoDBIO works differently:

there is a numSplits option which controls the number of auto buckets (10 by default) and the number of splitVector buckets if set;
does not estimate desired bundle size for auto bucketing, only for splitVector mode if numSplits is not provided and recalculates bundle size based on numSplits if it is provided;
does not use custom filter for auto bucketing, only filters the actual reads as per the split buckets;
does not have start/stop logic for dynamic rebalancing.

y1chi · 2020-11-23T18:48:28Z

Run Python MongoDBIO_IT

y1chi

LGTM, thanks for the contribution!

y1chi · 2020-11-23T20:32:53Z

@chamikaramj Cham do you mind help merging the PR？

[BEAM-11266] Python IO MongoDB: add bucket_auto aggregation option fo…

94a8533

…r bundling in Atlas.

y1chi reviewed Nov 16, 2020

View reviewed changes

y1chi approved these changes Nov 23, 2020

View reviewed changes

aaltay merged commit 67339a9 into apache:master Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-11266] Python IO MongoDB: add bucket_auto aggregation option for bundling in Atlas. #13350

[BEAM-11266] Python IO MongoDB: add bucket_auto aggregation option for bundling in Atlas. #13350

nikie commented Nov 15, 2020 •

edited

Loading

nikie commented Nov 15, 2020

chamikaramj commented Nov 16, 2020

iemejia commented Nov 16, 2020

y1chi Nov 16, 2020 •

edited

Loading

nikie Nov 17, 2020

y1chi Nov 16, 2020

nikie Nov 22, 2020

nikie commented Nov 17, 2020

iemejia commented Nov 18, 2020

nikie commented Nov 22, 2020 •

edited

Loading

y1chi commented Nov 23, 2020

y1chi left a comment

y1chi commented Nov 23, 2020

[BEAM-11266] Python IO MongoDB: add bucket_auto aggregation option for bundling in Atlas. #13350

[BEAM-11266] Python IO MongoDB: add bucket_auto aggregation option for bundling in Atlas. #13350

Conversation

nikie commented Nov 15, 2020 • edited Loading

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

nikie commented Nov 15, 2020

chamikaramj commented Nov 16, 2020

iemejia commented Nov 16, 2020

y1chi Nov 16, 2020 • edited Loading

Choose a reason for hiding this comment

nikie Nov 17, 2020

Choose a reason for hiding this comment

y1chi Nov 16, 2020

Choose a reason for hiding this comment

nikie Nov 22, 2020

Choose a reason for hiding this comment

nikie commented Nov 17, 2020

iemejia commented Nov 18, 2020

nikie commented Nov 22, 2020 • edited Loading

y1chi commented Nov 23, 2020

y1chi left a comment

Choose a reason for hiding this comment

y1chi commented Nov 23, 2020

nikie commented Nov 15, 2020 •

edited

Loading

y1chi Nov 16, 2020 •

edited

Loading

nikie commented Nov 22, 2020 •

edited

Loading