-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-11266] Python IO MongoDB: add bucket_auto aggregation option for bundling in Atlas. #13350
[BEAM-11266] Python IO MongoDB: add bucket_auto aggregation option for bundling in Atlas. #13350
Conversation
…r bundling in Atlas.
Hi! Could someone, review this, please? |
@y1chi can you take a look ? |
@nikie in case you have the extra cycles to create a PR for the Java connector I will be glad to take a look |
# single document not splittable | ||
return [] | ||
size = self.estimate_size() | ||
bucket_count = size // desired_chunk_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split function will likely be called recursively for dynamic rebalancing, so for a range with start_pos and end_pos, it can be further split upon backend request, so it might not be reasonable to always use the total collection size divided by desired_chunk_size to calculate the bucket count. Is it possible to only get the buckets within the give _id range? and we can probably use an average document size times the number of documents to calculate the size of the range being split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @y1chi!
I will look closer how it works in Java sdk and will try to filter by _id ranges.
if size % desired_chunk_size != 0: | ||
bucket_count += 1 | ||
with beam.io.mongodbio.MongoClient(self.uri, **self.spec) as client: | ||
buckets = list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the return buckets should guarantee the _id range is start_pos and end_pos otherwise same document could be read multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed - return buckets cover all the requested range.
@iemejia, Java connector already has this feature implemented - |
Thanks @nikie I have forgottent this was already fixed (and didn't check 🥵 ) ! |
…r bundling in Atlas. Bucket_auto mode: - Respects dynamic rebalancing, filter by _id range in each split. - Respects custom filter by merging it with the _id range filter, so that splits hold similar number of docs actually matching the filter (not possible in splitVector mode). - Estimates bundle size for non-initial splits by counting docs with filters applied and using 'avgObjSize' from MongoDB collstats. - Uses bundle generator common with splitVector mode for clarity of covering all the same cases. Misc: - Refactor _merge_id_filter to use '$and' only if necessary. - Fix one-off issue with single-document-not-splittable checks for both bucket_auto and splitVector modes (unit test added, before the fix if branches were unreachable). Unit tests: - Increase coverage and sanity. - Refactor collection mock filter and projection handling. Integration tests: - Add read cases: splitVector/bucket_auto * filter/no-filter. - Add checks for expected docs count.
@y1chi
Java's
|
Run Python MongoDBIO_IT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the contribution!
@chamikaramj Cham do you mind help merging the PR? |
This fixes issue BEAM-11266 allowing to use ReadFromMongoDB with MongoDB Atlas by optionally using
@bucketAuto
MongoDB aggregation instead ofsplitVector
:This enhancement is based on the solution, provided by Susumu Asaga in this comment to the issue BEAM-4567, related to Java MongoDB connector.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.