Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 improvement: speed up mongodb schema discovery #2851

Merged
merged 2 commits into from
Apr 15, 2021
Merged

🎉 improvement: speed up mongodb schema discovery #2851

merged 2 commits into from
Apr 15, 2021

Conversation

FUT
Copy link
Contributor

@FUT FUT commented Apr 12, 2021

Signed-off-by: fut fut.wrk@gmail.com

What

Speed up the discovery phase for MongoDB source.

How

Full DB scan takes huge amount of time for large DBs. Thus we take only 10k documents to search for props and 1k documents to determine prop types.

Pre-merge Checklist

  • Run integration tests
  • Publish Docker images (docker pull narrativebi/airbyte-source-mongodb:0.3.0)

FUT added 2 commits April 12, 2021 14:00
Signed-off-by: fut <fut.wrk@gmail.com>
Signed-off-by: fut <fut.wrk@gmail.com>
@auto-assign auto-assign bot requested review from davinchia and jrhizor April 12, 2021 11:12
@davinchia
Copy link
Contributor

Nice! By limiting the scan to only 10k, could we be potentially missing some documents + props?

@FUT
Copy link
Contributor Author

FUT commented Apr 12, 2021

That might happen, but that is a tradeoff for unstructured databases which we can not cover. Just imagine a MongoDB database with 1B documents like {1: 1} {2: 2} ..... {1000000000: 1000000000}. It is not possible to cover that case not only because of insane discovery time but also because we will hit out of memory error trying to keep and process all these keys.

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @FUT

@@cache[collection.name] = {}
airbyte_types.each_pair do |field, types|
# Has one specific type
if types.count == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to have a fallback type here in case there are more than one type? we typically use string as the type in other connectors

@sherifnada sherifnada changed the title fix: speed up mongodb source discovery 🐛 improvement: speed up mongodb source discovery Apr 12, 2021
@sherifnada sherifnada changed the title 🐛 improvement: speed up mongodb source discovery 🎉 improvement: speed up mongodb schema discovery Apr 14, 2021
@sherifnada sherifnada merged commit 6bdceef into airbytehq:master Apr 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants