Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel reads from a single source. #7749

Open
cgardens opened this issue Nov 8, 2021 · 8 comments
Open

Support parallel reads from a single source. #7749

cgardens opened this issue Nov 8, 2021 · 8 comments
Labels
area/connectors Connector related issues area/platform issues related to the platform frozen Not being actively worked on team/platform-move type/enhancement New feature or request

Comments

@cgardens
Copy link
Contributor

cgardens commented Nov 8, 2021

Tell us about the problem you're trying to solve

Not to be confused with this issue: #4081, here we are trying to make it possible to replicate multiple streams for a single source at the same time. i.e. if a source has two tables, both of those tables might be replicated in paralell.

@cgardens cgardens added type/enhancement New feature or request area/platform issues related to the platform labels Nov 8, 2021
@sherifnada sherifnada added the area/connectors Connector related issues label Nov 15, 2021
@hpias
Copy link

hpias commented May 2, 2022

One should only ever split the source load to multiple connections to have different sync schedules for different streams, never because of parallelism.
This should probably be a setting that is connection based, or even schedule based. Workers, Sources and destinations shouldnt care about it. Strictly orchestration/scheduler implementaion reaponsibility. Eg split configured streams in number of configured ques (connection parallelism/ max parallel streams), assign to worker. Or even better, Que only a single stream x parallel workers at once to account for the difference in stream loads to achieve round robin if the cost of instantiation of a single stream worker is not too high.
The scheduling optimizer could be further optimized by either using stats or changing the source spec to include estimated row counts or similar useful metrics for the stream weighting factor.
Normalization should be optionally postponed if it affects workers (not sure how it is implemented) for successful EL streams. The idea is to be able to scale in the k8s pool once EL has completed as DBT is destination workload.

@toandm
Copy link
Contributor

toandm commented Sep 5, 2022

This will be a great addition to Airbyte. I currently sync hundreds of table from a MySQL source. It takes ages to load them all in.

@blake-enyart
Copy link

@cgardens are there any updates on progress on this issue? Is this coming in tandem with the per-stream sync update or afterwards?

@luancaarvalho
Copy link

Any news ?

@Jordonkopp
Copy link

Any updates on this feature request?

@ggam
Copy link

ggam commented Apr 23, 2023

This is much needed as the workaround is to maintain multiple connections, splitting streams between them, which increases maintenance effort.

@samsipe
Copy link

samsipe commented Nov 9, 2023

+1 on this. Would be a huge help!

@mcivorsteiner
Copy link

+1 on this, this is big issue for us currently

@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/platform issues related to the platform frozen Not being actively worked on team/platform-move type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests