-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reads (in both Source and Destination) should batch on bytes read instead of records read. #3439
Comments
This is a very high priority to me. I have 2 databases that I need to sync that make heavy usage of jsonb columns and long strings, which crash in Airbyte. Perhaps making batch size configurable per connection would be an intermediate solution? (if smart batching is more complex and will take more time). |
@rparrapy does Airbyte work with those tables at all? Or are those tables 'left' out of syncs? |
@rparrapy Right now we have fixed batching by number of rows. Our first step will probably be to have fixed batching by byte size, and then adding more configurability later on. |
Regarding batching on the source side: |
Destination batching is the most common where people are encountering issues. DB sources stream how you're describing and some API sources do as well. For APIs some have configurable "page" sizes where they load some fixed number of records at a time. I actually don't know of a specific case where users have run into memory problems on the source side like this. @davinchia do you have any examples of sources that are exhibiting memory problems due to source-side batching strategies? |
I had to remove those tables in order to complete a sync. |
@jrhizor any DB Source with a large average row size will run into this issue. This isn't a problem for APIs since those mostly consume data greedily.
@ajzo90 cursors don't apply to full refresh. for incremental, this is an issue because the system still needs to decide how many records to read into memory at once. e.g. if there are 50k records within the where clause, some sort of batching still needs to happen under the hood. |
Sorry for the confusion, you can ignore the cursor in the query example. It's not that important, I just put it there as an example in case batching originated from queries like that. It doesn't change the point regarding streamability. Given that the db protocol support streaming, you don't need to buffer anything on the receive side. In your example: the result from the query that returns 50k records can be streamed and it's possible to emit record by record to stdout. I don't see why batching would be required. |
I should be clearer. Batching within the system can be understood in 2 parts:
Does that make sense? |
@davinchia any updates? |
This PR solves this for destinations: https://github.com/airbytehq/airbyte/pull/7719/files |
Tell us about the problem you're trying to solve
Today, we read continue reading records until we hit a batch size - currently 10k. This is fine for most cases. However, this can cause OOM errors for tables with large row size. e.g. a table with an average row size of 2MB will require a RAM of 20GB.
This is pretty simple for Destinations - since Destinations read record by record, they can check memory usage after each record and stop at a preconfigured limit to the maximum available heap size.
This is slightly trickier for Sources. Sources read data in batches - the only way to know how much memory a batch of data requires is to read the data. We'd probably need some sort of dynamic batching algorithm here, and a way to recover from memory exceptions.
Describe the solution you’d like
This should also take into account byte size as well. e.g. Insert the records if the record or byte limit is hit, whichever comes first.
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: