Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads (in both Source and Destination) should batch on bytes read instead of records read. #3439

Open
davinchia opened this issue May 17, 2021 · 11 comments
Labels
area/reliability frozen Not being actively worked on lang/java team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/enhancement New feature or request

Comments

@davinchia
Copy link
Contributor

davinchia commented May 17, 2021

Tell us about the problem you're trying to solve

Today, we read continue reading records until we hit a batch size - currently 10k. This is fine for most cases. However, this can cause OOM errors for tables with large row size. e.g. a table with an average row size of 2MB will require a RAM of 20GB.

This is pretty simple for Destinations - since Destinations read record by record, they can check memory usage after each record and stop at a preconfigured limit to the maximum available heap size.

This is slightly trickier for Sources. Sources read data in batches - the only way to know how much memory a batch of data requires is to read the data. We'd probably need some sort of dynamic batching algorithm here, and a way to recover from memory exceptions.

Describe the solution you’d like

This should also take into account byte size as well. e.g. Insert the records if the record or byte limit is hit, whichever comes first.

┆Issue is synchronized with this Asana task by Unito

@davinchia davinchia added the type/enhancement New feature or request label May 17, 2021
@davinchia davinchia changed the title Buffered Stream Consumer should batch on bytes read instead of records read. Reads (in both Source and Destination) should batch on bytes read instead of records read. May 19, 2021
@rparrapy
Copy link
Contributor

rparrapy commented Jun 3, 2021

This is a very high priority to me. I have 2 databases that I need to sync that make heavy usage of jsonb columns and long strings, which crash in Airbyte. Perhaps making batch size configurable per connection would be an intermediate solution? (if smart batching is more complex and will take more time).

@davinchia
Copy link
Contributor Author

@rparrapy does Airbyte work with those tables at all? Or are those tables 'left' out of syncs?

@jrhizor
Copy link
Contributor

jrhizor commented Jun 8, 2021

@rparrapy Right now we have fixed batching by number of rows. Our first step will probably be to have fixed batching by byte size, and then adding more configurability later on.

@ajzo90
Copy link
Contributor

ajzo90 commented Jun 8, 2021

Regarding batching on the source side:
Is this for some specific implementation?
I assuming this issue is for databases, since they are of more generic character, then APIs.
Maybe I'm missing something, but why is batching required in the first place? All read queries should be of kind select a,b,c,... from X where y >= cursor, and that should be perfectly fine to stream from most databases.

@jrhizor
Copy link
Contributor

jrhizor commented Jun 8, 2021

Destination batching is the most common where people are encountering issues. DB sources stream how you're describing and some API sources do as well.

For APIs some have configurable "page" sizes where they load some fixed number of records at a time. I actually don't know of a specific case where users have run into memory problems on the source side like this. @davinchia do you have any examples of sources that are exhibiting memory problems due to source-side batching strategies?

@rparrapy
Copy link
Contributor

rparrapy commented Jun 8, 2021

@rparrapy does Airbyte work with those tables at all? Or are those tables 'left' out of syncs?

I had to remove those tables in order to complete a sync.

@davinchia
Copy link
Contributor Author

@jrhizor any DB Source with a large average row size will run into this issue. This isn't a problem for APIs since those mostly consume data greedily.

Maybe I'm missing something, but why is batching required in the first place? All read queries should be of kind select a,b,c,... from X where y >= cursor, and that should be perfectly fine to stream from most databases.

@ajzo90 cursors don't apply to full refresh. for incremental, this is an issue because the system still needs to decide how many records to read into memory at once. e.g. if there are 50k records within the where clause, some sort of batching still needs to happen under the hood.

@ajzo90
Copy link
Contributor

ajzo90 commented Jun 9, 2021

Sorry for the confusion, you can ignore the cursor in the query example. It's not that important, I just put it there as an example in case batching originated from queries like that.

It doesn't change the point regarding streamability. Given that the db protocol support streaming, you don't need to buffer anything on the receive side.

In your example: the result from the query that returns 50k records can be streamed and it's possible to emit record by record to stdout. I don't see why batching would be required.

@davinchia
Copy link
Contributor Author

davinchia commented Jun 9, 2021

I should be clearer. Batching within the system can be understood in 2 parts:

  1. Query size when JDBC executes its SQL statement and reads from the DB. This is currently set to 1k. This affects Source memory requirements.
  2. Batch size when the Destination reads from the Source. We have two kinds of inserts, copying into the destination warehouse via a temp file in Cloud/local storage and manual inserts. For simplicity both use the same queue-ing mechanism and perform their flushing after a batch limit, which is currently 50k. In the Copy case, it should be possible to cleanly remove this batching. In the insert case, removing this batching will result in slower overall inserts. This batch size affects Destination memory requirements.

Does that make sense?

@danieldiamond
Copy link
Contributor

@davinchia any updates?

@davinchia
Copy link
Contributor Author

This PR solves this for destinations: https://github.com/airbytehq/airbyte/pull/7719/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability frozen Not being actively worked on lang/java team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/enhancement New feature or request
Projects
No open projects
Status: Backlog (unscoped)
Development

No branches or pull requests

8 participants