Reads (in both Source and Destination) should batch on bytes read instead of records read. #3439

davinchia · 2021-05-17T13:29:48Z

Tell us about the problem you're trying to solve

Today, we read continue reading records until we hit a batch size - currently 10k. This is fine for most cases. However, this can cause OOM errors for tables with large row size. e.g. a table with an average row size of 2MB will require a RAM of 20GB.

This is pretty simple for Destinations - since Destinations read record by record, they can check memory usage after each record and stop at a preconfigured limit to the maximum available heap size.

This is slightly trickier for Sources. Sources read data in batches - the only way to know how much memory a batch of data requires is to read the data. We'd probably need some sort of dynamic batching algorithm here, and a way to recover from memory exceptions.

Describe the solution you’d like

This should also take into account byte size as well. e.g. Insert the records if the record or byte limit is hit, whichever comes first.

┆Issue is synchronized with this Asana task by Unito

rparrapy · 2021-06-03T15:23:22Z

This is a very high priority to me. I have 2 databases that I need to sync that make heavy usage of jsonb columns and long strings, which crash in Airbyte. Perhaps making batch size configurable per connection would be an intermediate solution? (if smart batching is more complex and will take more time).

davinchia · 2021-06-08T08:19:19Z

@rparrapy does Airbyte work with those tables at all? Or are those tables 'left' out of syncs?

jrhizor · 2021-06-08T18:34:57Z

@rparrapy Right now we have fixed batching by number of rows. Our first step will probably be to have fixed batching by byte size, and then adding more configurability later on.

ajzo90 · 2021-06-08T18:51:52Z

Regarding batching on the source side:
Is this for some specific implementation?
I assuming this issue is for databases, since they are of more generic character, then APIs.
Maybe I'm missing something, but why is batching required in the first place? All read queries should be of kind select a,b,c,... from X where y >= cursor, and that should be perfectly fine to stream from most databases.

jrhizor · 2021-06-08T19:05:04Z

Destination batching is the most common where people are encountering issues. DB sources stream how you're describing and some API sources do as well.

For APIs some have configurable "page" sizes where they load some fixed number of records at a time. I actually don't know of a specific case where users have run into memory problems on the source side like this. @davinchia do you have any examples of sources that are exhibiting memory problems due to source-side batching strategies?

rparrapy · 2021-06-08T19:12:32Z

@rparrapy does Airbyte work with those tables at all? Or are those tables 'left' out of syncs?

I had to remove those tables in order to complete a sync.

davinchia · 2021-06-09T00:09:05Z

@jrhizor any DB Source with a large average row size will run into this issue. This isn't a problem for APIs since those mostly consume data greedily.

Maybe I'm missing something, but why is batching required in the first place? All read queries should be of kind select a,b,c,... from X where y >= cursor, and that should be perfectly fine to stream from most databases.

@ajzo90 cursors don't apply to full refresh. for incremental, this is an issue because the system still needs to decide how many records to read into memory at once. e.g. if there are 50k records within the where clause, some sort of batching still needs to happen under the hood.

ajzo90 · 2021-06-09T03:37:21Z

Sorry for the confusion, you can ignore the cursor in the query example. It's not that important, I just put it there as an example in case batching originated from queries like that.

It doesn't change the point regarding streamability. Given that the db protocol support streaming, you don't need to buffer anything on the receive side.

In your example: the result from the query that returns 50k records can be streamed and it's possible to emit record by record to stdout. I don't see why batching would be required.

davinchia · 2021-06-09T04:47:43Z

I should be clearer. Batching within the system can be understood in 2 parts:

Query size when JDBC executes its SQL statement and reads from the DB. This is currently set to 1k. This affects Source memory requirements.
Batch size when the Destination reads from the Source. We have two kinds of inserts, copying into the destination warehouse via a temp file in Cloud/local storage and manual inserts. For simplicity both use the same queue-ing mechanism and perform their flushing after a batch limit, which is currently 50k. In the Copy case, it should be possible to cleanly remove this batching. In the insert case, removing this batching will result in slower overall inserts. This batch size affects Destination memory requirements.

Does that make sense?

danieldiamond · 2021-08-26T03:54:03Z

@davinchia any updates?

davinchia · 2021-11-08T15:48:41Z

This PR solves this for destinations: https://github.com/airbytehq/airbyte/pull/7719/files

davinchia added the type/enhancement New feature or request label May 17, 2021

davinchia mentioned this issue May 17, 2021

Improve our story around specifying memory requirements in Java Connectors. #3440

Closed

davinchia changed the title ~~Buffered Stream Consumer should batch on bytes read instead of records read.~~ Reads (in both Source and Destination) should batch on bytes read instead of records read. May 19, 2021

davinchia mentioned this issue Jul 1, 2021

Make batchSize configurable or increase the value to improve performance #4314

Closed

rparrapy mentioned this issue Sep 6, 2021

Out of memory error when using S3 as staging storage for Snowflake #5277

Closed

sherifnada added lang/java area/reliability labels Sep 24, 2021

sherifnada moved this to Backlog in GL Roadmap Jan 12, 2022

sherifnada added this to GL Roadmap Jan 12, 2022

igrankova moved this from Backlog to Backlog (unscoped) in GL Roadmap Feb 2, 2022

ChristopheDuong mentioned this issue Mar 24, 2022

[EPIC] scale warehouse destination connectors to handle arbitrary number of streams #10260

Open

30 tasks

bleonard added the team/connectors-java label Apr 19, 2022

grishick added team/databases and removed team/connectors-java labels Apr 28, 2022

grishick added the team/db-dw-sources Backlog for Database and Data Warehouse Sources team label Sep 27, 2022

grishick removed the team/databases label Oct 7, 2022

Ferdi mentioned this issue Oct 12, 2022

Address Airbyte performance challenges in syncing large data sources mitodl/ol-data-platform#371

Closed

bleonard added the frozen Not being actively worked on label Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reads (in both Source and Destination) should batch on bytes read instead of records read. #3439

Reads (in both Source and Destination) should batch on bytes read instead of records read. #3439

davinchia commented May 17, 2021 •

edited by sync-by-unito bot

Loading

rparrapy commented Jun 3, 2021

davinchia commented Jun 8, 2021

jrhizor commented Jun 8, 2021

ajzo90 commented Jun 8, 2021

jrhizor commented Jun 8, 2021

rparrapy commented Jun 8, 2021

davinchia commented Jun 9, 2021

ajzo90 commented Jun 9, 2021

davinchia commented Jun 9, 2021 •

edited

Loading

danieldiamond commented Aug 26, 2021

davinchia commented Nov 8, 2021

Reads (in both Source and Destination) should batch on bytes read instead of records read. #3439

Reads (in both Source and Destination) should batch on bytes read instead of records read. #3439

Comments

davinchia commented May 17, 2021 • edited by sync-by-unito bot Loading

Tell us about the problem you're trying to solve

Describe the solution you’d like

rparrapy commented Jun 3, 2021

davinchia commented Jun 8, 2021

jrhizor commented Jun 8, 2021

ajzo90 commented Jun 8, 2021

jrhizor commented Jun 8, 2021

rparrapy commented Jun 8, 2021

davinchia commented Jun 9, 2021

ajzo90 commented Jun 9, 2021

davinchia commented Jun 9, 2021 • edited Loading

danieldiamond commented Aug 26, 2021

davinchia commented Nov 8, 2021

davinchia commented May 17, 2021 •

edited by sync-by-unito bot

Loading

davinchia commented Jun 9, 2021 •

edited

Loading