Destination Redshift: Limit Standard insert statement size < 16MB #36973

gisripa · 2024-04-10T18:31:34Z

Limit the multi-value insert statement to not reach beyond 16MB

gisripa · 2024-04-10T18:31:50Z

Destination Redshift: Limit Standard insert statement size < 16MB #36973 👈
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @gisripa and the rest of your teammates on Graphite

vercel · 2024-04-10T18:45:56Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Apr 10, 2024 11:19pm

edgao

for my understanding: this is basically just telling the async framework to send us batches of <14MB, which hopefully gives us enough wiggle room to avoid generating sql > 16MB?

(I also have no idea how the async framework measures record size, is it using the serialized json string size?)

edgao · 2024-04-10T18:42:28Z

...ft/src/main/java/io/airbyte/integrations/destination/redshift/RedshiftInsertDestination.java

+  // If the flush size allows the max batch of 10k records, then net overhead is ~1.5MB.
+  // Lets round it to 2MB and keep a max buffer of 14MB
+  // This will allow not sending record set larger than 14M limiting the batch insert statement.
+  private static final Long MAX_BATCH_SIZE_FOR_FLUSH = 14 * 1024 * 1024L;


nit: how is the optimalBatchSize parameter used? I.e. does the async framework treat it as a recommendation, or a hard limit? (and let's align the naming to either optimalBatchSize + OPTIMAL_BATCH_SIZE_FOR_FLUSH, or maxBatchSize + MAX_BATCH_SIZE_FOR_FLUSH)

edgao · 2024-04-10T18:44:37Z

...grations/destination/redshift/typing_deduping/RedshiftStandardInsertsTypingDedupingTest.java

@@ -16,4 +27,49 @@ protected ObjectNode getBaseConfig() {
    return (ObjectNode) Jsons.deserialize(IOs.readFile(Path.of("secrets/1s1t_config.json")));
  }

+  @Test
+  public void testStandardInsertBatchSizeGtThan16Mb() throws Exception {


should we actually move this into the base TD test class? Seems useful for all destinations to verify handling of nontrivial data size

Do we have limitations in staging batch size u recall. put it in std insert flow for SQL stmt limit. probably can pull it up to AbstractRedshiftTD too.

🤷

we probably target a specific size, but I'm not aware of redshift (or any staging dest) setting a hard limit on file size (and we're probably not reaching that limit anyway...)

gisripa · 2024-04-10T19:10:43Z

for my understanding: this is basically just telling the async framework to send us batches of <14MB, which hopefully gives us enough wiggle room to avoid generating sql > 16MB?

(I also have no idea how the async framework measures record size, is it using the serialized json string size?)

Yeah your understanding is correct. Its using raw string size which is consumed from Std.in passed along into as record size. If a bunch of small records add up to 14M, then our 10k batch insert will kick-in, else this will act as the guardrail and our batch will be << 10k

gisripa · 2024-04-10T19:12:26Z

.../src/main/kotlin/io/airbyte/cdk/integrations/destination/jdbc/JdbcBufferedConsumerFactory.kt

@@ -52,7 +52,8 @@ import org.slf4j.LoggerFactory
 object JdbcBufferedConsumerFactory {
    private val LOGGER: Logger = LoggerFactory.getLogger(JdbcBufferedConsumerFactory::class.java)

-    @JvmOverloads


No longer used in any Java context, so removed the @JvmOverloads

gisripa · 2024-04-10T22:15:30Z

/publish-java-cdk

🕑 https://github.com/airbytehq/airbyte/actions/runs/8638486506
✅ Successfully published Java CDK version=0.29.12!

stephane-airbyte · 2024-04-11T02:04:25Z

@gisripa can we create an issue to solve this in a more holistic way? I really don't think this should be solved at the individual connector level. We need to start thinking of a way for various connectors to advertise their capabilities to the platform rather than pass more and more parameters to the CDK.

edgao · 2024-04-11T15:17:13Z

+1 on solving this generically, but I do think this needs to be solved in-connector (and therefore probably in the cdk). It's a limitation of the length of the sql statement insert into <table> (cols...) values (rec1...), (rec2...), ... - platform has no idea that we're converting record messages into a giant sql blob

But right now there's a lot of indirection from stdin -> async framework (?) -> jdbc sql operations (I think??) -> redshift sql operations (???) -> actual insert statement, which makes this kind of painful :(

gisripa changed the title ~~destination-redshift-16mb-stmt-fix~~ Destination Redshift: Limit Standard insert statement size < 16MB Apr 10, 2024

gisripa marked this pull request as ready for review April 10, 2024 18:33

gisripa requested a review from a team as a code owner April 10, 2024 18:33

octavia-squidington-iii added connectors/destination/redshift area/connectors Connector related issues CDK Connector Development Kit labels Apr 10, 2024

edgao approved these changes Apr 10, 2024

View reviewed changes

gisripa commented Apr 10, 2024

View reviewed changes

gisripa force-pushed the gireesh/04-10-destination-redshift-16mb-stmt-fix branch 2 times, most recently from f5252bb to 5bdd24f Compare April 10, 2024 21:55

gisripa requested a review from a team as a code owner April 10, 2024 21:55

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Apr 10, 2024

vercel bot deployed to Preview April 10, 2024 22:00 View deployment

gisripa force-pushed the gireesh/04-10-destination-redshift-16mb-stmt-fix branch from 5bdd24f to 3d94949 Compare April 10, 2024 22:05

vercel bot deployed to Preview April 10, 2024 22:10 View deployment

destination-redshift-16mb-stmt-fix

e31cd1e

gisripa force-pushed the gireesh/04-10-destination-redshift-16mb-stmt-fix branch from 3d94949 to e31cd1e Compare April 10, 2024 23:14

vercel bot deployed to Preview April 10, 2024 23:19 View deployment

gisripa merged commit e16b0d2 into master Apr 10, 2024
28 checks passed

gisripa deleted the gireesh/04-10-destination-redshift-16mb-stmt-fix branch April 10, 2024 23:36

nataliekwong mentioned this pull request May 5, 2024

[Roadmap] Sync large records gracefully #34887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destination Redshift: Limit Standard insert statement size < 16MB #36973

Destination Redshift: Limit Standard insert statement size < 16MB #36973

gisripa commented Apr 10, 2024 •

edited

Loading

gisripa commented Apr 10, 2024

vercel bot commented Apr 10, 2024 •

edited

Loading

edgao left a comment

edgao Apr 10, 2024

edgao Apr 10, 2024

gisripa Apr 10, 2024

edgao Apr 10, 2024

gisripa commented Apr 10, 2024

gisripa Apr 10, 2024 •

edited

Loading

gisripa commented Apr 10, 2024 •

edited by github-actions bot

Loading

stephane-airbyte commented Apr 11, 2024

edgao commented Apr 11, 2024

Destination Redshift: Limit Standard insert statement size < 16MB #36973

Destination Redshift: Limit Standard insert statement size < 16MB #36973

Conversation

gisripa commented Apr 10, 2024 • edited Loading

gisripa commented Apr 10, 2024

vercel bot commented Apr 10, 2024 • edited Loading

edgao left a comment

Choose a reason for hiding this comment

edgao Apr 10, 2024

Choose a reason for hiding this comment

edgao Apr 10, 2024

Choose a reason for hiding this comment

gisripa Apr 10, 2024

Choose a reason for hiding this comment

edgao Apr 10, 2024

Choose a reason for hiding this comment

gisripa commented Apr 10, 2024

gisripa Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

gisripa commented Apr 10, 2024 • edited by github-actions bot Loading

stephane-airbyte commented Apr 11, 2024

edgao commented Apr 11, 2024

gisripa commented Apr 10, 2024 •

edited

Loading

vercel bot commented Apr 10, 2024 •

edited

Loading

gisripa Apr 10, 2024 •

edited

Loading

gisripa commented Apr 10, 2024 •

edited by github-actions bot

Loading