Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination Redshift: Limit Standard insert statement size < 16MB #36973

Merged
merged 1 commit into from
Apr 10, 2024

Conversation

gisripa
Copy link
Contributor

@gisripa gisripa commented Apr 10, 2024

Limit the multi-value insert statement to not reach beyond 16MB

Copy link
Contributor Author

gisripa commented Apr 10, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @gisripa and the rest of your teammates on Graphite Graphite

@gisripa gisripa changed the title destination-redshift-16mb-stmt-fix Destination Redshift: Limit Standard insert statement size < 16MB Apr 10, 2024
@gisripa gisripa marked this pull request as ready for review April 10, 2024 18:33
@gisripa gisripa requested a review from a team as a code owner April 10, 2024 18:33
@octavia-squidington-iii octavia-squidington-iii added connectors/destination/redshift area/connectors Connector related issues CDK Connector Development Kit labels Apr 10, 2024
Copy link

vercel bot commented Apr 10, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Apr 10, 2024 11:19pm

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding: this is basically just telling the async framework to send us batches of <14MB, which hopefully gives us enough wiggle room to avoid generating sql > 16MB?

(I also have no idea how the async framework measures record size, is it using the serialized json string size?)

// If the flush size allows the max batch of 10k records, then net overhead is ~1.5MB.
// Lets round it to 2MB and keep a max buffer of 14MB
// This will allow not sending record set larger than 14M limiting the batch insert statement.
private static final Long MAX_BATCH_SIZE_FOR_FLUSH = 14 * 1024 * 1024L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how is the optimalBatchSize parameter used? I.e. does the async framework treat it as a recommendation, or a hard limit? (and let's align the naming to either optimalBatchSize + OPTIMAL_BATCH_SIZE_FOR_FLUSH, or maxBatchSize + MAX_BATCH_SIZE_FOR_FLUSH)

@@ -16,4 +27,49 @@ protected ObjectNode getBaseConfig() {
return (ObjectNode) Jsons.deserialize(IOs.readFile(Path.of("secrets/1s1t_config.json")));
}

@Test
public void testStandardInsertBatchSizeGtThan16Mb() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we actually move this into the base TD test class? Seems useful for all destinations to verify handling of nontrivial data size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have limitations in staging batch size u recall. put it in std insert flow for SQL stmt limit. probably can pull it up to AbstractRedshiftTD too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷

we probably target a specific size, but I'm not aware of redshift (or any staging dest) setting a hard limit on file size (and we're probably not reaching that limit anyway...)

@gisripa
Copy link
Contributor Author

gisripa commented Apr 10, 2024

for my understanding: this is basically just telling the async framework to send us batches of <14MB, which hopefully gives us enough wiggle room to avoid generating sql > 16MB?

(I also have no idea how the async framework measures record size, is it using the serialized json string size?)

Yeah your understanding is correct. Its using raw string size which is consumed from Std.in passed along into as record size. If a bunch of small records add up to 14M, then our 10k batch insert will kick-in, else this will act as the guardrail and our batch will be << 10k

@@ -52,7 +52,8 @@ import org.slf4j.LoggerFactory
object JdbcBufferedConsumerFactory {
private val LOGGER: Logger = LoggerFactory.getLogger(JdbcBufferedConsumerFactory::class.java)

@JvmOverloads
Copy link
Contributor Author

@gisripa gisripa Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer used in any Java context, so removed the @JvmOverloads

@gisripa gisripa force-pushed the gireesh/04-10-destination-redshift-16mb-stmt-fix branch 2 times, most recently from f5252bb to 5bdd24f Compare April 10, 2024 21:55
@gisripa gisripa requested a review from a team as a code owner April 10, 2024 21:55
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Apr 10, 2024
@gisripa gisripa force-pushed the gireesh/04-10-destination-redshift-16mb-stmt-fix branch from 5bdd24f to 3d94949 Compare April 10, 2024 22:05
@gisripa
Copy link
Contributor Author

gisripa commented Apr 10, 2024

/publish-java-cdk

🕑 https://github.com/airbytehq/airbyte/actions/runs/8638486506
✅ Successfully published Java CDK version=0.29.12!

@gisripa gisripa force-pushed the gireesh/04-10-destination-redshift-16mb-stmt-fix branch from 3d94949 to e31cd1e Compare April 10, 2024 23:14
@gisripa gisripa merged commit e16b0d2 into master Apr 10, 2024
28 checks passed
@gisripa gisripa deleted the gireesh/04-10-destination-redshift-16mb-stmt-fix branch April 10, 2024 23:36
@stephane-airbyte
Copy link
Contributor

@gisripa can we create an issue to solve this in a more holistic way? I really don't think this should be solved at the individual connector level. We need to start thinking of a way for various connectors to advertise their capabilities to the platform rather than pass more and more parameters to the CDK.

@edgao
Copy link
Contributor

edgao commented Apr 11, 2024

+1 on solving this generically, but I do think this needs to be solved in-connector (and therefore probably in the cdk). It's a limitation of the length of the sql statement insert into <table> (cols...) values (rec1...), (rec2...), ... - platform has no idea that we're converting record messages into a giant sql blob

But right now there's a lot of indirection from stdin -> async framework (?) -> jdbc sql operations (I think??) -> redshift sql operations (???) -> actual insert statement, which makes this kind of painful :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation CDK Connector Development Kit connectors/destination/redshift
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants