-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination Redshift: Limit Standard insert statement size < 16MB #36973
Destination Redshift: Limit Standard insert statement size < 16MB #36973
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my understanding: this is basically just telling the async framework to send us batches of <14MB, which hopefully gives us enough wiggle room to avoid generating sql > 16MB?
(I also have no idea how the async framework measures record size, is it using the serialized json string size?)
// If the flush size allows the max batch of 10k records, then net overhead is ~1.5MB. | ||
// Lets round it to 2MB and keep a max buffer of 14MB | ||
// This will allow not sending record set larger than 14M limiting the batch insert statement. | ||
private static final Long MAX_BATCH_SIZE_FOR_FLUSH = 14 * 1024 * 1024L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how is the optimalBatchSize parameter used? I.e. does the async framework treat it as a recommendation, or a hard limit? (and let's align the naming to either optimalBatchSize + OPTIMAL_BATCH_SIZE_FOR_FLUSH, or maxBatchSize + MAX_BATCH_SIZE_FOR_FLUSH)
@@ -16,4 +27,49 @@ protected ObjectNode getBaseConfig() { | |||
return (ObjectNode) Jsons.deserialize(IOs.readFile(Path.of("secrets/1s1t_config.json"))); | |||
} | |||
|
|||
@Test | |||
public void testStandardInsertBatchSizeGtThan16Mb() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we actually move this into the base TD test class? Seems useful for all destinations to verify handling of nontrivial data size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have limitations in staging batch size u recall. put it in std insert flow for SQL stmt limit. probably can pull it up to AbstractRedshiftTD too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷
we probably target a specific size, but I'm not aware of redshift (or any staging dest) setting a hard limit on file size (and we're probably not reaching that limit anyway...)
Yeah your understanding is correct. Its using raw string size which is consumed from Std.in passed along into as record size. If a bunch of small records add up to 14M, then our 10k batch insert will kick-in, else this will act as the guardrail and our batch will be << 10k |
@@ -52,7 +52,8 @@ import org.slf4j.LoggerFactory | |||
object JdbcBufferedConsumerFactory { | |||
private val LOGGER: Logger = LoggerFactory.getLogger(JdbcBufferedConsumerFactory::class.java) | |||
|
|||
@JvmOverloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer used in any Java context, so removed the @JvmOverloads
f5252bb
to
5bdd24f
Compare
5bdd24f
to
3d94949
Compare
/publish-java-cdk
|
3d94949
to
e31cd1e
Compare
@gisripa can we create an issue to solve this in a more holistic way? I really don't think this should be solved at the individual connector level. We need to start thinking of a way for various connectors to advertise their capabilities to the platform rather than pass more and more parameters to the CDK. |
+1 on solving this generically, but I do think this needs to be solved in-connector (and therefore probably in the cdk). It's a limitation of the length of the sql statement But right now there's a lot of indirection from stdin -> async framework (?) -> jdbc sql operations (I think??) -> redshift sql operations (???) -> actual insert statement, which makes this kind of painful :( |
Limit the multi-value insert statement to not reach beyond 16MB