Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups #10342

Merged
merged 8 commits into from
Jan 24, 2024

Conversation

the-other-tim-brown
Copy link
Contributor

Change Logs

  • Reduces number of calls to schema provider by properly using orElseGet in one case and reusing existing writer config instead of recomputing
  • Reuses existing hadoop configuration instance instead of creating new one when possible
  • Handles case where schema provider returns null and there is an empty batch

Impact

  • Reduces number of calls to schema providers which can reduce load on the provider or cost of accessing file system
  • Fixes edge case where an empty batch could write a commit with a null value as the schema

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Schema sourceSchema = schemaProvider.getSourceSchema();
Schema targetSchema = schemaProvider.getTargetSchema();
Schema targetSchema = getSchemaForWriteConfig(schemaProvider.getTargetSchema());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call reInitWriteClient twice in this class. If we wer fixing the target schema here to go through getSchemaForWriteConfig, should we fix the other caller as well.
Also, looks like we are calling getSchemaForWriteConfig within reInitWriteClient(), should we remove that call if callers already take care of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't call getSchemaForWriteConfig from reInitWriteClient(). There is a call to getHoodieClientConfig that takes in a schema that must be non-null to hit the getSchemaForWriteConfig path though.

I think one way to simplify this is to make this call always happen if the schema is required in the config. When we initialize the StreamSync class, we will pass in some schema or null potentially but I don't think we really even need the writer schema for this case and want to avoid a read from the filesystem if possible to fetch the latest commit schema. If you agree, I can clean this up so that getHoodieClientConfig takes in a schema and a boolean indicating whether the schema must be set in the config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do call.

reInitWriteClient {

       final HoodieWriteConfig initialWriteConfig = getHoodieClientConfig(targetSchema); 
}

within getHoodieClientConfig, we do call

    if (schema != null) {
      builder.withSchema(getSchemaForWriteConfig(schema).toString());
    }

but I agree, that if we had already called getSchemaForWriteConfig and fixed the target schema, it will be no-op. Was just pointing out we are making repeated calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan can you link the line where we make the call? I am missing something or have some outdated code somehow.

Copy link
Contributor

@nsivabalan nsivabalan Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private void reInitWriteClient(Schema sourceSchema, Schema targetSchema, Option<JavaRDD<HoodieRecord>> recordsOpt) throws IOException {

final HoodieWriteConfig initialWriteConfig = getHoodieClientConfig(targetSchema);

private HoodieWriteConfig getHoodieClientConfig(Schema schema) {

https://github.com/apache/hudi/blob/50f0d9f3baeafe92c38dfb0b59b337d83391ea42/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java#L1076C26-L1076C49

private Schema getSchemaForWriteConfig(Schema targetSchema) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getHoodieClientConfig is also called from the constructor so you need to consider that path too if you are looking simply at paths that can eventually hit this getSchemaForWriteConfig method. I've updated the code so that there is no repeated call anymore and the call from the constructor avoids a potential schema lookup entirely.

@nsivabalan nsivabalan added priority:critical production down; pipelines stalled; Need help asap. release-0.14.1 labels Dec 17, 2023
@@ -55,7 +55,7 @@ public SchemaProvider getSchemaProvider() {
if (batch.isPresent() && schemaProvider == null) {
throw new HoodieException("Please provide a valid schema provider class!");
}
return Option.ofNullable(schemaProvider).orElse(new NullSchemaProvider());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. do you think we can statically define once and use it rather than creating new objects everytime ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updating now

@the-other-tim-brown
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

image

@nsivabalan nsivabalan merged commit b420850 into apache:master Jan 24, 2024
31 checks passed
@the-other-tim-brown the-other-tim-brown deleted the HUDI-7237 branch January 24, 2024 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:critical production down; pipelines stalled; Need help asap. release-0.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants