-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7237] Hudi Streamer: Handle edge case with null schema, minor cleanups #10342
Conversation
Schema sourceSchema = schemaProvider.getSourceSchema(); | ||
Schema targetSchema = schemaProvider.getTargetSchema(); | ||
Schema targetSchema = getSchemaForWriteConfig(schemaProvider.getTargetSchema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we call reInitWriteClient twice in this class. If we wer fixing the target schema here to go through getSchemaForWriteConfig, should we fix the other caller as well.
Also, looks like we are calling getSchemaForWriteConfig within reInitWriteClient(), should we remove that call if callers already take care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't call getSchemaForWriteConfig
from reInitWriteClient()
. There is a call to getHoodieClientConfig
that takes in a schema that must be non-null to hit the getSchemaForWriteConfig
path though.
I think one way to simplify this is to make this call always happen if the schema is required in the config. When we initialize the StreamSync class, we will pass in some schema or null potentially but I don't think we really even need the writer schema for this case and want to avoid a read from the filesystem if possible to fetch the latest commit schema. If you agree, I can clean this up so that getHoodieClientConfig
takes in a schema and a boolean indicating whether the schema must be set in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do call.
reInitWriteClient {
final HoodieWriteConfig initialWriteConfig = getHoodieClientConfig(targetSchema);
}
within getHoodieClientConfig, we do call
if (schema != null) {
builder.withSchema(getSchemaForWriteConfig(schema).toString());
}
but I agree, that if we had already called getSchemaForWriteConfig and fixed the target schema, it will be no-op. Was just pointing out we are making repeated calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan can you link the line where we make the call? I am missing something or have some outdated code somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
Line 1008 in 50f0d9f
private void reInitWriteClient(Schema sourceSchema, Schema targetSchema, Option<JavaRDD<HoodieRecord>> recordsOpt) throws IOException { |
hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
Line 1014 in 50f0d9f
final HoodieWriteConfig initialWriteConfig = getHoodieClientConfig(targetSchema); |
hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
Line 1049 in 50f0d9f
private HoodieWriteConfig getHoodieClientConfig(Schema schema) { |
hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
Line 1111 in 50f0d9f
private Schema getSchemaForWriteConfig(Schema targetSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getHoodieClientConfig
is also called from the constructor so you need to consider that path too if you are looking simply at paths that can eventually hit this getSchemaForWriteConfig
method. I've updated the code so that there is no repeated call anymore and the call from the constructor avoids a potential schema lookup entirely.
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
Show resolved
Hide resolved
@@ -55,7 +55,7 @@ public SchemaProvider getSchemaProvider() { | |||
if (batch.isPresent() && schemaProvider == null) { | |||
throw new HoodieException("Please provide a valid schema provider class!"); | |||
} | |||
return Option.ofNullable(schemaProvider).orElse(new NullSchemaProvider()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor. do you think we can statically define once and use it rather than creating new objects everytime ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updating now
@hudi-bot run azure |
Change Logs
orElseGet
in one case and reusing existing writer config instead of recomputingImpact
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist