Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7622] Optimize HoodieTableSource's sanity check #11031

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zhuanshenbsj1
Copy link
Contributor

@zhuanshenbsj1 zhuanshenbsj1 commented Apr 16, 2024

Change Logs

Existing exceptions cannot know which table and which specific columns the exception is.
image

Modify as follows:

  1. Printing wrong columns name and table name in exception for MergeOnReadTableState#getRequiredPositions,and advance the exception from the operator execution stage to the operator initialization stage

check columns
image

check pks
image

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 16, 2024
.toArray();
if (!expColumns.isEmpty()) {
throw new HoodieException("Column(s) " + String.join(", ", expColumns) + " does not exists in the hudi table " + this.tableName + ".");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column(s) [$col_a, $col_b, $col_c ...] does not exist in the table $tableName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@zhuanshenbsj1 zhuanshenbsj1 force-pushed the HUDI-7622 branch 6 times, most recently from 5ef17f7 to a7270a9 Compare April 18, 2024 01:46
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:S PR with lines of changes in (10, 100] labels Apr 22, 2024
@zhuanshenbsj1 zhuanshenbsj1 force-pushed the HUDI-7622 branch 3 times, most recently from a487acf to 0b3c631 Compare April 22, 2024 16:14
@@ -86,12 +84,14 @@ public DynamicTableSource createDynamicTableSource(Context context) {
setupTableOptions(conf.getString(FlinkOptions.PATH), conf);
ResolvedSchema schema = context.getCatalogTable().getResolvedSchema();
setupConfOptions(conf, context.getObjectIdentifier(), context.getCatalogTable(), schema);
return new HoodieTableSource(
HoodieTableSource source = new HoodieTableSource(
schema,
path,
context.getCatalogTable().getPartitionKeys(),
conf.getString(FlinkOptions.PARTITION_DEFAULT_NAME),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the sanity check in the HoodieTableFactory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If according to what you said, Metadataclient needs to be initialized in the factory(Hoodie source sanity check need). It seems more reasonable to initialize in source?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadataclient needs to be initialized in the factory

That's okay, we already do that for the sink sanity check of table config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuanshenbsj1 zhuanshenbsj1 force-pushed the HUDI-7622 branch 3 times, most recently from e667c4b to 20af652 Compare April 26, 2024 06:55
@@ -518,7 +499,7 @@ private MergeOnReadInputFormat mergeOnReadInputFormat(
tableAvroSchema.toString(),
AvroSchemaConverter.convertToSchema(requiredRowType).toString(),
inputSplits,
conf.getString(FlinkOptions.RECORD_KEY_FIELD).split(","));
OptionsResolver.getRecordKeyField(conf));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is way too complicated, can you re-illustrate the issue again? What is the use case from user and what the correct behavior is expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The table created by the upstream write (recorded in the existing metadata) do not match the columns configured by the downstream stream read. For example, some columns do not exist, resulting in the columns not be found.
    -> Verification failed, throwing exception
  2. The recordkey configuration does not exist
    -> Verification failed, throwing exception
  3. Case problem. The columns created based on calsite in the upstream are all lowercase. If there are uppercase in the downstream, such as "eventTime", the columns will not be found.
    ->Uniformly converted to lowercase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case problem. The columns created based on calsite in the upstream are all lowercase. If there are uppercase in the downstream, such as "eventTime", the columns will not be found.
->Uniformly converted to lowercase

This is not expected to be handled by Hudi, I think. At least, from the catalog layer, we should make the case-sensitivity agnostic to specific engines.

The table created by the upstream write (recorded in the existing metadata) do not match the columns configured by the do

In HoodieTableFactory#createDynamicTableSource, add a sanity check for the catalog table resolved schema and the existing Hudi table schema, that should be enough I guess. Similiar with the primary key definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuanshenbsj1 zhuanshenbsj1 changed the title [MINOR] Optimization function MergeOnReadTableState#getRequiredPositions [HUDI-7622] Add sanity check for HoodieTableSource Apr 28, 2024
@danny0405
Copy link
Contributor

Thanks for the contribuiton, here is a patch for the fix:
7622.patch.zip

It would be greate if you can help to add some UTs.

@zhuanshenbsj1 zhuanshenbsj1 force-pushed the HUDI-7622 branch 2 times, most recently from 14a352d to a7ab936 Compare May 16, 2024 07:41
@zhuanshenbsj1
Copy link
Contributor Author

zhuanshenbsj1 commented May 16, 2024

Thanks for the contribuiton, here is a patch for the fix: 7622.patch.zip

It would be greate if you can help to add some UTs.

Made some adjustments:

  1. Create a new utility-class called SanityCheckUtil for validation, and move the checkKeygenGenerator and checkPreCombineKey methods from StreamerUtil to this class.
  2. Instantiate the MetaClient operation is heavy, so the Hudi source constructor adds a metaclient parameter to avoid multiple initializations for metaclient.
  3. Merge source and sink validation into same function, differentiating them by checking if the metaclient is null.
  4. Source needs to validate whether the columns and recordkey are consistent with the metadata.
  5. All exception prints after validation should include the table name.
  6. Adjust some ut for change.

@zhuanshenbsj1 zhuanshenbsj1 changed the title [HUDI-7622] Add sanity check for HoodieTableSource [HUDI-7622] Optimize HoodieTableSource's sanity check May 16, 2024
/**
* Utilities for HoodieTableFactory sanity check.
*/
public class SanityCheckUtil {
Copy link
Contributor

@danny0405 danny0405 May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename class name to SanityChecks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename class name to SanityChecks.

Done.

return new HoodieTableSource(
schema,
path,
context.getCatalogTable().getPartitionKeys(),
conf.getString(FlinkOptions.PARTITION_DEFAULT_NAME),
conf);
conf,
metaClient);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metaClient initialization is costly but the table source only happens once in Job Graph compile time, let's not reuse here to reduce complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metaClient initialization is costly but the table source only happens once in Job Graph compile time, let's not reuse here to reduce complexity.

Done.

@zhuanshenbsj1 zhuanshenbsj1 force-pushed the HUDI-7622 branch 2 times, most recently from f0936bc to 30f50eb Compare May 16, 2024 08:41
@@ -86,6 +86,8 @@ public DynamicTableSource createDynamicTableSource(Context context) {
setupTableOptions(conf.getString(FlinkOptions.PATH), conf);
ResolvedSchema schema = context.getCatalogTable().getResolvedSchema();
setupConfOptions(conf, context.getObjectIdentifier(), context.getCatalogTable(), schema);
HoodieTableMetaClient metaClient = StreamerUtil.metaClientForReader(conf, HadoopConfigurations.getHadoopConf(conf));
SanityChecksUtil.sanitCheck(conf, schema, metaClient);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the instantiation of meta client into the SanityCheckUtil. And can we rename the class to SanityChecks

checkTableType(conf);
List<String> schemaFields = schema.getColumnNames();
if (metaClient != null) {
if (checkMetaData) {
HoodieTableMetaClient metaClient = StreamerUtil.metaClientForReader(conf, HadoopConfigurations.getHadoopConf(conf));
List<String> latestTablefields = StreamerUtil.getLatestTableFields(metaClient);
if (latestTablefields != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for sink has been changed with this patch. We have this code for the original sink:

    if (!OptionsResolver.isAppendMode(conf)) {
      checkRecordKey(conf, schema);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this logic into function checkRecordKey, and both the source and sink need to be checked.

  public static void checkRecordKey(Configuration conf,List<String> existingFields) {
    if (OptionsResolver.isAppendMode(conf)) {
      return;
    }
   ....
  }

And also do this in function checkIndexType.

public static void checkIndexType(Configuration conf) {
    if (OptionsResolver.isAppendMode(conf)) {
      return;
    }
....
}

setupConfOptions(conf, context.getObjectIdentifier(), context.getCatalogTable(), schema);
setupSortOptions(conf, context.getConfiguration());
SanityChecks.sanitCheck(conf, schema, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should move the line before 105.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L PR with lines of changes in (300, 1000]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants