Skip to content

[HUDI-5829] Optimize conversion from json to row format when sanitizing field names#11941

Merged
jonvex merged 9 commits intoapache:masterfrom
vamsikarnika:json_to_row
Oct 1, 2024
Merged

[HUDI-5829] Optimize conversion from json to row format when sanitizing field names#11941
jonvex merged 9 commits intoapache:masterfrom
vamsikarnika:json_to_row

Conversation

@vamsikarnika
Copy link
Collaborator

@vamsikarnika vamsikarnika commented Sep 13, 2024

Change Logs

Currently when source data has to read in row format and sanitization is enabled, we first read the data in avro format(which supports sanitization) and later convert from avro to row. This new approach simplifies this process by directly converting from json to row while applying sanitization.

Impact

When source data has to be read in row format, and sanitization is enabled. This change should make the conversion from json to row faster by directly converting from json to row.

This change directly affects the existing streams. This is currently added behind a flag which can be disabled if any issues are to be found.

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Sep 13, 2024
@vamsikarnika vamsikarnika changed the title convert json to row using MercifulJsonToRowConverter [HUDI-5829] Optimize conversion from json to row format when sanitizing field names Sep 16, 2024
Comment on lines -313 to -325
// As we don't do rounding, the validation will enforce the scale part and the integer part are all within the
// limit. As a result, if scale is 2 precision is 5, we only allow 3 digits for the integer.
// Allowed: 123.45, 123, 0.12
// Disallowed: 1234 (4 digit integer while the scale has already reserved 2 digit out of the 5 digit precision)
// 123456, 0.12345
if (bigDecimal.scale() > decimalType.getScale()
|| (bigDecimal.precision() - bigDecimal.scale()) > (decimalType.getPrecision() - decimalType.getScale())) {
// Correspond to case
// org.apache.avro.AvroTypeException: Cannot encode decimal with scale 5 as scale 2 without rounding.
// org.apache.avro.AvroTypeException: Cannot encode decimal with scale 3 as scale 2 without rounding
return Pair.of(false, null);
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have moved this code to DecimalFieldProcessor class, so this can be reused by both JsonToAvro and JsonToRow processors.

Comment on lines +41 to +46
static Stream<Object> decimalBadCases() {
return Stream.of(
// Invalid schema definition.
Arguments.of(DECIMAL_AVRO_FILE_INVALID_PATH, "123.45", null, false),
// Schema set precision as 5, input overwhelmed the precision.
Arguments.of(DECIMAL_AVRO_FILE_PATH, "123333.45", null, false),
Copy link
Collaborator Author

@vamsikarnika vamsikarnika Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have moved the test data generators from TestMercifulJsonConvertor to this base class, so that both row and avro conversions can use same data for testing.

Comment on lines -76 to +71
@ValueSource(strings = {
"{\"first\":\"John\",\"last\":\"Smith\"}",
"[{\"first\":\"John\",\"last\":\"Smith\"}]",
"{\"first\":\"John\",\"last\":\"Smith\",\"suffix\":3}",
})
@MethodSource("dataNestedJsonAsString")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have converted this MethodSource and moved it Test Base class to be reused by MercifulJsonToRowConverter

Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments


import org.apache.hudi.exception.HoodieJsonConversionException;

public class HoodieJsonToRowConversionException extends HoodieJsonConversionException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonvex jonvex self-assigned this Oct 1, 2024
@hudi-bot
Copy link
Collaborator

hudi-bot commented Oct 1, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@jonvex jonvex self-requested a review October 1, 2024 14:50
Copy link
Contributor

@jonvex jonvex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jonvex jonvex merged commit 2aad0a8 into apache:master Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants