[MINOR] Removing dead code in HiveAvroSerializer by voonhous · Pull Request #11577 · apache/hudi

voonhous · 2024-07-05T11:00:31Z

Change Logs

Removing the case INT + DATE if - block as DATE types from the Hive related entrypoints will have a DATE PrmitiveFieldObjectInspector type.

From my understanding, this logic is mainly copied over from:

hudi/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java

Lines 178 to 181 in 70f44ef

    
           case INT: 
        
             if (schema.getLogicalType() != null && schema.getLogicalType().getName().equals("date")) { 
        
               return HoodieHiveUtils.getDateWriteable((Integer) value); 
        
             }

Of which, was introduced in this PR here:
https://github.com/apache/hudi/pull/2634/files

The switch-case for this is applied on a different entrypoint (switching an Avro Schema type). For such cases, DATE types will most definitely enter the CASE-INT block.

However, in HiveAvroSerializer, the code is switching by PrimitiveObjectInspector.PrimitiveCategory instead. When reading DATE types, the CASE-DATE will be used instead of CASE-INT, making the block that we are deleting here a piece of code that will never be used.

On top of that, the entire CASE-INT block can be removed as it is already handled by the default block

Therefore, removing it to avoid further confusion.

TLDR: Remove case-blocks as we are switching by different types and entrypoints are different

Impact

None

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2024-07-06T00:16:19Z

Thanks for the contribution, can you check the Azure CI failures?

voonhous · 2024-07-06T07:18:25Z

Will check on it and ping you again. Seems like my initial claim of this section being no longer required is wrong, let me confirm this.

Stack trace of test failure:

2024-07-05T12:00:22.7907465Z [ERROR] Tests run: 296, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 1,364.399 s <<< FAILURE! - in JUnit Vintage
2024-07-05T12:00:22.7921852Z [ERROR] [1] true, true(testWriteLogDuringCompaction(boolean, boolean))  Time elapsed: 3.232 s  <<< ERROR!
2024-07-05T12:00:22.7922367Z java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DateWritable cannot be cast to org.apache.hadoop.io.IntWritable
2024-07-05T12:00:22.7922891Z 	at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.getPrimitiveJavaObject(WritableIntObjectInspector.java:46)
2024-07-05T12:00:22.7923390Z 	at org.apache.hudi.hadoop.utils.HiveAvroSerializer.serializePrimitive(HiveAvroSerializer.java:314)
2024-07-05T12:00:22.7923796Z 	at org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:210)
2024-07-05T12:00:22.7924221Z 	at org.apache.hudi.hadoop.utils.HiveAvroSerializer.setUpRecordFieldFromWritable(HiveAvroSerializer.java:119)
2024-07-05T12:00:22.7924909Z 	at org.apache.hudi.hadoop.utils.HiveAvroSerializer.serialize(HiveAvroSerializer.java:106)
2024-07-05T12:00:22.7925499Z 	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.convertArrayWritableToHoodieRecord(RealtimeCompactedRecordReader.java:203)
2024-07-05T12:00:22.7925986Z 	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.mergeRecord(RealtimeCompactedRecordReader.java:188)
2024-07-05T12:00:22.7926448Z 	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:128)
2024-07-05T12:00:22.7926893Z 	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:55)
2024-07-05T12:00:22.7927332Z 	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:86)
2024-07-05T12:00:22.7927751Z 	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36)
2024-07-05T12:00:22.7928213Z 	at org.apache.hudi.testutils.HoodieMergeOnReadTestUtils.getRecordsUsingInputFormat(HoodieMergeOnReadTestUtils.java:159)
2024-07-05T12:00:22.7928698Z 	at org.apache.hudi.testutils.HoodieMergeOnReadTestUtils.getRecordsUsingInputFormat(HoodieMergeOnReadTestUtils.java:112)
2024-07-05T12:00:22.7929171Z 	at org.apache.hudi.testutils.HoodieMergeOnReadTestUtils.getRecordsUsingInputFormat(HoodieMergeOnReadTestUtils.java:106)
2024-07-05T12:00:22.7929628Z 	at org.apache.hudi.testutils.HoodieMergeOnReadTestUtils.getRecordsUsingInputFormat(HoodieMergeOnReadTestUtils.java:101)
2024-07-05T12:00:22.7930142Z 	at org.apache.hudi.table.functional.TestHoodieSparkMergeOnReadTableCompaction.readTableTotalRecordsNum(TestHoodieSparkMergeOnReadTableCompaction.java:197)
2024-07-05T12:00:22.7930707Z 	at org.apache.hudi.table.functional.TestHoodieSparkMergeOnReadTableCompaction.testWriteLogDuringCompaction(TestHoodieSparkMergeOnReadTableCompaction.java:189)

voonhous · 2024-07-08T03:28:47Z

@danny0405 I checked the code and realised that the reason why the tests were failing because the DATE column is defined with an INT type in org.apache.hudi.common.testutils.HoodieTestDataGenerator#TRIP_HIVE_COLUMN_TYPES.

hudi/hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestDataGenerator.java

Lines 157 to 158 in 1f6520b

    
           public static final String TRIP_HIVE_COLUMN_TYPES = "bigint,string,string,string,string,string,double,double,double,double,int,bigint,float,binary,int,bigint,decimal(10,6)," 
        
               + "map<string,string>,struct<amount:double,currency:string>,array<struct<amount:double,currency:string>>,boolean";

In our productionised tests environment, the HIVE_TYPES for dates are passed by DATE instead of INT types.

On top of that, if values are indeed passed as INT, we should be seeing the same error as what was described in #11576.

The return types for are as such:

Hive2

org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableDateObjectInspector#getPrimitiveJavaObject

Hive3

org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableDateObjectInspector#getPrimitiveJavaObject

danny0405 · 2024-07-08T04:21:01Z

@danny0405 I checked the code and realised that the reason why the tests were failing because the DATE column is defined with an INT type in org.apache.hudi.common.testutils.HoodieTestDataGenerator#TRIP_HIVE_COLUMN_TYPES.

hudi/hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestDataGenerator.java

Lines 157 to 158 in 1f6520b

public static final String TRIP_HIVE_COLUMN_TYPES = "bigint,string,string,string,string,string,double,double,double,double,int,bigint,float,binary,int,bigint,decimal(10,6),"

+ "map<string,string>,struct<amount:double,currency:string>,array<struct<amount:double,currency:string>>,boolean";

In our productionised tests environment, the HIVE_TYPES for dates are passed by DATE instead of INT types.

On top of that, if values are indeed passed as INT, we should be seeing the same error as what was described in #11576.

The return types for are as such:

Hive2

org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableDateObjectInspector#getPrimitiveJavaObject

Hive3

org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableDateObjectInspector#getPrimitiveJavaObject

Great findings, maybe we can just keep the pattern match for INT in case there are some rare cases.

hudi-bot · 2024-07-08T05:25:05Z

CI report:

fd1ddca Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous · 2024-07-08T06:22:14Z

Sure, will close this then. I'll revisit this again when i have time.

[MINOR] Removing dead code in HiveAvroSerializer

17c4b49

github-actions bot added the size:XS PR with lines of changes in <= 10 label Jul 5, 2024

Fix tests

fd1ddca

voonhous closed this Jul 8, 2024

voonhous deleted the minor_remove_dead_code branch December 20, 2025 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR] Removing dead code in HiveAvroSerializer#11577

[MINOR] Removing dead code in HiveAvroSerializer#11577
voonhous wants to merge 2 commits intoapache:masterfrom
voonhous:minor_remove_dead_code

voonhous commented Jul 5, 2024

Uh oh!

danny0405 commented Jul 6, 2024

Uh oh!

voonhous commented Jul 6, 2024 •

edited

Loading

Uh oh!

voonhous commented Jul 8, 2024

Uh oh!

danny0405 commented Jul 8, 2024

Hive2

Hive3

Uh oh!

hudi-bot commented Jul 8, 2024

Uh oh!

voonhous commented Jul 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	case INT:
	if (schema.getLogicalType() != null && schema.getLogicalType().getName().equals("date")) {
	return HoodieHiveUtils.getDateWriteable((Integer) value);
	}

Conversation

voonhous commented Jul 5, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

danny0405 commented Jul 6, 2024

Uh oh!

voonhous commented Jul 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voonhous commented Jul 8, 2024

Hive2

Hive3

Uh oh!

danny0405 commented Jul 8, 2024

Hive2

Hive3

Uh oh!

hudi-bot commented Jul 8, 2024

CI report:

Uh oh!

voonhous commented Jul 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

voonhous commented Jul 6, 2024 •

edited

Loading