Skip to content

[SUPPORT] Column misalignment occurs when reading the copy on write type of hudi table through Flink #2976

@AirToSupply

Description

@AirToSupply

To Reproduce

Steps to reproduce the behavior:

1.Build from source with branch [master], the version is 0.9.0-SNAPSHOT.
2.Start a Fink1.12.x streaming job, read data from hudi table to test.
3.Online observation, the exception caused the flink job to fail

Expected behavior
java.lang.IllegalArgumentException: Unexpected type: ...

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.9.0-SNAPSHOT

  • Spark version : None

  • Hive version : None

  • Hadoop version : 2.9.2

  • Storage (HDFS/S3/GCS..) : HDFS

  • Running on Docker? (yes/no) : no

Additional context

The timing of the exception is: when the specified partition column field is not at the end of the sequence of fields written to the hudi table.

For example, if the order of the fields (including partition columns) written in the hudi table is: col1, col2, col3. At this time, if the partition column field is col1, the exception will be generated. If the partition column field is col3, it can work normally.

The hypothesis and phenomenon of this problem are as follows
First, we register into a hudi table(type: COW) by using DDL statement in Flink SQL in a real-time computing application. The generation of this exception involves two cases:
【case-1】When querying some columns of hudi table (for example: select col1,col2 from table),the bug sometimes occurs and does not necessarily cause an exception.
【case-2】When querying all the fields of hudi table (for example: select * from table),the bug is bound to occur.

A clear and concise description of the problem.

Stacktrace

The exception stack is as follows:

BB0B7B65-BC82-40da-ABD9-6550956AAFDD

The local debugging is as follows:

C10E0226-BBAD-4ef3-B3AE-161586449B35

Initial diagnosis reason: When reading the hudi table through Flink, org.apache.hudi.table.format.cow.ParquetSplitReaderUtil#genPartColumnarRowReader will be called. This method returns that the selectedTypes and selectedFieldNames arrays in the ParquetColumnarRowSplitReader object are misaligned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions