Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] HoodieRealtimeRecordReader can only work on RealtimeSplit and not with hdfs://111.parquet:0+4 #2813

Closed
MyLanPangzi opened this issue Apr 13, 2021 · 13 comments
Assignees
Labels
hive Issues related to hive

Comments

@MyLanPangzi
Copy link
Contributor

MyLanPangzi commented Apr 13, 2021

Describe the problem you faced

flink write mor table but cannot using hive agg query newest data.

To Reproduce

Steps to reproduce the behavior:

1.flink write mor table
2.create hive extrenal table using org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat

CREATE EXTERNAL TABLE dwd_sale_sale_detail_rt
(
    `_hoodie_commit_time`    STRING,
    `_hoodie_commit_seqno`   STRING,
    `_hoodie_record_key`     STRING,
    `_hoodie_partition_path` STRING,
    `_hoodie_file_name`      STRING,
    shopid                   STRING
salevalue decimal(1,2)
) partitioned by (`sdt` string)
    ROW FORMAT SERDE
        'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT
        'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
        OUTPUTFORMAT
            'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION 'hdfs://nameservice1/user/xiebo/hudi/dwd/dwd_sale_sale_detail_rt';

3.using hive shell agg query get error
4.

Expected behavior

hive query mor correctly and return agg result.

Environment Description

  • Hudi version : 0.9.0

  • Spark version :

  • Hive version : 1.1 cdh 5.6.12

  • Hadoop version : 2.6 cdh 5.6.12

  • Storage (HDFS/S3/GCS..) : hdfs

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

021-04-13 17:05:45,815 INFO [IPC Server handler 6 on 46363] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1594105654926_12624744_m_000012_0 is : 0.0
2021-04-13 17:05:45,818 FATAL [IPC Server handler 8 on 46363] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1594105654926_12624744_m_000012_0 - exited : java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:267)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:213)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:334)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:734)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:253)
... 11 more
Caused by: java.lang.IllegalArgumentException: HoodieRealtimeRecordReader can only work on RealtimeSplit and not with hdfs://nameservice1/user/hudi/dwd/dwd_sale_sale_detail_rt/20210413/ab5a8ff3-4647-46ae-ba13-7b6eb7914516_8-10-0_20210413170058.parquet:0+57883047
at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:117)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:68)

@bvaradar
Copy link
Contributor

cc @n3nash

Can you try setting hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat

and trying.

@n3nash n3nash self-assigned this Apr 13, 2021
@n3nash n3nash added awaiting-community-help hive Issues related to hive labels Apr 13, 2021
@qianjiangbing
Copy link

I tried to set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat;
But another error occurred:

Diagnostic Messages for this Task:
Error: java.lang.ClassCastException: org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch cannot be cast to org.apache.hadoop.io.ArrayWritable
at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.createValue(RealtimeCompactedRecordReader.java:150)
at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.createValue(RealtimeCompactedRecordReader.java:43)
at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.createValue(HoodieRealtimeRecordReader.java:89)
at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.createValue(HoodieRealtimeRecordReader.java:36)
at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.createValue(RealtimeCompactedRecordReader.java:150)
at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.createValue(RealtimeCompactedRecordReader.java:43)
at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.createValue(HoodieRealtimeRecordReader.java:89)
at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.createValue(HoodieCombineRealtimeRecordReader.java:87)
at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.createValue(HoodieCombineRealtimeRecordReader.java:41)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.createValue(MapTask.java:186)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1731)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)

@n3nash
Copy link
Contributor

n3nash commented Apr 14, 2021

@qianjiangbing @bvaradar There is a ticket created for this -> https://issues.apache.org/jira/browse/HUDI-1036. I will look into this later this week. This looks like a legitimate issue

@MyLanPangzi
Copy link
Contributor Author

update the question description. i'm using org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat agg query mor table when error occur.

@qianjiangbing
Copy link

I use hive2.3.8 to test, it's ok!

@n3nash
Copy link
Contributor

n3nash commented Apr 26, 2021

@qianjiangbing Thanks for confirming.

@MyLanPangzi I just noticed that you were using Hive version : 1.1 cdh 5.6.12. This is a very old version of Hive. The latest Hudi builds only work with Hive 2.x+ versions. Are you able to migrate to a higher version of Hive ?

@MyLanPangzi
Copy link
Contributor Author

@qianjiangbing Thanks for confirming.

@MyLanPangzi I just noticed that you were using Hive version : 1.1 cdh 5.6.12. This is a very old version of Hive. The latest Hudi builds only work with Hive 2.x+ versions. Are you able to migrate to a higher version of Hive ?

Sorry, I can't upgrade the cluster version. so the only option is using org.apache.hudi.hadoop.HoodieParquetInputFormat for mor table in my cluster.
Can i close the issue if you haven't any fix?

@nsivabalan
Copy link
Contributor

CC @n3nash

@stayrascal
Copy link
Contributor

@n3nash @nsivabalan may i check does this feature(agg query table) work well in hive 3.1.2, I meet the the sample problem about cast class exception as bellow shows:

Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch cannot be cast to org.apache.hadoop.io.ArrayWritable
	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.createValue(RealtimeCompactedRecordReader.java:183)
	at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.createValue(RealtimeCompactedRecordReader.java:47)
	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.createValue(HoodieRealtimeRecordReader.java:89)
	at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.createValue(HoodieRealtimeRecordReader.java:36)
	at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:58)
	at org.apache.hadoop.hive.ql.io.HiveRecordReader.createValue(HiveRecordReader.java:33)
	at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.createValue(TezGroupedSplitsInputFormat.java:160)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.setupOldRecordReader(MRReaderMapred.java:168)
	at org.apache.tez.mapreduce.lib.MRReaderMapred.setSplit(MRReaderMapred.java:83)
	at org.apache.tez.mapreduce.input.MRInput.initFromEventInternal(MRInput.java:706)
	at org.apache.tez.mapreduce.input.MRInput.initFromEvent(MRInput.java:665)
	at org.apache.tez.mapreduce.input.MRInputLegacy.checkAndAwaitRecordReaderInitialization(MRInputLegacy.java:150)
	at org.apache.tez.mapreduce.input.MRInputLegacy.init(MRInputLegacy.java:114)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.getMRInput(MapRecordProcessor.java:525)
	at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:171)
	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)

Not sure if we have verified that this feature works well in Hive3, I'm using hive 3.1.2 and hudi 0.12.2

@danny0405
Copy link
Contributor

You may need to turn off the vectorized execution for Hive.

@stayrascal
Copy link
Contributor

thanks @danny0405, it works.

@stayrascal
Copy link
Contributor

stayrascal commented Feb 24, 2023

@danny0405 one more question, after disable vectorized vectorization, count RT table works, but got empty content during query the RT table, which haven't done any compaction(only have log file without base file). Any parameter I need to set?

0: jdbc:hive2://xxxxxx-1:10000/> select count(*) from flink_hudi_mor_tbl_rt;
+------+
| _c0  |
+------+
| 16   |
+------+
1 row selected (6.141 seconds)

0: jdbc:hive2://xxxxxx-1:10000/> select * from flink_hudi_mor_tbl_rt;
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
| flink_hudi_mor_tbl_rt._hoodie_commit_time  | flink_hudi_mor_tbl_rt._hoodie_commit_seqno  | flink_hudi_mor_tbl_rt._hoodie_record_key  | flink_hudi_mor_tbl_rt._hoodie_partition_path  | flink_hudi_mor_tbl_rt._hoodie_file_name  | flink_hudi_mor_tbl_rt.uuid  | flink_hudi_mor_tbl_rt.name  | flink_hudi_mor_tbl_rt.age  | flink_hudi_mor_tbl_rt.ts  | flink_hudi_mor_tbl_rt.partition  |
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
+--------------------------------------------+---------------------------------------------+-------------------------------------------+-----------------------------------------------+------------------------------------------+-----------------------------+-----------------------------+----------------------------+---------------------------+----------------------------------+
No rows selected (0.137 seconds)

0: jdbc:hive2://xxxx-1:10000/> set hive.input.format;
+----------------------------------------------------+
|                        set                         |
+----------------------------------------------------+
| hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat |
+----------------------------------------------------+
1 row selected (0.006 seconds)
0: jdbc:hive2://xxxx-1:10000/> set hive.tez.input.format;
+----------------------------------------------------+
|                        set                         |
+----------------------------------------------------+
| hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat |
+----------------------------------------------------+
1 row selected (0.006 seconds)

it seems that select * from flink_hudi_mor_tbl_rt; didn't submit any job. But it works if query a RT table which has done some compaction commits.

@danny0405
Copy link
Contributor

rt table with pure log is not supported well for Hive queries, you may need to switch to ro table instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hive Issues related to hive
Projects
None yet
Development

No branches or pull requests

8 participants
@nsivabalan @n3nash @bvaradar @qianjiangbing @danny0405 @stayrascal @MyLanPangzi and others