Skip to content

[SUPPORT] Global Indexes return old partition value when querying Hive tables #1958

@rmpifer

Description

@rmpifer

Describe the problem you faced

When using global indexes (GLOBAL_BLOOM, HBASE) and a records partition value has been updated during an upsert the record gets written to the previous partition folder. This causes issues when querying Hive _ro and _rt tables as it displays the new data but with the previous partition value. We can see this by looking at the result of the query for 'ad' partition column and the Hudi data stored

scala> spark.sql("select * from gb_partition_update_ro").show()
+-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|     wbn|    status|     scan_date|   ad_updated|           ad|
+-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
|     20200812204404|  20200812204404_0_1|          12345678|         2020-08-06-12|5ecf97ef-b9d9-4e3...|12345678|Delivered|1596716921000605|2020-08-06-13|2020-08-06-12|

scala> spark.read.format("parquet").load("s3://ryanpife-emr-dev/hudi/data/hudi_mor_global_bloom/1/2020-08-06-12/5ecf97ef-b9d9-4e3f-b586-0140f6ebdd3a-0_0-97-39036_20200812204404.parquet").show()
+-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|     wbn|    status|     scan_date|           ad|   ad_updated|
+-------------------+--------------------+------------------+----------------------+--------------------+--------+---------+----------------+-------------+-------------+
|     20200812204404|  20200812204404_0_1|          12345678|         2020-08-06-12|5ecf97ef-b9d9-4e3...|12345678|Delivered|1596716921000605|2020-08-06-13|2020-08-06-13|

This is probably due to the table definition in hive reading in partition value from folder location

hive> SHOW CREATE TABLE gb_partition_update_ro;
OK
CREATE EXTERNAL TABLE `gb_partition_update_ro`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `wbn` int, 
  `status` string, 
  `scan_date` bigint, 
  `ad_updated` string)
PARTITIONED BY ( 
  `ad` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://ryanpife-emr-dev/hudi/data/hudi_mor_global_bloom/1'
TBLPROPERTIES (
  'last_commit_time_sync'='20200812204404', 
  'transient_lastDdlTime'='1597264976')

This does not seem like correct behavior as the query results return new data with stale partition column value

To Reproduce

Steps to reproduce the behavior:

  1. Create table with global index
  2. Insert a record
  3. Upsert same record key with new partition value
  4. Query Hive table (_ro or _rt) for record key and previous partition value will be displayed

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.5.2

  • EMR version : 5.30.1

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions