[SUPPORT] Hudi datastore missing updates for many records #1384

utk-spartan · 2020-03-07T19:42:25Z

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Overview of the flow
Mysql-Maxwell -> Kafka -> Spark preprocessing(Sorting, dedup etc.) -> Hudi upsert via spark datasource writer (with Hivesync)

Hudi tables in S3 are missing updates for some records.

To pinpoint the issue in our entire flow we are writing dataframe to S3 after each stage and we observed that all the updates are present in the dataframe upon which the hudi datasource writer is called on, but some of these updates are applied in data present in hudi table.

We were initially using 0.4.7 and have upgraded to hudi 0.5.1 and recreated the entire hudi table but the issue still persists.

The count of records matches exactly but we are not sure if inserts are also getting dropped as any one of the captured update event for a record will create its entry, as everything is treated as an upsert. We are analyzing our data currently for this scenario.

These records having inconsistent updates don't seem to correspond to any pattern or table size or batch size.
Upon replaying the batch some of these missed updates are applied i.e. only some arbitrary percent of updates are applied each time the batch is processed.

We will be further digging in hudi code, and find a way to replicate it in non S3 env.

To Reproduce

Steps to reproduce the behavior:

Currently not able to reproduce this behaviour properly on our dev env, will update here.

Expected behavior

Both updates and inserts should be 100% consistent with source db.

Environment Description

Hudi version : 0.5.1
Spark version : 2.4.0
Hive version : 2.3.0
Hadoop version : 2.6.5
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

Additional context

Logs generated from hudi and aws-sdk for s3 have no WARN or ERROR level statements, nothing out of normal in INFO level logs.

Config params for datasource writer
DataSourceWriteOptions.OPERATION_OPT_KEY, "upsert"
"hoodie.bulkinsert.shuffle.parallelism", "100"
"hoodie.upsert.shuffle.parallelism", "100"
"hoodie.insert.shuffle.parallelism", "100"
HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 256 * 1024 * 1024
HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 64 * 1024 * 1024
HoodieCompactionConfig.CLEANER_COMMITS_RETAINED_PROP, 2
HIVE_SYNC_ENABLED_OPT_KEY, true

PARQUET_COMPRESSION_CODEC, "uncompressed"

Stacktrace

Add the stacktrace of the error.

The text was updated successfully, but these errors were encountered:

vinothchandar · 2020-03-08T02:37:58Z

Does sound weird and the fact that you can repro on dev environment suggests may be this is a data issue? Can you see any errors reported in the commit metadata.

vinothchandar · 2020-03-08T16:55:33Z

Hmmm the datasource does fail the commit if there are such errors..

 } else {
      log.error(s"$operation failed with $errorCount errors :")
      if (log.isTraceEnabled) {
        log.trace("Printing out the top 100 errors")
        writeStatuses.rdd.filter(ws => ws.hasErrors)
          .take(100)
          .foreach(ws => {
            log.trace("Global error :", ws.getGlobalError)
            if (ws.getErrors.size() > 0) {
              ws.getErrors.foreach(kt =>
                log.trace(s"Error for key: ${kt._1}", kt._2))
            }
          })
      }
      false
    }

In any case, having some information on workload, MOR vs COW and % of missing records would help debug more.. Did you also have the issue on 0.4.7? or only after you upgraded to 0.5.1?

bvaradar · 2020-03-09T15:49:25Z

@utk-spartan : Please provide more details for us to help here.

jainnidhi703 · 2020-03-11T10:48:48Z

The issue was prevailing on 0.4.7, so we thought maybe it was due to #418, and upgraded to Hudi 0.5.1. But the issue was still not resolved.

bvaradar · 2020-03-11T16:52:07Z

@jainnidhi703 : Is this MOR or COW table. Also, Can you give us some idea of % of missing records. Also, can you inspect Spark logs to see if there are any other failures ?

utk-spartan · 2020-03-11T16:56:04Z

This is for COW tables, upon analyzing the data , missing record updates were below 0.01 % for old updated data but have recently increased to around 20-30%.

Can't find any failures in spark logs.

utk-spartan · 2020-03-11T16:59:47Z

Can this have some relation with https://issues.apache.org/jira/browse/HUDI-409 , as we recently encountered parquet corruption errors (magic numbers mismatch) while reading from presto on a fresh hudi table, and there were no errors/warn reported by spark or in hudi commit metadata files.

bvaradar · 2020-03-11T17:43:25Z

@utk-spartan : HUDI-409 is for MOR tables and unrelated to your scenario. Hmmm. COW is one of the most battle-tested part of Hudi :) and hence it is very surprising. Just to be clear, You are using hudi 0.5.x and started with a clean dataset. right ? I am assuming these are valid updates and not deletes. Right ? One way to debug is : use the cli to print out commit stats for commits at file level to check if you are seeing a drop in numWrites (and other stats numUpdateWrites, numDeletes.... are sane). You may have to write a custom script to suit your need or work with existing cli commands.

vinothchandar · 2020-07-02T21:28:32Z

Closing due to inactivity

bvaradar self-assigned this Mar 12, 2020

tooptoop4 mentioned this issue Apr 12, 2020

presto v0.226 (or v0.221, probably up to v0.234 too) reading parquet does not work (but v0.220 works) prestodb/presto#13457

Closed

vinothchandar closed this as completed Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Hudi datastore missing updates for many records #1384

[SUPPORT] Hudi datastore missing updates for many records #1384

utk-spartan commented Mar 7, 2020

vinothchandar commented Mar 8, 2020

vinothchandar commented Mar 8, 2020

bvaradar commented Mar 9, 2020

jainnidhi703 commented Mar 11, 2020 •

edited

bvaradar commented Mar 11, 2020

utk-spartan commented Mar 11, 2020

utk-spartan commented Mar 11, 2020 •

edited

bvaradar commented Mar 11, 2020

vinothchandar commented Jul 2, 2020

[SUPPORT] Hudi datastore missing updates for many records #1384

[SUPPORT] Hudi datastore missing updates for many records #1384

Comments

utk-spartan commented Mar 7, 2020

vinothchandar commented Mar 8, 2020

vinothchandar commented Mar 8, 2020

bvaradar commented Mar 9, 2020

jainnidhi703 commented Mar 11, 2020 • edited

bvaradar commented Mar 11, 2020

utk-spartan commented Mar 11, 2020

utk-spartan commented Mar 11, 2020 • edited

bvaradar commented Mar 11, 2020

vinothchandar commented Jul 2, 2020

jainnidhi703 commented Mar 11, 2020 •

edited

utk-spartan commented Mar 11, 2020 •

edited