[SUPPORT] When I used 100,000 data to update 100 million data, The program is stuck

**Environment Description**

* Hudi version : 0.7.0/0.6.0

* Spark version : 2.4.4 

* Hive version :2.3.1

* Hadoop version : 2.7.5

* Storage (HDFS/S3/GCS..) : HDFS

* Running on Docker? (yes/no) : no

When I used 100,000 data to update 100 million data, the program was stuck and could not execute further. The table type used was MOR. The program execution diagram is as follows:
![image](https://user-images.githubusercontent.com/34565079/111167633-48772800-85dc-11eb-9072-1f4f7a3a2c54.png)
![image](https://user-images.githubusercontent.com/34565079/111169120-a48e7c00-85dd-11eb-8d1c-979eaffce76f.png)

hudi parameters as follow:
    TABLE_TYPE_OPT_KEY -> MOR_TABLE_TYPE_OPT_VAL, 
//      OPERATION_OPT_KEY -> WriteOperationType.UPSERT.value, 
      OPERATION_OPT_KEY -> "upsert",  
      RECORDKEY_FIELD_OPT_KEY -> pkCol,  
      PRECOMBINE_FIELD_OPT_KEY -> preCombineCol,  
      "hoodie.embed.timeline.server" -> "false",
      "hoodie.cleaner.commits.retained" -> "1",
      "hoodie.cleaner.fileversions.retained" -> "1",
      "hoodie.cleaner.policy" -> HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name(),
      "hoodie.keep.min.commits" -> "3",
      "hoodie.keep.max.commits" -> "4",
      "hoodie.compact.inline" -> "true",
      "hoodie.compact.inline.max.delta.commits" -> "1",
      //      "hoodie.copyonwrite.record.size.estimate" -> String.valueOf(500),
      PARTITIONPATH_FIELD_OPT_KEY -> "dt", 
      HIVE_PARTITION_FIELDS_OPT_KEY -> "dt",
      HIVE_URL_OPT_KEY -> "jdbc:hive2:/0.0.0.0:10000",
      HIVE_USER_OPT_KEY -> "",
      HIVE_PASS_OPT_KEY -> "",
      HIVE_DATABASE_OPT_KEY -> hiveDatabaseName,
      HIVE_TABLE_OPT_KEY -> hiveTableName,
      HIVE_SYNC_ENABLED_OPT_KEY -> "true",
      HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH -> "true",
      HoodieWriteConfig.TABLE_NAME -> hiveTableName,  
      HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName,
      HoodieIndexConfig.INDEX_TYPE_PROP -> HoodieIndex.IndexType.GLOBAL_BLOOM.name(),
      "hoodie.insert.shuffle.parallelism" -> parallelism,
      "hoodie.upsert.shuffle.parallelism" -> parallelism



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] When I used 100,000 data to update 100 million data, The program is stuck #2676

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] When I used 100,000 data to update 100 million data, The program is stuck #2676

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions