Skip to content

[SUPPORT] HDFSParquetImport performance issues #2303

@lucprosa

Description

@lucprosa

Hi all,
We're trying to use the HDFSParquetImporter to import data from parquet (source) to Hudi (target) on S3 but we're facing some performance problems.

Hudi version 0.5.3 (EMR 5.30.1).

Here's important considerations about our data:

  • Parquet size on S3: 536.8 GiB
  • Parquet count: 8785
  • Total rows: > 8 billions of rows
  • Not partitioned

About our partition rules on Hudi:

  • We're using Multi-level partitions - organization/year/month/day
  • We have thousands of organizations

Our data schema in AVRO format:
{ "type": "record", "name": "UsageFact", "doc": "Usage Fact", "fields": [ { "name": "sk_usage_id", "type": "string" }, { "name": "sk_comm_capability_id", "type": "string" }, { "name": "time", "type": "string" }, { "name": "mt_load_time", "type": "string" }, { "name": "direction", "type": "string" }, { "name": "channel", "type": "string" }, { "name": "provider", "type": "string" }, { "name": "metric", "type": "string" }, { "name": "sk_comm_capability_name", "type": "string" }, { "name": "sk_operation_id", "type": "string" }, { "name": "sk_operation_name", "type": "string" }, { "name": "country", "type": "string" }, { "name": "subcategory", "type": "string" }, { "name": "category", "type": "string" }, { "name": "quantity", "type": "int" }, { "name": "partition_path", "type": "string" } ] }

The column "partition_path" is the definition of our partition rules (for example: "organization=ABC/year=2020/month=01/day=01").

So we're trying to execute HDFSParquetImport:

hdfsparquetimport --upsert false --srcPath "[PARQUET_SOURCE_PATH]" --targetPath "[HUDI_TARGET_PATH]" --tableName [TABLE_NAME] --tableType COPY_ON_WRITE --rowKeyField [ROW_IDENTIFIER] --partitionPathField "partition_path" --parallelism 5000 --schemaFilePath "[AVRO SCHEMA]" --format parquet --sparkMemory 20g --retry 3

The problem:

In our first test, the importer took 33 minutes to import 20 millions rows. So we're concerned about use this importer to ingest our 8 billions of rows.
We tried to change some performance arguments (sparkMemory and parallelism) but without any good results. The Spark job created by the importer only use 30% of the cluster resources. We just can't make this job to use more resources in our cluster.

We already tried to use Spark and BULK INSERT to write on Hudi but the performance was worst than use the importer (one million rows in more than one hour).

So our questions are:

  • How can we tune this importer? How can we allocate more resources to this job on yarn?
  • Can we run multiple parallel importers on the same Hudi table using bulk mode?
  • Is there other good alternative to import a large amount of data to Hudi? What is the best option in terms of performance?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions