[SUPPORT] HDFSParquetImport performance issues

Hi all,
We're trying to use the HDFSParquetImporter to import data from parquet (source) to Hudi (target) on S3 but we're facing some performance problems.

Hudi version 0.5.3 (EMR 5.30.1).

Here's important considerations about our data:
- Parquet size on S3: 536.8 GiB
- Parquet count: 8785
- Total rows: > 8 billions of rows
- Not partitioned

About our partition rules on Hudi:
- We're using Multi-level partitions - organization/year/month/day
- We have thousands of organizations

Our data schema in AVRO format:
`{
  "type": "record",
  "name": "UsageFact",
  "doc": "Usage Fact",
  "fields": [
    {
      "name": "sk_usage_id",
      "type": "string"
    },
    {
      "name": "sk_comm_capability_id",
      "type": "string"
    },
    {
      "name": "time",
      "type": "string"
    },
    {
      "name": "mt_load_time",
      "type": "string"
    },
    {
      "name": "direction",
      "type": "string"
    },
    {
      "name": "channel",
      "type": "string"
    },
    {
      "name": "provider",
      "type": "string"
    },
    {
      "name": "metric",
      "type": "string"
    },
    {
      "name": "sk_comm_capability_name",
      "type": "string"
    },
    {
      "name": "sk_operation_id",
      "type": "string"
    },
    {
      "name": "sk_operation_name",
      "type": "string"
    },
    {
      "name": "country",
      "type": "string"
    },
    {
      "name": "subcategory",
      "type": "string"
    },
    {
      "name": "category",
      "type": "string"
    },
    {
      "name": "quantity",
      "type": "int"
    },
    {
      "name": "partition_path",
      "type": "string"
    }
  ]
}
`

The column "partition_path" is the definition of our partition rules (for example: "organization=ABC/year=2020/month=01/day=01").

So we're trying to execute HDFSParquetImport:

`hdfsparquetimport --upsert false --srcPath "[PARQUET_SOURCE_PATH]" --targetPath "[HUDI_TARGET_PATH]" --tableName [TABLE_NAME] --tableType COPY_ON_WRITE --rowKeyField [ROW_IDENTIFIER] --partitionPathField "partition_path" --parallelism 5000 --schemaFilePath "[AVRO SCHEMA]" --format parquet --sparkMemory 20g --retry 3
`

The problem:

In our first test, the importer took 33 minutes to import 20 millions rows. So we're concerned about use this importer to ingest our 8 billions of rows. 
We tried to change some performance arguments (sparkMemory and parallelism) but without any good results. The Spark job created by the importer only use 30% of the cluster resources. We just can't make this job to use more resources in our cluster.

We already tried to use Spark and BULK INSERT to write on Hudi but the performance was worst than use the importer (one million rows in more than one hour).

So our questions are:

- How can we tune this importer? How can we allocate more resources to this job on yarn?
- Can we run multiple parallel importers on the same Hudi table using bulk mode?
- Is there other good alternative to import a large amount of data to Hudi? What is the best option in terms of performance?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] HDFSParquetImport performance issues #2303

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SUPPORT] HDFSParquetImport performance issues #2303

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions