[SUPPORT] Hudi Write Performance #2484

rubenssoto · 2021-01-24T15:50:10Z

Hello,

I want to start using Hudi on my datalake, so I'm running some performance tests comparing current processing time with and without Hudi. We have a lot of tables in our datalake so we are processing these tables in groups in the same spark context with different threads.
I made a test processing all table sources again, with regular parquet it took 15 minutes, with Hudi bulk insert 29 minutes, Hudi has some operations that regular parquet doesn't have, for example sorting but the big performance difference was in writing parquet operation, is there any difference writing parquet with Hudi and regular parquet? I used gzip codec in both.

In Hudi I configured bulk parallelism to 20 and regular parquet I made a coalesce 20.

Hudi Version: 0.8.0-SNAPSHOT
Spark Version: 3.0.1
11 Executors with 5 cores each and 35g of memory

spark submit:
spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf spark.executor.memoryOverhead=3000 --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.memory=35g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --packages org.apache.spark:spark-avro_2.12:2.4.4 --jars s3://dl/lib/spark-daria_2.12-0.38.2.jar,s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar --class TableProcessorWrapper s3://dl/code/projects/data_projects/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.5.jar courier_api_group01

val hudiOptions = Map[String, String](
      "hoodie.table.name"                        -> tableName,
      "hoodie.datasource.write.operation"        -> "bulk_insert",
      "hoodie.bulkinsert.shuffle.parallelism"    -> "20",
      "hoodie.parquet.small.file.limit"          -> "536870912",
      "hoodie.parquet.max.file.size"             -> "1073741824",
      "hoodie.parquet.block.size"                -> "536870912",
      "hoodie.copyonwrite.record.size.estimate"  -> "1024",
      "hoodie.datasource.write.precombine.field" -> deduplicationColumn,
      "hoodie.datasource.write.recordkey.field"  -> primaryKey.mkString(","),
      "hoodie.datasource.write.keygenerator.class" -> (if (primaryKey.size == 1) {
                                                         "org.apache.hudi.keygen.SimpleKeyGenerator"
                                                       } else { "org.apache.hudi.keygen.ComplexKeyGenerator" }),
      "hoodie.datasource.write.partitionpath.field"           -> partitionColumn,
      "hoodie.datasource.write.hive_style_partitioning"       -> "true",
      "hoodie.datasource.write.table.name"                    -> tableName,
      "hoodie.datasource.hive_sync.table"                     -> tableName,
      "hoodie.datasource.hive_sync.database"                  -> databaseName,
      "hoodie.datasource.hive_sync.enable"                    -> "true",
      "hoodie.datasource.hive_sync.partition_fields"          -> partitionColumn,
      "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
      "hoodie.datasource.hive_sync.jdbcurl"                   -> "jdbc:hive2://ip-10-0-19-157.us-west-2.compute.internal:10000"    )

Regular Parquet

Hudi has a Rdd conversion Part

Hudi Write, took double time

It was one real world processing that I tried, but I notice this slow writing on every processing that I use Hudi.

Is it normal? Is there any way to tunning it? Am i doing something wrong?

Thank you so much!!!!!

The text was updated successfully, but these errors were encountered:

rubenssoto · 2021-01-25T22:04:18Z

Hello,

I changed the option hoodie.datasource.write.row.writer.enable and took only 21 minutes, 30% faster, great!!!!

vinothchandar · 2021-01-25T22:26:27Z

@rubenssoto yes. row writer is the difference. the df.rdd conversion in Spark takes that hit. I recommend sorting the file initially, since it gives you lots of returns for query performance

rubenssoto · 2021-01-25T22:29:01Z

Do you mean an Order By before df.write.format('hudi').save() ?

vinothchandar · 2021-01-25T22:34:10Z

No, I mean the sorting Hudi internally does that you mentioned before. So this is not even configurable for row writing. So all good. That should explain the extra time (21-15)

    return colOrderedDataset
        .sort(functions.col(HoodieRecord.PARTITION_PATH_METADATA_FIELD), functions.col(HoodieRecord.RECORD_KEY_METADATA_FIELD))
        .coalesce(config.getBulkInsertShuffleParallelism());

rubenssoto · 2021-01-25T23:13:28Z

Great, thank you for the explanation....its makes sense.

If I understand this code right, Hudi will order by partition key and record key, so if I have an unpartitioned table the data on my files should be ordered by record key(primary key), is it?

I check one table of mine, and it is my min and max of my primary key of each file

Using that code as a reference, I imagined the files like:

file01 1 1000
file02 1001 2000
file03 2001 3000
....... and so on

vinothchandar · 2021-01-26T04:00:32Z

yes thats correct. they are lexicographically sorted if you notice.

This is a trick we used at Uber even before Hudi. It helps layout data initially sorted, so range pruning is faster, and also when dealing with partitions with unequal size, the sort based on partitionpath ensures we are writing the smallest number of files in total. otherwise, if you hash partition 1000 times across 1000 partition paths, you ll end up with 1M files. In this approach, you will end up with atmost 2000 files. huge benefit. and from there on, when doing upserts/inserts, Hudi will maintain the file sizes.

rubenssoto closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Hudi Write Performance #2484

[SUPPORT] Hudi Write Performance #2484

rubenssoto commented Jan 24, 2021 •

edited

Loading

rubenssoto commented Jan 25, 2021

vinothchandar commented Jan 25, 2021

rubenssoto commented Jan 25, 2021

vinothchandar commented Jan 25, 2021

rubenssoto commented Jan 25, 2021

vinothchandar commented Jan 26, 2021

[SUPPORT] Hudi Write Performance #2484

[SUPPORT] Hudi Write Performance #2484

Comments

rubenssoto commented Jan 24, 2021 • edited Loading

rubenssoto commented Jan 25, 2021

vinothchandar commented Jan 25, 2021

rubenssoto commented Jan 25, 2021

vinothchandar commented Jan 25, 2021

rubenssoto commented Jan 25, 2021

vinothchandar commented Jan 26, 2021

rubenssoto commented Jan 24, 2021 •

edited

Loading