Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Hudi Write Performance #2484

Closed
rubenssoto opened this issue Jan 24, 2021 · 6 comments
Closed

[SUPPORT] Hudi Write Performance #2484

rubenssoto opened this issue Jan 24, 2021 · 6 comments

Comments

@rubenssoto
Copy link

rubenssoto commented Jan 24, 2021

Hello,

I want to start using Hudi on my datalake, so I'm running some performance tests comparing current processing time with and without Hudi. We have a lot of tables in our datalake so we are processing these tables in groups in the same spark context with different threads.
I made a test processing all table sources again, with regular parquet it took 15 minutes, with Hudi bulk insert 29 minutes, Hudi has some operations that regular parquet doesn't have, for example sorting but the big performance difference was in writing parquet operation, is there any difference writing parquet with Hudi and regular parquet? I used gzip codec in both.

In Hudi I configured bulk parallelism to 20 and regular parquet I made a coalesce 20.

Hudi Version: 0.8.0-SNAPSHOT
Spark Version: 3.0.1
11 Executors with 5 cores each and 35g of memory

spark submit:
spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf spark.executor.memoryOverhead=3000 --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.memory=35g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --packages org.apache.spark:spark-avro_2.12:2.4.4 --jars s3://dl/lib/spark-daria_2.12-0.38.2.jar,s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar --class TableProcessorWrapper s3://dl/code/projects/data_projects/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.5.jar courier_api_group01

val hudiOptions = Map[String, String](
      "hoodie.table.name"                        -> tableName,
      "hoodie.datasource.write.operation"        -> "bulk_insert",
      "hoodie.bulkinsert.shuffle.parallelism"    -> "20",
      "hoodie.parquet.small.file.limit"          -> "536870912",
      "hoodie.parquet.max.file.size"             -> "1073741824",
      "hoodie.parquet.block.size"                -> "536870912",
      "hoodie.copyonwrite.record.size.estimate"  -> "1024",
      "hoodie.datasource.write.precombine.field" -> deduplicationColumn,
      "hoodie.datasource.write.recordkey.field"  -> primaryKey.mkString(","),
      "hoodie.datasource.write.keygenerator.class" -> (if (primaryKey.size == 1) {
                                                         "org.apache.hudi.keygen.SimpleKeyGenerator"
                                                       } else { "org.apache.hudi.keygen.ComplexKeyGenerator" }),
      "hoodie.datasource.write.partitionpath.field"           -> partitionColumn,
      "hoodie.datasource.write.hive_style_partitioning"       -> "true",
      "hoodie.datasource.write.table.name"                    -> tableName,
      "hoodie.datasource.hive_sync.table"                     -> tableName,
      "hoodie.datasource.hive_sync.database"                  -> databaseName,
      "hoodie.datasource.hive_sync.enable"                    -> "true",
      "hoodie.datasource.hive_sync.partition_fields"          -> partitionColumn,
      "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
      "hoodie.datasource.hive_sync.jdbcurl"                   -> "jdbc:hive2://ip-10-0-19-157.us-west-2.compute.internal:10000"    )

Regular Parquet
Captura de Tela 2021-01-24 às 12 42 13

Hudi has a Rdd conversion Part
Captura de Tela 2021-01-24 às 12 45 14

Hudi Write, took double time
Captura de Tela 2021-01-24 às 12 46 37
Captura de Tela 2021-01-24 às 12 47 48

It was one real world processing that I tried, but I notice this slow writing on every processing that I use Hudi.

Is it normal? Is there any way to tunning it? Am i doing something wrong?

Thank you so much!!!!!

@rubenssoto
Copy link
Author

Hello,

I changed the option hoodie.datasource.write.row.writer.enable and took only 21 minutes, 30% faster, great!!!!

@vinothchandar
Copy link
Member

@rubenssoto yes. row writer is the difference. the df.rdd conversion in Spark takes that hit. I recommend sorting the file initially, since it gives you lots of returns for query performance

@rubenssoto
Copy link
Author

Do you mean an Order By before df.write.format('hudi').save() ?

@vinothchandar
Copy link
Member

No, I mean the sorting Hudi internally does that you mentioned before. So this is not even configurable for row writing. So all good. That should explain the extra time (21-15)

    return colOrderedDataset
        .sort(functions.col(HoodieRecord.PARTITION_PATH_METADATA_FIELD), functions.col(HoodieRecord.RECORD_KEY_METADATA_FIELD))
        .coalesce(config.getBulkInsertShuffleParallelism());

@rubenssoto
Copy link
Author

Great, thank you for the explanation....its makes sense.

If I understand this code right, Hudi will order by partition key and record key, so if I have an unpartitioned table the data on my files should be ordered by record key(primary key), is it?

I check one table of mine, and it is my min and max of my primary key of each file
Captura de Tela 2021-01-25 às 20 09 22

Using that code as a reference, I imagined the files like:

file01 1 1000
file02 1001 2000
file03 2001 3000
....... and so on

@vinothchandar
Copy link
Member

yes thats correct. they are lexicographically sorted if you notice.

This is a trick we used at Uber even before Hudi. It helps layout data initially sorted, so range pruning is faster, and also when dealing with partitions with unequal size, the sort based on partitionpath ensures we are writing the smallest number of files in total. otherwise, if you hash partition 1000 times across 1000 partition paths, you ll end up with 1M files. In this approach, you will end up with atmost 2000 files. huge benefit. and from there on, when doing upserts/inserts, Hudi will maintain the file sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants