-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Hudi Write Performance #2484
Comments
Hello, I changed the option hoodie.datasource.write.row.writer.enable and took only 21 minutes, 30% faster, great!!!! |
@rubenssoto yes. row writer is the difference. the |
Do you mean an Order By before df.write.format('hudi').save() ? |
No, I mean the sorting Hudi internally does that you mentioned before. So this is not even configurable for row writing. So all good. That should explain the extra time
|
yes thats correct. they are lexicographically sorted if you notice. This is a trick we used at Uber even before Hudi. It helps layout data initially sorted, so range pruning is faster, and also when dealing with partitions with unequal size, the sort based on partitionpath ensures we are writing the smallest number of files in total. otherwise, if you hash partition 1000 times across 1000 partition paths, you ll end up with 1M files. In this approach, you will end up with atmost 2000 files. huge benefit. and from there on, when doing upserts/inserts, Hudi will maintain the file sizes. |
Hello,
I want to start using Hudi on my datalake, so I'm running some performance tests comparing current processing time with and without Hudi. We have a lot of tables in our datalake so we are processing these tables in groups in the same spark context with different threads.
I made a test processing all table sources again, with regular parquet it took 15 minutes, with Hudi bulk insert 29 minutes, Hudi has some operations that regular parquet doesn't have, for example sorting but the big performance difference was in writing parquet operation, is there any difference writing parquet with Hudi and regular parquet? I used gzip codec in both.
In Hudi I configured bulk parallelism to 20 and regular parquet I made a coalesce 20.
Hudi Version: 0.8.0-SNAPSHOT
Spark Version: 3.0.1
11 Executors with 5 cores each and 35g of memory
spark submit:
spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf spark.executor.memoryOverhead=3000 --conf spark.yarn.maxAppAttempts=1 --conf spark.executor.memory=35g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --packages org.apache.spark:spark-avro_2.12:2.4.4 --jars s3://dl/lib/spark-daria_2.12-0.38.2.jar,s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar --class TableProcessorWrapper s3://dl/code/projects/data_projects/batch_processor_engine/batch-processor-engine_2.12-3.0.1_0.5.jar courier_api_group01
Regular Parquet
![Captura de Tela 2021-01-24 às 12 42 13](https://user-images.githubusercontent.com/36298331/105635487-c40cf200-5e41-11eb-99f8-7dd069b26b4e.png)
Hudi has a Rdd conversion Part
![Captura de Tela 2021-01-24 às 12 45 14](https://user-images.githubusercontent.com/36298331/105635542-1cdc8a80-5e42-11eb-8c2c-e0d394a4f8c5.png)
Hudi Write, took double time
![Captura de Tela 2021-01-24 às 12 46 37](https://user-images.githubusercontent.com/36298331/105635569-4e555600-5e42-11eb-85fe-4b924f61b024.png)
![Captura de Tela 2021-01-24 às 12 47 48](https://user-images.githubusercontent.com/36298331/105635590-67f69d80-5e42-11eb-9a6f-7be470417ab8.png)
It was one real world processing that I tried, but I notice this slow writing on every processing that I use Hudi.
Is it normal? Is there any way to tunning it? Am i doing something wrong?
Thank you so much!!!!!
The text was updated successfully, but these errors were encountered: