New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Upsert for S3 Hudi dataset with large partitions takes a lot of time in writing #1371
Comments
cc @umehrot2 is this similar to what you were mentioning as well? @abhaygupta3390 IIUC you are facing this as a part of a streaming write? using the structured streaming sink? if you can share a code snippet , it will help reproduce this |
@vinothchandar this is exactly what I was talking about. This easily becomes a bottleneck as the driver spends time filtering out the files that it gets from |
Created a Jira for this issue https://issues.apache.org/jira/browse/HUDI-656 |
@vinothchandar No, I am using the df write api. Sample code snippet:
|
@abhaygupta3390 we will get this fixed in the next release.. @umehrot2 do you have a patch to share already? |
@umehrot2 : Assigning this github issue to you. Corresponding Jira : https://jira.apache.org/jira/browse/HUDI-672 If there is already a tracking Jira, please feel free to close this one. |
#1394 merged. Resolving this ticket. |
Describe the problem you faced
I have a batch stream processing spark app in which I take a bunch of upserts and write the result at a s3 location in hudi format. The application is running on an EMR cluster.
The dataset has 3 partition columns and the overall cardinality of the partitions is roughly 200 * 2 * 12.
After the commit and clean is done, the method
createRelation
is invoked which takes roughly 9-10 mins and is increasing as the cardinality of the partitions is increasingTo Reproduce
Steps to reproduce the behavior:
Expected behavior
Since I am writing the DataFrame in append mode to the path, I expect the write to be complete at the point when the commit happens
Environment Description
Hudi version : 0.5.1-incubating
Spark version : 2.4.4
Hive version : Hive 2.3.6-amzn-1
Hadoop version : Amazon 2.8.5
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
EMR version: emr-5.29.0
Stacktrace
Info logs for one iteration of
org.apache.hudi.hadoop.HoodieROTablePathFilter#accept
inorg.apache.spark.sql.execution.datasources.InMemoryFileIndex#listLeafFiles
:Below is the screenshot of the sparkUI depicting the time gap which represents the time taken between the above step and processing the next batch of updates:
The text was updated successfully, but these errors were encountered: