-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] Hudi Job fails fast in concurrent write even with high retries and long wait time #9728
Comments
OCC is not designed for high frequency concurrent access, we are working hard towards non-blocking concurrency control: #7907 |
Thanks! That's good to know. Meanwhile we can build our own "lock" However, does this symptom also mean these retry parameters does not work for DynamoDB based lock provider? I was looking into the source code for 0.13.0 and seems DynamoDBLockProvider is not picking up some of the parameters like "write.lock.num_retries". Is that true? |
@Jason-liujc I did a fix around the same line which will be included in 0.14. Dynamodb config class had some configuration issues which was not allowing it to pick default params. Although I haven't tried this around num_retries. Can you try with this PR - #8868 |
Thanks. We've tried the newest update from Just going through the source code, I'm seeing Line 58 in 9e9f768
Whereas DynamoDB is not: hudi/hudi-aws/src/main/java/org/apache/hudi/aws/transaction/lock/DynamoDBBasedLockProvider.java Line 62 in 9e9f768
My theory right now is when we are using DynamoDB Lock Provider, the retry parameters are not used, hence the job fails pretty fast. (i.e. Let me know if that's the case or an expected behavior from Hudi. |
@Jason-liujc I believe the issue you're seeing is unrelated to lock acquisition. From the multi-writer docs:
From the error you posted, it seems like both writers write to the same file ID so one of them fails. The PR that @danny0405 linked addresses this issue by resolving the conflicts during compaction/read-time. |
Can we please confirm this issue if it persists ? |
@SamarthRaval @Jason-liujc As discussed, the retry configuration is unrelated to the problem you are facing. The only way to handle such scenario's at this moment will be handling retries at your application level code. Hopefully #7907 will solve this problem in hudi 1.0 |
Describe the problem you faced
We have a usecase where might have a lot of concurrent writes to the same partition under special scenarios. We are testing if Hudi supports this natively by changing some of the lock retry/wait-time parameters.
We are trying allow all these writers to go through with optimistic retries eventually by using really high
num_retries
andwait_time_ms
parameters.To Reproduce
We are using DynamoDB as the lock provider, running these loads on AWS EMR
We have the following using options related to concurrency control:
We spinned up 7 jobs that writes to the same table. Each job should take around ~20 minutes to finish on its own.
Expected behavior
These 7 jobs will have conflicting writes and will retry and will succeed eventually.
Base on the retry parameters I have set, I'd expect it to run for at least 4 hours.
Environment Description
Hudi version : 0.13.0
Spark version : 3.3
Hive version : 3.1.3
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Additional context
Running these workloads on EMR. This is a follow up to this issue: #9512
Stacktrace
Seeing these errors after 0.5 hours for 40% of the jobs:
The text was updated successfully, but these errors were encountered: