Skip to content

Conversation

@zhuangchong
Copy link
Contributor

@zhuangchong zhuangchong commented Dec 9, 2025

Purpose

Linked issue: close #6782

There are many possible causes for “hive lock may encounter deadlock”, for example:
1.The first table task is still running, and subsequent tasks cannot acquire the Hive table lock, leading to a timeout.
2.Delays in acquiring the Hive metastore lock, which also cause timeouts.
And so on.

In my latest changes:

1.Added detailed logs to show whether a lock acquisition failure was caused by a timeout or by another lock state.

String msg =
String.format(
"for table %s.%s (lockId=%d) after %dms. Final lock state: %s",
database, table, lockId, duration, lockState);
LOG.info("Acquire lock {}", msg);

2.Fixed an issue where lockResponse = clients.run(client -> client.checkLock(lockId)); would throw an exception and the lock would not be released, preventing subsequent tasks from acquiring the lock.

try {
while (lockResponse.getState() == LockState.WAITING) {
long elapsed = System.currentTimeMillis() - startMs;
if (elapsed >= acquireTimeout) {
break;
}
nextSleep = Math.min(nextSleep * 2, checkMaxSleep);
Thread.sleep(nextSleep);
lockResponse = clients.run(client -> client.checkLock(lockId));
}
} finally {
if (lockResponse.getState() != LockState.ACQUIRED) {
// unlock if not acquired
unlock(lockId);
}
}

Tests

API and Format

Documentation

long lockId = lock(database, table);
Long lockId = null;
try {
lockId = lock(database, table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for runWithLock method:
I cannot get it why modify here. What is difference?

Copy link
Contributor Author

@zhuangchong zhuangchong Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many possible causes for “hive lock may encounter deadlock”, for example:

The first table task is still running, and subsequent tasks cannot acquire the Hive table lock, leading to a timeout.

Delays in acquiring the Hive metastore lock, which also cause timeouts.
And so on.

In my latest changes:

Added detailed logs to show whether a lock acquisition failure was caused by a timeout or by another lock state.

Fixed an issue where lockResponse = clients.run(client -> client.checkLock(lockId)); would throw an exception and the lock would not be released, preventing subsequent tasks from acquiring the lock.

@JingsongLi
Copy link
Contributor

Can you update the Purpose in the description? I cannot get information in try lock method.

@zhuangchong
Copy link
Contributor Author

Can you update the Purpose in the description? I cannot get information in try lock method.

done.

@JingsongLi
Copy link
Contributor

Thanks @zhuangchong +1

@JingsongLi JingsongLi merged commit f3f7bd3 into apache:master Dec 11, 2025
22 checks passed
@zhuangchong zhuangchong deleted the hive-catalog-lock branch December 11, 2025 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] HiveCatalogLock may experience deadlock

2 participants