Skip to content

Hive: make sure to unlock table when the finally block is not executed#3606

Closed
smallx wants to merge 3 commits intoapache:masterfrom
smallx:hive-lock-fix
Closed

Hive: make sure to unlock table when the finally block is not executed#3606
smallx wants to merge 3 commits intoapache:masterfrom
smallx:hive-lock-fix

Conversation

@smallx
Copy link
Contributor

@smallx smallx commented Nov 25, 2021

Avoid that the finally code block used to unlock hive table may not be executed due to the following conditions:

  • System.exit(N)
  • all non-daemon threads exit

The old pr is #3000

@smallx
Copy link
Contributor Author

smallx commented Nov 25, 2021

@kbendick @pvary Can you help me to run github workflows, thanks.

* Also, to avoid memory leaks caused by addShutdownHook, we use a class level hook.
*/
private static synchronized void initUnlockTableHook(ClientPool metaClients) {
globalMetaClients = metaClients;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you have multiple catalogs configured with different URI?
Having a global variable storing the clients could be a problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This is a use case that I know exists at some companies (multiple catalogs configured with different URIs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've fixed it.

LockResponse lockResponse = metaClients.run(client -> client.lock(lockRequest));
AtomicReference<LockState> state = new AtomicReference<>(lockResponse.getState());
long lockId = lockResponse.getLockid();
tableLocksById.put(lockId, fullName);
Copy link
Contributor

@pvary pvary Nov 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe store the triplet, so we can unlock it:

  • lockId
  • fullName
  • metaClients

if (lockId.isPresent()) {
try {
doUnlock(lockId.get());
tableLocksById.remove(lockId.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a problem with unlock, this could cause a memory leak

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've fixed it.

@pvary
Copy link
Contributor

pvary commented Nov 30, 2021

I am still wary of this change, as the issue seems quite rare and could be fixed by running Hive housekeeper thread which removes old locks without heartbeat. OTHO this introduces additional complexity and possible leaks, problems.

Could you explain the use-case which causes this issue, and help me understand why the other solutions are not sufficient?

Thanks,
Peter

baseline.gradle Outdated
'-Xep:EqualsGetClass:OFF',
// patterns that are allowed
'-Xep:MissingCasesInEnumSwitch:OFF',
'-Xep:ShutdownHook:OFF',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this, is it possible to add an annotation suppressing the error in just that one part of the code?

If not, then that's ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've fixed it.

@smallx
Copy link
Contributor Author

smallx commented Dec 5, 2021

Thanks @pvary @kbendick

Our HMS does not enable txn and housekeeper service, which is Hive's default behavior. When our streaming task has a fatal error, it will exit in another thread, which may lead to a small probability of lock non release.

At the same time, other users may also use System.exit(N) to exit in Spark application, although this is not recommended.

@pvary
Copy link
Contributor

pvary commented Dec 6, 2021

Our HMS does not enable txn and housekeeper service, which is Hive's default behavior.

May I know why the housekeeper threads are not enabled? This seems brittle to me, and forcing every user to clean up everything, everytime seems counterproductive.

Thanks, Peter

@smallx
Copy link
Contributor Author

smallx commented Dec 6, 2021

Thanks @pvary

We haven't encountered similar lock problems before. And we will also try housekeeper service.

If we can forwardly release the locks when the program is abnormal, it may be better for users who do not start the housekeeper threads.

When the program exits normally, tableLocksById should be empty, so this shutdown hook will do nothing. The lock cleaning action will be executed only when the program exits abnormally. If possible, I can also add a property to control whether to add shutdown hook.

@smallx
Copy link
Contributor Author

smallx commented Jan 13, 2022

Close this because it not be a general problem and can be solved by starting Hive housekeeper services. Thanks everyone.

@smallx smallx closed this Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants