Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive: Fix concurrent transactions overwriting commits by adding hive lock heartbeats. #5036

Merged
merged 3 commits into from
Aug 4, 2022

Conversation

SinghAsDev
Copy link
Contributor

When using Hive Metastore catalog, it is possible that longer transactions can take longer than transaction timeout of the Hive Metastore (hive.txn.timeout or metastore.txn.timeout in the newer versions). In those cases, concurrent transactions can overwrite each other leading to data corruption. This change fixes concurrent transactions overwriting commits by adding hive lock heartbeats.

@github-actions github-actions bot added the hive label Jun 14, 2022
@SinghAsDev
Copy link
Contributor Author

@rdblue @pvary can you help review, thanks!

@rdblue
Copy link
Contributor

rdblue commented Jun 14, 2022

@szehon-ho and @RussellSpitzer, you may be interested in this one, too.

@pvary
Copy link
Contributor

pvary commented Jun 15, 2022

@SinghAsDev: Quick question: Did you face a specific issue around this, or we are just fixing a long outstanding TODO here?

I am asking, because I have a feeling that the need for heartbeat is more like a theoretical issue than a concrete one. The code is like this:

lock
loadTable
if needed createTable
some object value settings
persistTable

The acquireLock part should heartbeat the lock until the lock is acquired, and the other parts should be fast.

I was trying to find situations where we need heartbeat, and I was able to come up with these:

  • Wrong configuration: Too low heartbeat interval configured on the HMS
  • JVM pause on the client, or the HMS side

For the JVM pause situations the problem is still unlikely to happen (even without the heartbeat), and the heartbeating thread is also suspended in this case, so there is no guarantees that it will help.

Since Hive 4 is around the release, we might try to add a checkAndSet method to the HMS API for setting the metadata_location atomically. This could be the ultimate solution for the issue, if we find this fix important.

All that said: using the default configuration on the HMS, the timeout is set to 300s. If we can not finish the commit in 5 mins, then we already are in serious trouble.

@SinghAsDev
Copy link
Contributor Author

SinghAsDev commented Jun 15, 2022

Hi @pvary we did hit the issue that led to data corruption. I think a lot of platforms are still going to remain on older version of HMS, and may not want to increase the txn timeout.

Tbh, we should have rolled back the transaction or at least fail the job when we failed to unlock the hive lock from the beginning. Many users may be hitting this issue without even realizing.

Btw, there can be many more reasons for the transactions to take way longer than txn timeout. For instance, HMS does a fast scan for alter table operations even when stats collection is turned off. This fast stats lists all files from S3, and can take hours. Another example, would be a HMS pre commit hook taking long for whatever reason. I would argue that we can not safely make the assumption that transactions would finish within txn timeout.

@pvary
Copy link
Contributor

pvary commented Jun 15, 2022

For instance, HMS does a fast scan for alter table operations even when stats collection is turned off

That's seriously bad and unnecessary. This would cause a seriously long commit times.
Which version of Hive you are using?

@SinghAsDev
Copy link
Contributor Author

For instance, HMS does a fast scan for alter table operations even when stats collection is turned off

That's seriously bad and unnecessary. This would cause a seriously long commit times.

Which version of Hive you are using?

We are on hive 1.2. But I think the issue still exists in later versions. I agree this is more of a bug. We had fixed this for our hms deployment. Btw, forgot to mention this issue is limited to unpartitioned tables. All iceberg tables are stored as unpartitioned tables on HMS, so iceberg tables are affected by this.

@pvary
Copy link
Contributor

pvary commented Jun 15, 2022

For instance, HMS does a fast scan for alter table operations even when stats collection is turned off

That's seriously bad and unnecessary. This would cause a seriously long commit times.
Which version of Hive you are using?

We are on hive 1.2. But I think the issue still exists in later versions. I agree this is more of a bug. We had fixed this for our hms deployment. Btw, forgot to mention this issue is limited to unpartitioned tables. All iceberg tables are stored as unpartitioned tables on HMS, so iceberg tables are affected by this.

After fixing this issue, you still see long alter table HMS calls? I am trying to understand the situation better, and possible other places where we have to fix too

@SinghAsDev
Copy link
Contributor Author

After fixing this issue, you still see long alter table HMS calls? I am trying to understand the situation better, and possible other places where we have to fix too

Yea, though likely it won't be as frequent anymore. I think the point I am trying to make is that we can not guarantee that Iceberg commit would be able finish within hive.txn.timeout. You and I both listed a few reasons on why that can happen. I think heartbeats on locks is a must have, or we should fail the commit/ rollback the commit if unlock fails with lock not found. I prefer the former as it would really be a waste of compute to simply fail or rollback.

return null;
});
} catch (TException | InterruptedException e) {
LOG.error("Fail to heartbeat for lock: {}", lockId, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this, but I am thinking about the situation where we find that the heartbeat is failed.
Shall we make sure that the persistTable is not called, and return a CommitFailedException?

Similarly - as you have already mentioned - we might want to throw a CommitStateUnknownException if the unlock has failed. This is a little bit different, because the metastore_location might be set as the current transaction expects, but some concurrent transactions change might be lost in the meantime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I also think we should fail the commit on heart beat failure. However, to do so we would have to add retry on heart beats as well, so that transient failures don't fail the entire job. I will add that.

For the unlock failure handling, I think that needs a bit more discussion and if it is OK with you let's do that discussion in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RetryingMetastoreClient does the retry for us. No need to do the retries by hand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea, forgot about the niceties of RetryingMetastoreClient. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pvary , I was thinking about this more and I am wondering if throwing CommitFailedException would really be of any benefit. Within the acquired hive lock, the only expensive operation we do is alter table operation, which is a blocking call. Once issued, we won't be able to abort it, even in case heartbeat fails. Please correct me if I am missing something here. I think it would be make more sense to handle the lock not found during unlock operation and try to revert the commit or at least throw CommitStateUnknownException. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SinghAsDev: We can do some elaborate stuff to throw CommitFailedException if we are before the alter operation, and we can throw CommitStateUnknownException if we called the alter command already, but I hope this would be a rare occurrence, so it might not worth the extra complexity. The question is how we will be able to communicate back to the original thread that the heartbeat is failed. Based on this the decision above could be an easy one, or a more complex one, and then we can decide which one to chose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary I agree that this added complexity does not really buy us much. All the lengthy operations I have seen that needed heartbeats are within the alter table operation, so very unlikely we will see heartbeat failure before alter operation. After alter table is done, any issue with lock, we will anyway see in the unlock operation. As such, I still think we likely don't need the added complexity. If we agree on this reasoning, I can add this info in the code as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So we would likely to need a test where the failing heartbeat will fail the commit with the CommitStateUnknownException. So It is fine if we have CommitStateUnknownException for all of these cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test. Can you help take a look @pvary @szehon-ho , thanks!

@pvary
Copy link
Contributor

pvary commented Jun 16, 2022

After fixing this issue, you still see long alter table HMS calls? I am trying to understand the situation better, and possible other places where we have to fix too

Yea, though likely it won't be as frequent anymore. I think the point I am trying to make is that we can not guarantee that Iceberg commit would be able finish within hive.txn.timeout. You and I both listed a few reasons on why that can happen. I think heartbeats on locks is a must have, or we should fail the commit/ rollback the commit if unlock fails with lock not found. I prefer the former as it would really be a waste of compute to simply fail or rollback.

You have convinced me here, I just trying to collect the things that we have to fix in Hive to make this work as fast as possible 😄

@@ -43,8 +43,12 @@ public abstract class HiveMetastoreTest {
@BeforeClass
public static void startMetastore() throws Exception {
HiveMetastoreTest.metastore = new TestHiveMetastore();
metastore.start();
HiveMetastoreTest.hiveConf = metastore.hiveConf();
if (HiveMetastoreTest.hiveConf != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand, how would this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is happening before class is actually instantiated HiveMetastoreTest.hiveConf can be null. I actually saw this in the unit test I added as part of this PR.

@szehon-ho
Copy link
Collaborator

Since Hive 4 is around the release, we might try to add a checkAndSet method to the HMS API for setting the metadata_location atomically. This could be the ultimate solution for the issue, if we find this fix important.

I think that would be awesome.

@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

The 0.14.0 release is coming soon. I'll try to include this if it is done in time.

@pvary
Copy link
Contributor

pvary commented Jul 19, 2022

@SinghAsDev: Could you please check why the test is failing? Also left some minor comments.

Thanks,
Peter

@SinghAsDev SinghAsDev force-pushed the lock_heartbeat_oss branch from 07895a4 to f6356ef Compare July 28, 2022 02:06
@SinghAsDev
Copy link
Contributor Author

@SinghAsDev: Could you please check why the test is failing? Also left some minor comments.

Thanks,

Peter

@pvary mind looking again, thanks!

@SinghAsDev
Copy link
Contributor Author

@pvary @szehon-ho @rdblue I think this is ready now? The one test failure is not related to this diff. I will force push to re-trigger the checks.

@pvary pvary changed the title Fix concurrent transactions overwriting commits by adding hive lock heartbeats. Hive: Fix concurrent transactions overwriting commits by adding hive lock heartbeats. Aug 4, 2022
@pvary pvary merged commit 28b5d0e into apache:master Aug 4, 2022
@pvary
Copy link
Contributor

pvary commented Aug 4, 2022

Thanks @SinghAsDev for the PR and @szehon-ho for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants