Why do resource recovery in ResourceRegistrar.register? #59

klr8 · 2016-04-15T12:56:07Z

Simple (potentially dumb) question: why does the ResourceRegistrar.register() method attempt resource recovery if the transaction manager is already running?

The reason I ask is the following:
Our application fails to startup because of a single transaction causing a RecoveryException. This exception is raised when we try to register a resource with the transaction manager. At that point the transaction manager is already running because it was started using the Tomcat BTMLifecycleListener, and consequently ResourceRegistrar.register() attempts recovery:

Caused by: bitronix.tm.recovery.RecoveryException: error recovering resource '1460720339996_3D_DOCCLE' due to an incompatible heuristic decision at bitronix.tm.recovery.IncrementalRecoverer.recover(IncrementalRecoverer.java:94) ~[btm-2.1.2.jar:2.1.2] at bitronix.tm.resource.ResourceRegistrar.register(ResourceRegistrar.java:78) ~[btm-2.1.2.jar:2.1.2] at grid.storage.transaction.btm.PoolingSessionFactory.buildXAPool(PoolingSessionFactory.java:68) ~[honeycomb-bitronix-1.8.23.jar:na] at grid.storage.transaction.btm.PoolingSessionFactory.init(PoolingSessionFactory.java:52) ~[honeycomb-bitronix-1.8.23.jar:na] ... 42 common frames omitted

So because of this the application cannot register the resource with the transaction manager and of course needs to fail. It seems a bit harsh to prevent full application startup just because of a recovery failure on a single transaction.
Is there a better way to handle this kind of issue?

The text was updated successfully, but these errors were encountered:

lorban · 2016-04-15T23:49:49Z

Hi,

Let's first start by trying to answer your "simple" question.

When a resource is added to the ResourceRegistrar, it is recovered because
it might be coming back after a crash; hence BTM has to make sure that
every pending in-doubt transaction related to that resource get finished up
before you attempt starting new ones and the only way to do that is to run
recovery on the resource and reconcile that against BTM's journal.

According to the exception you posted, BTM is refusing to work with that
resource because it detected an inconsistency between its decision and what
your '1460720339996_3D_DOCCLE' resource actually did. I only have very
partial data here so my attempt at explaining what has been going on might
be a bit off; but here is what I believe happened:

your app started a XA transaction on 1460720339996_3D_DOCCLE plus at
least one other resource.
your app called commit() on BTM which started the 2PC protocol.
your app crashed right after BTM finished the 1st phase of the 2PC and
durably saved a commit decision in its journal but before the 2nd phase
actually committing had a chance to run, leaving a transaction in-doubt (or
prepared) at least in 1460720339996_3D_DOCCLE.
you went to the 1460720339996_3D_DOCCLE database and manually rolled back
the in-doubt transaction, creating a heuristic decision.
you restarted your app, and during the 1460720339996_3D_DOCCLE
registration BTM detected that some transaction was still in-doubt waiting
for a commit, but got actually rolled back.
BTM threw that RecoveryException indicating that it refuses to work with
that resource as its data integrity is now messed up.

BTM should have logged an error message looking like "unable to commit
in-doubt branch on resource '1460720339996_3D_DOCCLE' - error=XA_
something." that should tell you a bit more about how it detected that
inconsistency.

You might find it harsh that BTM now refuses to work with your resource,
but it actually is doing its best to avoid any worsening of the situation:
when such occurs, what you should do is take a manual corrective action to
repair the integrity of the spoiled data (there is unfortunately no
information BTM could collect to help you figure out what data was touched
on this resource by that transaction) and manually go to the Oracle server
and tell it to forget about the XA transaction with an Oracle tool like the
dbms_xa.xa_forget PL/SQL procedure (
http://www.morganslibrary.org/reference/pkgs/dbms_xa.html#dxa4) or
something equivalent.

I hope this helps.

Ludovic

On Fri, Apr 15, 2016 at 8:56 AM, Erwin Vervaet notifications@github.com
wrote:

Simple (potentially dumb) question: why does the
ResourceRegistrar.register() method attempt resource recovery if the
transaction manager is already running?

The reason I ask is the following:
Our application fails to startup because of a single transaction causing a
RecoveryException. This exception is raised when we try to register a
resource with the transaction manager. At that point the transaction
manager is already running because it was started using the Tomcat
BTMLifecycleListener, and consequently ResourceRegistrar.register()
attempts recovery:

Caused by: bitronix.tm.recovery.RecoveryException: error recovering
resource '1460720339996_3D_DOCCLE' due to an incompatible heuristic decision
at
bitronix.tm.recovery.IncrementalRecoverer.recover(IncrementalRecoverer.java:94)
~[btm-2.1.2.jar:2.1.2]
at
bitronix.tm.resource.ResourceRegistrar.register(ResourceRegistrar.java:78)
~[btm-2.1.2.jar:2.1.2]
at
grid.storage.transaction.btm.PoolingSessionFactory.buildXAPool(PoolingSessionFactory.java:68)
~[honeycomb-bitronix-1.8.23.jar:na]
at
grid.storage.transaction.btm.PoolingSessionFactory.init(PoolingSessionFactory.java:52)
~[honeycomb-bitronix-1.8.23.jar:na]
... 42 common frames omitted

So because of this the application cannot register the resource with the
transaction manager and of course needs to fail. It seems a bit harsh to
prevent full application startup just because of a recovery failure on a
single transaction.
Is there a better way to handle this kind of issue?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#59

klr8 · 2016-04-16T17:04:02Z

Hello Ludovic,

Thank you for your insightful anwser.

You were almost right as far as your analysis of what happened goes. There was indeed an in-flight transaction involving two resources when we had a full crash of the machine. From the data I have it seems the crash happened during the 2nd phase of the 2PC protocol, right after the first resource committed but before the second one was able to commit. There was no manual intervention on the first resource. When the machine retarted Bitronix restarted and tried to recover the transaction. At that point the 2nd resource refused to commit, which is of course problematic and not Bitronix's fault.

I agree that for the data involved in this f***-up transaction we need manual invervention after studying the whole situation. In our case, operational support people have insufficient knowledge to do that. Consequently the machine will refuse to restart untill the problem is resolved, which is quite a bit later, after consulting with for instance development people.

In this case it was a crash of a single machine in a cluster, and the other instances kept working hapely with the problematic resource. That's where my "Isn't it a bit harsh to prevent startup for just one problematic transaction?" remark originated from.

I clearly see that in general Bitronix should err on the side of safety. I also see that in theory a single 'partially committed' transaction could cause serious damage down the road for the rest of the system. However, the other machines in the cluster would first have to know about the problem before they would be able to refuse further actions on the resource. How would that work?
And if it doesn't work, transaction manager instances on other machines will hapilly keep on working with the resource, which of course might make the strict 'full recovery on startup' prerequisite of Bitronix a bit of a moot point.

lorban · 2016-04-18T11:15:40Z

Hi Erwin,

There was no manual intervention on the first resource.

It is possible that the DB was mis-configured (or maybe that's the default
config?) in such a way that makes it automatically take a commit or
rollback decision on in-doubt transactions after a (probably) configurable
timeout, I believe the Oracle RECO is capable of such thing. I'm no Oracle
XA expert, but the docs (here:
https://docs.oracle.com/cd/B19306_01/server.102/b14231/ds_txnman.htm#i1007799
and there:
https://blogs.oracle.com/db/entry/oracle_support_master_note_for_troubleshooting_managed_distributed_transactions_doc_id_1006641)
seem to indicate that there are plenty of things that can be done at that
level and that the DB could be recording information about any made
heuristic decision that may help you tracking down what really happened.
FIY, some databases have some weird side-effects when stored procedures
perform a commit or a rollback when executed in a XA transaction context: I
vaguely remember that some version of Sybase ASE used to forcibly rollback
the XA transaction's local branch when a stored procedure called rollback.
Maybe Oracle has something similar, but I honestly don't know. I would
strongly advise you to engage Oracle Support and ask them to help you
figuring out what went on and provide recommendations about what to do to.

In our case, operational support people have insufficient knowledge to do
that. Consequently the machine will refuse torestart untill the problem is
resolved, which is quite a bit later, after consulting with for instance
development people.

Strictly speaking, the above problem is caused by a lack of documentation /
procedure of crash recovery. It is "well-known" that XA transactions can
cause all sorts of problems that might need manual intervention, so any
operational guide of an application that makes use of XA transaction should
contain a chapter explaining what to do after a reported recovery failure.
While we're on that topic, are the sysadmins aware that the BTM journal
(both *.tlog files) should be lie on safe storage and should be backed up
as part of normal DB backup routines? Since they record the decisions,
losing those files means BTM would blindly rollback everything it sees
during a crash recovery as per the XA "presumed abort" optimization.

Realistically speaking, I understand your grief and do understand that this
resource quarantining feature isn't working the way it should in a
clustered environment due to the way journaling is sharded. I wish a had
the time to write a true clustered store working on top of the Terracotta
server that would solve that problem but I unfortunately don't.

I do agree that the BTM behavior and doc leaves a bit to be desired in this
particular case but honestly speaking, if it didn't have prevented that
server from starting over and had emitted an error log instead, would you
or your team have heard about that problem at all? And if yes, would you
have analyzed the problem so deeply?

Understand me here: I'm not trying to blame anyone nor to over-zealously
try to defend BTM: I'm just listing cold facts and asking genuine
questions. Under a support contract, I would have immediately done whatever
necessary to improve the BTM pitfall you report here, which I agree should
be fixed. But I would also have pointed out your lack of crash-recovery
procedure (and maybe monitoring / backup gaps) to hopefully help you
prevent that problem from ever happening over again.

Ludovic

On Sat, Apr 16, 2016 at 7:04 PM, Erwin Vervaet notifications@github.com
wrote:

Hello Ludovic,

Thank you for your insightful anwser.

You were almost right as far as your analysis of what happened goes. There
was indeed an in-flight transaction involving two resources when
we had a full crash of the machine. From the data I have it seems the
crash happened during the 2nd phase of the 2PC protocol, right
after the first resource committed but before the second one was able to
commit. There was no manual intervention on the first resource.
When the machine retarted Bitronix restarted and tried to recover the
transaction. At that point the 2nd resource refused to commit, which
is of course problematic and not Bitronix's fault.

I agree that for the data involved in this f***-up transaction we need
manual invervention after studying the whole situation.
In our case, operational support people have insufficient knowledge to do
that. Consequently the machine will refuse to
restart untill the problem is resolved, which is quite a bit later, after
consulting with for instance development people.

In this case it was a crash of a single machine in a cluster, and the
other instances kept working hapely with the problematic resource.
That's where my "Isn't it a bit harsh to prevent startup for just one
problematic transaction?" remark originated from.

I clearly see that in general Bitronix should err on the side of safety. I
also see that in theory a single 'partially committed'
transaction could cause serious damage down the road for the rest of the
system. However, the other machines in the cluster
would first have to know about the problem before they would be able to
refuse further actions on the resource. How would that work?
And if it doesn't work, transaction manager instances on other machines
will hapilly keep on working with the resource, which of course
might make the strict 'full recovery on startup' prerequisite of Bitronix
a bit of a moot point.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#59 (comment)

klr8 · 2016-04-18T14:24:03Z

About what went wrong: we're actually using MySQL in this app, which, as you know, has notoriously lackluster XA support. In this transaction MySQL was the first participating resource, while Redcurrant (http://www.redcurrant.io/) was the second. However, redcurrant actually piggy-backs its XA support on top of MySQL, so that's also a bit iffy.

The sysadmins are indeed aware of the requirements related to properly managing the journal, but you do have a point that they need better documentation and training as far as dealing with transaction recovery goes. BTM's strict recovery policy did of course force us to deal with the problem immediately. Furthermore, since just one instance in our cluster was affected, no real damage was done.

You've more than adequately answered my questions providing me some much appreciated conceptual insights into this whole matter. I consider this question answered and will close the ticket.

Again, thank you for all the feedback on this! I owe you a beer next time we meet IRL! :-)

lorban · 2016-04-18T15:34:26Z

I have no idea why I believed you actually were using Oracle, there's no trace of any kind referring to Oracle in any of your messages.

If you're willing to contribute a fix, please feel welcome to submit a PR. If you feel heroic, we can even discuss you becoming an official maintainer. ;)

klr8 closed this as completed Apr 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do resource recovery in ResourceRegistrar.register? #59

Why do resource recovery in ResourceRegistrar.register? #59

klr8 commented Apr 15, 2016

lorban commented Apr 15, 2016

klr8 commented Apr 16, 2016 •

edited

Loading

lorban commented Apr 18, 2016

klr8 commented Apr 18, 2016

lorban commented Apr 18, 2016

Why do resource recovery in ResourceRegistrar.register? #59

Why do resource recovery in ResourceRegistrar.register? #59

Comments

klr8 commented Apr 15, 2016

lorban commented Apr 15, 2016

klr8 commented Apr 16, 2016 • edited Loading

lorban commented Apr 18, 2016

klr8 commented Apr 18, 2016

lorban commented Apr 18, 2016

klr8 commented Apr 16, 2016 •

edited

Loading