-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do resource recovery in ResourceRegistrar.register? #59
Comments
Hi, Let's first start by trying to answer your "simple" question. When a resource is added to the ResourceRegistrar, it is recovered because According to the exception you posted, BTM is refusing to work with that
BTM should have logged an error message looking like "unable to commit You might find it harsh that BTM now refuses to work with your resource, I hope this helps. Ludovic On Fri, Apr 15, 2016 at 8:56 AM, Erwin Vervaet notifications@github.com
|
Hello Ludovic, Thank you for your insightful anwser. You were almost right as far as your analysis of what happened goes. There was indeed an in-flight transaction involving two resources when we had a full crash of the machine. From the data I have it seems the crash happened during the 2nd phase of the 2PC protocol, right after the first resource committed but before the second one was able to commit. There was no manual intervention on the first resource. When the machine retarted Bitronix restarted and tried to recover the transaction. At that point the 2nd resource refused to commit, which is of course problematic and not Bitronix's fault. I agree that for the data involved in this f***-up transaction we need manual invervention after studying the whole situation. In our case, operational support people have insufficient knowledge to do that. Consequently the machine will refuse to restart untill the problem is resolved, which is quite a bit later, after consulting with for instance development people. In this case it was a crash of a single machine in a cluster, and the other instances kept working hapely with the problematic resource. That's where my "Isn't it a bit harsh to prevent startup for just one problematic transaction?" remark originated from. I clearly see that in general Bitronix should err on the side of safety. I also see that in theory a single 'partially committed' transaction could cause serious damage down the road for the rest of the system. However, the other machines in the cluster would first have to know about the problem before they would be able to refuse further actions on the resource. How would that work? |
Hi Erwin, There was no manual intervention on the first resource. It is possible that the DB was mis-configured (or maybe that's the default In our case, operational support people have insufficient knowledge to do Strictly speaking, the above problem is caused by a lack of documentation / Realistically speaking, I understand your grief and do understand that this I do agree that the BTM behavior and doc leaves a bit to be desired in this Understand me here: I'm not trying to blame anyone nor to over-zealously Ludovic On Sat, Apr 16, 2016 at 7:04 PM, Erwin Vervaet notifications@github.com
|
About what went wrong: we're actually using MySQL in this app, which, as you know, has notoriously lackluster XA support. In this transaction MySQL was the first participating resource, while Redcurrant (http://www.redcurrant.io/) was the second. However, redcurrant actually piggy-backs its XA support on top of MySQL, so that's also a bit iffy. The sysadmins are indeed aware of the requirements related to properly managing the journal, but you do have a point that they need better documentation and training as far as dealing with transaction recovery goes. BTM's strict recovery policy did of course force us to deal with the problem immediately. Furthermore, since just one instance in our cluster was affected, no real damage was done. You've more than adequately answered my questions providing me some much appreciated conceptual insights into this whole matter. I consider this question answered and will close the ticket. Again, thank you for all the feedback on this! I owe you a beer next time we meet IRL! :-) |
I have no idea why I believed you actually were using Oracle, there's no trace of any kind referring to Oracle in any of your messages. If you're willing to contribute a fix, please feel welcome to submit a PR. If you feel heroic, we can even discuss you becoming an official maintainer. ;) |
Simple (potentially dumb) question: why does the ResourceRegistrar.register() method attempt resource recovery if the transaction manager is already running?
The reason I ask is the following:
Our application fails to startup because of a single transaction causing a RecoveryException. This exception is raised when we try to register a resource with the transaction manager. At that point the transaction manager is already running because it was started using the Tomcat BTMLifecycleListener, and consequently ResourceRegistrar.register() attempts recovery:
Caused by: bitronix.tm.recovery.RecoveryException: error recovering resource '1460720339996_3D_DOCCLE' due to an incompatible heuristic decision at bitronix.tm.recovery.IncrementalRecoverer.recover(IncrementalRecoverer.java:94) ~[btm-2.1.2.jar:2.1.2] at bitronix.tm.resource.ResourceRegistrar.register(ResourceRegistrar.java:78) ~[btm-2.1.2.jar:2.1.2] at grid.storage.transaction.btm.PoolingSessionFactory.buildXAPool(PoolingSessionFactory.java:68) ~[honeycomb-bitronix-1.8.23.jar:na] at grid.storage.transaction.btm.PoolingSessionFactory.init(PoolingSessionFactory.java:52) ~[honeycomb-bitronix-1.8.23.jar:na] ... 42 common frames omitted
So because of this the application cannot register the resource with the transaction manager and of course needs to fail. It seems a bit harsh to prevent full application startup just because of a recovery failure on a single transaction.
Is there a better way to handle this kind of issue?
The text was updated successfully, but these errors were encountered: