Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMQNET-656 failover stalls without reconnecting in some cases - ensure an await statement is not inside a lock statement #59

Closed
wants to merge 6 commits into from

Conversation

brudo
Copy link
Contributor

@brudo brudo commented Jan 27, 2021

This is in followup to #45. It appears that the Task.Run that was removed from the PR may have been needed after all. At least, I have seen failover fail to reconnect occasionally in our environment, and it appears to be related to this.

I can't offer a full repro to illustrate the issue was present and its resolution, as it is very timing dependent. However, errors are to be expected when an await statement is inside a lock statement.

See also AMQNET-656, opened for this issue, and related cases AMQNET-624 and AMQNET-626, both closed.

@brudo
Copy link
Contributor Author

brudo commented Jan 27, 2021

So far I have tested this locally - no regressions on initial connect, and I have not observed any stalls yet when failover is induced by stopping / starting brokers.

@michaelpearce-gain
Copy link
Contributor

@brudo this seems to have failed one of the CI builds. Also a quick note, we've re-written alot of the internals ready for nms2.0, see here #56 without a test case, its not possible to validate we wont regress your change, you may want to look there, and send a pr to that branch, as its likely we will merge this soon

@brudo
Copy link
Contributor Author

brudo commented Feb 10, 2021

@michaelpearce-gain I wasn't able to replicate the issue with Travis CI. Not sure what's going on there. I was able to build in Ubuntu 18.04 and run the test suite, without error, using the same commands as from the script section in travis.yml.

I'll look at creating a PR for the 2.0.x branch with the same change, but hope to sort out this issue first, or it will likely repeat over there.

@michaelandrepearce
Copy link
Contributor

@brudo a small note, the 2.0.x work is now merged into here https://github.com/apache/activemq-nms-amqp/tree/2.0.x, i think the aim is to release the api in the next week or two, followed by it after that

@brudo
Copy link
Contributor Author

brudo commented Feb 12, 2021

I don't think the additional commit is what fixed it - in theory it should have the same run time behavior either way.

@michaelandrepearce
Copy link
Contributor

@Havret wdyt

@michaelandrepearce
Copy link
Contributor

@lukeabsent would you mind quickly looking at this as a second pair of eyes?

@Havret
Copy link
Contributor

Havret commented Feb 16, 2021

I'm not sure if this solves anything. Currently you will leave the lock on first await. With the proposed fix you will leave it instantly. In either case lock doesn't work as it should. I'm not sure why https://issues.apache.org/jira/browse/AMQNET-626 was closed, so I reopened it.

As I said in #45

I think we should revisit this part of code (mainly TriggerReconnectionAttempt) after we hopefully deliver the initial release. Calling async method without await or GetResult() is always a red flag.

This was a bit too naively ported from Java and definitely requires a second look. A more reliable implementation might look as follows but this wouldn't be a simple change.

@brudo
Copy link
Contributor Author

brudo commented Feb 17, 2021

@Havret thanks for your comments. The issue is that when you hit that first await, on a thread that is currently in a synchronized block anywhere up the call stack, that is already an invalid condition that may lead to deadlocks (and has been observed doing so - assuming I identified the right root cause). I agree that Task.Run is not the most elegant way to resolve it, but it might be the right solution for patching 1.8.x.

A more correct way, assuming this kind of synchronization is still required in 2.0.x, would be to change to use an async-friendly alternative such as a SemaphoreSlim wherever the synchronized statement appears. Then Task.Run would not be needed. I'd be willing to help with a change like that for 2.0.x as a separate PR.

I don't see any locks of any kind in your Artemis-specific client, but putting the recovery loop in its own long-running task also provides a form of synchronization / coordination, as that Task will only be running on one thread at a time. I like that approach.

@Havret
Copy link
Contributor

Havret commented Feb 17, 2021

Hi @brudo, what you're describing here could theoretically happen, when there is an await inside of a lock. In reality, the compiler prevents you from writing this kind of statement. It simply doesn't compile. You can give it a try. :) What we are dealing with here though is a slightly different situation. TriggerReconnectionAttempt is invoked inside of a lock in a fire and forget fashion. On the first await inside of it, it will simply return from the caller's perspective. The continuation will kick off on another thread, but the calling code will not be aware of that. This may cause some problems, but a deadlock is not one of them.

The solution that I proposed is a rather standard approach to synchronization/coordination when one needs to deal with asynchronous code.

@brudo
Copy link
Contributor Author

brudo commented Feb 17, 2021

@Havret the compiler will catch it if the await appears directly inside the same code block where the lock statement appeared; but as far as I understand, it's still invalid to await when inside a lock, even in the cases where the compiler can't detect it - as in the case where the await occurs in a different block, that gets invoked from inside the lock.

@Havret
Copy link
Contributor

Havret commented Feb 17, 2021

@brudo Please take a look at the following snippet:

class Program
{
    private static object syncRoot = new object();

    static async Task Main(string[] args)
    {
        lock (syncRoot)
        {
            Console.WriteLine($"Before not awaited on thread {Thread.CurrentThread.ManagedThreadId}.");
            NotAwaitedMethod();
            Console.WriteLine($"After not awaited on thread {Thread.CurrentThread.ManagedThreadId}.");
        }

        await Task.Delay(TimeSpan.FromSeconds(2));
    }

    private static async Task NotAwaitedMethod()
    {
        Console.WriteLine($"Inside before await on thread {Thread.CurrentThread.ManagedThreadId}.");
        await Task.Delay(TimeSpan.FromSeconds(1));
        Console.WriteLine($"Inside after await on thread {Thread.CurrentThread.ManagedThreadId}.");
    }
}

The output will be as follows:

Before not awaited on thread 1.
Inside before await on thread 1.
After not awaited on thread 1.
Inside after await on thread 4.

In order to preserve the async-await behavior, you need to push it all the way up the call stack. If you do that and at the same time try to fool the compiler by changing the lock statement to Monitor.Enter, Monitor.Exit we would in fact end up with a deadlock.

@Havret Havret closed this Feb 17, 2021
@Havret Havret reopened this Feb 17, 2021
@brudo
Copy link
Contributor Author

brudo commented Feb 18, 2021

@Havret I will have to study this further, look at some IL, etc. I do note that "Inside after await" would / could be on a different thread (scheduled on the thread pool), while all the other WriteLines would be running on the initial thread (which might not be on the thread pool). And, now as I think about it, it is the "other side" (on that other thread) which actually has the await / continuation on it.

@Havret
Copy link
Contributor

Havret commented Feb 18, 2021

@brudo The most important thing in this particular case is that "After not awaited" will be on the same thread that "Before not awaited", this is sth that guarantees that there won't be any deadlock here.

I've updated my example a little bit. I'm not sure if it would be cost-effective to dive into IL. This is just how async-await works.

@brudo
Copy link
Contributor Author

brudo commented Feb 18, 2021

@Havret the thread IDs are consistent with what I said. Good, so I understand that much at least! But fair enough, without any await, there is at least a guarantee that the lock will be exited from the same thread where it was entered. I'll have to try to find more evidence or a more complete theory, or better yet a repro, of what actually caused it to stall in our case.

@Havret
Copy link
Contributor

Havret commented Feb 18, 2021

Unfortunately I don't know how you're using this library. I recall that we were experiencing some deadlock situations, but it was caused by the Spring.NET wrapper that we are using for our services. With Spring.NET it is quite easy to shoot yourself in the foot unfortunately. :(

Have you tried to take a memory dump and see where your treads are actually stuck?

@lukeabsent
Copy link
Contributor

lukeabsent commented Feb 22, 2021

Had a little look, and it seems overall that there could be more Reconnect() going
at the same time than just one (which I believe was the original intention), and that is one risk.
If you suspect deadlock happening around HandleProviderError-lock(SyncRoot) then maybe
it is being possible cause there is provider.Close inside - though deeper inside it seems to be timeouted for 1m.
Other risk is that HandleProviderError looks like it may be called from underlying AsyncPump (AmqpNetLite) thread (inline execution) and if it locks or waits on something, then pump stops pumping and if thing being expected is something that was supposed to be delivered by AsyncPump, then there is deadlock.
In general, it looks like it requires some work, it would be good to be able to reproduce it in some test.

@brudo
Copy link
Contributor Author

brudo commented Feb 22, 2021

@Havret in our case, the stall occurs very rarely, and I do not have a memory dump from when it last occurred. The failover AMQP connection is initialized from a thread pool thread. We do have some logging code in ConnectionInterruptedListener and ConnectionResumedListener events, which currently runs synchronously - I wonder if that could be a factor as it might affect the timing.

I will try to create a test case and add to FailoverIntegrationTest.cs and see if I can get the issue to occur more consistently.

@michaelpearce-gain
Copy link
Contributor

@brudo any update on your progress?

@michaelandrepearce
Copy link
Contributor

@brudo we have merged the 2.0.x feature into master and will be looking to cut a release this week. Is there any progress on this we should hold up on? If not we will proceed with 2.0.0 as is. And come back to this after

@brudo
Copy link
Contributor Author

brudo commented Apr 20, 2021

Sorry for the delay following up... no, other things have taken priority and I did not have any progress on this. I won't be able to switch to NMS 2 until activemq-nms-openwire is also updated - right now it is missing the async methods defined in the 2.0.x interfaces.

@michaelpearce-gain
Copy link
Contributor

michaelpearce-gain commented Apr 20, 2021

@brudo openwire nms client atm, isnt the most active, whilst bug fixes and bits like ensuring it works for latest language support, e.g. support for .net std im not seeing much activity adding new features to it, most recent efforts are currently put into AMQP for NMS going forwards as this became an industry standard protocol, which both ActiveMQ 5 (Classic) and ActiveMQ Artemis brokers support and many other commercial offerings such as Solace, IBM MQ,Azure etc.

We will probably then move ahead with cutting the nms amqp 2.0 client soon then, i believe these some final edge cases need fixing picked up by integration tests, probably if given time i suspect one of the main current contributors to pick this up in the next month.

@brudo
Copy link
Contributor Author

brudo commented Apr 20, 2021

@michaelandrepearce We had dropped openwire support from our product and pivoted to AMQP when we were migrating to .NET Standard and .NET Core, but had to add it back in due to the impacts of AMQNET-656.

Perhaps if we can get a test and a fix in for this failover issue included in NMS 2.0 AMQP, then we could pull the plug on the openwire support again, particularly if it will not be moving back into a more-active status.

Although I don't have any deep knowledge of the AmqpNetLite implementation at all, what @lukeabsent observed about the AsyncPump thread making synchronous calls to HandleProviderError sounded like a very strong lead toward the root cause. I'll take a look.

@brudo
Copy link
Contributor Author

brudo commented Apr 21, 2021

I tried to see if I could understand this better, including digging into AmqpNetLite. In FailoverProvider, the underlying provider and its listener are cleared before calling provider.Close(), while Amqp.AmqpObject.NotifyClosed protects against calling back more than once from the same instance, so it's probably safe from triggering a second call to HandleProviderError there.

In any case my initial approach with adding a call to Task.Run may be more heavy-handed than it needed to be, and might be in the wrong place for all we know. I notice there is not a single Task.Run, other than the one I added, and the ones in the tests.

One approach that doesn't use Task.Run might be something like this: have a dedicated background thread, that waits for a manual reset event that is set when a reconnect is required. I don't know if that is better, though. It guarantees only one reconnect at a time, and doesn't tie up a thread from the thread pool, and doesn't require blocking the AsyncPump, but uses a kernel handle and an idle thread per failover provider instance.

The next step though should be to come up with a test that confirms the issue... hopefully the test will point to the right way to fix it, and if it's a quick fix that stands up to review, then perhaps it will make it into the 2.0 release. I'll be looking into that this afternoon.

@brudo
Copy link
Contributor Author

brudo commented Apr 23, 2021

Thus far I haven't been successful in creating a unit test that would fail with a timeout indicating a stall. I wonder if other changes in 2.0.x might have cleared up the issue. However, it's at least as likely that my test attempts were flawed, e.g. that synchronization within the test may have interfered with the results.

@brudo
Copy link
Contributor Author

brudo commented Jul 26, 2021

Although I did not fully understand the root cause, and the change I proposed in this PR may not have been the correct way to fix it, a stall was nevertheless observed where failover did not occur when the current active broker crashed and another broker took over.

It seems the following bug in AMQPNetLite, addressed in 2.4.3, could have been the true root cause: Azure/amqpnetlite#460 "NotifyClosed not being called on connection loss detected by HeartBeat..."

If so, the issue should no longer occur in 1.8.2 as it now references AMQPNetLite 2.4.3. Likewise, even before 1.8.2 is released, perhaps a workaround could be to reference AMQPNetLite 2.4.3 explicitly.

@michaelandrepearce @lukeabsent @michaelpearce-gain

@michaelpearce-gain
Copy link
Contributor

@brudo thanks for the update, and its good news in a way, that it turns out to be issue in underlying amqpnetlite, as then as you say people can patch this themselves by forcing dependency update. I know that 1.8.2 is in process of release @Havret has done two RC's but i think a third is needed to fix some bits like headers, @Havret are you able to give @brudo an idea how long till we fix RC to go again on votes, to hopefully get 1.8.2 out, or are we now just focussed on 2.0.0?

@Havret
Copy link
Contributor

Havret commented Jul 27, 2021

@michaelandrepearce I will submit RC3 on Friday.

@michaelandrepearce
Copy link
Contributor

michaelandrepearce commented Sep 21, 2021

@brudo im closing this out, now 1.8.2 is released. re-open if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants