Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client Stream Producers are not being removed #1421

Closed
amccool opened this issue Feb 10, 2016 · 32 comments
Closed

Client Stream Producers are not being removed #1421

amccool opened this issue Feb 10, 2016 · 32 comments
Labels

Comments

@amccool
Copy link
Contributor

amccool commented Feb 10, 2016

I'm getting failures when subscribing to a stream from a client.
The use case is a client is the producer, and a client is a subscriber. No custom grains are being used.

The problem is that the producers (the client) are not being removed from the storage data.
So when a subscriber is added, the storage still contains the old producer, and the system attempts to contact the old producer and a failure (System.Collections.Generic.KeyNotFoundException: No activation for client *cli/97a45295) occurs.

the error is occuring at https://github.com/dotnet/orleans/blob/master/src/OrleansRuntime/Streams/PubSub/PubSubRendezvousGrain.cs#L185

We probably need a better exception for the disappeared client, and then catch that exception and remove the producer.

@gabikliot
Copy link
Contributor

This is the same issue as #982.

As in #982, I am asking to have a test with clear repro first and then we can discuss the solution.
The reason we did not fix it till now, is since the solution is actually not as easy as you proposed.
The solution to just remove a client producer from the pub sub table upon exception has a potential to also remove temporally disconnected, but really alive, client. If/when this client reconnects back, we will loose the producer (or consumer) entry for it in the pub sub, which will violate streaming semantics.

The problem fundamentally stems from the fact that we don't track client liveness status, as we do for silos, so we can't easily and authoritatively decide if the client is up or down. One solution I thought about is to remove, like you suggested, but upon client reconnecting, it will re-subscribe to all its streams. There are some drawbacks to that approach as well, and that is why it is not a clear cut what is the right thing to do and if there is a better solution.

@gabikliot gabikliot added the bug label Feb 10, 2016
@amccool
Copy link
Contributor Author

amccool commented Feb 10, 2016

At first glance #982 appears to be the same issue. However to be clear I will describe the our use-case.

  • No custom grains
  • three silos using SQLServer membership
  • Redis Storage for PubSubStore (we have confirmed this is working correct, Write, Read, Clear all good (uses SerializationManager for binary rather than json)
  • using Orleans.Providers.Streams.SimpleMessageStream.SimpleMessageStreamProvider
  • FireAndForgetDelivery="false"

Windows service connects as a client, upon an external network connection it begins using OnNextAsync (to a single common streamId)

  • this windows service is currently NOT calling OnCompletedAsync
  • multiple services run on multiple nodes

WinForm client connects as a client
Calls SubscribeAsync (to the single common streamId)

  • these clients are run remotely on multiple workstations.

The issue appears when a windows service (producer) is either restarted, or has to reinitialize. I await confirmation on this statement from our testing team.

I agree a repo with this use-case would be a good idea. I also have not attempted to create a unit test using the TestingSilo to see if the use case can be replicated that way.

@amccool
Copy link
Contributor Author

amccool commented Feb 10, 2016

I have confirmed that the Producer client is NOT calling GrainClient.UnInitialize on its graceful exit.
Rectifying this, and will update this issue with the results.

However, this would be a migition as the producer client could crash out, correct?

@jason-bragg
Copy link
Contributor

Attempted a fix in "Remove dead client producer from pub sub #1429"
Tests look good so far, but I've not run a full pass.

@jason-bragg
Copy link
Contributor

@gabikliot
Oops, I was looking at this last night, but didn't see you're comment until after posting the PR.
I'll leave it up, for now, for discussion purposes, but note to not merge it until we've settled on a solution.

@gabikliot
Copy link
Contributor

@amccool - indeed, this happens when "(producer) is either restarted, or has to reinitialize." Same would hold for client consumer.
Indeed, GrainClient.UnInitialize is only a partial mitigation. In case of crashes, we will still have the same problem.

Lets do the repro test first, and then get back to discuss the solution options.

@amccool
Copy link
Contributor Author

amccool commented Feb 12, 2016

I'm trying to think if its possible to repro this within the TestingSiloHost context. At least it would be easier to grok and run compared to a unique solution with multiple executables.

Does the TestingSiloHost have a mechanism to spawn or destroy clients?

@jason-bragg
Copy link
Contributor

At an earlier point I put forward a possible solution (#1429) that was not ideal. It removed client producers from the pubsub system if the client had been dropped. The main problem with this is that clients may reconnect after being dropped, at which time, events they generated could be lost, as they'd been removed as producers from the pubsub system. This is certainly a problem, but IMO less harmful than the current behavior wherein any client that fails to unregister as a producer permanently blocks all future subscriptions. I am of the view that a solution akin to #1429 is preferable to the existing behavior, and we shouldn't let the incomplete nature of the solution block it.

While this issue describes a problem with subscribing to streams produced by Orleans clients, I believe a similar problem exists for producing onto streams consumed by Orleans clients, though in that case, the application can choose to ignore the KeyNotFound exceptions, so it can be less harmful.

@gabikliot

The solution to just remove a client producer from the pub sub table upon exception has a potential to also remove temporally disconnected, but really alive, clients.

The KeyNotFoundException (to be changed to ClientNotAvailableException) is only thrown after the client is unreachable for 1 minute. This logic, which drops 'dead clients', is consistent for all activity with clients, not just the streaming infrastructure.

If/when this client reconnects back, we will loose the producer (or consumer) entry for it in the pub sub, which will violate streaming semantics

The virtual actor semantics does not strictly apply to clients, so I'm unclear how the virtual stream semantics can apply. That is, if a grain obtains a reference that points to a grain, that grain is virtual and should be available even if the silo the target grain's activation was on goes down. This is not true for a reference to a client. Clients are not 'virtual'. If one becomes unavailable, it is unavailable. I'm not of the opinion that it's reasonable (or even possible) to support the same semantics for streams for clients as grains. I believe it is necessary to have clear semantics, but I don't think it's necessary to have the same semantics for clients.

@gabikliot
Copy link
Contributor

The virtual actor semantics does not strictly apply to clients,
I'm not of the opinion that it's reasonable (or even possible) to support the same semantics for streams for clients as grains.

Of course they are not, and no one suggested to support "virtual clients". I am not sure how you got to that conclusion.
The problem is not with Virtual actors/clients, but with streaming semantics. Our pub sub guarantees that one you subscribed, and as long as you are alive (which for grain is forever and for client is until he dies) you will keep getting events. If client disconnected and then reconnects, he is did not die, it is the same client. But if we do what you suggested, the client will be as if unsubscribed and will stop receiving new events, without a knowledge of it in the application (I am talking here about removing subscribers from the pub sub). Removing producers is similar/symmetric problem.

That all said, you might be right that a first step in the solution could be to start like you did here. I am not actually yet suggesting ANY solution.
Please look at what I suggested before:
a) Fix that test to throw the right exception, but not remove from the table yet.
b) implement a test
c) discuss possible solutions

The outcome of c may be to start with what you did here, or maybe we will find a better way.

@amccool
Copy link
Contributor Author

amccool commented Mar 29, 2016

@gabikliot I'm getting the impression that I should avoid using a grain client as a stream producer. Are the stream semantics (based on looking at PubSubRendezvousGrain) more forgiving, if you will, to grain clients as subscribers?

While we do want to find a gold plated solution, I am asking if users should design around this issue.

@gabikliot
Copy link
Contributor

I would definitely recommend against the strategy of avoiding using a grain client as a stream producer and even more against "designing around this issue".
It is just a bug, we should fix it quickly and move on.
I believe we can get the fix in a week timeframe tops, if we put an effort into this asap.
We put a lot of efforts to make streams work on the client the same way as on the grains and it would be very unfortunate to declare now it is un-usefull.

The reason I did not fix it myself a while ago is that I want @jason-bragg and @sergeybykov to be in the crux of this fix, be part of the different tradeoff decisions that we will have to make, as they are the ones who would also need to maintain that code later on.
I can guarantee you that the fix will not be blocked on my side.

In the meanwhile, you can help by forming a test that reproduces that problem.

@jason-bragg, @sergeybykov , we can chat over Skype tonight and discuss the various options for the fix.

@jason-bragg
Copy link
Contributor

@jason-bragg, @sergeybykov , we can chat over Skype tonight and discuss the various options for the fix.

@gabriel, if you’re available, let’s try the skype tomorrow night.

I agree that we should fix the exception first, then get tests setup, but for triage reasons, we’re trying to get to a point where we have a viable plan to fix this sooner than later. Once we’ve a general agreement on what the streaming semantics should be on the client, and an approach that seems viable, it will be easier to scope and prioritize the work needed to resolve this.

Semantics

Our pub sub guarantees that once you subscribed, and as long as you are alive (which for grain is forever and for client is until he dies) you will keep getting events. If client disconnected and then reconnects, he did not die, it is the same client.

I agree in principle, but I’d like to explore what this means in a little more detail. “until he (the client) dies” – What does this mean? As you’ve previously stated “The problem fundamentally stems from the fact that we don't track client liveness status, as we do for silos, so we can't easily and authoritatively decide if the client is up or down.”

I suggest the following semantics for client life/dead status.
Consistency – Except for a very narrow window when a client is ‘Dying a client is either alive or dead to all gateways and client.
Single life Cycle – a client is alive after successful initialization, but once it dies, it cannot return to alive state. The application level must initialize a new client (or reinitialize?)
Unilateral – The client or any gateway can declare the client ‘dead’ and all must respect the declaration.

With these semantics, the streaming semantics you described become more clear. Any stream consumers or producers should function until the client ‘dies’. If the application is still running, but the Orleans client is ‘dead’, the application must reinitialize the client and re-establish stream producers and consumers before functionality will be restored.

Detailed Technical solution
To support the previously described semantics, I suggest the following.

  1. Replace KeyNotFound exception thrown by catalog when attempting to make calls to dead clients with a ClientNotAvailable exception.
  2. Add test cases for stream producer and consumer scenarios wherein client is lost. Lost consumers and producers should be remove automatically.
  3. Stream infrastructure removes clients and producers on ClientNotAvailable exceptions. (Tests in 2 should start working)
  4. Update Client/Gateway handshake to introduce separate handshakes for initial connection and reconnection. This allows gateways to distinguish between new clients and reconnecting clients without having to maintain state indefinitely.
  5. Reconnecting clients that are not in the list of live clients are considered dead. The reconnect handshake protocol communicates this to the client.
  6. A gateway that has lost contact with a client for more than a configurable period of time (currently 1 minute) will declare that client ‘dead’. This is currently the case, except the time period is not configurable.
  7. A client, upon receiving notification that it is dead from the reconnect handshake will notify all other gateways that it is dead, and put itself in an uninitialized state. Some sort of application layer notification callback should be supported. Application calls to the client should throw exceptions until the client is reinitialized.
  8. Add tests where client is declared dead, verify that application layer callback is made, and that calls to the client throw exceptions. Verify operation returns to normal once application layer reinitializes the client.
  9. Add tests where client stream producers and consumers are setup, client is declared dead, application layer detects this and reinitializes the client, then reestablishes stream producers and consumers. Verify operation returns to normal.

Some scenarios to consider:

Client connects to several gateways. A silo hosting one of the gateways cycles. Client would reconnect, but gateway on silo would not recognize it, and declare it dead.

Client connects to several gateways. Client loses connection to a single gateway. Gateway declares the silo dead, but since client can’t reach it and all other gateways are fine, neither client nor other gateways would have any way of knowing it has been declared dead. (Should client declare itself dead if it can't reach a gateway?)

Client loses connection and dies. Each gateway declares it dead, but without the client to communicate this to other gateways in a timely manner, the window of time for all gateways to declare the client dead would vary by the configurable period of inactivity required to declare a client dead.

Thoughts?

@gabikliot
Copy link
Contributor

@jason-bragg , the solution could be something along those lines.
The 2 main differences are:
a) we don't have any way or mechanism right now to "declare client as dead". Every GW tracks if the client is currently connect to it, or not, but there is no global notion of client live/dead. We can build something like that, but that might be hard. An option, but maybe too complicated.
b) "Some sort of application layer notification callback should be supported" for client to re-init itself. I think this is a major problem, since it will have a big complication on the client-side application programming model. I don't think it will fly with @sergeybykov. Ideally we would fix this bug without introducing this new complication to the programming model.

Basically, one of my ideas, which is quite similar to what you suggested (but not identical) is to buffer/store all streams for which client subscribed/produced (we already have them in the streaming runtime basically) and re-establish them (reconnect/refresh) every time we connect to a new GW. In addition, a couple of optimizations can be added to make it more lightweight. There are a couple of complications to that plan, but all doable I think.

Lets chat tomorrow evening.

@jason-bragg
Copy link
Contributor

we don't have any way or mechanism right now to "declare client as dead" - no global notion of client live/dead

From the gateway side, we do have a mechanism for determining that a client is dead, it's just not global, it's per gateway. Currently we consider a client dead if a gateway can't communicate with it for more than 1 minute. After this time, we unregister the activation from the directory and all calls to that client (from streams in this case) begin to throw the KeyNotFound (ClientNotAvailable) exception.

The suggestion for making it global was (7). The client, being the only component currently capable of coordinating between all the gateways, would be responsible for communicating it's dead status globally (to all gateways) when any gateway, or the client itself, declares the client dead. A complication with this approach was explored in the second scenario of "Some scenarios to consider" section.

"complication on the client-side application programming"

I agree, how much of a problem this is however is unclear. The application layer must deal with the problem of the Orleans Client losing connectivity with the cluster. This changes how applications handle this case, but in my view does not complicate it significantly. I do agree we should think this through though.

A way to think about this is that when a client temporarily loses connection to the cluster the application will start seeing errors, but there is reason to believe that these errors are transient. When the application is notified that the client has died (however we want to communicate that) the application layer now knows that the errors are in fact not transient, and it should act accordingly.

If we're adding capabilities for clients and actors to call functionality on the client (like observers and streams, ..) we are supporting a more robust set of client behaviors and cluster integration. The programming model must reflect and support this. Clients are no longer just external entry points for grain calls. The proposed approach is not likely the best we can come up with, but the capabilities that have been added to the client should prompt a reevaluation of the client programming model.

re-establish them (reconnect/refresh) every time we connect to a new GW.

I like this idea, but I suspect that this is less strait forward than it seems. Stream providers are extension points to Orleans, so this must be done in a general way to work with any stream provider that developers may devise. How would you suggest we do this?

When thinking through the above proposal, I anticipated your interest in this area as you'd eluded to it in previous posts, and concluded that it would be a natural extension of what was proposed. Assume we've implemented all of the details in the proposal, but we wanted to make it easier for application developers using streams on the client to recover from a connection loss from the cluster. We could add a capability to the client that automatically reestablished stream links upon the re-initialization of the client. This is not exactly what you're suggesting but would be comparable, given the change to the client programming model.

This was referenced Apr 1, 2016
@gabikliot
Copy link
Contributor

@amccool , I promised one week and with a great push by @jason-bragg this issue has now been fixed.

@amccool
Copy link
Contributor Author

amccool commented Apr 5, 2016

@jason-bragg @gabikliot hats off to you guys. I've got testers just itching to try this out. I will have to merge some differences between master and 1.1.3 and try to have some results out quickly.

@sergeybykov
Copy link
Contributor

👍

@jason-bragg
Copy link
Contributor

@amccool. This is a partial fix that should unblock most users, but we've still got more work to do on this. I'll be adding an issue soon to cover roughly what we're thinking.

With this fix, if a client dies, the streaming infrastructure will remove any producers it had from the pubsub system. However, if the client loses connection with the cluster for a period of time (configurable, but default is 1 minute) then reconnects, it's producers will be removed, but not re-added when it reconnects. This means a streaming client that reconnects after the configured period, will likely not function correctly. For this reason, if a client using streaming loses connectivity to the cluster for more than the configured time, it should probably restart.

In regards to your testing, the producers on the dead client are only removed after the client has been disconnected (or dead) for grater than the configured time period (1 minute). Also, I'd be down right giddy if those tests were contributed back to our test harnesses. :) Most of the infrastructure for this (but not all) was added in #1637.

We've also not yet done the work to cleanup stream consumers on clients that died, only producers.

@amccool
Copy link
Contributor Author

amccool commented Apr 5, 2016

if a client using streaming loses connectivity to the cluster for more than the configured time, it should probably restart.

Does restart mean a full uninitialize and initialize, or just a new OnNextAsync?

@jason-bragg
Copy link
Contributor

It means a full uninitialize/initialize of the client, and re-acquiring the stream from the stream provider. Basically, anything one acquired from the client (StreamProviders, AsyncObservers, AsyncObservables, ..) would all become invalid after the client is reinitialized, so all would need to be re-acquired from the reinitialized client.

My initial suggestion though was to cycle the service. I assumed that the Orleans client was probably a web front end with some redundancy in the same data center as the cluster, so connection loss for such an extended period of time is not likely, and in that case, cycling the process wouldn't really be much more disruptive than the connection loss that necessitated it. If these assumptions hold, cycling the process would be less work and safer than resetting the grain client and restarting all the producers.

@gabikliot
Copy link
Contributor

@amccool , I want to clarify something.
The failure case @jason-bragg is talking about is an extremely rare case when the client
disconnects from ALL silos for longer than 1 minute and then reconnects. Given that you probably are running with multiple silos, I think it is extremely unlikely you will see that happen. I think even without that fix, you have 99.9XXX% availability SLA.

We do need to fix that, to be 100% clean, and we will in the near future, but practically I don't think you need to worry about it. I would just optimistically ignore it and NOT bother implementing any of the work arounds of full client re-initialize like @jason-bragg suggested.

@jason-bragg
Copy link
Contributor

@amccool
@gabikliot stated more clearly and directly what I was eluding to, that this should be so rare as to not be an issue, and if it is, a simple restart is probably the most practical approach. Thanks @gabikliot.

I added the clarifications more as information to provide a deeper insight into the behaviors you should expect, as I would want to know these nuances were I running a production system on Orleans. I am not of the opinion that there is anything you need to do to account for the missing functionality. You know your environment and scenarios better than I though :)

@onionhammer
Copy link
Contributor

So what's the status of this in 1.2.0 beta? I got latest and the stream subscriptions seem much more reliable (I haven't had that key not found exception, so my client has been able to subscribe to streams without me clearing out the stream table so far), but I'm still seeing lots of exceptions in the log

[2016-04-20 13:23:02.740 GMT    17  ERROR   100513  Catalog 10.0.0.8:11111] !!!!!!!!!! Error calling grain's OnActivateAsync() method - Grain type = Orleans.Streams.PubSubRendezvousGrain Activation = [Activation: S10.0.0.8:11111:198805990*grn/716E8E94/e1d315a1+SMSProvider_SiteStatus@8e1013b1 #GrainType=Orleans.Streams.PubSubRendezvousGrain Placement=RandomPlacement State=Activating]   
Exc level 0: System.NullReferenceException: Object reference not set to an instance of an object.
   at Orleans.Streams.PubSubRendezvousGrain.<OnActivateAsync>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Catalog.<CallGrainActivate>d__65.MoveNext()

@gabikliot
Copy link
Contributor

gabikliot commented Apr 20, 2016

Is this the full exception in the log, or is there more stack trace?

@onionhammer
Copy link
Contributor

onionhammer commented Apr 20, 2016

This has a bit more stack


Forwarding failed: tried to forward message NewPlacement Request S10.0.0.4:11111:198880849*grn/FAEFC6C8/cd01e998@f36746b0->S10.0.0.4:11111:198880849*grn/716E8E94/11590da3+SMSProvider_DeviceStatus@a6179db0 #4015[ForwardCount=2]: global::Orleans.Streams.IPubSubRendezvousGrain:GetAllSubscriptions() for 2 times after Failed InvokeActivate to invalid activation. Rejecting now. System.NullReferenceException: Object reference not set to an instance of an object.
   at Orleans.Streams.PubSubRendezvousGrain.<OnActivateAsync>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Catalog.<CallGrainActivate>d__65.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Scheduler.SchedulerExtensions.<>c__DisplayClass1_0.<<QueueTask>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Orleans.Runtime.Catalog.<InitActivation>d__50.MoveNext()

This seems related, but not the exact same errors as the one I pasted earlier. Thats all the information I have logged about them

@jason-bragg
Copy link
Contributor

I've not found anything in the grain logic itself that would account for this. I've a theory I'm trying to repro anyway.

What storage provider are you using.

@jason-bragg
Copy link
Contributor

@ onionhammer,
I'm convinced the issue is in AzureTableStorage storage provider, but I'm still working on a repro. I'll create an issue and link this thread when I've reproduced it.

In the interim, if this is causing issues in your production environment and you are in great need of a work around, you can try the following.

  1. Add the setting DeleteStateOnClear=true to your AzureTableStorage storage provider settings.
  2. Go to table storage and delete any pubsub entities (Table=OrleansGrainState, RowKey=OrleansStreams.PubSubRendezvousGrain) that have empty or missing "Data" and "StringData" properties. If either property has value, leave it. Remove only the entities that have no data in either property.
  3. Bounce service.

Please keep in mind that the above workaround will only help if my current working theory is correct. If you can wait for my confirmation of the issue, it would be safer to do so.

@onionhammer
Copy link
Contributor

onionhammer commented Apr 20, 2016

I am indeed using azure table storage. I will try those out tomorrow, and get back, thanks!

@onionhammer
Copy link
Contributor

I am no longer seeing these errors in the log; I've had DeleteStateOnClear=true set for the past 16 hours, and the log is looking much cleaner.

@jason-bragg
Copy link
Contributor

Documented issue: "Receiving null reference error when accessing grain state" #1703
PR for fix: "Azure table storage provider returning null grain state - fixed" #1702

@jason-bragg
Copy link
Contributor

@onionhammer
Glad you're unblocked.
Thanks for letting us know about this issue.

@sergeybykov
Copy link
Contributor

Resolved via #1429.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants