-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client Stream Producers are not being removed #1421
Comments
This is the same issue as #982. As in #982, I am asking to have a test with clear repro first and then we can discuss the solution. The problem fundamentally stems from the fact that we don't track client liveness status, as we do for silos, so we can't easily and authoritatively decide if the client is up or down. One solution I thought about is to remove, like you suggested, but upon client reconnecting, it will re-subscribe to all its streams. There are some drawbacks to that approach as well, and that is why it is not a clear cut what is the right thing to do and if there is a better solution. |
At first glance #982 appears to be the same issue. However to be clear I will describe the our use-case.
Windows service connects as a client, upon an external network connection it begins using OnNextAsync (to a single common streamId)
WinForm client connects as a client
The issue appears when a windows service (producer) is either restarted, or has to reinitialize. I await confirmation on this statement from our testing team. I agree a repo with this use-case would be a good idea. I also have not attempted to create a unit test using the TestingSilo to see if the use case can be replicated that way. |
I have confirmed that the Producer client is NOT calling GrainClient.UnInitialize on its graceful exit. However, this would be a migition as the producer client could crash out, correct? |
Attempted a fix in "Remove dead client producer from pub sub #1429" |
@gabikliot |
@amccool - indeed, this happens when "(producer) is either restarted, or has to reinitialize." Same would hold for client consumer. Lets do the repro test first, and then get back to discuss the solution options. |
I'm trying to think if its possible to repro this within the Does the |
At an earlier point I put forward a possible solution (#1429) that was not ideal. It removed client producers from the pubsub system if the client had been dropped. The main problem with this is that clients may reconnect after being dropped, at which time, events they generated could be lost, as they'd been removed as producers from the pubsub system. This is certainly a problem, but IMO less harmful than the current behavior wherein any client that fails to unregister as a producer permanently blocks all future subscriptions. I am of the view that a solution akin to #1429 is preferable to the existing behavior, and we shouldn't let the incomplete nature of the solution block it. While this issue describes a problem with subscribing to streams produced by Orleans clients, I believe a similar problem exists for producing onto streams consumed by Orleans clients, though in that case, the application can choose to ignore the KeyNotFound exceptions, so it can be less harmful.
The KeyNotFoundException (to be changed to ClientNotAvailableException) is only thrown after the client is unreachable for 1 minute. This logic, which drops 'dead clients', is consistent for all activity with clients, not just the streaming infrastructure.
The virtual actor semantics does not strictly apply to clients, so I'm unclear how the virtual stream semantics can apply. That is, if a grain obtains a reference that points to a grain, that grain is virtual and should be available even if the silo the target grain's activation was on goes down. This is not true for a reference to a client. Clients are not 'virtual'. If one becomes unavailable, it is unavailable. I'm not of the opinion that it's reasonable (or even possible) to support the same semantics for streams for clients as grains. I believe it is necessary to have clear semantics, but I don't think it's necessary to have the same semantics for clients. |
Of course they are not, and no one suggested to support "virtual clients". I am not sure how you got to that conclusion. That all said, you might be right that a first step in the solution could be to start like you did here. I am not actually yet suggesting ANY solution. The outcome of c may be to start with what you did here, or maybe we will find a better way. |
@gabikliot I'm getting the impression that I should avoid using a grain client as a stream producer. Are the stream semantics (based on looking at While we do want to find a gold plated solution, I am asking if users should design around this issue. |
I would definitely recommend against the strategy of avoiding using a grain client as a stream producer and even more against "designing around this issue". The reason I did not fix it myself a while ago is that I want @jason-bragg and @sergeybykov to be in the crux of this fix, be part of the different tradeoff decisions that we will have to make, as they are the ones who would also need to maintain that code later on. In the meanwhile, you can help by forming a test that reproduces that problem. @jason-bragg, @sergeybykov , we can chat over Skype tonight and discuss the various options for the fix. |
@gabriel, if you’re available, let’s try the skype tomorrow night. I agree that we should fix the exception first, then get tests setup, but for triage reasons, we’re trying to get to a point where we have a viable plan to fix this sooner than later. Once we’ve a general agreement on what the streaming semantics should be on the client, and an approach that seems viable, it will be easier to scope and prioritize the work needed to resolve this. Semantics
I agree in principle, but I’d like to explore what this means in a little more detail. “until he (the client) dies” – What does this mean? As you’ve previously stated “The problem fundamentally stems from the fact that we don't track client liveness status, as we do for silos, so we can't easily and authoritatively decide if the client is up or down.” I suggest the following semantics for client life/dead status. With these semantics, the streaming semantics you described become more clear. Any stream consumers or producers should function until the client ‘dies’. If the application is still running, but the Orleans client is ‘dead’, the application must reinitialize the client and re-establish stream producers and consumers before functionality will be restored. Detailed Technical solution
Some scenarios to consider: Client connects to several gateways. A silo hosting one of the gateways cycles. Client would reconnect, but gateway on silo would not recognize it, and declare it dead. Client connects to several gateways. Client loses connection to a single gateway. Gateway declares the silo dead, but since client can’t reach it and all other gateways are fine, neither client nor other gateways would have any way of knowing it has been declared dead. (Should client declare itself dead if it can't reach a gateway?) Client loses connection and dies. Each gateway declares it dead, but without the client to communicate this to other gateways in a timely manner, the window of time for all gateways to declare the client dead would vary by the configurable period of inactivity required to declare a client dead. Thoughts? |
@jason-bragg , the solution could be something along those lines. Basically, one of my ideas, which is quite similar to what you suggested (but not identical) is to buffer/store all streams for which client subscribed/produced (we already have them in the streaming runtime basically) and re-establish them (reconnect/refresh) every time we connect to a new GW. In addition, a couple of optimizations can be added to make it more lightweight. There are a couple of complications to that plan, but all doable I think. Lets chat tomorrow evening. |
From the gateway side, we do have a mechanism for determining that a client is dead, it's just not global, it's per gateway. Currently we consider a client dead if a gateway can't communicate with it for more than 1 minute. After this time, we unregister the activation from the directory and all calls to that client (from streams in this case) begin to throw the KeyNotFound (ClientNotAvailable) exception. The suggestion for making it global was (7). The client, being the only component currently capable of coordinating between all the gateways, would be responsible for communicating it's dead status globally (to all gateways) when any gateway, or the client itself, declares the client dead. A complication with this approach was explored in the second scenario of "Some scenarios to consider" section.
I agree, how much of a problem this is however is unclear. The application layer must deal with the problem of the Orleans Client losing connectivity with the cluster. This changes how applications handle this case, but in my view does not complicate it significantly. I do agree we should think this through though. A way to think about this is that when a client temporarily loses connection to the cluster the application will start seeing errors, but there is reason to believe that these errors are transient. When the application is notified that the client has died (however we want to communicate that) the application layer now knows that the errors are in fact not transient, and it should act accordingly. If we're adding capabilities for clients and actors to call functionality on the client (like observers and streams, ..) we are supporting a more robust set of client behaviors and cluster integration. The programming model must reflect and support this. Clients are no longer just external entry points for grain calls. The proposed approach is not likely the best we can come up with, but the capabilities that have been added to the client should prompt a reevaluation of the client programming model.
I like this idea, but I suspect that this is less strait forward than it seems. Stream providers are extension points to Orleans, so this must be done in a general way to work with any stream provider that developers may devise. How would you suggest we do this? When thinking through the above proposal, I anticipated your interest in this area as you'd eluded to it in previous posts, and concluded that it would be a natural extension of what was proposed. Assume we've implemented all of the details in the proposal, but we wanted to make it easier for application developers using streams on the client to recover from a connection loss from the cluster. We could add a capability to the client that automatically reestablished stream links upon the re-initialization of the client. This is not exactly what you're suggesting but would be comparable, given the change to the client programming model. |
@amccool , I promised one week and with a great push by @jason-bragg this issue has now been fixed. |
@jason-bragg @gabikliot hats off to you guys. I've got testers just itching to try this out. I will have to merge some differences between master and 1.1.3 and try to have some results out quickly. |
👍 |
@amccool. This is a partial fix that should unblock most users, but we've still got more work to do on this. I'll be adding an issue soon to cover roughly what we're thinking. With this fix, if a client dies, the streaming infrastructure will remove any producers it had from the pubsub system. However, if the client loses connection with the cluster for a period of time (configurable, but default is 1 minute) then reconnects, it's producers will be removed, but not re-added when it reconnects. This means a streaming client that reconnects after the configured period, will likely not function correctly. For this reason, if a client using streaming loses connectivity to the cluster for more than the configured time, it should probably restart. In regards to your testing, the producers on the dead client are only removed after the client has been disconnected (or dead) for grater than the configured time period (1 minute). Also, I'd be down right giddy if those tests were contributed back to our test harnesses. :) Most of the infrastructure for this (but not all) was added in #1637. We've also not yet done the work to cleanup stream consumers on clients that died, only producers. |
Does restart mean a full uninitialize and initialize, or just a new |
It means a full uninitialize/initialize of the client, and re-acquiring the stream from the stream provider. Basically, anything one acquired from the client (StreamProviders, AsyncObservers, AsyncObservables, ..) would all become invalid after the client is reinitialized, so all would need to be re-acquired from the reinitialized client. My initial suggestion though was to cycle the service. I assumed that the Orleans client was probably a web front end with some redundancy in the same data center as the cluster, so connection loss for such an extended period of time is not likely, and in that case, cycling the process wouldn't really be much more disruptive than the connection loss that necessitated it. If these assumptions hold, cycling the process would be less work and safer than resetting the grain client and restarting all the producers. |
@amccool , I want to clarify something. We do need to fix that, to be 100% clean, and we will in the near future, but practically I don't think you need to worry about it. I would just optimistically ignore it and NOT bother implementing any of the work arounds of full client re-initialize like @jason-bragg suggested. |
@amccool I added the clarifications more as information to provide a deeper insight into the behaviors you should expect, as I would want to know these nuances were I running a production system on Orleans. I am not of the opinion that there is anything you need to do to account for the missing functionality. You know your environment and scenarios better than I though :) |
So what's the status of this in 1.2.0 beta? I got latest and the stream subscriptions seem much more reliable (I haven't had that key not found exception, so my client has been able to subscribe to streams without me clearing out the stream table so far), but I'm still seeing lots of exceptions in the log
|
Is this the full exception in the log, or is there more stack trace? |
This has a bit more stack
This seems related, but not the exact same errors as the one I pasted earlier. Thats all the information I have logged about them |
I've not found anything in the grain logic itself that would account for this. I've a theory I'm trying to repro anyway. What storage provider are you using. |
@ onionhammer, In the interim, if this is causing issues in your production environment and you are in great need of a work around, you can try the following.
Please keep in mind that the above workaround will only help if my current working theory is correct. If you can wait for my confirmation of the issue, it would be safer to do so. |
I am indeed using azure table storage. I will try those out tomorrow, and get back, thanks! |
I am no longer seeing these errors in the log; I've had DeleteStateOnClear=true set for the past 16 hours, and the log is looking much cleaner. |
@onionhammer |
Resolved via #1429. |
I'm getting failures when subscribing to a stream from a client.
The use case is a client is the producer, and a client is a subscriber. No custom grains are being used.
The problem is that the producers (the client) are not being removed from the storage data.
So when a subscriber is added, the storage still contains the old producer, and the system attempts to contact the old producer and a failure (
System.Collections.Generic.KeyNotFoundException: No activation for client *cli/97a45295
) occurs.the error is occuring at https://github.com/dotnet/orleans/blob/master/src/OrleansRuntime/Streams/PubSub/PubSubRendezvousGrain.cs#L185
We probably need a better exception for the disappeared client, and then catch that exception and remove the producer.
The text was updated successfully, but these errors were encountered: