Skip to content

Loading…

Reports not being sent. #2

Closed
satyamshekhar opened this Issue · 24 comments

2 participants

@satyamshekhar
Collaborator

I found a bug yesterday, https://github.com/dhruvbird/node-xmpp-bosh/blob/master/src/session.js#L975 always evaluates to false. No, reports are ever sent -- since node.attrs.ack is not an integer.

@dhruvbird
Owner

dang! We should sanitize all user sent attributes before we even start processing them.

@satyamshekhar
Collaborator

:).. totally.. however when I fixed this a lot reports were being sent continuously.. since two response are sent simultaneously (initially) and the client is never able to recover. We should do this with a timeout.

@dhruvbird
Owner

didn't get this bit...

@satyamshekhar
Collaborator

actually there was another bug.. https://github.com/dhruvbird/node-xmpp-bosh/blob/master/src/session.js#L975 -- I changed line from if (node.attrs.ack < this.max_rid_sent && this.unacked_responses[node.attrs.ack]) to if (node.attrs.ack < this.max_rid_sent && this.unacked_responses[node.attrs.ack + 1]), which I know realise is not correct either. response corresponding to rid=node.attrs.ack will always be deleted in handle acks. So this will never evaluate to true. We need to decide when should we send a report and without how much timeout.

@dhruvbird
Owner

I don't see what the problem is??

@dhruvbird
Owner
@satyamshekhar
Collaborator

I am really sorry for not being clear. I have this tendency.

Check out the change I made here 9f0b359#L0L977 (though this is a mistake). Lets consider the code as it was before this change. We have three issues -

  • First is sanitation of all the input attrs of a request, since we never make node.attrs.ack an integer. (wondering how/why does acking work correctly)

  • bug in the code: 9f0b359#L0L973

This line is wrong since we are checking for this.unacked_responses[node.attrs.ack]. This will always be false since while handling acks we delete from unacked responses the response for which we have received an ack (all response for rid <= node.attrs.ack).

  • Even if we correct the above two issues, it still wont work properly for clients that dont have acks, since for such clients we assume the ack to rid - window. Due to this assumption we will end up sending a report everytime (since ack < max_rid_sent).

-- Also, we should see if there is a certain amount of time has pass before we enqueue a report. For example -

  1. server is holding r1.
  2. server sends something on r1.
  3. before receiving r1 the client sends something on r2.
  4. server sends a report for r1 (r1 wont be acked).

Probably its an extreme edge case - and also the cost is only one extra response being sent from the server in case it has nothing to send.

@dhruvbird
Owner
@satyamshekhar
Collaborator

Right, so what I had done was a mistake - will correct.

That means for a client with no acks we will end up sending a lot of reports, since we assume that for such a client ack = rid - window. Maybe we should disable acks for such clients.

Its just that at times there will an extra RTT incurred since we sent a report when it was not required. Shouldn't we wait for some time out (~5 secs: > RTT/2) before sending a report?

@dhruvbird
Owner

hmm... yes, for non-ack clients, we'll land up sending a report every time.

It makes sense to just have a flag which check if acks are enabled for this client and not do anything if acks aren't enabled (session.ack should reflect this).

As far as sending a report is concerned, we can do it immediately. To save traffic, we can wait for 1 rid instead of a difference of 0. i.e. If the client has ACKed 1 less than the current, we can live with it. Introducing a delayed event would probably complicate things since a lot of state would need to be updated if the ACK does come in within that window. (trying to keep it simple) - what do you think?

@satyamshekhar
Collaborator

I was thinking about it differently - Cant we send a report only if the time we sent the prev response is greater than a threshold (~5secs)?

@dhruvbird
Owner

Dunno if that's a good idea since potentially many packets can be exchanged within that time frame. Plus, the point of a report is to let the client know sooner than later about a potentially missed response... Basically, I'm trying to stay away from anything that involves "guessing" a good constant/time value since that is where things tend to go - esp. when there is a decent deterministic solution.

@satyamshekhar
Collaborator

I know what you mean.. Cool..

Also, the timestamp we send in the report is the timestamp for the acked response. Shouldn't it be timestamp for the least rid unacked response?

@dhruvbird
Owner

"In this case it SHOULD include a 'report' attribute set to one greater than the 'ack' attribute it received from the client, and a 'time' attribute set to the number of milliseconds since it sent the response associated with the 'report' attribute."

I am guessing that "the response associated with the 'report' attribute" is the same as ack from client + 1.

@satyamshekhar
Collaborator

Ya.. so I think we were sending timestamp for node.attrs.ack - the change currently in the master is the change I made thinking we should check for node.attrs.ack + 1 instead of node.attrs.ack -- which you corrected.

The situation is like this:
i) We check if we have node.attrs.ack in our unacked_responses - so that we dont keep sending the same report again and again.
ii) We check if we have node.attrs.ack + 1 in our unacked_responses - so that we can to send time stamp for that.

So, now this brings us to the case where we dont have node.attrs.ack + 1 in our unacked responses(the request got lost) but max_rid_sent > node.attrs.ack. How do we handle this?

@dhruvbird
Owner

Something doesn't seem right here.

Suppose we just responded on rid=11. The previous rid=10. Before our response (rid=11) reached the client, it ACKed rid=10 on (request) rid=12.

The check node.attrs.rid(10) < max_rid_sent(11) will succeed and a report will be sent out. This is because we are processing acknowledgments after the reports are sent out. Doesn't this seem wasteful?

Also, do you think it's right to process ACKs ''after'' handling the 'report sending?

@dhruvbird
Owner

After giving this some more thoughts, these are my observations:

  1. We should process RID/unacked_responses updates before sending out the report.

  2. To avoid too many reports being sent, we should check if (rid < max_rid_sent - 1 && unacked_responses[rid + 1]) -- @satyamshekhar you were right about the latter conditional all along. Additionally, we need only worry about the object in unacked_responses[rid + 1] since THAT is the earliest unacknowledged response (according to our logic).

  3. There is an invariant that we MUST maintain on unacked_responses. Namely: if unacked_responses[k] is truthy, then unacked_responses[k .. max_rid_sent] MUST also be truthy. This allows us to freely index unacked_responses and avoid a situation where unacked_responses[rid+1] is NOT present, but unacked_responses[rid+2] is, etc...

Do you think this makes more sense?

@satyamshekhar
Collaborator

Consider the following situation:

If we have a check for rid < max_rid_sent - 1

  1. The client sent request 10 and 11.
  2. Server replied on 10 and held 11.
  3. Before wait period the server sent something on the response for 11 which got lost.
  4. Now the client sends request 12, he ll ack request 10.
  5. max_rid_sent is 11, ack is 10. -- we should send a report for this.

I think the check (acked) rid < max_rid_sent was correct. The only problem with this is that in an ongoing chat, we might end up enqueing a lot of reports - and thus degrading our performance -- since a lot requests will be exchanged,

The check unacked_responses[rid + 1] or unacked_response[rid] seemed similar to me initially but they are not. Ideally, we should have these methods(enqueue_report and handle_acks) independent of each other -- their order should not matter. We achieve this by using node.attrs.ack + 1. Another, consequence of using rid + 1 is that we might send multiple reports for the same rid, if rid + 1 is not acked. However, if we use the check unacked_responses[rid] it forces us to send only one report.

The ideal solution to the problem would be to immediately send a report if we receive a request, which doesn't ack a response that we had sent after even after RTT secs (its median) -- this can be many times. Otherwise, we dont enqueue any report. We need the check for unacked_responses for misbehaving clients(in case it has already acked it before).

@dhruvbird
Owner

max_rid_sent is 11, ack is 10. -- we should send a report for this.

According to the spec, we should. Do we have any numbers on how often this happens for well-behaved clients with a good network? (false negatives?)

I agree with you about the fact that their order should not matter. However, this is true if we use the node.attrs.ack+1 logic instead of keying on node.attrs.ack since as you mentioned if we use the latter, and we delete ACKed responses before handling 'report', we shall never find the unacked response.

I guess it is "okay" to send out multiple reports (barring performance). I mean it's not a correctness issue. In fact, it is "correct" to send multiple reports. The reason being that the first report might be lost.

@satyamshekhar
Collaborator

According to the spec, we should. Do we have any numbers on how often this happens for well-behaved clients with a good network? (false negatives?)

No numbers yet, but are reports meant for good networks. Shouldn't they optimize bad networks without hindering performance for good networks?

I am still biased towards the timeout solution :). Though, I also get why you think its not at all clean.

@dhruvbird
Owner

Maybe we need to create another "report queue" to queue up all report-type messages. Reports can be clubbed with normal responses. Send a report along with a normal response if one is to be sent?

@satyamshekhar
Collaborator

We en-queue reports with normal responses right now as well. Why do you think a separate queue is required?

@dhruvbird
Owner

We can (potentially) club:
1. A normal xmpp response
2. A bosh response (not report type)
3. A 'report' type bosh response

Currently, we don't distinguish between [2] & [3].

@dhruvbird
Owner

@satyamshekhar This is almost fossilized. Is there anything actionable that came out of this? Closing for now. Re-open if needed.

@dhruvbird dhruvbird closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.