Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Pending inbound jobs counter is not set to proper value after connection's SerialExecutor shut down #2065

Closed
Pinioo opened this issue Sep 12, 2022 · 15 comments

Comments

@Pinioo
Copy link

Pinioo commented Sep 12, 2022

Version info

The issue was faced while using Scandium in version 3.6.0 as a DTLS Server.

Failing scenario with the root cause

During a handshake, the server sends a CertificateRequest, and the client responds with a datagram that contains 4 records, starting with an invalid Certificate record (it contains no certificate).

DTLSConnector puts a new job on the connection's serial executor for every record, decrementing pendingInboundJobsCountdown. Job scheduled for Certificate record throws HandshakeException causing connection's serial executor to shut down, therefore counter is decremented 4 times and incremented only once (incrementing happens at the end of scheduled runnables that couldn't be executed).

Fix

I have a piece of code that seems to fix this problem and I'm going to issue a PR shortly.

@Pinioo Pinioo changed the title Bug: Pending inbound jobs counter is not set to proper value after shutdowning connection's SerialExecutor Bug: Pending inbound jobs counter is not set to proper value after connection's SerialExecutor shut down Sep 12, 2022
@boaks boaks added the bug label Sep 12, 2022
@boaks
Copy link
Contributor

boaks commented Sep 12, 2022

Nice catch.

An alternative idea would be additional pending-in/out counter per connection.
That would also allow, to limit pending jobs on single connections.

WDYT?

@Pinioo
Copy link
Author

Pinioo commented Sep 13, 2022

I was also considering a similar solution but I find it more difficult to inject into the codebase (and I didn't want to introduce any significant changes).

If you think a solution with counters per connection is better and more flexible, I can implement it tomorrow.

@boaks
Copy link
Contributor

boaks commented Sep 13, 2022

I guess, it requires some changes to "add to queue and count or reject and don't count" in an atomic way. That may require something in ExecutionListener or SerialExecutor or in both.
I hope I can spend some time tomorrow.

If you think a solution with counters per connection is better and more flexible, I can implement it tomorrow.

If checked your PR and it's not bad. But give me the time tomorrow to consider an alternative solution/implementation.

boaks added a commit to boaks/californium that referenced this issue Sep 14, 2022
Ensure, that jobs removed on shutdown, are considered.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 14, 2022
Ensure, that jobs removed on shutdown, are considered.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
@boaks
Copy link
Contributor

boaks commented Sep 14, 2022

FMPOV, the QueueingListener makes it easier to add additional control over jobs.
The PendingJobthen uses that control to do the counting.
(I didn't go for the additonal per-connection limit, maybe we do that later.)

@Pinioo
Copy link
Author

Pinioo commented Sep 14, 2022

I went through your PR and I like it, nice approach!

@boaks
Copy link
Contributor

boaks commented Sep 14, 2022

OK, I will polish it and add some unit-tests.
And I will check, if that could be applied to 2.7/2.8.

boaks added a commit to boaks/californium that referenced this issue Sep 14, 2022
Ensure, that jobs removed on shutdown, are considered.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 14, 2022
Ensure, that jobs removed on shutdown, are considered.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 15, 2022
Ensure, that jobs removed on shutdown, are considered.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 15, 2022
Ensure, that jobs removed on shutdown, are considered.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 15, 2022
Ensure, that jobs removed on shutdown, are considered.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 15, 2022
Ensure, that jobs removed on shutdown, are considered.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 15, 2022
Run pending jobs on connection shutdown to align the counters.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
@boaks
Copy link
Contributor

boaks commented Sep 15, 2022

Working on unit tests to ensure the proper function, I found, that the jobs are not only handed over to the SerialExecutor. There was also an issue left with the head of the jobs of the SerialExecutor, which has already been removed from the queue there, but it was not safe that these jobs are processed proper.

I replaced therefore the approach by one, which ensures, that the pending jobs are called on connection shutdown and the actual executed functions are blocked. With that, only the "onDequeueing" is executed.

I've tested that approach and will continue to do some load-tests tomorrow.

boaks added a commit to boaks/californium that referenced this issue Sep 15, 2022
Run pending jobs on connection shutdown to align the counters.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 17, 2022
Run pending jobs on connection shutdown to align the counters.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
boaks added a commit to boaks/californium that referenced this issue Sep 17, 2022
Run pending jobs on connection shutdown to align the counters.

Fixes issue eclipse-californium#2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
@boaks
Copy link
Contributor

boaks commented Sep 19, 2022

The LimitedRunnableis no split into a separate class in order to "reuse" it.

I consider to schedule a 2.7.4 release for Wednesday 21.9. and a 3.7.0 for Thursday 22.9.
It's also possible to postpone that to Tuesday 27.9. and Wednesday 28.9., if you prefer to test it more.

@Pinioo
Copy link
Author

Pinioo commented Sep 19, 2022

I'm currently checking the solution in action, so I'm going to give you feedback later today.

@Pinioo
Copy link
Author

Pinioo commented Sep 19, 2022

Everything looks fine to me. If you are okay with releasing it this week, it would be great to see it.

Thank you very much for your engagement!

@boaks
Copy link
Contributor

boaks commented Sep 19, 2022

Thanks for reporting this nasty bug.

If nobody else raises an other opinion, I will do the releases as scheduled.

boaks added a commit that referenced this issue Sep 19, 2022
Run pending jobs on connection shutdown to align the counters.

Fixes issue #2065

Signed-off-by: Achim Kraus <achim.kraus@cloudcoap.net>
@boaks
Copy link
Contributor

boaks commented Sep 23, 2022

Fixed with PR #2068. Contained in minor release 3.7.0.

Please close this bug, after confirming that it "works for you".

@TerenceYi
Copy link

What's the impact of using PSK(Pre-shared key) during a handshake from the view server side?

@boaks
Copy link
Contributor

boaks commented Nov 16, 2022

Until now, I didn't found such a vulnerability for PSK on the server side.
The intended process flow of a PSK client flight 5 should not stop processing before the Finish.
The left risk seems to be, that someone finds an additional bug, maybe using some malicious messages, which then causes to abort the processing of a flight.
To minimize that risk, this bugfix is required.
Are there any topics, which blocks you using 3.7.0 or 2.7.4?

@boaks boaks reopened this Nov 16, 2022
@TerenceYi
Copy link

Appreciate your comments, we are working on the impact and will plan the upgrade accordingly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants