Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot dial webtransport-go server #258

Closed
Tracked by #2422
achingbrain opened this issue Mar 4, 2024 · 23 comments · Fixed by #267
Closed
Tracked by #2422

Cannot dial webtransport-go server #258

achingbrain opened this issue Mar 4, 2024 · 23 comments · Fixed by #267

Comments

@achingbrain
Copy link
Contributor

I'm trying to dial a very simple echo server using webtransport-go but the opening handshake always times out so the session never becomes ready.

I've put a small repro case together here.

Running npm start will:

  1. Start a webtransport-go echo server as a separate process
  2. Dial the server using the client from @fails-components/webtransport.
  3. Starts a bidirectional stream
  4. Write one buffer
  5. Close the writable
  6. Read from the readable until it closes.

npm run browser starts the webtransport-go server but then spits out some JS you can paste into a browser to run the client portion of the test.

This works in Chrome but in node it fails with a handshake timeout after a short delay.

% npm start 

> webtransport-dial-go-server@1.0.0 start
> node index.js

SERVER start
SERVER ready
CLIENT create session https://127.0.0.1:12345/echo
CLIENT wait for session
WebTransportError: Opening handshake failed.
    at Timeout._onTimeout (file:///Users/alex/Documents/Workspaces/achingbrain/webtransport-go-server/node_modules/@fails-components/webtransport/lib/client.js:91:11)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
  [Symbol(Symbol.toStringTag)]: 'WebTransportError'
}
@martenrichter
Copy link
Member

It will take some until I can look into it.
You may want to put some debug code into the socket classes on the js side.
Then one can see, if at least packets are being sent or received, to sort out if the ports or firewall issues occur.

@achingbrain
Copy link
Contributor Author

This appears to be a regression as dialling the go-webtransport server works as expected with @fails-components/webtransport@0.3.1.

I'm pretty sure there are no port of firewall issues as the only change to make to get it to work is to change the @fails-components/webtransport version in package.json, remove node_modules and reinstall deps.

@martenrichter
Copy link
Member

Sure, this is correct. Let me explain the change.
Before 1.0 libquiche used its own sockets, it now uses node.js sockets. So this code is completely replaced by standard node.js code. So the regression may happen in this completely rewritten code. That is why I am asking to put some debugging code in serversocket.js, clientsocket.js etc. Since this is the new code.
(Btw. I hoped, that this replacement solves more issues then introducing new issues.)

@martenrichter
Copy link
Member

Probably, the problems are caused by these lines:

@achingbrain
Copy link
Contributor Author

I've made the following changes:

diff --git a/transports/http3-quiche/lib/clientsocket.js b/transports/http3-quiche/lib/clientsocket.js
index 5bfa22d..a2c0836 100644
--- a/transports/http3-quiche/lib/clientsocket.js
+++ b/transports/http3-quiche/lib/clientsocket.js
@@ -49,6 +49,7 @@ export class Http3WebTransportClientSocket extends Http3WebTransportSocket {
           //
         })
         this.socketInt.on('message', (msg, rinfo) => {
+          console.info('clientsocket incoming', msg, rinfo)
           this.cobj.recvPaket({
             msg,
             rinfo,
@@ -56,6 +57,16 @@ export class Http3WebTransportClientSocket extends Http3WebTransportSocket {
           })
         })
         this.socketInt.bind(this.localPort)
+
+        const send = this.socketInt.send
+
+        // @ts-expect-error incorrect types
+        this.socketInt.send = (...args) => {
+          console.info('clientsocket outgoing', ...args)
+          // @ts-expect-error incorrect types
+          return send.call(this.socketInt, ...args)
+        }
+
         setImmediate(() => this.cobj.onCanWrite())
       })
       .catch((error) => {

If I run the example I get:

% DEBUG=websocket*, DEBUG_TRACE=* npm start

> webtransport-dial-go-server@1.0.0 start
> node index.js

SERVER start
SERVER ready
CLIENT create session https://127.0.0.1:12345/echo
CLIENT wait for session
clientsocket outgoing <Buffer c3 00 00 00 01 08 fa ab e7 73 c0 3f f0 29 00 00 44 d0 75 9d c7 63 76 4f f6 63 ba e6 04 0c 25 0f 01 68 4a 4a ec fd 5e 7d 04 40 71 15 b5 1e ec 7f 80 91 ... 1200 more bytes> 0 1250 12345 127.0.0.1 [Function: bound packetSendCB]
clientsocket incoming <Buffer ca 00 00 00 01 00 04 cd d1 77 00 00 42 3e 63 17 bc e2 1f 6e f4 8a b6 18 a6 6b f0 c1 82 7f 95 ba 28 09 c9 4b 2b b1 8c bb d9 40 1a a9 37 ec 28 ce 3f 47 ... 1202 more bytes> { address: '127.0.0.1', family: 'IPv4', port: 12345, size: 1252 }
clientsocket outgoing <Buffer ee 00 00 00 01 04 cd d1 77 00 00 40 3d 35 d1 0b 09 cc a8 79 84 b7 6e c9 b0 05 82 7b e6 9f 1e 57 18 9d ad 80 8e 02 9b 6b 1e 80 0b 62 50 87 f4 ab 24 4b ... 114 more bytes> 0 164 12345 127.0.0.1 [Function: bound packetSendCB]
clientsocket incoming <Buffer 47 3d 8f fd 7f 4b 4b b2 5e 45 b2 e8 13 c5 37 8e 05 c1 f1 ef 2e 62 39 78 20 9c be a1 c2 d6 5b> { address: '127.0.0.1', family: 'IPv4', port: 12345, size: 31 }
clientsocket incoming <Buffer 43 5a b8 ba 5d bd 9b 12 e4 b4 c0 0c 12 e2 cd fe e1 a9 f6 35 bf 29 cb 43 37 db ee 8c ad 8a 0d 91 f7 f7 41 17 98 43 30 46 d6 ae 2d 80 cc 06 18 ae 4b ad ... 246 more bytes> { address: '127.0.0.1', family: 'IPv4', port: 12345, size: 296 }
clientsocket outgoing <Buffer 4c cd d1 77 00 7c 69 e6 fe b2 8c 33 04 c9 d4 8d 4e d7 85 d6 f6 41 1a af 46 88 fd> 0 27 12345 127.0.0.1 [Function: bound packetSendCB]
WebTransportError: Opening handshake failed.
    at Timeout._onTimeout (file:///Users/alex/Documents/Workspaces/achingbrain/webtransport-go-server/node_modules/@fails-components/webtransport/lib/client.js:91:11)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7) {
  [Symbol(Symbol.toStringTag)]: 'WebTransportError'
}

@martenrichter
Copy link
Member

Ok, that means at least network communication works.
Then, it must be something within the handshake process. I can debug it at the weekend (Friday is already a holiday).
Anyway, there is another possibility: quiche dropped an older WebTransport version; if I had a quick guess, I would look there. You should also consider increasing the quiche logging level; it is now possible to see the C++ logging by passing an argument to create the first object that uses libquiche.

@martenrichter
Copy link
Member

martenrichter commented Mar 5, 2024

Another comment, if you want to see what is going on:


and the following lines handle the different stages of connecting. So with printfs you can analyze, where it is stuck.

@achingbrain
Copy link
Contributor Author

I think maybe the bug is that Http3Client::handleConnecting() returns recheck as true to get the client to recheck the session, but in Http3ClientJS::recvPaket the return is negated when set as client_->connectionrecheck_?

That is, after the first packet is received, the recheck variable is set here.

Then after the second packet is received client_->connectionrecheck_ is false so client_->handleConnecting() is not called again.

If I remove the negation it seems to work?

@martenrichter
Copy link
Member

Yes, this can be... But if it is wrong, it must also be changed in other places.
Are the unit tests all passing regardless of whether it is changed?
Then, the packets will probably arrive so fast in unit tests that they do not have to be rechecked.
It makes sense, as this is the part I had to rewrite.

@achingbrain
Copy link
Contributor Author

They all still pass if I remove the negation here but some fail if I also remove the negation here.

@martenrichter
Copy link
Member

Ok, then I have to look into it.
You nailed it down, but the connection process will probably starve somewhere if some fail.
I believe I did not return the correct recheck value for some cases. (You may add a debug message before recheck is set to determine in which situation it breaks.)
Anyway, the connection process is seriously broken.

@achingbrain
Copy link
Contributor Author

I have to agree, I can't make it work in anything but the most trivial demo app so I think there are certainly some timing bugs in there.

For example my ultimate goal here is to integrate @fails-compononents/webtransport with this js-libp2p PR and even after removing the first negation, it's possible for the onReady callback of HttpWTSession to never be invoked, which means the visitor isn't being set so I think it's possible for the connection to get into a state where that branch is missed.

@martenrichter
Copy link
Member

I am away from my dev machine.
But


maybe the problem. I believe it must be a true.

@achingbrain
Copy link
Contributor Author

That seems to make the tests pass with both negations removed.

Does this need to be false too:

if (checkSession())
{
recheck = true;
}

@achingbrain
Copy link
Contributor Author

achingbrain commented Mar 5, 2024

I do wonder if the checkSession method needs a bit of attention, it's possible for it to return true even if the wtsession has not had it's visitor set, which seems to be the reason the onReady is never called in js-land.

Edit: I'm not sure this is true now, I can see the visitor being set but onReady is still not called.

@martenrichter
Copy link
Member

That seems to make the tests pass with both negations removed.

Does this need to be false too:

if (checkSession())
{
recheck = true;
}

No I think this is correct, if we assume that checkSession returns a value for rechecking.

I do wonder if the checkSession method needs a bit of attention, it's possible for it to return true even if the wtsession has not had it's visitor set, which seems to be the reason the onReady is never called in js-land.

That, I agree. We probably need another bool indicating whether a session has been acquired.
In the moment, it is only rechecking, if there are other pending streams, but this may change.

@martenrichter
Copy link
Member

I am starting to work on the problem.
First, without your example, I have recognized that if I only run the bidirectional streams, they also fail, as you said. Btw. so you know a way to influence the order of the tests, if the bidi streams always run first, I would have recognized the problems earlier.

@martenrichter
Copy link
Member

Here are my attempts. I haven't tested the Go server yet, but I found another issue: The add-on was not loaded quickly enough.
https://github.com/fails-components/webtransport/tree/fix_http3_client_connect

@martenrichter martenrichter linked a pull request Mar 8, 2024 that will close this issue
@martenrichter
Copy link
Member

Ok, the changes discussed seem to work.
But before merging, it needs to be tested in my real-world application, as the fixed timings in the test do not uncover all problems.
One thing I recognized is that the output shows the following:

SERVER start
SERVER ready
CLIENT create session https://127.0.0.1:12345/echo
CLIENT wait for session
CLIENT session ready
CLIENT create bidi stream
CLIENT get writer
CLIENT wait for writer
CLIENT write
CLIENT close writer
CLIENT read from stream
CLIENT got from stream { value: Uint8Array(4) [ 0, 1, 2, 3 ], done: false }
CLIENT read from stream

So, the fin is not received. I suspect the echo server is the source of the problem, but I will recheck if a fin is sent and/or received. (These echo applications are not trivial; both Firefox and Chrome had bugs that I had to file and patch, but for go I am afraid I have no motivation).

@martenrichter
Copy link
Member

Also, the real-world application seems to work. merging next...

@martenrichter
Copy link
Member

And regarding the fin, it does not seem to be reported by the go code.

@achingbrain
Copy link
Contributor Author

And regarding the fin, it does not seem to be reported by the go code.

Good news on that front - it was user error - the go server needs to explicitly close streams rather than io.Copy doing it implicitly as I'd thought.

Thanks for fixing this up.

@martenrichter
Copy link
Member

Ok, that is good.... so no work to do here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants