Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot send on macos #27

Closed
fdagnat opened this issue Nov 25, 2022 · 25 comments · Fixed by #33
Closed

Cannot send on macos #27

fdagnat opened this issue Nov 25, 2022 · 25 comments · Fixed by #33

Comments

@fdagnat
Copy link

fdagnat commented Nov 25, 2022

Hello,

Using a locally built version of bob (using opam pin) on a macos 12.6.1 (x86-64), the send does not work see below:

> bob send <file> -r <our relay> -vv
Password: wheayantelt-moadth
[ 62196][     DEBUG][            bob.pack]: Compute <file>.
[ 62196][     DEBUG][          bob.stream]: End of the file: <file>
[ 62196][     DEBUG][           bob.fiber]: Close the file-description 4
[ 62196][     DEBUG][            bob.pack]: <file> -> 299192bd26006baa2b1704c6924368a21600dbff
[ 62196][     DEBUG][            bob.pack]: Store of <file>.
[ 62196][     DEBUG][            bob.pack]: The PACK stream is ready to be sent.
[ 62196][     DEBUG][           bob.fiber]: Call to connect() (operating system: Unix)
[ 62196][     DEBUG][           bob.fiber]: Set file-descriptor into a non-blocking mode.
[ 62196][     DEBUG][           bob.fiber]: Try to connect to <our relay>
[ 62196][     ERROR][           bob.fiber]: Got an error for a non-blocking connect(): Operation now in progress.
bob: connect(): Operation now in progress.

[ 62196][     DEBUG][           bob.fiber]: Close remaining file-descriptions:
[]

Any hint would be appreciated.

@dinosaure
Copy link
Owner

This seems to be related to the behaviour of connect. We had already reported different behaviour on different platforms (see #13) but it seems that my previous version of me had already suspected a possible error on macOS:

when is_freebsd () (* TODO(dinosaure): check for MacOS *) ->

Thank you for your issue, I am unfortunately not a Mac owner, can you test this version (#28) of Bob which should (a priori) fix your issue?

We should probably also look at Cosmopolitan support. It seems to me that the "bob.com" version works on MacOS (surprising) which describes a behaviour I had not imagined (maybe on MacOS, IsFreeBSD returns true...).

@fdagnat
Copy link
Author

fdagnat commented Nov 28, 2022

Hi,

Testing is still not satisfying. Now it can connect but it only send 64KiB and then terminate.
The debug trace is too big for this forum (1.1 M). You can find it there https://gist.github.com/fdagnat/ae32ea9e64fc3f465d8c909a7b2a5c48

Have a good day

@dinosaure
Copy link
Owner

it only send 64KiB and then terminate.

I will try to figure out why. It will take a time because it seems that this bug appears only on MacOS (and I need to talk to some friends to reproduce & test your case). Can you tell me which version of OS X do you have and which Mac (to be able to reproduce then)?

@fdagnat
Copy link
Author

fdagnat commented Nov 29, 2022

As already said above, I run macos 12.6.1 on a x86-64 mac. More precisely it is a macbook pro 16 of 2019 with a 2,3 GHz Intel Core i9 (8 cores).

@dinosaure
Copy link
Owner

With the help from @kit-ty-kate (who she has a MacOS too), I can understand a bit more what is going on. It seems that problem is much more related to how we read the given file than the transmission and/or the protocol (which is a kind of good news). More concretely, it seems that bob only read 64Kib of your file and terminate by an End_of_file here:

bob/lib/stream.ml

Lines 748 to 752 in f898b50

Fiber.read ~len fd >>= function
| Ok `End ->
Log.debug (fun m ->
m "End of the file: %a" Bob_fpath.pp path);
Fiber.return r

(Confirmed by your log here).

This signal can come from read() or a possible double-close of the file-descriptor or simply from an exception leak. Unfortunately, I am blindfolded on this issue and need more information on Bob's behaviour on macOS. In the PR proposed above, I have added a commit that adds logs for the three cases described. Would it be possible to rerun your example and get the log? Finally, only the logs for bob.fiber and bob.stream (probably bob.pack as well) are interesting.

@fdagnat
Copy link
Author

fdagnat commented Nov 29, 2022

Hi again,

No problem for testing. Do not hesitate. My only question is how do I limit the logs to the bob.fiber, bob.stream and bob.pack?

@dinosaure
Copy link
Owner

We don't have yet a standalone mechanism to filter logs. You probably can just save all logs into a file and do a grep like:

$ bob send -p <password> file.txt &> log.txt
$ grep "bob\.\(fiber\|pack\|stream\)" log.txt

@fdagnat
Copy link
Author

fdagnat commented Nov 29, 2022

For the moment the complete debug is here: https://gist.github.com/fdagnat/38c48d12b392e5fd2e1c68185c3eb805

@dinosaure
Copy link
Owner

For the moment the complete debug is here: https://gist.github.com/fdagnat/38c48d12b392e5fd2e1c68185c3eb805

Arf, I forgot to notice you to re-update the version of bob:

$ opam pin add bob https://github.com/dinosaure/bob.git#fix-connect-macos-native
$ opam reinstall bob

@fdagnat
Copy link
Author

fdagnat commented Nov 29, 2022

@dinosaure
Copy link
Owner

Yes, it's better, I will try to figure out from this outcome what is going on precisely 👍.

@dinosaure
Copy link
Owner

I think I finally found the issue. It seems that the End value is given by something else than our read() syscalls. Now, we have 2 possibilities: an exception leak or a signal from select about an exception.

bob/lib/fiber.ml

Lines 386 to 391 in f898b50

let sigexcept fd =
match Hashtbl.find_opt prd fd with
| None -> ()
| Some (`Read (ivar, _len)) ->
Hashtbl.remove prd fd;
Ivar.fill ivar (Ok `End)

The documentation about the select() behavior and the possibile difference with a BSD/Linux system is not clear. So I decided to avoir the handler of select()'s exceptions. I updated #28 with that. I added some logs too to really figure out the execution path in details. Can you re-update Bob and show the same kind of logs?

$ opam pin add bob https://github.com/dinosaure/bob.git#fix-connect-macos-native
$ opam reinstall bob
$ bob send -p <password> file.txt &> log.txt
$ grep "bob\.\(fiber\|pack\|stream\)" log.txt

@fdagnat
Copy link
Author

fdagnat commented Nov 29, 2022

@dinosaure
Copy link
Owner

It seems to work, probably, on the other side, you must have the same version of bob but from what I see, bob transmitted the whole file. Can you confirm that?

@fdagnat
Copy link
Author

fdagnat commented Nov 29, 2022

Hum, the other party was a colleague on Linux using the official bob release. He get a file but with an error about the hash of the file. We can do more testing tomorrow if it is needed.

Thanks for your work

@fdagnat
Copy link
Author

fdagnat commented Dec 7, 2022

Hello,
I wanted to test again today but bob does not build anymore:

# (cd _build/default && /Users/fdagnat/.opam/default/bin/ocamlc.opt -w -40 -g -bin-annot -I lib/.bob_dns.objs/byte -I /Users/fdagnat/.opam/default/lib/asn1-combinators -I /Users/fdagnat/.opam/default/lib/astring -I /Users/fdagnat/.opam/default/lib/base/caml -I /Users/fdagnat/.opam/default/lib/base64 -I /Users/fdagnat/.opam/default/lib/bheap -I /Users/fdagnat/.opam/default/lib/cstruct -I /Users/[...]
# File "lib/bob_dns.ml", line 72, characters 6-47:
# 72 |     | Happy_eyeballs.Connect_failed (_host, id) ->
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Error: The constructor Happy_eyeballs.Connect_failed expects 3 argument(s),
#        but is applied here to 2 argument(s)
# (cd _build/default && /Users/fdagnat/.opam/default/bin/ocamlopt.opt -w -40 -g -I lib/.bob_dns.objs/byte -I lib/.bob_dns.objs/native -I /Users/fdagnat/.opam/default/lib/asn1-combinators -I /Users/fdagnat/.opam/default/lib/astring -I /Users/fdagnat/.opam/default/lib/base/caml -I /Users/fdagnat/.opam/default/lib/base64 -I /Users/fdagnat/.opam/default/lib/bheap -I /Users/fdagnat/.opam/default/lib[...]
# File "lib/bob_dns.ml", line 72, characters 6-47:
# 72 |     | Happy_eyeballs.Connect_failed (_host, id) ->
#            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Error: The constructor Happy_eyeballs.Connect_failed expects 3 argument(s),
#        but is applied here to 2 argument(s)

It seems that Happy eyeballs changed its API.

Have a good day

@dinosaure
Copy link
Owner

Thansk, I just made a PR (#30) for that and the upstream version of Bob should work. I relaunched a build on https://builds.osau.re/ to integrate this PR into the bob.com binary. I rebased #28 with this PR too. If we take in account your last log it seems that the error is more about the transmission than the MacOS support now.

It will be useful to get a log from the sender (with a MacOS & the fix available on #28) and the receiver (whatever the system & bob~upstream).

On the sender, something like:

$ opam pin add bob https://github.com/dinosaure/bob.git#fix-connect-macos-native
$ opam install bob
$ ob send -p <password> file.txt &> log.txt

And on the receiver:

$ wget https://builds.osau.re/job/bob/build/latest/f/bin/bob.com
$ chmod +x bob.com
$ ./bob.com recv <password> -vvv &> log.txt

It will be super useful.

@fdagnat
Copy link
Author

fdagnat commented Dec 7, 2022

Hello again,
Testing the new code, my file is systematically corrupted. On the receiver side we get:

> bob recv ...
Accept from ...:55967 [Y/n]:
>>> Received a file: 2211.16212.pdf.
>>> ⠇  128.0 KiB    1.4 MiB/s
[ 72890][     ERROR][           bob.fiber]: close(): Bad file descriptor
bob: Corrupted file (unexpected hash).

I tested with both send and receiver on my mac and with sender mac and receiver linux and got the same result.

By consequence, I think there is another issue. Notice that now the receiver gets 2 blocks of 64K. In case it may serve, I let the trace there https://gist.github.com/fdagnat/b537392382c67dc56ba8deedaf59d478.

Thanks again

@reynir
Copy link
Contributor

reynir commented Dec 7, 2022

I don't know if it's relevant, but it could be an idea to log a warning here:

| None -> Fiber.close socket)

@dinosaure
Copy link
Owner

I don't know if it's relevant, but it could be an idea to log a warning here:

You probably right but I don't think the issue about the corrupted file is related to this probable double-close. It's more about: how Bob deflates and transmits the file to the relay. The DNS part was already done before (just to resolve osau.re). My intuition is (only on Mac):

  • we start to deflate and transmit the file
  • at the end, we close the connection
  • the relay close abruptely the connection
  • the receiver close the connection too but it did not receive everything...

It's probable that, on Mac, when we close the socket, we did not send everything for whatever reason. Again, it's highly related to the system's behavior and Mac (probably) temporise TCP/IP packets. Something like a flush is probably needed.

@dinosaure
Copy link
Owner

To confirm my hypothesis, I just added a new commit on #28 which set the socket for the secure room with TCP_NODELAY. By this way, we ensure that the system should sends everything and properly close the socket then. I'm not sure that it's the right fix - and probably not the right solution - but it permits to restrict the system's behavior about the transmission.

@fdagnat Would it be possible to test with this new version (again, do the opam pin add and opam reinstall to make sure you have the right version). If the problem persist, I definitely need a MacOS to see what is wrong...

@fdagnat
Copy link
Author

fdagnat commented Dec 7, 2022

The behavior is becoming more and more intriguing. From macos to linux, it never works (same 128K only transfer). From my machine to my machine (macos), without -vv it produces the same error (as Linux) and with -vv, it works correctly.

@dinosaure
Copy link
Owner

Good news! I have an access to a Mac and I can reproduce your error. I will try to fix that as soon as I can but the iteration between logs and bugs will be better! I will come back to you when I will fix the initial issue. Thanks for your patience.

@dinosaure
Copy link
Owner

I tried to play a bit with a Mac and see what happens. Currently, I can notice a real corruption between the sender and the receiver. As far as I can say, the relay is not the source of the bug but it seems that the sender craft a packet and send something which was changed by an other part of the code. This is indeed very intriguing. I'll try to look more closely at Bob's behaviour on the Mac, but it seems that the bug is related to the notion of block ownership.

Like you, locally (and the relay), everything works. Like you, if you force some scheduling with -vvv which does I/O, Bob works too. Otherwise, it doesn't work.

@dinosaure dinosaure mentioned this issue Dec 13, 2022
@dinosaure
Copy link
Owner

@fdagnat, I just found the initial bug and I just tested on a MacOS that the commit fixes the issue. Can you confirm the fix? You need to opam pin the right version:

$ opam pin add -y bob https://github.com/dinosaure/bob.git#good-fix

We changed a bit the protocol so the usage of the current relay is probably buggy (but it seems fine, I just tested it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants