Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propagation (of new files in a directory?) uses 6 stack frames per file, rather than iterating (NOT ABOUT ocaml 5.x!!) #990

Closed
gdt opened this issue Jan 19, 2024 · 10 comments · Fixed by #1016
Labels
crash unison crashes (a specific kind of defect; use crash when matches) defect unison fails to meet its specification (but doesn't crash; see also "crash") effort-medium issue is likely resolvable with <= 20h of effort impact-low low importance

Comments

@gdt
Copy link
Collaborator

gdt commented Jan 19, 2024

As reported by @olafhering on unison-users@ after discussion, propagation seems to use excessive stack depth.
The repro recipe is roughly:

  • have a directory a with about 350K files, synced to another system
  • mv a b
  • sync again
  • probably, observe that despite a crash progress is made, so rerunning will converge, making this relatively low impact, given that it takes a vast number of files

Looking at the stack trace below, it smells like there is a recursive function which does not do tail recursion when it should, but that is speculation. The bug is that while 350K files might reasonably take 350K * A bytes of RAM for some A, it should not result in stack usage proportional to the number of files.

#0  caml_hash (count=<optimized out>, limit=<optimized out>, seed=<optimized out>, obj=15) at hash.c:196
#1  0x000055bdf7673810 in camlStdlib__Hashtbl__key_index_1273 () at hashtbl.ml:504
#2  0x000055bdf7673ac2 in camlStdlib__Hashtbl__find_1304 () at hashtbl.ml:538
#3  0x000055bdf75efcdc in camlAbort__errorCountCell_421 () at src/abort.ml:40
#4  0x000055bdf75eff98 in camlAbort__check_547 () at src/abort.ml:46
#5  0x000055bdf760aed2 in camlFiles__fun_4454 () at src/files.ml:665
#6  0x000055bdf7627853 in camlLwt_util__fun_607 () at src/lwt/lwt_util.ml:77
#7  0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#8  0x000055bdf76277a2 in camlLwt_util__run_in_region_1_449 () at src/lwt/lwt.ml:95
#9  0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#10 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#11 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#12 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#13 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#14 0x000055bdf7627898 in camlLwt_util__fun_622 () at src/lwt/lwt_util.ml:78
...
#285323 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285324 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285325 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#285326 0x000055bdf7627898 in camlLwt_util__fun_622 () at src/lwt/lwt_util.ml:78
#285327 0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#285328 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#285329 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285330 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285331 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#285332 0x000055bdf7627898 in camlLwt_util__fun_622 () at src/lwt/lwt_util.ml:78
#285333 0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#285334 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#285335 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285336 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285337 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#285338 0x000055bdf7627898 in camlLwt_util__fun_622 () at src/lwt/lwt_util.ml:78
#285339 0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#285340 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#285341 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285342 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285343 0x000055bdf76276c7 in camlLwt_util__leave_region_443 () at src/lwt/lwt.ml:69
#285344 0x000055bdf76277e8 in camlLwt_util__fun_616 () at src/lwt/lwt_util.ml:77
#285345 0x000055bdf76262e3 in camlLwt__apply_356 () at src/lwt/lwt.ml:61
#285346 0x000055bdf76265fc in camlLwt__fun_547 () at src/lwt/lwt.ml:80
#285347 0x000055bdf763eda0 in camlStdlib__List__iter_507 () at list.ml:110
#285348 0x000055bdf76260ba in camlLwt__restart_281 () at src/lwt/lwt.ml:31
#285349 0x000055bdf7627da3 in camlLwt_unix_impl__restart_threads_621 () at src/lwt/lwt.ml:69
#285350 0x000055bdf762838c in camlLwt_unix_impl__run_715 () at src/lwt/generic/lwt_unix_impl.ml:125
#285351 0x000055bdf76140ae in camlUicommon__transportItems_1958 () at src/uicommon.ml:736
#285352 0x000055bdf74e2707 in camlDune__exe__Uigtk3__synchronize_8710 () at src/uigtk3.ml:3736
#285353 0x000055bdf74b4482 in camlDune__exe__Uigtk3__getLock_945 () at src/uigtk3.ml:138
#285354 0x000055bdf74f53d1 in camlGtkSignal__safe_call_inner_801 () at src/gtkSignal.ml:61
@gdt gdt added defect unison fails to meet its specification (but doesn't crash; see also "crash") impact-low low importance crash unison crashes (a specific kind of defect; use crash when matches) effort-medium issue is likely resolvable with <= 20h of effort labels Jan 19, 2024
@tleedjarv
Copy link
Contributor

I am unable to reproduce this stack growth.

@olafhering could you please report the full version you were using, including the version of the compiler used. Are you using a different version on the remote end? Please also report any non-default preferences that might impact the syncing process (you can leave out prefs such as ignore, paths, etc.).

Is the remote (in this case, the receiving) end significantly slower than the sending side, or is there high latency in the network?

If you can reproduce the crash with the GUI, could you try and see if the stack starts growing with the text UI.

If you can reproduce the crash then some things to try:

  • have the same version on local and remote ends;
  • see if you can reporduce locally (instead of a remote sync, sync locally to a different location);
  • if you were using a 5.x series compiler, please try with a 4.x compiler;
  • try a build of the latest unreleased code on GH.

@olafhering
Copy link
Contributor

Both unison binaries are identical. They are built with dune, using OCaml 4.14.1. Both systems are in the same LAN, connected via 1G.

While looking at the .prf file, I realized there are ignore = Path {a/large-dir} lines. I forgot about them while renaming a to b. As a result the new initial sync processed b/large-dir. The only "relevant/non-standard" entry in .prf seems to be times=true.

I will see if I can reproduce it again with this new finding.

@tleedjarv
Copy link
Contributor

The ignore preferences should not play any role here (neither should times).

I guess you already answered it, but just in case: have you used any preference like maxthreads or stream?

If you manage to reliably reproduce the issue, could you then try with version 2.53.0 and see if it works there?

@tleedjarv
Copy link
Contributor

I've done some more debugging and can see the same code being executed, yet the stack is not growing (and I'm testing with far larger directory trees). I see this as unlikely but it could be that your binary is not properly optimized. If you manage to reproduce the issue, could you try with a binary that is built in a different environment (for example, download from GH)? More precisely, you should try with a binary built without opam or with a completely clean opam env. I know this is far-fetched but it could be that you have some compiler settings that produce a non-optimized or differently optimized binary. (Even if that turns out to be the case, it could still be a bug to fix then.)

@olafhering
Copy link
Contributor

The binaries come from https://build.opensuse.org/package/show/Archiving:unison/unison and will always be a clean build using https://build.opensuse.org/project/show/Archiving:unison:buildrequires.

It will take some time until I find the time to reproduce it.

@tleedjarv
Copy link
Contributor

I can easily reproduce with your build and also the 2.53.3 build from GH. I was myself testing with the latest code (unreleased, but available at GH) and this worked even with a tiny stack. It is thus likely that the next release fixes this issue but it would be great if you could actually test the latest code already now.

@gdt
Copy link
Collaborator Author

gdt commented Feb 18, 2024

I'm going to close this as believe fixed in the current code base. If you can repro with 2.53.4 (RC2 today, release tomorrow), please post and I'm happy to reopen.

@gdt gdt closed this as completed Feb 18, 2024
@norbusan
Copy link

I was told on unison-users to report here a similar case:

Version 2.53.4 (ocaml 5.1.0)
OS Arch Linux

I got reproducible crashes syncing my laptop against my desktop, both running linux. The crash happened while the sync was going on, no user interaction was necessary. I could repeat it three times, all three times the backtrace in gdb showed caml_try_realloc_stack as the top entry.

I can also confirm that using the text ui unison worked without any problem.

I attach three different traces I got from the three crashes.

gdb-trace1.log
gdb-trace2.log
gdb-trace3.log

@gdt
Copy link
Collaborator Author

gdt commented Mar 15, 2024

We now believe @norbusan 's comment is the same thing as #1006 . I am therefore closing #900. @norbusan : please follow up in #1006 if useful/appropriate.

@gdt gdt closed this as completed Mar 15, 2024
@gdt gdt changed the title propagation (of new files in a directory?) uses 6 stack frames per file, rather than iterating propagation (of new files in a directory?) uses 6 stack frames per file, rather than iterating (NOT ABOUT ocaml 5.x!!) Mar 23, 2024
@tleedjarv
Copy link
Contributor

I have prepared a tentative fix which I believe might be common to both this ticket and #1006, even though archlinux seems to be hit harder and might still be a different issue.

@olafhering if you are interested in testing then see details at #1006 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crash unison crashes (a specific kind of defect; use crash when matches) defect unison fails to meet its specification (but doesn't crash; see also "crash") effort-medium issue is likely resolvable with <= 20h of effort impact-low low importance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants