Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking LibreOffice with WASM EH and SjLj fails since 3.1.6 #16572

Open
jmglogow opened this issue Mar 23, 2022 · 38 comments
Open

Linking LibreOffice with WASM EH and SjLj fails since 3.1.6 #16572

jmglogow opened this issue Mar 23, 2022 · 38 comments

Comments

@jmglogow
Copy link

** Version of emscripten/emsdk: **

emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.7 (48a1620)
clang version 15.0.0 (https://github.com/llvm/llvm-project fbce4a78035c32792b0a13cf1f169048b822c06b)
Target: wasm32-unknown-emscripten
Thread model: posix
InstalledDir: /home/jmg/Development/libreoffice/git_emsdk/upstream/bin

** Linking EH command flags: **

Failed build: -fwasm-exceptions -s SUPPORT_LONGJMP=wasm. Buiding with -s DISABLE_EXCEPTION_CATCHING=0 works fine.

** Failing command line in full: **

S=/home/jmg/Development/libreoffice/wasm && B=$S/build-dbg-neh && I=$B/instdir && W=$B/workdir && /usr/bin/ccache /home/jmg/Development/libreoffice/git_emsdk/upstream/emscripten/em++ -fno-stack-protector -pthread -s USE_PTHREADS=1 -s TOTAL_MEMORY=1GB -s PTHREAD_POOL_SIZE=4 --bind -s FORCE_FILESYSTEM=1 -s WASM_BIGINT=1 -s ERROR_ON_UNDEFINED_SYMBOLS=1 -s FETCH=1 -s ASSERTIONS=1 -s EXIT_RUNTIME=0 -s EXPORTED_RUNTIME_METHODS=["UTF16ToString","stringToUTF16","printErr"] -pthread -s USE_PTHREADS=1 -fwasm-exceptions -s SUPPORT_LONGJMP=wasm -L$W/LinkTarget/StaticLibrary -L$I/sdk/lib -L$I/program -L$I/program -O1 -fstrict-aliasing -fstrict-overflow -g -gseparate-dwarf --pre-js $S/static/emscripten/environment.js --pre-js $W/CustomTarget/static/emscripten_fs_image/soffice.data.js.link --pre-js $S/static/emscripten/soffice_args.js $W/CObject/desktop/source/app/main.o -Wl,--start-group -luno_sal -lsofficeapp -luno_sal -lsofficeapp -lcomphelper -luno_cppu -luno_cppuhelpergcc3 -ldeploymentmisclo -leditenglo -lfwklo -li18nlangtag -luno_salhelpergcc3 -lsblo -lsfxlo -lsvllo -lsvxlo -lsvxcorelo -lsvtlo -ltklo -ltllo -lucbhelper -lutllo -lvcllo -lreglo -lunoidllo -lxmlreaderlo -lstorelo -lxmlscriptlo -lbasegfxlo -ldrawinglayercorelo -li18nutil -lsotlo -lepoxy -lxolo -llnglo -lsaxlo -ldrawinglayerlo -lavmedialo -lcomponentslo -lsvgfilterlo -lgraphicfilterlo -lhyphenlo -llnthlo -lspelllo -lbiblo -lchartcorelo -lchartcontrollerlo -lcmdmaillo -lconfigmgrlo -lctllo -ldbtoolslo -ldesktopbe1lo -levtattlo -lexpwraplo -lfilterconfiglo -lfps_officelo -lforlo -lfsstoragelo -li18npoollo -li18nsearchlo -llocalebe1lo -lloglo -lmigrationoo2lo -lmigrationoo3lo -lmsfilterlo -lnumbertextlo -lodfflatxmllo -loffacclo -looxlo -lpasswordcontainerlo -lpdffilterlo -lstoragefdlo -lsvgiolo -lemfiolo -lswlo -lsysshlo -ltextconversiondlgslo -ltextfdlo -lucpexpand1lo -lucpextlo -lucpimagelo -lucptdoc1lo -lunordflo -lunoxmllo -luuilo -lxmlfalo -lxmlfdlo -lxoflo -lxsltdlglo -lxsltfilterlo -lcuilo -lhwplo -lmswordlo -lswdlo -lt602filterlo -lwpftwriterlo -lwriterfilterlo -lcached1 -ldeployment -ldeploymentgui -lembobj -lemboleobj -lpackage2 -lsrtrs1 -lucb1 -lucpfile1 -lucphier1 -lucppkg1 -lxmlsecurity -lxsec_xmlsec -lxstor -lbinaryurplo -lbootstraplo -lintrospectionlo -linvocadaptlo -linvocationlo -liolo -lnamingservicelo -lproxyfaclo -lreflectionlo -lstocserviceslo -luuresolverlo -lwriterperfectlo -lgcc3_uno -lvclplug_qt5lo -lcollator_data -ldict_ja -ldict_zh -lindex_data -llocaledata_en -llocaledata_es -llocaledata_euro -llocaledata_others -ltextconv_dict -lswuilo -lepoxy $W/LinkTarget/StaticLibrary/libdtoa.a $W/LinkTarget/StaticLibrary/libzlib.a $W/LinkTarget/StaticLibrary/libboost_locale.a $W/LinkTarget/StaticLibrary/libgraphite.a $W/LinkTarget/StaticLibrary/liblibjpeg-turbo.a $W/LinkTarget/StaticLibrary/liblibpng.a $W/LinkTarget/StaticLibrary/libzlib.a $W/LinkTarget/StaticLibrary/libexpat.a $W/LinkTarget/StaticLibrary/libdtoa.a $W/LinkTarget/StaticLibrary/libzlib.a $W/LinkTarget/StaticLibrary/libfindsofficepath.a $W/LinkTarget/StaticLibrary/libboost_locale.a $W/LinkTarget/StaticLibrary/libgraphite.a $W/LinkTarget/StaticLibrary/liblibjpeg-turbo.a $W/LinkTarget/StaticLibrary/liblibpng.a $W/LinkTarget/StaticLibrary/libulingu.a $W/LinkTarget/StaticLibrary/libexpat.a $W/LinkTarget/StaticLibrary/libshell_xmlparser.a $W/LinkTarget/StaticLibrary/libboost_filesystem.a -L$W/UnpackedTarball/icu/source/lib -licui18n -licuuc $W/UnpackedTarball/openssl/libssl.a $W/UnpackedTarball/openssl/libcrypto.a -L$W/UnpackedTarball/liblangtag/liblangtag/.libs -llangtag -L$W/UnpackedTarball/libxml2/.libs -lxml2 -lm -L$W/UnpackedTarball/harfbuzz/src/.libs -lharfbuzz -L$W/UnpackedTarball/lcms2/src/.libs -llcms2 -L$W/UnpackedTarball/libwebp/src/.libs -lwebp -L$W/UnpackedTarball/cairo/src/.libs -lcairo -L$W/UnpackedTarball/pixman/pixman/.libs -lpixman-1 -L$W/UnpackedTarball/fontconfig/src/.libs -lfontconfig -L$W/UnpackedTarball/freetype/instdir/lib -lfreetype -L$W/UnpackedTarball/liborcus/src/liborcus/.libs -lorcus-0.17 -L$W/UnpackedTarball/liborcus/src/parser/.libs -lorcus-parser-0.17 -L$W/UnpackedTarball/hunspell/src/hunspell/.libs -lhunspell-1.7 -L$W/UnpackedTarball/hyphen/.libs -lhyphen -L$W/UnpackedTarball/mythes/.libs -lmythes-1.2 $W/UnpackedTarball/libnumbertext/src/.libs/libnumbertext-1.0.a -L$W/UnpackedTarball/redland/src/.libs -lrdf -L$W/UnpackedTarball/raptor/src/.libs -lraptor2 -L$W/UnpackedTarball/rasqal/src/.libs -lrasqal -L$W/UnpackedTarball/libxslt/libxslt/.libs -lxslt -L$W/UnpackedTarball/libxslt/libexslt/.libs -lexslt $W/UnpackedTarball/libabw/src/lib/.libs/libabw-0.1.a $W/UnpackedTarball/libebook/src/lib/.libs/libe-book-0.1.a -L$W/UnpackedTarball/libmwaw/src/lib/.libs -lmwaw-0.3 -L$W/UnpackedTarball/libodfgen/src/.libs -lodfgen-0.1 -L$W/UnpackedTarball/librevenge/src/lib/.libs -lrevenge-0.0 -L$W/UnpackedTarball/libstaroffice/src/lib/.libs -lstaroffice-0.0 -L$W/UnpackedTarball/libwpd/src/lib/.libs -lwpd-0.10 -L$W/UnpackedTarball/libwpg/src/lib/.libs -lwpg-0.3 -L$W/UnpackedTarball/libwps/src/lib/.libs -lwps-0.4 $W/UnpackedTarball/xmlsec/src/.libs/libxmlsec1.a -ldl -L/home/jmg/Development/libreoffice/git_qt5/install-5.15.2/lib -lQt5Core -lQt5Gui -lQt5Widgets -lQt5Network -lqtpcre2 -lQt5EventDispatcherSupport -lQt5FontDatabaseSupport -L/home/jmg/Development/libreoffice/git_qt5/install-5.15.2/plugins/platforms -lqwasm -L$W/UnpackedTarball/icu/source/lib -licui18n -L$W/UnpackedTarball/icu/source/lib -licuuc $W/UnpackedTarball/openssl/libssl.a $W/UnpackedTarball/openssl/libcrypto.a -L$W/UnpackedTarball/liblangtag/liblangtag/.libs -llangtag -L$W/UnpackedTarball/libxml2/.libs -lxml2 -lm -L$W/UnpackedTarball/harfbuzz/src/.libs -lharfbuzz -L$W/UnpackedTarball/icu/source/lib -licuuc -L$W/UnpackedTarball/lcms2/src/.libs -llcms2 -L$W/UnpackedTarball/libwebp/src/.libs -lwebp -L$W/UnpackedTarball/cairo/src/.libs -lcairo -L$W/UnpackedTarball/pixman/pixman/.libs -lpixman-1 -L$W/UnpackedTarball/fontconfig/src/.libs -lfontconfig -L$W/UnpackedTarball/freetype/instdir/lib -lfreetype -L$W/UnpackedTarball/liborcus/src/liborcus/.libs -lorcus-0.17 -L$W/UnpackedTarball/liborcus/src/parser/.libs -lorcus-parser-0.17 -L$W/UnpackedTarball/hunspell/src/hunspell/.libs -lhunspell-1.7 -L$W/UnpackedTarball/hyphen/.libs -lhyphen -L$W/UnpackedTarball/mythes/.libs -lmythes-1.2 $W/UnpackedTarball/libnumbertext/src/.libs/libnumbertext-1.0.a -L$W/UnpackedTarball/redland/src/.libs -lrdf -L$W/UnpackedTarball/raptor/src/.libs -lraptor2 -L$W/UnpackedTarball/rasqal/src/.libs -lrasqal -L$W/UnpackedTarball/libxslt/libxslt/.libs -lxslt -L$W/UnpackedTarball/libxslt/libexslt/.libs -lexslt $W/UnpackedTarball/libabw/src/lib/.libs/libabw-0.1.a $W/UnpackedTarball/libebook/src/lib/.libs/libe-book-0.1.a -L$W/UnpackedTarball/libmwaw/src/lib/.libs -lmwaw-0.3 -L$W/UnpackedTarball/libodfgen/src/.libs -lodfgen-0.1 -L$W/UnpackedTarball/librevenge/src/lib/.libs -lrevenge-0.0 -L$W/UnpackedTarball/libstaroffice/src/lib/.libs -lstaroffice-0.0 -L$W/UnpackedTarball/libwpd/src/lib/.libs -lwpd-0.10 -L$W/UnpackedTarball/libwpg/src/lib/.libs -lwpg-0.3 -L$W/UnpackedTarball/libwps/src/lib/.libs -lwps-0.4 -L/home/jmg/Development/libreoffice/git_qt5/install-5.15.2/lib -lQt5Core -lQt5Gui -lQt5Widgets -lQt5Network -lqtpcre2 -lQt5EventDispatcherSupport -lQt5FontDatabaseSupport -L/home/jmg/Development/libreoffice/git_qt5/install-5.15.2/plugins/platforms -lqwasm -L$W/UnpackedTarball/icu/source/lib -licudata -Wl,--end-group -o $I/program/soffice.html ; RC=$? ; rm -f $W/LinkTarget/link.lock; if test $RC -ne 0; then exit $RC; fi

error: undefined symbol: emscripten_longjmp (referenced by top-level compiled C/C++ code)
warning: Link with -s LLD_REPORT_UNDEFINED to get more information on undefined symbols
warning: To disable errors for undefined symbols use -s ERROR_ON_UNDEFINED_SYMBOLS=0
warning: _emscripten_longjmp may need to be added to EXPORTED_FUNCTIONS if it arrives from a system library
Error: Aborting compilation due to previous errors
em++: error: '/home/jmg/Development/libreoffice/git_emsdk/node/14.15.5_64bit/bin/node /home/jmg/Development/libreoffice/git_emsdk/upstream/emscripten/src/compiler.js /tmp/tmpu_78bn8k.json' failed (returned 1)
make[1]: *** [/home/jmg/Development/libreoffice/wasm/desktop/Executable_soffice_bin.mk:10: /home/jmg/Development/libreoffice/wasm/build-dbg-neh/instdir/program/soffice.html] Fehler 1

** Full link command and output with -v appended: **

"/home/jmg/Development/libreoffice/git_emsdk/upstream/bin/wasm-ld" @/tmp/emscripten_9txe174p.rsp.utf-8
"/home/jmg/Development/libreoffice/git_emsdk/upstream/bin/wasm-emscripten-finalize" -g --bigint --no-dyncalls --no-legalize-javascript-ffi --dwarf /home/jmg/Development/libreoffice/wasm/build-dbg-neh/instdir/program/soffice.wasm --detect-features
"/home/jmg/Development/libreoffice/git_emsdk/node/14.15.5_64bit/bin/node" /home/jmg/Development/libreoffice/git_emsdk/upstream/emscripten/src/compiler.js /tmp/tmpijnjxqwv.json

Linking with EMCC_DEBUG=1 the diff -u link-3.1.5.log link-3.1.6.log has an interesting diff in the 'declares': array:

@@ -3040,6 +3040,7 @@
               'exit',
               'emscripten_asm_const_int',
               'setTempRet0',
+              'emscripten_longjmp',
               'gethostbyname',
               'emscripten_log',
               'getentropy',
@@ -3337,7 +3338,6 @@
               '__syscall_truncate64',
               '__syscall_utimensat',
               'emscripten_resize_heap',
-              '_emscripten_throw_longjmp',
               'strftime_l',
               '__syscall_accept4',
               '__syscall_bind',

The rest is just temporary files and time differences.

It looks like the origin of the problem is #15792, which added #ifndef __USING_WASM_SJLJ__ for the missing symbol.

FWIW, the WASM EH LibreOffice just builds since some time but doesn't run yet, because of a function call mismatch, which interestingly doesn't happen with the Emscripten EH build with the same code...

@kripken
Copy link
Member

kripken commented Mar 24, 2022

Interesting, not many projects using wasm EH + wasm sjlj AFAIK, this might be one of the first...

The first issue is hopefully not complex, but the function call mismatch mentioned on the last line is more worrying to me. Does it happen only in certain optimization levels perhaps? (can vary at both compile and link time)

cc @aheejin

@sbc100
Copy link
Collaborator

sbc100 commented Mar 25, 2022

Can you see which object file contains the reference to the undefined emscripten_longjmp function? It seems like maybe whatever file that was might not have been compiled using -sSUPPORT_LONGJMP=wasm?

(sometimes -sLLD_REPORT_UNDEFINED will give you information on which object contains the undefined reference, but I think it doesn't work in all cases).

@aheejin
Copy link
Member

aheejin commented Mar 25, 2022

I also suspect one of the object files linked wasn't built with -sSUPPORT_LONGJMP=wasm. The default is -sSUPPORT_LONGJMP=emscripten or -sSUPPORT_WASM=1 (which is currently the same as =emscripten), and if -sSUPPORT_LONGJMP=wasm was not explicitly given, it would have been built with the emscripten option. emscripten_longjmp is not used in Wasm SjLj.

I am also curious about the function call mismatch error... Do you have a reproducer?

@jmglogow
Copy link
Author

I also suspect one of the object files linked wasn't built with -sSUPPORT_LONGJMP=wasm

Maybe the parameter is missing for some external library. I just know 3.1.5 links and 3.1.6 doesn't. And WASM EH fails with a strange error early in the LO job scheduler and Emscripten EH works. WASM Qt5 is supposed to be build without EH, but I didn't rebuild that; maybe that also needs this flag nevertheless.

The following builds are a month old, used for testing FF nightly with WASM EH back then. The "About LO" dialog should show the Emscripten version used. The broken call is a class function, Task::UpdateMinPeriod, called from https://github.com/LibreOffice/core/blob/c4cb1d1dd581a5f120d9cf8b1d4274ec38f3eabe/vcl/source/app/scheduler.cxx#L395. And it's not the first call of it. I just know it doesn't happen with the Emscripten EH build.

opt-build: https://drive.google.com/file/d/1JAWj0S7gB6kWej3i3xaZUjYPN2bxg6su/view?usp=sharing
opt-neh-build: https://drive.google.com/file/d/1s35pZOsUSgiecnOkO-IhUF-x2KU7JTkB/view?usp=sharing

@jmglogow
Copy link
Author

I found that libpng wasn't build with -s SUPPORT_LONGJMP=wasm, because I just added that flag to the CXX exception flags in case of WASM EH. The changelog.md 3.1.3 entry is just talking about exception and SjLj support, so I simply missed the fact, that now all code must be build with -s SUPPORT_LONGJMP=wasm, not just the CXX (exception) code. AFAIK the whole LO code just uses SjLj to include the PNG and JPEG libs; both graphics import filters don't use exceptions.

So now my LO build doesn't have any more emscripten_longjmp ... but Qt WASM also has an internal libpng / libjpeg, which now need -s SUPPORT_LONGJMP=wasm...

@sbc100
Copy link
Collaborator

sbc100 commented Mar 25, 2022

I think you would only need to supply -sSUPPORT_LONGJMP=wasm to code that itself uses setjpm/longjmp, right? The undefined emscripten_longjmp symbol would only exist an object file that uses setjmp/longjmp (IIUC). This is a little surprising because I doubt libpng / libjpeg use setjmp/longjmp.

Did you try -sLLD_REPORT_UNDEFINED? It would tell you exactly which object file contains the undefined reference to emscripten_longjmp... you could also use llvm-nm to discover this. It sounds like you've solved the issues, but perhaps there is some of bug where object files that don't use setjmp/longjmp are generating references to emscripten_longjmp. That shouldn't happen right @aheejin ?

@kleisauke
Copy link
Collaborator

kleisauke commented Mar 25, 2022

This is a little surprising because I doubt libpng / libjpeg use setjmp/longjmp.

FWIW, libpng in fact does use setjmp/longjmp internally, see:
https://github.com/glennrp/libpng/blob/a37d4836519517bdce6cb9d956092321eca3e73b/png.c#L301-L303

libjpeg(-turbo) also uses setjmp/longjmp for it's error-handling, but not internally. Though, any API usage is supposed to use setjmp/longjmp.
https://github.com/libjpeg-turbo/libjpeg-turbo/blob/f3c716a2bc1dce073ed729619a141b9927653d72/example.txt#L347-L348

This means that, for example, all libraries that link against libjpeg (e.g. libtiff) must also be compiled with -pthread, -sSUPPORT_LONGJMP=wasm or -sSHARED_MEMORY if the final link also contains one of these options.

(I had a similar issue with wasm-vips when I updated Emscripten to 3.1.6, see commit kleisauke/wasm-vips@396c85b).

@jmglogow
Copy link
Author

jmglogow commented Mar 25, 2022

Back when doing the initial LO WASM port, I had the idea to replace the LO's SjLj usage in the PNG and JPEG filters with exception handling. Then I decided to explicitly build those two files without EH. Sure you AFAIK could just build the source with SjLj calls with -s SUPPORT_LONGJMP=wasm, but that simply isn't manageable for external libraries with their own autotools buildchain. The only realistic option is to provide additional C*FLAGS to the configure calls, which I now did for Cairo, Freetype, numbertext, Cppunit and LO itself. And FYI: WASM is currently just building Writer; the other LO components will need more external libraries when they will be ported.

Now llvm-nm finds no more emscripten_longjmp symbols for all of LO's static libraries, but still for my WASM Qt libs. Adding the flag to Qt's mkspecs/wasm-emscripten/qmake.conf seems straight-forward enough for a test build; something for tomorrow.

@jmglogow
Copy link
Author

jmglogow commented Mar 26, 2022

After rebuilding Qt with -s SUPPORT_LONGJMP=wasm, LO now links. Unfortunately the "Application exit (RuntimeError: indirect call signature mismatch)" still happens, so it looks like it was not a problem of the mixed exception handling, and it still happens at the same point then in the provided opt-neh-build (neh = native exceptions) download (the failing job is still desktop::Desktop m_firstRunTimer, if logging is build-in) :-(

FWIW if you run LO via the qt_soffice.html in the provided downloads, the opt build works, but the opt-neh build still fails the same way AFAI can see it.

I really have no idea how to provide a simpler example. I could provide a DWARF debug build, if that would help.

Since the original problem is now fixed / invalid, you / we could eventually close this issue. I would be happy to get some further ideas, how to debug my runtime problem. Still can be a bug with FF nightly, but Chrome fails at the same point; no difference between the old and my current build from my POV.

@jmglogow
Copy link
Author

I'm planning to merge https://gerrit.libreoffice.org/c/core/+/132139

Would be nice, if someone can verify the commit message, so I can adapt it if I'm still misunderstanding something.

@kripken
Copy link
Member

kripken commented Mar 28, 2022

Unfortunately the "Application exit (RuntimeError: indirect call signature mismatch)" still happens

Those can be tricky to debug. I'd make sure you have a deterministic testcase, then make sure it happens in all browsers (to rule out a browser bug, which it sounds like you have). If this is a regression (I'm not sure if this is, or if the linking issue is), you can bisect on emscripten. Otherwise perhaps you can bisect on something in LO. If you can't bisect, you can try the sanitizers. If all of the above fails, manual debugging might be needed (find out exactly where it fails, and add some debug logging etc.).

@aheejin
Copy link
Member

aheejin commented Mar 28, 2022

@jmglogow If what the commit does is to add -s SUPPORT_LONGJMP=wasm to external libraries, the message looks good to me.

For the other bug, can you provide a reproducer? It doesn't have to be small; if we need to download a repo, that's also fine. As long as you have a deterministic reproducer and steps to reproduce that it will be helpful.

@jmglogow
Copy link
Author

jmglogow commented Mar 29, 2022

@aheejin I've linked the reproducers above, but the following are new builds with Emscripten 3.1.8 and current LO source + my -s SUPPORT_LONGJMP=wasm patch:

  1. Build with Emscripten EH: https://drive.google.com/file/d/1Y-nGDe9Ba61IoVwjrw0P4cC0gP9CaDVm/view?usp=sharing
  2. Build with WASM EH: https://drive.google.com/file/d/1uXZkoh6oOeAuu3BNqVgxkcPKd7ct13jB/view?usp=sharing

Both contain a qt_soffice.html (in addition to the Escripten generated soffice.html, which has better backtraces), both are build from the same codebase, just different flags. The first "works", the 2nd fails early.

If you modify ENV.SAL_LOG in soffice.js to be +WARN+INFO.vcl.schedule, you can see that LO stops at the desktop::Desktop m_firstRunTimer log output. The object itself should be of type Timer, but the dynamic_cast<Timer*> from the logging code fails for it (it is a member of a larger class). And then the later call of some class function of the base Task class fails. I have no idea, why the dynamic_cast fails. And it's not a nullptr, as LO logs the "description" of the task as desktop::Desktop m_firstRunTimer.

Both builds are done with -gseparate-dwarf , but I didn't include the additional soffice.wasm.debug.wasm (1GB uncompressed). I can also provide these if that would help.

@aheejin
Copy link
Member

aheejin commented Mar 29, 2022

Can you provide a reproducer and not the resulting object files? The former usually helps a lot more. What I mean by the reproducer is your directory before the build and the step to reproduce the resulting (erroneous) object files.

@jmglogow
Copy link
Author

Can you provide a reproducer and not the resulting object files?

The reproducer would be you building LO WASM with WASM EH. Since the linked patch didn't yet pass the CI (I updated a minor detail), it's just on the features/wasm branch yet (with some other test stuff). You can clone from https://github.com/LibreOffice/core/ and add https://git.libreoffice.org/core as a 2nd remote, as Github is just synced once a day.

Then you can follow the static/README.wasm.md. This has a section "Experimental (AKA currently broken) WASM exception + SjLj build".

More info about the LO build on Linux is https://wiki.documentfoundation.org/Development/BuildingOnLinux. The WASM build is a cross build, because LO needs tooling to generate source and config files.

My current autogen.input for a build with the LO source in /libreoffice/git_core is

QT5DIR=/libreoffice/git_qt5/install-5.15.2-neh

--host=wasm64-local-emscripten

--with-external-tar=/libreoffice/tarballs
--enable-ccache
--enable-dbgutil
--enable-symbols
--enable-wasm-exceptions

I'm jmux in #libreoffice-dev on libera.chat; that would be faster then trying to handle build problems here.

@jmglogow
Copy link
Author

jmglogow commented Apr 23, 2022

I just spend some time trying to debug this problem a bit further. While I found no smaller reproducer, I found the dynamic_cast failure happens already much earlier. The following is the structure of the failing code, I tried to reproduce:

#define SAL_DEBUG(a) std::cout << a << std::endl;

#include <iostream>
#include <cassert>

typedef signed long int sal_Int64;
typedef unsigned long int sal_uInt64;

class Task
{
    const char *mpDebugName;

public:
    Task( const char *pDebugName );
    virtual ~Task();

    const char *GetDebugName() const { return mpDebugName; }
    virtual void Start(const bool bStartTimer = true);
};

class Timer : public Task
{
    sal_uInt64 mnTimeout;
    const bool mbAuto;

protected:
    Timer( bool bAuto, const char *pDebugName );

public:
    Timer( const char *pDebugName );
    virtual ~Timer() override;
    virtual void Start(const bool bStartTimer = true) override;
};

namespace desktop {

class Desktop final
{
private:
    Timer m_aTimer;

public:
    Desktop();
    void Start();
};
}

Task::Task( const char *pDebugName )
    : mpDebugName( pDebugName )
{
    assert(mpDebugName);
}

Task::~Task() {}

void Task::Start(const bool bStartTimer)
{
    // dynamic_cast fails
    SAL_DEBUG(this << " " << __func__ << " 1 dynamic_cast<Timer*> ==> " << dynamic_cast<Timer*>(this));
}

Timer::Timer( bool bAuto, const char *pDebugName )
    : Task( pDebugName )
    , mnTimeout(0)
    , mbAuto( bAuto )
{
}

Timer::Timer( const char *pDebugName )
    : Timer( false, pDebugName )
{
}

Timer::~Timer() {}

void Timer::Start(const bool bStartTimer)
{
    // dynamic_cast ok
    SAL_DEBUG(this << " " << __func__ << " 2 dynamic_cast<Timer*> ==> " << dynamic_cast<Timer*>(this));
    Task::Start(false);
}

desktop::Desktop::Desktop() : m_aTimer("Test") {}

void desktop::Desktop::Start() { m_aTimer.Start(); }

int main()
{
    Timer aTimer("Hello");
    aTimer.Start();
    desktop::Desktop aDestop;
    aDestop.Start();
    printf("hello, world!\n");
    return 0;
}

The output from LO is:

qtloader.js:383 debug:42:1: 0x34fbf60 pre-Start dynamic_cast<Timer*> ==> 0x34fbf60
qtloader.js:383 debug:42:1: 0x34fbf60 Timer::Start dynamic_cast<Timer*> ==> 0x34fbf60
qtloader.js:383 debug:42:1: 0x34fbf60 Task::Start dynamic_cast<Timer*> ==> 0
qtloader.js:383 debug:42:1: 0x34fbf60 Timer::Start dynamic_cast<Timer*> ==> 0x34fbf60
qtloader.js:383 debug:42:1: 0x34fbf60 post-Start dynamic_cast<Timer*> ==> 0x34fbf60

Task::Start is a lot more complex, but the dynamic_cast instantly fails at the start of the function, like above example code. I used the same flags to compile it and even split it in multiple files for different compilation units, like the original code, without success to reproduce. And also in the real code there is no other code between the function calls: Desktop::Start => Timer::Start => Task::Start, except for my added debug output.

FWIW I found a workaround for the assumed compiler / linker bug: instead of embedding the Timer into the Desktop class, I used a Timer* and just m_aTimer(new Timer("Test")) and LO runs. But there are more classes with embedded Task-based objects, so unfortunately it's more of a POC then a workaround...

Any other suggestions?

@jmglogow
Copy link
Author

jmglogow commented Apr 23, 2022

I just tried to "fix" my hack by adding a delete m_aTimer; to Desktop::~Desktop and then Chrome fails with Application exit (RuntimeError: table index is out of bounds) and FF nightly fails with Application exit (RuntimeError: index out of bounds) just after the first debug line debug:42:1: 0x3565db0 pre-Start dynamic_cast<Timer*> ==> 0x3565db0... The Desktop object is part of LO's main (kind of), so just deleted, if LO actually quits, so the added code is not even called. As expected, it doesn't show up, if I add a debug statement. I'm referring to the real LO code, not my non-broken reproducer.

@jmglogow
Copy link
Author

jmglogow commented Jun 1, 2022

Unrelated with this problem, I updated emsdk from 3.1.10 to 3.1.12 today and now get linking problems due to AFAIK missing libc++ symbols :-(

shared:INFO: (Emscripten: Running sanity checks)
error: undefined symbol: _ZNKSt3__220__vector_base_commonILb1EE20__throw_length_errorEv (referenced by top-level compiled C/C++ code)
warning: Link with `-sLLD_REPORT_UNDEFINED` to get more information on undefined symbols
warning: To disable errors for undefined symbols use `-sERROR_ON_UNDEFINED_SYMBOLS=0`
warning: __ZNKSt3__220__vector_base_commonILb1EE20__throw_length_errorEv may need to be added to EXPORTED_FUNCTIONS if it arrives from a system library
error: undefined symbol: _ZNKSt3__221__basic_string_commonILb1EE20__throw_length_errorEv (referenced by top-level compiled C/C++ code)
warning: __ZNKSt3__221__basic_string_commonILb1EE20__throw_length_errorEv may need to be added to EXPORTED_FUNCTIONS if it arrives from a system library
Error: Aborting compilation due to previous errors

The build itself was done using 3.1.12. Just downgrading to 3.1.10 for the link fixes this. Adding a manual -lc++ doesn't help.

@sbc100
Copy link
Collaborator

sbc100 commented Jun 1, 2022

Did you do a complete rebuild of all object files when upgrading from 3.1.10 to 3.1.12? I'd likely this is related to the libc++ upgrade that happen in 3.1.11 (#1700).

@sbc100
Copy link
Collaborator

sbc100 commented Jun 1, 2022

(Yes, I think a full rebuild should fix it)

@jmglogow
Copy link
Author

jmglogow commented Jun 1, 2022

It was a full rebuild. But now I realized I didn't rebuild my Qt WASM libraries, the only external dependency not build by LO itself. I'll rebuild Qt and I guess it'll link then. Maybe I'm lucky and the newer clang will even fix this bug (which was the main reason to try the Emscripten upgrade; I still believe it's a compiler / linker bug).

@jmglogow
Copy link
Author

jmglogow commented Jun 1, 2022

So the Qt rebuild fixed the linking problem (as expected) and this bug still exists in the updated clang (which I expected too). Still the WASM EH build feels much "snappier" then the Emscripten EH build. I still have no minimal reproducer. I somehow expect the bug to be related to the size of the code, but this is just blind guessing.

@aheejin
Copy link
Member

aheejin commented Jun 7, 2022

Sorry for the late reply. This thread got long and I'm not sure what the remaining bugs are. The link errors you reported have been resolved, right? Then the remaining bugs are

  1. "Application exit (RuntimeError: indirect call signature mismatch)" at runtime
  2. dynamic_cast failure you described in Linking LibreOffice with WASM EH and SjLj fails since 3.1.6 #16572 (comment)

Is this correct? Are these two different bugs? Or 2 is a reducer for 1?

For 1, I tried to setup the environment following #16572 (comment), but there were many documents to follow and something didn't work in the middle. I don't really remember what that was, because I tried that more than 2 months ago when you posted it. I'd appreciate a smaller reproducer or, even if it's large, an already set-up build directory with the actual emcc command line.

For 2 (#16572 (comment)), I'm not sure what the bug is. I compiled the C++ code you attached 1. without EH 2. with Emscripten EH 3. with Wasm EH, and all of them seemed to r un fine. This is my shell printout:

// Without EH
aheejin@aheejin:~/test/lo$ em++ test.cpp -sENVIRONMENT=node -o test-noeh.js
aheejin@aheejin:~/test/lo$ node test-noeh.js
0x5051c8 Start 2 dynamic_cast<Timer*> ==> 0x5051c8
0x5051c8 Start 1 dynamic_cast<Timer*> ==> 0x5051c8
0x5051b8 Start 2 dynamic_cast<Timer*> ==> 0x5051b8
0x5051b8 Start 1 dynamic_cast<Timer*> ==> 0x5051b8
hello, world!

// With Emscripten EH
aheejin@aheejin:~/test/lo$ em++ test.cpp -sENVIRONMENT=node -fexceptions -o test-emeh.js
aheejin@aheejin:~/test/lo$ node test-emeh.js
0x505df8 Start 2 dynamic_cast<Timer*> ==> 0x505df8
0x505df8 Start 1 dynamic_cast<Timer*> ==> 0x505df8
0x505de0 Start 2 dynamic_cast<Timer*> ==> 0x505de0
0x505de0 Start 1 dynamic_cast<Timer*> ==> 0x505de0
hello, world!

// With Wasm EH
aheejin@aheejin:~/test/lo$ em++ test.cpp -sENVIRONMENT=node -fwasm-exceptions -o test-wasmeh.js
aheejin@aheejin:~/test/lo$ node test-wasmeh.js
0x506048 Start 2 dynamic_cast<Timer*> ==> 0x506048
0x506048 Start 1 dynamic_cast<Timer*> ==> 0x506048
0x506038 Start 2 dynamic_cast<Timer*> ==> 0x506038
0x506038 Start 1 dynamic_cast<Timer*> ==> 0x506038
hello, world!

Is this printout not correct? Or am I testing this in the wrong way?

@jmglogow
Copy link
Author

jmglogow commented Jun 8, 2022

Sorry for the late reply. This thread got long and I'm not sure what the remaining bugs are. The link errors you reported have been resolved, right? Then the remaining bugs are

1. "Application exit (RuntimeError: indirect call signature mismatch)" at runtime

2. `dynamic_cast` failure you described in [Linking LibreOffice with WASM EH and SjLj fails since 3.1.6 #16572 (comment)](https://github.com/emscripten-core/emscripten/issues/16572#issuecomment-1107313559)

Is this correct? Are these two different bugs? Or 2 is a reducer for 1?

This is correct. We're back at the original bug (I should have opened a new one for the additional problem; sorry for that). I think this is one bug. IMHO 2. is the reason for 1.. Somehow the embedded Timer object's type / class information, is broken in the 2nd compile unit. So first the 'dynamic_cast<Timer*>( )fails and then a call to the virtual function Task::UpdateMinPeriod generates theRuntimeError: indirect call signature mismatch`. These were my initial observations.

Further debugging showed, that a dynamic cast already fails much earlier, when calling a function from a different compile unit (all the info about Start()). Literally, the dynamic_cast instantly fails, if I put it in a debug output at the start of the function call in the 2nd compile unit. I have no real idea, why changing the Timer to Timer* fixes this. It looks like some memory / address management error from the compiler. Especially since putting a delete in the destructor results in the same error then the original bug. That bug happens, even without the actual destructor being called, just the additional generated code.

For 1, I tried to setup the environment following #16572 (comment), but there were many documents to follow and something didn't work in the middle. I don't really remember what that was, because I tried that more than 2 months ago when you posted it. I'd appreciate a smaller reproducer or, even if it's large, an already set-up build directory with the actual emcc command line.

Hmm - let's see; I can provide my NEH Qt5 build. That is IMHO the hardest to setup, because qmake's Makefile generation is a bit brittle. If you don't want ccache, use --disable-ccache to drop it, if it's installed. Use the following autogen.input:

QT5DIR=<absolute path to>/qt5-5.15.2-neh

--host=wasm64-local-emscripten
--with-build-platform-configure-options=--enable-ccache --without-system-libxml --without-system-fontconfig --without-system-freetype --without-system-zlib

--enable-ccache
--enable-dbgutil
--enable-symbols
--enable-wasm-exceptions

Then just run ./autogen.sh and make. It'll download a lot of dependency source code. You can alternatively install the system font libraries dev packages, but these are just used to build the tooling to build LO, so that shouldn't matter.

It'll miss some basic tooling, like flex, bison and g++ (or clang) and lib(std)c++, but otherwise should just finish the build.

emrun instdir/program/qt_soffice.html should then WASM LO.

For 2 (#16572 (comment)), I'm not sure what the bug is. I compiled the C++ code you attached 1. without EH 2. with Emscripten EH 3. with Wasm EH, and all of them seemed to run fine.

Is this printout not correct? Or am I testing this in the wrong way?

Yeah, as I wrote, I failed to reproduce it and just posted the code to give an idea about the code structure, so sorry for not expressing this good enough.

@aheejin
Copy link
Member

aheejin commented Jun 9, 2022

I'm not sure if I did this right, but I downloaded your QT build and extracted it in ~/test/lo/core/qt5-5.15.2-neh. Then the autogen.input you provided... I guess I need to use them as options to autogen.sh right?

Also I cloned https://github.com/LibreOffice/core/ in ~/test/lo/core.

Then I ran this within ~/test/lo/core:

aheejin@aheejin:~/test/lo/core$ QT5DIR=$HOME/test/lo/qt5-5.15.2-neh ./autogen.sh --host=wasm64-local-emscripten --with-build-platform-configure-options=--enable-ccache --without-system-libxml --without-system-fontconfig --without-system-freetype --without-system-zlib --enable-ccache -enable-dbgutil --enable-symbols --enable-wasm-exceptions

It crashes with this:

checking for Qt5 libraries... /usr/local/google/home/aheejin/test/lo/qt5-5.15.2-neh/lib
configure: error: Missing llvm-nm expected to be found at "/usr/local/google/home/aheejin/emscripten/../bin/llvm-nm".
emconfigure: error: './configure --host=wasm64-local-emscripten --with-build-platform-configure-options=--enable-ccache --without-system-libxml --without-system-fontconfig --without-system-freetype --without-system-zlib --enable-ccache -enable-dbgutil --enable-symbols --enable-wasm-exceptions --srcdir=/usr/local/google/home/aheejin/test/lo/core --enable-option-checking=fatal' failed (returned 1)
Error running configure at ./autogen.sh line 322.

I have llvm-nm in my PATH, but I'm not sure why it tries to find it in /usr/local/google/home/aheejin/emscripten/../bin/llvm-nm, which is basically just ~/bin/llvm-nm. Where can I change this setting?

@sbc100
Copy link
Collaborator

sbc100 commented Jun 9, 2022

I think that is because llvm-nm (and other llvm tools) live in emsdk/upstream/bin in the emsdk which is ../bin relative the emscripten directory emsdk/upstream/emscripten (where emscripten lives). So it looks like this script expects that emsdk layout.

IIRC that way to find the path to the llvm tools based on an emscripten install is to run ./em-config LLVM_ROOT.. this will print the LLVM_ROOT config setting to stdout.

@aheejin
Copy link
Member

aheejin commented Jun 9, 2022

I see, thanks. I don't use emsdk and use my local build of LLVM/binaryen/emscripten directly..

But em-config LLVM_ROOT correctly prints my current LLVM root though:

aheejin@aheejin:~/test/lo/core$ em-config LLVM_ROOT
/usr/local/google/home/aheejin/llvm-git/install.release/bin

I guess this autogen script or QT or something doesn't read this LLVM_ROOT...?

@jmglogow
Copy link
Author

jmglogow commented Jun 9, 2022

I'm not sure if I did this right, but I downloaded your QT build and extracted it in ~/test/lo/core/qt5-5.15.2-neh. Then the autogen.input you provided... I guess I need to use them as options to autogen.sh right?

It's the same. You can just create the file autogen.input in the core git root with my pasted content and autogen.sh will use these flags as input for configure. That file is easier to manage, then remembering all the configuration parameters. And yup, the alternative is to pass all of them to autogen.sh (which is actually a perl script...).

Also I cloned https://github.com/LibreOffice/core/ in ~/test/lo/core.

Then I ran this within ~/test/lo/core:

aheejin@aheejin:~/test/lo/core$ QT5DIR=$HOME/test/lo/qt5-5.15.2-neh ./autogen.sh --host=wasm64-local-emscripten --with-build-platform-configure-options=--enable-ccache --without-system-libxml --without-system-fontconfig --without-system-freetype --without-system-zlib --enable-ccache -enable-dbgutil --enable-symbols --enable-wasm-exceptions

It crashes with this:

checking for Qt5 libraries... /usr/local/google/home/aheejin/test/lo/qt5-5.15.2-neh/lib
configure: error: Missing llvm-nm expected to be found at "/usr/local/google/home/aheejin/emscripten/../bin/llvm-nm".
emconfigure: error: './configure --host=wasm64-local-emscripten --with-build-platform-configure-options=--enable-ccache --without-system-libxml --without-system-fontconfig --without-system-freetype --without-system-zlib --enable-ccache -enable-dbgutil --enable-symbols --enable-wasm-exceptions --srcdir=/usr/local/google/home/aheejin/test/lo/core --enable-option-checking=fatal' failed (returned 1)
Error running configure at ./autogen.sh line 322.

I have llvm-nm in my PATH, but I'm not sure why it tries to find it in /usr/local/google/home/aheejin/emscripten/../bin/llvm-nm, which is basically just ~/bin/llvm-nm. Where can I change this setting?

configure.ac expects to work with an emsdk upstream folder. All the paths are currently "hardcoded", based on em-config EMSCRIPTEN_ROOT. FWIW llvm-nm is used to check the Qt libs for the symbols of Emskripten EH or WASM EH, so you don't get late link errors, which normally don't explain the problem (see EMSDK_LLVM_NM usage in configure.ac). So you can just comment the whole EMSDK_LLVM_NM block in configure.ac. Besides that, LO also uses "$(em-config EMSCRIPTEN_ROOT)"/tools/file_packager. I honestly don't know, if there are better ways to find these executables.

When I posted my comment yesterday night, my build hadn't finished. I got additional linking errors, which I have now fixed with https://gerrit.libreoffice.org/c/core/+/135519. You can either git pull with a few more changes and eventually a larger rebuild, or just do a git fetch origin + git cherry-pick 136fac12eb9752f1072f852cc193d6a9accdc4a7. That does link here and qt_soffice.html shows the error.

Thanks for looking into this.

@jmglogow
Copy link
Author

jmglogow commented Jun 9, 2022

Yikes - I always read em-config EMSCRIPTEN_ROOT when you actually wrote LLVM_ROOT. So now there is https://gerrit.libreoffice.org/c/core/+/135565. Will take approximately an hour to get through LO CI to be merged.

@aheejin
Copy link
Member

aheejin commented Jun 10, 2022

I was able to build it, thanks! Now I'm seeing the same error message:
image

But I'm having trouble finding where I should start debugging this or even where this error occurs or anything. I don't even have a stack trace after it crashes. I checked "Pause on exceptions" in Chrome developer tools but got nothing. Also "Console" pane doesn't seem to contain that string RuntimeError: null function or function signature mismatch or something similar.

Do you have any advice on how to proceed? Or where should I set a breakpoint?

@jmglogow
Copy link
Author

jmglogow commented Jun 11, 2022

Yikes - I always read em-config EMSCRIPTEN_ROOT when you actually wrote LLVM_ROOT. So now there is https://gerrit.libreoffice.org/c/core/+/135565. Will take approximately an hour to get through LO CI to be merged.

I just merged this; little bit longer then the hour I originally assumed (LO CI had some Windows troubles, which seem to be resolved now).

I was able to build it, thanks!

Great. Sorry that you had to weed-out some stuff I wasn't yet aware of. OTOH and FWIW, the LO WASM build should be easier now for others ;-)

But I'm having trouble finding where I should start debugging this or even where this error occurs or anything. I don't even have a stack trace after it crashes. I checked "Pause on exceptions" in Chrome developer tools but got nothing. Also "Console" pane doesn't seem to contain that string RuntimeError: null function or function signature mismatch or something similar.

Do you have any advice on how to proceed? Or where should I set a breakpoint?

First start would be to cherry-pick commit e5572ca83a15be900aaecefd415d3ad31d34200c. That should work around the bug and actually let LO start in the browser. It's diff also shows the problematic Start() call. It goes to vcl/source/app/timer.cxx. If you're using vim or some other editor with ctags support, make tags may be helpful. If you want to use "printf debugging", for LO that is encapsulated by SAL_DEBUG("my var: " << myvar);. LO has build-in logging, which can be enabled by the environment variable SAL_LOG. This also exists in instdir/program/soffice.js as ENV.SAL_LOG. The original bug happens in LO's job scheduler, which can be logged via SAL_LOG=+WARN+INFO.vcl.schedule (or drop the other warnings by removing +WARN). The whole documentation is in include/sal/log.hxx:207. The logging targets are in include/sal/log-areas.dox. There is https://opengrok.libreoffice.org/, if you prefer this for searching in the code.

Generally you can have a look at make help. You can build a single module, like desktop with make desktop. The build flags for Emscripten are in solenv/gbuild/platform/EMSCRIPTEN_INTEL_GCC.mk, if you want to change some generic stuff. You probably want to see the real command output, which is visible using make <optional target> verbose=t.

The originally reported bug happens at vcl/source/app/scheduler.cxx:658 - the UpdateMinPeriod call is the crash. In the log you'll notice, that at that point m_firstRunTimer was already not logged by line 369 as a Timer, but by line 376 as a Task, showing the dynamic_cast<Timer*> failure from line 367.

There is also the Emscripten-generated instdir/program/soffice.html. That normally has better debugging and backtrace support. Since you don't need to interact with LO, this might be the better entry point for debugging.

HTH to get you started.

P.S. there is https://wiki.documentfoundation.org/Development/How_to_debug, but that won't help you with WASM.

@aheejin
Copy link
Member

aheejin commented Jun 11, 2022

Sorry, I have zero knowledge about LibreOffice internals, so it's not easy to follow what you say.. 😢

I cherry-picked the commit e5572ca83a15be900aaecefd415d3ad31d34200c as you suggested, and now it works. But this doesn't help me with anything, because it just works, and it doesn't reproduce the error. Without the cherry-picked commit, it crashes, but as I said, I don't have any stack traces or anything, so it's hard to know where to start..

The originally reported bug happens at vcl/source/app/scheduler.cxx:658 - the UpdateMinPeriod call is the crash

vcl/source/app/scheduler.cxx:658 has nothing but comment. It's the last line of the file.

What I would like is the some more info than the text RuntimeError: null function or function signature mismatch. Stack traces, or really anything about the point that crash happens. Things like, if there's a null function, which callsite does the error occur? If function signatures mismatch between a callsite and a callee, what's the callsite? I think you told me about this, but the line number seems incorrect. Can you please check?

@jmglogow
Copy link
Author

Sorry, I have zero knowledge about LibreOffice internals, so it's not easy to follow what you say.. cry

No worries. I'm happy someone actually invests time in this, who might be able to fix the bug (or produce a smaller reproducer, or give any other insight, even helping me to debug this further). I would be a bit embarrassed, if it were some LO specific problem… but then I also have no real way to debug / detect this. I know the general concepts of WASM (stacking VM, those function tables, which verify function signatures, etc.), but I generally found porting LO to WASM more like trial-and-error, compared to the Windows Arm64 port I also did (including learning Arm64 assembler to implement LO's own FFI implementation).

I cherry-picked the commit e5572ca83a15be900aaecefd415d3ad31d34200c as you suggested, and now it works. But this doesn't help me with anything, because it just works, and it doesn't reproduce the error. Without the cherry-picked commit, it crashes, but as I said, I don't have any stack traces or anything, so it's hard to know where to start..

The originally reported bug happens at vcl/source/app/scheduler.cxx:658 - the UpdateMinPeriod call is the crash

vcl/source/app/scheduler.cxx:658 has nothing but comment. It's the last line of the file.

A copy and paste error: currently it's line 395. It's the first call of UpdateMinPeriod in Scheduler::CallbackTaskScheduling, when LO walks the task lists to find the next one to run / process / execute or too sleep until the next is ready.

What I would like is the some more info than the text RuntimeError: null function or function signature mismatch. Stack traces, or really anything about the point that crash happens. Things like, if there's a null function, which callsite does the error occur? If function signatures mismatch between a callsite and a callee, what's the callsite? I think you told me about this, but the line number seems incorrect. Can you please check?

The original callsite is the one described above. Scheduler::CallbackTaskScheduling is LO job / task scheduling function. More info can be found in vcl/README.scheduler.md. But the gist is, that LO can post a Task to be done at some time to the scheduler. These have a priority, a function to tell when they are ready and a function to execute the task. LO then has a single timer to process these tasks. Almost everything LO does happens via Scheduler Tasks.

So AFAIK a backtrace won't help you much, because it will be the system event loop, triggering the Scheduler timeout, the Scheduler searching for the next task to process. Here is the backtrace I see in the JavaScript console, if I run soffice.html instead of qt_soffice.html:

Uncaught RuntimeError: null function or function signature mismatch
    at Scheduler::CallbackTaskScheduling() (scheduler.cxx:395)
    at SalTimer::CallCallback() (saltimer.hxx:54)
    at QtTimer::timeoutActivated() (QtTimer.cxx:51)
    at QtTimer::qt_static_metacall(QObject*, QMetaObject::Call, int, QtTimer::qt_static_metacall(QObject*, QMetaObject::Call, int, void**) (QtTimer.moc:91)
    at void doActivate<false>(QObject*, int, void**) (soffice.wasm:0x54e471b)
    at QMetaObject::activate(QObject*, QMetaObject const*, int, void**) (soffice.wasm:0x54eabed)
    at QTimer::timerEvent(QTimerEvent*) (soffice.wasm:0x54ed048)
    at QObject::event(QEvent*) (soffice.wasm:0x54e4d9f)
    at QApplicationPrivate::notify_helper(QObject*, QEvent*) (soffice.wasm:0x580f35a)
    at QApplication::notify(QObject*, QEvent*) (soffice.wasm:0x5812b32)
    at QCoreApplication::notifyInternal2(QObject*, QEvent*) (soffice.wasm:0x5498234)

Hope this helps anyway.

@aheejin
Copy link
Member

aheejin commented Jun 23, 2022

I dug into this for some time. I still haven't found out why this fails, but here are some observations.

I have confirmed the behavior you described in #16572 (comment), #16572 (comment), and #16572 (comment).

I think the reason the last Timer cast failed was memory corruption. If you log from each Timer's address from constructor, destructor, and the Scheduler::CallbackTaskScheduling where that cast fails, you can see the program is trying to use the Timer that was already deleted.

The task that was deleted was DeskTop::m_firstRunTimer: https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/desktop/inc/app.hxx#L169
and the reason it was deleted was because Desktop's destructor was called. It is the same situation when you change it to Timer *m_firstRunTimer and do delete m_firstRuntimer within the destructor, which also fails.

The reason Desktop's destructor was deleted was because soffice_main was terminated and with that its stack objects were destroyed, including this Desktop object:
https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/desktop/source/app/sofficemain.cxx#L67

I'm not sure how the lifetime of this main thread that runs soffice_main is supposed to last, but it looks the main thread ends first, destroying objects including the Desktop (and with its m_firstRunTimer), and other threads continue to run and access the memory the main thread freed. I don't know much about the structure of the program so I'm not sure where the other threads are spawned.

The reason soffice_main ends is because this call throws:
https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/desktop/source/app/sofficemain.cxx#L94

I tracked down the leaf level function that actually throws:
soffice_main -> SVMain:
https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/desktop/source/app/sofficemain.cxx#L94
SVMain -> ImplSVMain:
https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/vcl/source/app/svmain.cxx#L234
ImplSVMain -> Desktop::Main
https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/vcl/source/app/svmain.cxx#L202
Desktop::Main -> Application::Execute
https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/desktop/source/app/app.cxx#L1600
Application::Execute -> QtInstance::DoExecute
https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/vcl/source/app/svapp.cxx#L444
QtInstance::DoExecute -> QApplication::exec
https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/vcl/qt5/QtInstance.cxx#L734

I couldn't track down further because QApplication::exec is from an external library. I'm not familiar with that and I have no idea why it would throw. Also when the main thread throws, my Chrome developer tools somehow didn't show any backtraces, so I couldn't see the backtrace lower than QApplication::exec.

Anyway, to sum up, because QApplication::exec throws, the main thread crashes with an exception and deletes its stack Desktop object in soffice_main with its Timer member variable, which is still accessed later in other threads. I don't know if the reason QApplication::exec throws is due to an application bug or compiler bug at this point though.

@jmglogow
Copy link
Author

I dug into this for some time. I still haven't found out why this fails, but here are some observations.

I have confirmed the behavior you described in #16572 (comment), #16572 (comment), and #16572 (comment).

I think the reason the last Timer cast failed was memory corruption. If you log from each Timer's address from constructor, destructor, and the Scheduler::CallbackTaskScheduling where that cast fails, you can see the program is trying to use the Timer that was already deleted.

I'll try to verify this later with latest Emscripten and a fresh build. Thing is Task::~Task() has https://github.com/LibreOffice/core/blob/master/vcl/source/app/scheduler.cxx#L652, so it should not be possible to have a memory corruption, because the ImplSchedulerData would contain nullptr. If the Timer (which is a Task: https://github.com/LibreOffice/core/blob/master/include/vcl/timer.hxx#L26) is destroyed, it's automatically removed from the Scheduler. The Scheduler is protected by a Mutex, so I can't see, how this is happening. Nothing at this point should be multi-threaded anyway. I'll push a debug patch into the LO repo, so you can verify my own findings easier.

I couldn't track down further because QApplication::exec is from an external library.

Yeah, I didn't include the debug WASM into the Qt upload.

Honestly, I'm now as puzzled as before.

@aheejin
Copy link
Member

aheejin commented Jun 26, 2022

I'll try to verify this later with latest Emscripten and a fresh build. Thing is Task::~Task() has https://github.com/LibreOffice/core/blob/master/vcl/source/app/scheduler.cxx#L652, so it should not be possible to have a memory corruption, because the ImplSchedulerData would contain nullptr. If the Timer (which is a Task: https://github.com/LibreOffice/core/blob/master/include/vcl/timer.hxx#L26) is destroyed, it's automatically removed from the Scheduler. The Scheduler is protected by a Mutex, so I can't see, how this is happening. Nothing at this point should be multi-threaded anyway. I'll push a debug patch into the LO repo, so you can verify my own findings easier.

The timeline is:

  1. The Timer is created as a member m_firstRunTimer variable when Desktop instance is created here. At this point its mpSchedulerData member is nullptr: https://github.com/LibreOffice/core/blob/b901fd3b27191d5565376c5a708da43a2ac0f6ee/desktop/source/app/sofficemain.cxx#L67
  2. The Timer is deleted because QApplication::exec throws, terminating soffice_main and deleting all its stack object including the Desktop created in 1. At this point, because its mpSchedulerData is still nullptr, no removing from the scheduler occurs: https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/vcl/source/app/scheduler.cxx#L651
  3. Task::Start is called for this deleted Timer, which shouldn't occur. There it creates a new ImplSchedulerData instance and assign mpSchedulerData with it. https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/vcl/source/app/scheduler.cxx#L569-L574
  4. The Timer comes up in the scheduler: https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/vcl/source/app/scheduler.cxx#L367

So that the destructor removes itself from the scheduler doesn't matter here.

I tried to find who calls Timer::Start, which begins the process 3 above. I think it's called by some event loop which I'm not familiar with. What I found is,
Desktop::Main calls Application::PostUserEvent to schedule OpenClients_Impl callback:
https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/desktop/source/app/app.cxx#L1576-L1579
Application::PostUserEvent calls QtFrame::PostEvent:
https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/vcl/source/app/svapp.cxx#L1139
QtFrame::PostEvent calls SalUserEventList::PostEvent:
https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/vcl/qt5/QtFrame.cxx#L313
SalUserEventList::PostEvent calls QtInstance::TriggerUserEventProcessing:
https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/vcl/inc/salusereventlist.hxx#L118
QtInstance::TriggerUserEventProcessing calls QAbstractEventDispatcher::wakeUp, which is an external library function:
https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/vcl/qt5/QtInstance.cxx#L478

That OpenClients_Impl function, which was scheduled as a callback above, is what causes Timer::Start to run. OpenClients_Impl calls CheckFirstRun:
https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/desktop/source/app/app.cxx#L1912
CheckFirstRun calls Timer::Start on m_firstRunTimer, the Timer we are talking about:
https://github.com/LibreOffice/core/blob/423f277cc0c185ff7eaf79aa9237585c52e0c652/desktop/source/app/app.cxx#L2557-L2559

So basically we need to figure out why QApplication::exec throws. And probably, when a Task is deleted, we need to make sure it's also removed from the event handling loop above, so its Start function doesn't get called later..?

@aheejin
Copy link
Member

aheejin commented Jun 27, 2022

By the way, have you tried the address sanitizer? It might help you with diagnosing memory issues.

@jmglogow
Copy link
Author

So basically we need to figure out why QApplication::exec throws. And probably, when a Task is deleted, we need to make sure it's also removed from the event handling loop above, so its Start function doesn't get called later..?

So I'm back in this old thread. OTOH it was probably good to get away from the problem to come back with new "energy". And sorry for the late reply. Today I started adding qWarnings to the Qt WASM code and found the exception was coming from emscripten_set_main_loop_arg call with simulate_infinite_loop = true. The function has additional documentation since PR #16871, which explains the behavior nicely:

When simulate_infinite_loop is true, emscripten_set_main_loop does not work with Wasm EH, because throwing a JS exception will cause destructors in the Wasm stack frames to run.

It was a little bit more frustrating, because I added a try {} catch (...) {} around that Qt call in LO (and various others), which didn't catch anything, but Qt was build without exception support… in fact, when I enabled / hacked in exceptions in Qt, I got a warning from its main loop, that Qt "had caught an exception from an event handler"… And that is actually what I was seeing and always wondering: why the Desktop object was destructed in first place. And it also explains, why Emscripten exceptions work.

In Qt 6.4, WASM got a new main loop that still uses simulate_infinite_loop = true, but its callback is just an emscripten_pause_main_loop: https://github.com/qt/qtbase/blob/dbf6e2db3bb724669b60fdd22221ed023ec1d739/src/corelib/kernel/qeventdispatcher_wasm.cpp#L380

Not idea, if this could prevent the documented JS exception problem. I'll try to build Qt6 WASM again, which I failed to do at the beginning of the year. 6.4 should be much less buggy w.r.t. WASM.

"Magic" exception handling without unwinding the stack is definitely something unexpected for me.

I still might be wrong, as I found no way to catch any exception in the C++ code, but my guess is this issue can be now closed, if there is one already tracking the missing implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants