Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emscripten demo broken, probably highlights underlying problem linked to dictionary compilation #584

Closed
BenTalagan opened this issue Feb 14, 2019 · 12 comments

Comments

@BenTalagan
Copy link
Contributor

@BenTalagan BenTalagan commented Feb 14, 2019

Hi all,

The emscripten demo has been broken this year : currently the symptom is that it will block for quite a while and then crash with a :

Bad rules data in 'en_dict' at 0xff948a30

It took me a while to bisect it but the demo started to behave badly on this precise commit : 55c6403 although it would work on the precedent commit 0e91fcb.

Seems to me that having a crash in one environment (emscripten) and not the other (unix/mac) may highlight something nastier behind the hood.

One remark, the demo is also out of date, emscripten has evolved and some modifications affect the compilation. These flags should be added to the Makefile for EM_CXXFLAGS : -s FORCE_FILESYSTEM=1 -s WASM=0 and in the post.js, Runtime. should be removed before addFunction and removeFunction. I can prepare a PR if it helps, but it won't fix the underlying dictionary compilation pb.

Ben

@valdisvi

This comment has been minimized.

Copy link
Member

@valdisvi valdisvi commented Feb 15, 2019

With this commit (and some other changes later) espeak-ng has changed file format for compiled rules (..dict files located in espeak-ng-data folder). This error may occur when newer engine is used with older data files or vice versa.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Feb 15, 2019

When I compile for emscripten, I compile from scratch each time so as to avoid taking the risk of embedding old data. This is my complete build flow (from root directory) :

make clean
make distclean
./configure --prefix=/usr --without-async --without-mbrola --without-sonic
make en
cd src/ucd-tools
make clean
emconfigure ./configure
emmake make clean
emmake make
cd ../..
emconfigure ./configure --prefix=/usr --without-async --without-mbrola --without-sonic
emmake make clean
emmake make src/libespeak-ng.la
cd emscripten
emmake make clean
emmake make

So I may be wrong, but imho I don't think I'm using outdated data, just those who are present at the commit in question, and looking at the date of the generated data files, these data seem to be definitely repacked at each build. So if it's not a problem of version compatibility like I think, my first guess would be some problem of packing and alignment which is common between compilers if you don't take extra care when generating binary data.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Feb 15, 2019

By the way, here is a PR #586 to be able to compile again with newer versions of emscripten.

@valdisvi

This comment has been minimized.

Copy link
Member

@valdisvi valdisvi commented Feb 18, 2019

I tried to test your pull request #586, but I got problem executing emmake make in emscripten folder (step 4. in emscripten documentation):

python /usr/share/emscripten/tools/webidl_binder.py espeakng_glue.idl glue
Traceback (most recent call last):
  File "/usr/share/emscripten/tools/webidl_binder.py", line 15, in <module>
    import WebIDL
ImportError: No module named WebIDL
Makefile:122: recipe for target 'glue.cpp' failed
make: *** [glue.cpp] Error 1

How did you solve this?
(I'm using 64-bit (L)Ubuntu 16.04)

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Feb 18, 2019

On my side, I'm on MacOs Mojave with the default python install, and I use emscripten via a git pull of emsdk (the tool for managing multiple versions of emscripten). From emsdk, I have installed and activated the 1.38.26 version of emscripten.

Reading from the net, your problem looks more like a problem of python installation itself (used by the emscripten toolchain) - it's the first time I see this error.

You can check on the installation page of emscripten that you've not missed a point during the install, here : https://emscripten.org/docs/getting_started/downloads.html

or maybe try to update your python installation ? What is your version? 2.7.12 is at least required (stated by the emscripten page).

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Mar 6, 2019

One possible explanation that occurs to me is that emscripten uses 32-bit technology. I may be wrong (I haven't had a look in detail in dictionary compiler sourcecode) but it seems to me that part of the compiled dict format is integer-aligned, without taking care of sizeof(int). Thus, compiling the rules on a 64-bit platform, then packaging them for a 32-bit emscripten is risky, isn't it?

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 18, 2019

After taking some time to investigate, I think I have found the problem. It comes from the following lines :

while (*(unsigned int *)p != 0)
p++;

They behave differently when compiled with llvm and emscripten. Under llvm, like with gcc, this will have what I would call an 'expected' behaviour : the cast to unsigned int from any position in the char* buffer will take into account the fact that we are not aligned to a multiple of 4 bytes. Under emscripten it doesn't : shifting by n+0, n+1, n+2 or n+3 bytes leads indifferently to the same result when casting to an int. One of the rules of the 'en' dictionary falls under this case, so the condition of having 4 successive bytes at 0 is not met and the rule parser explodes.

@rhdunn, I'd like your opinion on that issue : should we implement a simple fix for this (like testing the four bytes instead of casting to unsigned int), are there any other part of the code that may be concerned?

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 18, 2019

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 18, 2019

After implementing a temp fix :

while (p[0] != 0 && p[1] != 0 && p[2] != 0 && p[3] != 0) {
				p++;
			}

the parsing of the rules looks ok now, but the translation is still messed up. Found at least one suspicious place (within commit 55c6403) :

static const char *FindReplacementChars(Translator *tr, const char **pfrom, unsigned int c, const char *next, int *ignore_next_n)
{
const char *from = *pfrom;
while (*(unsigned int *)from != 0) {
unsigned int fc = 0; // from character
unsigned int nc = c; // next character
const char *match_next = next;

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 18, 2019

Ok, after fixing the condition in FindReplacementChars, it seems I can get back a working generation/transcription with emscripten. I'd still need some expertise to tell me if I'm missing some potential similar alignment problems.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 18, 2019

Thanks for the analysis. It looks like a version of Read4Bytes (https://github.com/espeak-ng/espeak-ng/blob/master/src/libespeak-ng/readclause.c#L280) for a const char * is needed to fix this -- renaming Read4Bytes to fread_uint32 and create a read_uint32 function. The code would then need to be audited to avoid direct casting to unsigned int *.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 19, 2019

@rhdunn : Thanks for your answer ! I have prepared a PR (#676), and limited myself to add a function to test sequential bytes to zero. It's very close to what was intended originally and non intrusive (the original code only tests four bytes, but after that they are still read one by one, not 4 by 4).

@rhdunn rhdunn closed this Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.