Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build fails on MacOS Catalina #674

Closed
BenTalagan opened this issue Nov 15, 2019 · 41 comments
Closed

Build fails on MacOS Catalina #674

BenTalagan opened this issue Nov 15, 2019 · 41 comments

Comments

@BenTalagan
Copy link
Contributor

@BenTalagan BenTalagan commented Nov 15, 2019

Trying to build espeak-ng on Catalina today (on two different machines) resulted in a fail during the compilation of phoneme data. The tool seems to perform strange operations :

...
  CC       src/libespeak-ng/la-event.lo
  CC       src/libespeak-ng/la-fifo.lo
  CCLD     src/libespeak-ng.la
  CCLD     src/espeak-ng
ESPEAK_DATA_PATH=/Users/ben/poub/espeak-ng src/espeak-ng --compile-intonations && \
		ESPEAK_DATA_PATH=/Users/ben/poub/espeak-ng src/espeak-ng --compile-phonemes && \
		touch phsource/phonemes.stamp
Compiled 26 intonation tunes: 0 errors.
Unknown phoneme table: 'en'
Can't read dictionary file: '/Users/ben/poub/espeak-ng/espeak-ng-data/en_dict'
Compiling phoneme data: /Users/ben/poub/espeak-ng/espeak-ng-data/../phsource/phonemes
phonemes(129): Expected a number
phonemes(129): Expected ')'
phonemes(129): The phoneme feature is not recognised: '1='.
phonemes(129): The phoneme feature is not recognised: '2'.
phonemes(129): The phoneme feature is not recognised: '2700'.
phonemes(129): The phoneme feature is not recognised: '400'.
phonemes(129): The phoneme feature is not recognised: '600'.
phonemes(129): The phoneme feature is not recognised: '300'.
phonemes(129): The phoneme feature is not recognised: '80'.
phonemes(131): The phoneme feature is not recognised: '1'.
phonemes(132): The phoneme feature is not recognised: '_'.
phonemes(132): The phoneme feature is not recognised: 'engthmod'.
phonemes(133): The phoneme feature is not recognised: '3'.
phonemes(137): Expected AND, OR, THEN
phonemes(137): Expected a condition, not 'phonemetable'
phonemes(137): Unexpected keyword 'phonemetable'
phonemes(137): Expected AND, OR, THEN
phonemes(138): Expected a condition, not 'base1'
phonemes(138): Unexpected keyword 'base1'
phonemes(138): Expected AND, OR, THEN
phonemes(140): Expected a condition, not 'ph_icelandic'
phonemes(140): Unexpected keyword 'ph_icelandic'
phonemes(140): Expected AND, OR, THEN
phonemes(140): Expected a condition, not 'ja'
phonemes(140): Unexpected keyword 'ja'
phonemes(141): Expected AND, OR, THEN
... etc ... etc ...

Strangely, I get similar bad results on older commits (for which I'm sure it was working before under MacOS Mojave). Does anyone have a clue on how to investigate this? Could it be related to the fact that 64-bits applications and compilation have been ditched from MacOS (is espeak-ng 64-bits bullet proof) ?

Ben

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 15, 2019

That's strange. It should work on 64-bit.

The error messages relate to compiling the phsource/phonemes file, but the line numbers look wrong. See CompilePhonemeFiles in src/libespeak-ng/compiledata.c.

"Expected a number" is from https://github.com/espeak-ng/espeak-ng/blob/master/src/libespeak-ng/compiledata.c#L819.

"Expected ')'" is from https://github.com/espeak-ng/espeak-ng/blob/master/src/libespeak-ng/compiledata.c#L877.

The "The phoneme feature is not recognised" messages come from the ENS_UNKNOWN_PHONEME_FEATURE error code. It is coming from a call to the phoneme_add_feature function in phoneme.c at https://github.com/espeak-ng/espeak-ng/blob/master/src/libespeak-ng/compiledata.c#L2024.

Looking at that code, we know several things:

  1. the values passed to phoneme_add_feature appear to be missing the first character (e.g. "engthmod" instead of "lengthmod");
  2. from the code, NextItem (https://github.com/espeak-ng/espeak-ng/blob/master/src/libespeak-ng/compiledata.c#L740) is returning a value less than 0 as that is what this if branch is checking;
  3. this appears to be after processing whitespace.

I would start by printing values and control flow in NextItem to see what is happening there. NOTE: item_string is the string value of the token being read.


My intuition given the above information would be that the issue is at https://github.com/espeak-ng/espeak-ng/blob/master/src/libespeak-ng/compiledata.c#L760 in that it is checking for \n when identifying the end of a comment, so is likely incorrectly processing the file when using Mac (\r) line endings.

There are likely other cases like that, which would also explain why the line numbers are wrong in the output.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Hello Reece and thanks a lot for having taken the time to provide some clues. I am currently investigating ; adding a printf("%s\n", item_string); at :

will yield the following output :

... a few dozens of numbers
35
5
1
1
75
1
150
1
200
1
10
1
10
3
welStarts
phonemes(129): Expected a number
phonemes(129): Expected ')'
... etc ... (after that the parsing is messy)

So everything looks good until the parsing gets messy, the strangest thing being that the first occurence of a bad item parsing yields welStarts which looks like a chunk of 'NextVowelStarts'. I'm continuing my investigation.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Second remark, investigating on the LF ('\n') and CR ('\r') track : I have only found two phoneme files with '\r', namely ph_nahuatiand ph_setswana. I have regularized these files with LF but this has no impact ; so the problem might be elsewhere.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Ok, further investigation. Adding some debug here :

printf("%d: %s : %d\n", type, item_string, strlen(item_string));

And also at the start of each phone compilation gives me this :

... everything looks ok before ...
Compile phoneme: _;_
7: pause : 5
7: starttype : 9
5: _ : 1
7: endtype : 7
5: _ : 1
7: lengthmod : 9
3: 1 : 1
7: length : 6
3: 200 : 3
7: endphoneme : 10
7: phoneme : 7
2: _^_ : 3
Compile phoneme: _^_
7: pause : 5
7: starttype : 9
5: _ : 1
7: endtype : 7
5: _ : 1
7: lengthmod : 9
3: 1 : 1
7: length : 6
3: 10 : 2
7: endphoneme : 10
7: phoneme : 7
2: _X1 : 3
Compile phoneme: _X1
7: pause : 5
7: starttype : 9
5: _ : 1
7: endtype : 7
5: _ : 1
7: lengthmod : 9
3: 1 : 1
7: length : 6
3: 10 : 2
7: endphoneme : 10
7: phoneme : 7
2: ? : 1
Compile phoneme: ?
7: vls : 3
7: glt : 3
7: stp : 3
7: lengthmod : 9
3: 3 : 1
7: nolink : 6
7: Vowelin : 7
7: glstop : 6
7: Vowelout : 8
7: Vowelout : 8
7: glstop : 6
7: WAV : 3
7: WAV : 3
2: ustop/null : 10
7: brk : 3
7: FMT : 3
2: r3/r_trill : 10
7: EndSwitch : 9
7: VowelEnding : 11
2: w/xw : 4
4: welStarts : 9
phonemes(129): Expected a number
phonemes(129): Expected ')'
... etc ...

The parsing of the ? does not go well, it starts to read some tokens multiple times (Vowelout and WAV) and after that it does not manage to fall back on its feet (no endphoneme). For memory, here is the phoneme file around that point :

phoneme  _X1  //  a language specific action
  pause
  starttype _ endtype _
  lengthmod 1
  length 10
endphoneme 

phoneme ?  // glottal stp
  vls glt stp
  lengthmod 3   // ??
  nolink
  Vowelin  glstop
  Vowelout glstop
  WAV(ustop/null)
endphoneme

The funny thing is that the next token brk is quite far in the file.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

My suspicions go to the ungetc function :

This is how I instrumented it :

static unsigned int get_char()
{
	unsigned int c;
	c = fgetc(f_in);
	if (c == '\n')
		linenum++;

        printf("Got '%c'\n", c);

	return c;
}

static void unget_char(unsigned int c)
{
	ungetc(c, f_in);
	if (c == '\n')
		linenum--;
  
        printf("Ungot '%c'\n", c);
}

For the parsing of the ? phoneme this is what I get :

Compile phoneme: ?
Got '/'
Got '/'
Got ' '
Got 'g'
Got 'l'
Got 'o'
Got 't'
Got 't'
Got 'a'
Got 'l'
Got ' '
Got 's'
Got 't'
Got 'p'
Got '
'
Got ' '
Got ' '
Got 'v'
Got 'l'
Got 's'
Got ' '
Got 'g'
7: vls -> 'g'
Ungot 'g'
Got 'g'
Got 'l'
Got 't'
Got ' '
Got 's'
7: glt -> 's'
Ungot 's'
Got 's'
Got 't'
Got 'p'
Got '
'
Got ' '
Got ' '
Got 'l'
7: stp -> 'l'
Ungot 'l'
Got 'l'
Got 'e'
Got 'n'
Got 'g'
Got 't'
Got 'h'
Got 'm'
Got 'o'
Got 'd'
Got ' '
Got '3'
7: lengthmod -> '3'
Ungot '3'
Got '3'
Got ' '
Got ' '
Got ' '
Got '/'
3: 3 -> '/'
Ungot '/'
Got '/'
Got '/'
Got ' '
Got '?'
Got '?'
Got '
'
Got ' '
Got ' '
Got 'n'
Got 'o'
Got 'l'
Got 'i'
Got 'n'
Got 'k'
Got '
'
Got ' '
Got ' '
Got 'V'
7: nolink -> 'V'
Ungot 'V'
Got 'V'
Got 'o'
Got 'w'
Got 'e'
Got 'l'
Got 'i'
Got 'n'
Got ' '
Got ' '
Got 'g'
7: Vowelin -> 'g'
Ungot 'g'
Got 'g'
Got 'l'
Got 's'
Got 't'
Got 'o'
Got 'p'
Got '
'
Got ' '
Got ' '
Got 'V'
7: glstop -> 'V'
Ungot 'V'
Got 'V'
Got 'o'
Got 'w'
Got 'e'
Got 'l'
Got 'o'
Got 'u'
Got 't'
Got ' '
Got 'g'
7: Vowelout -> 'g'
Ungot 'g'
Got 'V'
Got 'o'
Got 'w'
Got 'e'
Got 'l'
Got 'o'
Got 'u'
Got 't'
Got ' '
Got 'g'
7: Vowelout -> 'g'
Ungot 'g'
Got 'g'
Got 'l'
Got 's'
Got 't'
Got 'o'
Got 'p'
Got '
'
Got ' '
Got ' '
Got 'W'
7: glstop -> 'W'
Ungot 'W'
Got 'W'
Got 'A'
Got 'V'
Got '('
7: WAV -> '('
Ungot '('
Got 'W'
Got 'A'
Got 'V'
Got '('
7: WAV -> '('
Ungot '('
Got '('
Got 'u'
Got 's'
Got 't'
Got 'o'
Got 'p'
Got '/'
Got 'n'
Got 'u'
Got 'l'
Got 'l'
Got ')'
2: ustop/null -> ')'
Ungot ' '
Got ' '
Got 'b'
Got 'r'
Got 'k'
Got '
'
Got ' '
Got ' '
Got 'F'
7: brk -> 'F'
Ungot 'F'
Got 'F'
Got 'M'
Got 'T'
Got '('
7: FMT -> '('
Ungot '('
Got '('
Got 'r'
Got '3'
Got '/'
Got 'r'
Got '_'
Got 't'
Got 'r'
Got 'i'
Got 'l'
Got 'l'
Got ')'
2: r3/r_trill -> ')'
Ungot ' '
Got ' '
Got ' '
Got ' '
Got ' '
Got 'E'
Got 'n'
Got 'd'
Got 'S'
Got 'w'
Got 'i'
Got 't'
Got 'c'
Got 'h'
Got '
'
Got '
'
Got ' '
Got ' '
Got ' '
Got ' '
Got 'V'
7: EndSwitch -> 'V'

We can clearly see that after vowelout and WAV we don't get what we have unget. Worse, there's a full jump after the ustop/null that sends us way beyond the current position.

What's your opinion on this ?

Edit: This might be a wrong guess, it looks like there is also the UngetItem functions which may interfere here. Still investigating.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

What I am seeing with the version on commit 07012f6 (which is working for me on Debian linux) is:

Compile phoneme: ?
7: vls : 3
7: glt : 3
7: stp : 3
7: lengthmod : 9
3: 3 : 1
7: nolink : 6
7: Vowelin : 7
7: glstop : 6
7: Vowelout : 8
7: Vowelout : 8
7: glstop : 6
7: WAV : 3
7: WAV : 3
2: ustop/null : 10
7: endphoneme : 10
7: phoneme : 7
2: : : 1
Compile phoneme: :
@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Thanks! So it is the following sequence that looks really suspicious :

7: WAV -> '('
Ungot '('
Got '('
Got 'u'
Got 's'
Got 't'
Got 'o'
Got 'p'
Got '/'
Got 'n'
Got 'u'
Got 'l'
Got 'l'
Got ')'
2: ustop/null -> ')'
Ungot ' '
Got ' '
Got 'b'
Got 'r'
Got 'k'
Got '
'
Got ' '
Got ' '
Got 'F'
7: brk -> 'F'

After putting back a space in the stream, we jump far away in the file (in the R phoneme)

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Yeah, that looks like it is the issue. I'm not sure what is going on with that yet.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Check UngetItem, which is calling fseek.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Yes, it's exactly what I was doing :-) Getting this sequence :

Compile phoneme: ?
Got '/'
Got '/'
Got ' '
Got 'g'
Got 'l'
Got 'o'
Got 't'
Got 't'
Got 'a'
Got 'l'
Got ' '
Got 's'
Got 't'
Got 'p'
Got '
'
Got ' '
Got ' '
Got 'v'
Got 'l'
Got 's'
Got ' '
Got 'g'
7: vls -> 'g'
Ungot 'g'
Got 'g'
Got 'l'
Got 't'
Got ' '
Got 's'
7: glt -> 's'
Ungot 's'
Got 's'
Got 't'
Got 'p'
Got '
'
Got ' '
Got ' '
Got 'l'
7: stp -> 'l'
Ungot 'l'
Got 'l'
Got 'e'
Got 'n'
Got 'g'
Got 't'
Got 'h'
Got 'm'
Got 'o'
Got 'd'
Got ' '
Got '3'
7: lengthmod -> '3'
Ungot '3'
Got '3'
Got ' '
Got ' '
Got ' '
Got '/'
3: 3 -> '/'
Ungot '/'
Got '/'
Got '/'
Got ' '
Got '?'
Got '?'
Got '
'
Got ' '
Got ' '
Got 'n'
Got 'o'
Got 'l'
Got 'i'
Got 'n'
Got 'k'
Got '
'
Got ' '
Got ' '
Got 'V'
7: nolink -> 'V'
Ungot 'V'
Got 'V'
Got 'o'
Got 'w'
Got 'e'
Got 'l'
Got 'i'
Got 'n'
Got ' '
Got ' '
Got 'g'
7: Vowelin -> 'g'
Ungot 'g'
Got 'g'
Got 'l'
Got 's'
Got 't'
Got 'o'
Got 'p'
Got '
'
Got ' '
Got ' '
Got 'V'
7: glstop -> 'V'
Ungot 'V'
Got 'V'
Got 'o'
Got 'w'
Got 'e'
Got 'l'
Got 'o'
Got 'u'
Got 't'
Got ' '
Got 'g'
7: Vowelout -> 'g'
Ungot 'g'
Ungot item 2252
Got 'V'
Got 'o'
Got 'w'
Got 'e'
Got 'l'
Got 'o'
Got 'u'
Got 't'
Got ' '
Got 'g'
7: Vowelout -> 'g'
Ungot 'g'
Got 'g'
Got 'l'
Got 's'
Got 't'
Got 'o'
Got 'p'
Got '
'
Got ' '
Got ' '
Got 'W'
7: glstop -> 'W'
Ungot 'W'
Got 'W'
Got 'A'
Got 'V'
Got '('
7: WAV -> '('
Ungot '('
Ungot item 2270
Got 'W'
Got 'A'
Got 'V'
Got '('
7: WAV -> '('
Ungot '('
Got '('
Got 'u'
Got 's'
Got 't'
Got 'o'
Got 'p'
Got '/'
Got 'n'
Got 'u'
Got 'l'
Got 'l'
Got ')'
2: ustop/null -> ')'
Ungot ' '
Got ' '
Got 'b'
Got 'r'
Got 'k'
Got '
'
Got ' '
Got ' '
Got 'F'
7: brk -> 'F'
Ungot 'F'
Got 'F'
Got 'M'
Got 'T'
Got '('

(UngetItem is instrumented with :

printf("Ungot item %d\n", f_in_displ);

Looks like there were some calls to UngetItem, but not immediately before the jump.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Those are what I am seeing as well.

So it looks like it could be the code before reading the item_string in NextItem that skips whitespace. Maybe the comment skipping logic? Try something like:

			while (!feof(f_in) && ((c = get_char()) != '\n') && (c != '\n'))
				;

The Mac get_char may be normalising '\n` characters.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

No better luck :-) Unfortunately it has no effect. I have to make a pause, but I'll get back at it later, and try to use GDB for better insight.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Have a good day.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Thanks, you too, and for your help!

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Further investigation, it looks like there's something really nasty happening with the stream, if I instrument like this :

static unsigned int get_char()
{
	unsigned int c;

  int pb = ftell(f_in);
	c = fgetc(f_in);
  int pa = ftell(f_in);
	if (c == '\n')
		linenum++;

  printf("Got '%c' %d -> %d\n", c, pb, pa);

	return c;
}

static void unget_char(unsigned int c)
{
  int pb = ftell(f_in);
	ungetc(c, f_in);
  int pa = ftell(f_in);
	if (c == '\n')
		linenum--;
  
  printf("Ungot '%c' %d -> %d\n", c, pb, pa);
}

The presence of ftells modifies the behaviour of the program! (It should be non intrusive I guess) ; the output :

7: glstop -> 'W'
Ungot 'W' 0 -> 0
Got 'W' 2270 -> 2271
Got 'A' 2271 -> 2272
Got 'V' 2272 -> 2273
Got '(' 2273 -> 2274
7: WAV -> '('
Ungot '(' 0 -> 0
Ungot item 2270
Got 'W' 2270 -> 2271
Got 'A' 2271 -> 2272
Got 'V' 2272 -> 2273
Got '(' 2273 -> 2274
7: WAV -> '('
Ungot '(' 0 -> 0
Got '(' 2273 -> 2274
Got 'u' 2274 -> 2275
Got 's' 2275 -> 2276
Got 't' 2276 -> 2277
Got 'o' 2277 -> 2278
Got 'p' 2278 -> 2279
Got '/' 2279 -> 2280
Got 'n' 2280 -> 2281
Got 'u' 2281 -> 2282
Got 'l' 2282 -> 2283
Got 'l' 2283 -> 2284
Got ')' 2284 -> 2285
2: ustop/null -> ')'
Ungot ' ' 0 -> 0
Got ')' 2284 -> 2285
7:  -> ')'
Ungot ' ' 0 -> 0
phonemes(125): The phoneme feature is not recognised: ''.
Got ')' 2284 -> 2285
7:  -> ')'
Ungot ' ' 0 -> 0
phonemes(125): The phoneme feature is not recognised: ''.
Got ')' 2284 -> 2285
7:  -> ')'
Ungot ' ' 0 -> 0
phonemes(125): The phoneme feature is not recognised: ''.
Got ')' 2284 -> 2285
7:  -> ')'
Ungot ' ' 0 -> 0
... INFINITE LOOP...

I wonder if the reason is not that we put back a character which is not the one it was before (?!). We get ')' but we unget ' ' after ustop/null.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

ftell is called at the start of NextItem, so if my hypothesis is correct, it could mess with ungetc.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

That is strange. On my machine, I get output like:

ungot 'l' 21037 => 21036

So ungetc is broken on the Mac?

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

I'm trying to narrow the problem ; could you try to compile and launch this small program on your machine ?

#include <stdio.h>
#include <stdlib.h>

FILE* f_in = NULL;

static unsigned int get_char()
{
	unsigned int c;
	c = fgetc(f_in);
  printf("Got '%c'\n", c);
	return c;
}

static void unget_char(unsigned int c)
{   
	ungetc(c, f_in);
  printf("Ungot '%c'\n", c);
}

int main(int argc, char** argv) {
    
  f_in = fopen("testfile","rb");
  
  if(!f_in)
  {
    printf("Failed to open test file \n");
    return -1;
  }
  
  int triggered = 0;
  while(!feof(f_in))
  {
    int c = get_char();
    if( c== '4' && !triggered) {
      unget_char(' ');
      triggered = 1;
      
      int cpos = ftell(f_in);
      printf("FTELL : %d\n",cpos);
    }
  }
  
  fclose(f_in);
  return 0;
}

The behaviour of that program is strange to me, it has the 'jump' effect :

Got '1'
Got '
'
Got '2'
Got '
'
Got '3'
Got '
'
Got '4'
Ungot ' '
FTELL : 14
Got '8'
Got '?'
@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

This is the test file "testfile" that goes alongside the program :

1
2
3
4
5
6
7
8
@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Got '1'
Got '
'
Got '2'
Got '
'
Got '3'
Got '
'
Got '4'
Ungot ' '
FTELL : 6
Got ' '
Got '
'
Got '5'
Got '
'
Got '6'
Got '
'
Got '7'
Got '
'
Got '8'
Got '
'
Got '�'
@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Ouch. Looks like we have the problem :( The behavior of ungetc + ftell is no more compliant.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

I've been thinking of trying a local implementation of the unget behaviour, but haven't currently figured out how to get it working.

I'm not sure why ungetc is not working properly on Mac in this case. It is most likely a bug in their implementation.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Funny enough, if I replace the unget_char(' '); line by unget_char('4');, I have the same results as you. In fact, I trigger the jump if what I unget is different from what I've got.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

I'm not sure why ungetc is not working properly on Mac in this case. It is most likely a bug in their implementation.

Probably, you must be right. I'm a bit surprised that it was broken recently. Or maybe, the way espeak uses ftell + ungetc + fseek is not really legit ?

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

http://man7.org/linux/man-pages/man3/ungetc.3p.html doesn't say that the ungot character has to be the same as the character previously read.

The following simple ungetc replacement does not currently work:

diff --git a/src/libespeak-ng/compiledata.c b/src/libespeak-ng/compiledata.c
index acd62221..e55545ab 100644
--- a/src/libespeak-ng/compiledata.c
+++ b/src/libespeak-ng/compiledata.c
@@ -403,6 +403,7 @@ static FILE *f_report;
 static FILE *f_in;
 static int f_in_linenum;
 static int f_in_displ;
+static unsigned int f_in_ungetc = EOF;
 
 static int linenum;
 static int count_references = 0;
@@ -715,7 +716,11 @@ static int LookupPhoneme(const char *string, int control)
 static unsigned int get_char()
 {
        unsigned int c;
-       c = fgetc(f_in);
+       if (f_in_ungetc != EOF) {
+               c = f_in_ungetc;
+               f_in_ungetc = EOF;
+       } else
+               c = fgetc(f_in);
        if (c == '\n')
                linenum++;
        return c;
@@ -723,7 +728,7 @@ static unsigned int get_char()
 
 static void unget_char(unsigned int c)
 {
-       ungetc(c, f_in);
+       f_in_ungetc = c;
        if (c == '\n')
                linenum--;
 }

I'm getting a lot of errors, starting with:

phonemes(124): The phoneme feature is not recognised: 'gowelout'.
phonemes(359): The phoneme feature is not recognised: 'fowelout'.

That looks suspiciously similar to what you are seeing (esp. re: the line numbers), so maybe that is what the Mac implementation is doing internally.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Can you check what ungetc is returning, and if it is returning EOF then what is the errno value and associated message?

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Adding:

@@ -881,6 +886,7 @@ static int NextItemBrackets(int type, int control)
 static void UngetItem()
 {
        fseek(f_in, f_in_displ, SEEK_SET);
+       f_in_ungetc = EOF;
        linenum = f_in_linenum;
 }

for the seek behaviour of ungetc results in the errors looking like:

phonemes(124): The phoneme feature is not recognised: 'owelout'.
phonemes(359): The phoneme feature is not recognised: 'owelout'.
phonemes(359): The phoneme feature is not recognised: '2'.
phonemes(359): The phoneme feature is not recognised: '1600'.

Note that seeking to f_in_displ - 1 does not work. Maybe it is only occasionally off by one (i.e. when it is trying to read a keyword like Vowelout or length). I haven't yet tracked down where this offset adjustment would be needed.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Can you check what ungetc is returning, and if it is returning EOF then what is the errno value and associated message?

Ok, I've had a look at this. The result of ungetc looks ok : it always returns the value of the character that was ungot, even in the suspicious cases.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

However, I can see some potential problems with your implementation :

  • Multiple sequential calls to ungetc will not work (the buffer depth is 1)
  • If used in conjunction with fseek (like in UngetItem) or ftell (like at the start of NextItem), this may do strange things

Maybe one possible implementation would be to work with one big buffer instead of a file stream ?

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

I'm aware of that limitation, but IIUC, the compile phoneme code does not need a larger ungetc buffer -- it should only be restoring the last read character on a false branch of a loop (e.g. when reading a comment line).

I've adjusted for fseek in the subsequent comment.

I think this is similar to how the Mac logic is behaving, given the errors. Therefore, if we can figure out how to make this work (possibly in relation to the ftell call), it should hopefully shed some light on what needs to be modified to get the Mac implementation to work.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Ok! Just tell me when you're good and if you want me to perform some tests.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Will do, thanks.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Note : one other potential problem I see is that f_in can be switched to another file in the stack (thus the buffered byte for one file may interfere with another file). I don't know if it should be taken into account or not.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

One additional note, I could not find the source for Catalina. But in precedent versions of macOS, the code of ungetc is different depending on the fact that we push back the same character or not.

ungetc.c

In the first case, it's a simple rewind of the file pointer. In the second case, a buffer is used. That could explain why I see different behaviors depending on the fact that we push back the same character that was read and why it can interfere with ftell.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Interesting. Thanks.

I wonder what is causing espeak to unget a character different to the previously read character. Maybe addressing that will fix the issue you are seeing on the Mac (and possibly on other BSD-based platforms).

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

I think it's this part :

while (isspace(c))
c = get_char();
item_terminator = ' ';
if ((c == ')') || (c == '(') || (c == ','))
item_terminator = c;
if ((c == ')') || (c == ','))
c = ' ';
if (!feof(f_in))
unget_char(c);

Line 798 will modify the character. I tried to invert the last line, but pushing back the ')' will result now in an infinite loop. I guess pushing back a space was a trick to get rid of the parsing of the ')'.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

Ok! I think I got it :

	while (isspace(c))
		c = get_char();
   
	item_terminator = ' ';
	if ((c == ')') || (c == '(') || (c == ','))
		item_terminator = c;

	if ((c == ')') || (c == ','))
		c = ' ';
	else if(!feof(f_in))
		unget_char(c);

This will allow to compile the full phoneme files. This is my result :

Refs 4021,  Reused 3068
Compiled phonemes: 0 errors.
touch dictsource/en_extra
  DICT      espeak-ng-data/en_dict
Can't read dictionary file: '/Users/ben/poub/espeak-ng/espeak-ng-data/en_dict'
Using phonemetable: 'en'
Compiling: 'en_list'
	5458 entries
Compiling: 'en_emoji'
	1690 entries
Compiling: 'en_extra'
	0 entries
Compiling: 'en_rules'
	6743 rules, 103 groups (0)

Is it ok ?

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

That works on my machine, so feel free to create a patch.

Are there any other problems?

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

I don't think so. I remember having a problem with emscripten a few months ago (#584), the compiled js was unable to parse correctly the bundled data. I don't know, it might be related (or not). Will give it a try again later, but I will prepare a PR for now.

@rhdunn

This comment has been minimized.

Copy link
Member

@rhdunn rhdunn commented Nov 16, 2019

Great. Thanks.

@BenTalagan

This comment has been minimized.

Copy link
Contributor Author

@BenTalagan BenTalagan commented Nov 16, 2019

PR ready (#675) :-) Thanks a lot for having taken such time to help!

@rhdunn rhdunn closed this Nov 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.