Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc #9

ana-kuznetsova · 2018-05-22T05:56:24Z

Lttoolbox generates forms, but fails to analyze them.
For ex.:
echo "^a<prn><p1><sg>+guata<v><iv><pres>$" | lt-proc -g grn.autogen.bin aguata

But, there's no such a form in morph analyzer:
echo "aguata" | apertium -d . grn-morph ^aguata/*aguata$^./.<sent>$
Although some forms are analyzed correctly:

echo "ndaguatái" | apertium -d . grn-morph ^ndaguatái/nd<neg>+a<prn><p1><sg>+guata<v><iv><pres>+i<neg>$^./.<sent>$
We will be very grateful if you fix this.

The text was updated successfully, but these errors were encountered:

unhammer · 2018-05-22T07:46:34Z

Forms with + in them have to be sent through apertium-pretransfer first, which turns ^a+b$ into ^a$ ^b$.

ftyers · 2018-05-22T11:16:47Z

That's not quite the issue. The problem is that lttoolbox and HFST have different behaviours. Consider the following:

$ echo "^a<prn><p1><sg>+guata<v><iv><pres>$" | lt-proc -g grn.autogen.bin
aguata

$ echo "a<prn><p1><sg>+guata<v><iv><pres>" | hfst-lookup grn.autogen.hfst 
a<prn><p1><sg>+guata<v><iv><pres>	aguata	0,000000

$ echo "aguata" | lt-proc grn.automorf.bin 
^aguata/*aguata$

$ echo "aguata" | hfst-lookup grn.automorf.hfst 
aguata	a<prn><p1><sg>+guata<v><iv><pres>	0,000000
aguata	a<prn><p1><sg>+guata<v><tv>	0,000000

It's something to do with initial epsilon transitions I suspect.

unhammer · 2018-05-22T11:35:22Z

Ah, sorry, I read the report a bit too quickly.

How was grn.autogen.bin created? Would it be possible to upload a minimal one that has only the relevant words?

jonorthwash · 2018-12-23T18:08:31Z

As far as I'm concerned, this is a bug in lttoolbox. See apertium/apertium-yid#2 and apertium/apertium-yid@efdb8ea for a similar issue and my fairly simple work-around (which may not be possible in all languages).

unhammer · 2018-12-24T20:05:46Z

the workaround is to shuffle around continuation lexicons?

is it possible to make a minimal test case out of this?

jonorthwash · 2018-12-24T20:47:48Z

the workaround is to shuffle around continuation lexicons?

Yeah, pretty much, but that won't work in all cases. It seems to fall if you have abc: CONTIN ;. The work around is to put something on the right side of the :.

the workaround is to shuffle around continuation lexicons?

Should be pretty simple...

jonorthwash · 2018-12-24T20:57:58Z

works.lexc:

Multichar_Symbols

%<det%>

LEXICON Root

Determiners ;

LEXICON INFL-Det

%<det%>: # ;

LEXICON Determiners

the:the INFL-Det ;

broken.lexc:

Multichar_Symbols

%<det%>

LEXICON Root

Determiners ;

LEXICON INFL-Det

%<det%>:the # ;

LEXICON Determiners

the: INFL-Det ;

Tested working and broken, respectively. Will post code to compile and test later.

jonorthwash · 2018-12-25T03:47:50Z

To compile the above files:

$ hfst-lexc broken.lexc | hfst-invert | hfst-fst2fst -O -o broken.hfst
$ hfst-lexc works.lexc | hfst-invert | hfst-fst2fst -O -o works.hfst
$ hfst-fst2txt works.hfst > works.att
$ lt-comp lr works.att works.bin
$ hfst-fst2txt broken.hfst > broken.att
$ lt-comp lr broken.att broken.bin

To test:

$ echo "the" | hfst-proc works.hfst
^the/the<det>$
$ echo "the" | hfst-proc broken.hfst
^the/the<det>$

$ echo "the" | lt-proc works.bin
^the/the<det>$
$ echo "the" | lt-proc broken.bin
^the/*the$

The contents of works.att:

0	1	t	t	0.000000	
1	2	h	h	0.000000	
2	3	e	e	0.000000	
3	4	ε	<det>	0.000000	
4	0.000000

The contents of broken.att:

0	1	@0@	t	0.000000
1	2	@0@	h	0.000000
2	3	@0@	e	0.000000
3	4	t	<det>	0.000000
4	5	h	@0@	0.000000
5	6	e	@0@	0.000000
6	0.000000

jonorthwash · 2018-12-25T04:09:53Z

I'm starting to wonder if this is a bug in hfst-fst2txt...

jonorthwash · 2018-12-25T06:51:13Z

Well, no, the "broken" transducer is valid for the given input and output, albeit not optimal.

The real issue seems to be related to how lt-proc deals with @0@.

On that note, what's the difference between @0@ and ε?

@flammie, this thread may be of interest.

unhammer · 2018-12-25T19:38:45Z

lt-print produces ε for epsilons, hfst-fst2txt produces @0@.

hfst-txt2fst expects @0@ but can handle ε if you give -e ε.
I don't know if lt-comp handles both.

In any case, a sed 's/@0@/ε/g' broken.att >broken.latt && lt-comp lr broken.latt broken.ltbin && echo the | lt-proc broken.ltbin doesn't help, still unknown.

Also,

$ lt-print broken.ltbin
Error: empty set of final states

(whether @0@ or ε)

It may be that lttoolbox has some expectation that the input-side always has a symbol on the first transition. It's always possible to turn the fst into something where that's true, as long as we have no empty left-hand-side (which I think lt-proc complains about if you try to do that). I don't know how easy it is to do that transformation though. There's some discussion at hfst/hfst#400 about a tool that would do something similar

jonorthwash · 2018-12-25T20:23:21Z

It may be that lttoolbox has some expectation that the input-side always has a symbol on the first transition.

I'm pretty sure this is true. At least, it won't follow a path where the first input-side character doesn't match the first input character (could be an easy fix?) I shifted the first left-side t up to state 0, and it seems to work:

nolongerbroken.att:

0	1	t	t	0.000000
1	2	@0@	h	0.000000
2	3	@0@	e	0.000000
3	4	@0@	<det>	0.000000
4	5	h	@0@	0.000000
5	6	e	@0@	0.000000
6	0.000000

Testing:

$ lt-comp lr nolongerbroken.att nolongerbroken.bin
$ echo the | lt-proc nolongerbroken.bin
^the/the<det>$

flammie · 2018-12-28T03:00:57Z

I think the idea that lttoolbox doesn't support e.g. initial epsilons on "input" side is probably correct, as far as I've understood lttoolbox only has compiler and lookupper for very specific subset of finite state automata. On a side note, this is also very suboptimal for any real fst library as well, I know openfst has the tools to fix this at least in some conditions, it's possible that ideal solution is not general case solvable in reasonable amount of time, see here http://www.openfst.org/twiki/bin/view/FST/PushDoc

jonorthwash · 2018-12-28T03:19:56Z

I know openfst has the tools to fix this at least in some conditions

The linked tool fixes the transducer, iiuc. I think a fix to the parsing code would be simpler in the end, for several reasons.

ftyers · 2018-12-29T02:41:48Z

I know openfst has the tools to fix this at least in some conditions

The linked tool fixes the transducer, iiuc. I think a fix to the parsing code would be simpler in the end, for several reasons.

This should be fixed in ATTCompiler probably.

MemduhG · 2018-12-29T02:52:12Z

$ echo deken | hfst-lookup ckb.automorf.hfst
> deken krdn<v><tv><neg><pri><p2><pl>   0,000000
deken   krdn<v><tv><neg><pri><p3><pl>   0,000000
deken   krdn<v><tv><pri><p2><pl>        0,000000
deken   krdn<v><tv><pri><p3><pl>        0,000000

$ echo "deken" | lt-proc ckb.automorf.bin
^deken/*deken$

I seem to be having this too.

jonorthwash · 2018-12-29T20:42:24Z

I seem to be having this too.

Yeah, because of how you're using flag diacritics, the beginning of your transducer for some of those paths looks like this:

0       1       @0@     @0@     0.000000
1       2       @P.Asp.Prog@    @P.Asp.Prog@    0.000000
2       471     d       @0@     0.000000

So you start with @0@ on the input side for all paths, it looks like. It'd be a wonder if your .bin transducer worked for any input. As discussed on IRC, using twol-style constraints to do your prefixational morphology will probably help.

jonorthwash · 2018-12-29T22:52:01Z

The linked tool fixes the transducer, iiuc. I think a fix to the parsing code would be simpler in the end, for several reasons.

This should be fixed in ATTCompiler probably.

No, not in the compiler—I meant in the parser of compiled transducers. That is, it shouldn't just give up on an input if it has to go through a null input character/arc before it gets to the first character of the input. My point was that the fix to the compiler would probably be much, much harder to implement, because it requires shifting stuff around.

unhammer · 2018-12-30T12:36:43Z

the parser of compiled transducer

That would be fst_processor.cc, which eventually calls state.cc's State::apply. Perhaps it'd be possible to make that function skip epsilons, but it's a quite overloaded function, so you'd have to make the change several places:

    101:State::apply(int const input)
    134:State::apply_override(int const input, int const old_sym, int const new_sym)
    198:State::apply(int const input, int const alt)
    246:State::apply_careful(int const input, int const alt)
    320:State::apply(int const input, int const alt1, int const alt2)
    382:State::apply(int const input, set<int> const alts)

I think it'd be more maintainable in the long run to do something like the pushlabels suggested by @flammie . Isn't this a problem that only manifests in languages already depending on HFST?

flammie · 2018-12-31T14:38:33Z

I think it'd be more maintainable in the long run to do something like the pushlabels suggested by @flammie . Isn't this a problem that only manifests in languages already depending on HFST?

I think in general case it can be said that this problem only appears in automata that are not made by ltcomp's dix compilation? In terms of maintainability, it's a trade-off between complex C++ code getting even more complex or some lines of make scripts and an extra dep, right? Both are quite bad.

unhammer · 2019-01-01T18:31:42Z

It won't mean an extra dep if it's already in HFST and only affects language packages that already depend on HFST. But HFST might need a cli tool that exposes the pushlabels feature of openfst.

AMR-KELEG · 2019-05-21T20:47:17Z

I bumped into the same problem while compiling hfst transducers that were converted to at&t format.
lt-print isn't working as expected and lt-proc doesn't produce the correct weights for the generated analyses!

mr-martian · 2020-05-23T04:02:54Z

I think this is the same issue:

test.att

0	1	@0@	@0@	0.000000
1	2	@0@	c	0.000000
1	2	.	.	0.000000
2	0.000000

test2.att

0	1	@0@	c	0.000000
0	1	.	.	0.000000
1	0.000000

$ lt-comp lr test.att test.bin
main@standard 2 1
final@inconditional 3 2
$ lt-print test.bin
0	1	ε	ε	0.000000	
1	2	.	.	0.000000	
2	0.000000
--
0	1	ε	ε	0.000000	
$ lt-comp lr test2.att test2.bin
main@standard 1 0
final@inconditional 2 1
$ lt-print test2.bin
0	1	.	.	0.000000	
1	0.000000
--

(the default output of lt-print here is Error: empty set of final states, I deleted a call to joinFinals() in my local copy so that it would output anyway)

The output I would expect is as follows:

$ lt-print test.bin
0	1	ε	ε	0.000000	
1	2	.	.	0.000000	
2	0.000000
--
0	1	ε	ε	0.000000	
1	2	ε	c	0.000000	
2	0.000000
$ lt-print test2.bin
0	1	.	.	0.000000	
1	0.000000
--
0	1	ε	c	0.000000	
1	0.000000

mr-martian · 2020-05-23T04:46:20Z

This .att file produces the expected output.

0	1	@0@	@0@	0.000000
1	2	.	.	0.000000
1	2	@0@	c	0.000000
2	0.000000

I believe the issue is the following check:

lttoolbox/lttoolbox/att_compiler.cc

Lines 238 to 243 in 5e69502

    
                 /* skip lines that have an empty left side and output 
        
                    if we haven't seen an input symbol */ 
        
                 if(upper == L"" && lower != L"" && !seen_input_symbol) 
        
                 { 
        
                   continue; 
        
                 }

This discards any transition in the file (except the first line) which has an epsilon on the left side but not the right which is not preceded by at least one line which has a non-epsilon left side.

Removing this check results in the examples in my previous comment working as expected.

Why is the check there?

It was added in 6bce53b but I don't know what bug is being referred to. @ftyers

AMR-KELEG · 2020-05-23T05:11:03Z

My analysis of the problem was that the current algorithm assumes that you need to have a way to determine whether a certain path belongs to the main or the inconditional part of the transducer.
An epsilon transition doesn't disambiguate this so you need to check the following transitions until you can determine the type of the path.

mr-martian · 2020-05-23T14:58:26Z

@AMR-KELEG it sets the initial classification to BOTH and then epsilon transitions get the type of whatever precedes them. Thus if you remove the check, all 3 of my ATT files get compiled such that all transitions are classified as both punctuation and standard.

jonorthwash · 2020-05-24T01:41:15Z

Aha, so this is related to #81 and #85?

mr-martian · 2020-05-24T01:49:18Z

Aha, so this is related to #81 and #85?

The particular thing I was describing isn't about what counts as punctuation, but rather that everything counts as epsilon.

unhammer closed this as completed May 22, 2018

ftyers reopened this May 22, 2018

ftyers changed the title ~~Broken analyzer~~ Problem with ATT to lttoolbox compiled transducers May 22, 2018

ftyers mentioned this issue Dec 20, 2018

hfst and lttoolbox transducers behave differently apertium/apertium-yid#2

Closed

flammie mentioned this issue Jan 2, 2019

Create tool for label pushing? hfst/hfst#422

Closed

unhammer changed the title ~~Problem with ATT to lttoolbox compiled transducers~~ Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc May 22, 2019

unhammer mentioned this issue May 22, 2019

parsing issues with converted transducer #57

Open

mr-martian mentioned this issue May 27, 2020

retain initial epsilon transitions when compiling ATT files #87

Merged

TinoDidriksen closed this as completed in #87 May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc #9

Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc #9

ana-kuznetsova commented May 22, 2018

unhammer commented May 22, 2018

ftyers commented May 22, 2018

unhammer commented May 22, 2018

jonorthwash commented Dec 23, 2018

unhammer commented Dec 24, 2018

jonorthwash commented Dec 24, 2018

jonorthwash commented Dec 24, 2018

jonorthwash commented Dec 25, 2018

jonorthwash commented Dec 25, 2018

jonorthwash commented Dec 25, 2018

unhammer commented Dec 25, 2018 •

edited

Loading

jonorthwash commented Dec 25, 2018 •

edited

Loading

flammie commented Dec 28, 2018

jonorthwash commented Dec 28, 2018

ftyers commented Dec 29, 2018

MemduhG commented Dec 29, 2018

jonorthwash commented Dec 29, 2018

jonorthwash commented Dec 29, 2018

unhammer commented Dec 30, 2018

flammie commented Dec 31, 2018

unhammer commented Jan 1, 2019

AMR-KELEG commented May 21, 2019 •

edited

Loading

mr-martian commented May 23, 2020

mr-martian commented May 23, 2020 •

edited

Loading

AMR-KELEG commented May 23, 2020 •

edited

Loading

mr-martian commented May 23, 2020

jonorthwash commented May 24, 2020

mr-martian commented May 24, 2020

Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc #9

Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc #9

Comments

ana-kuznetsova commented May 22, 2018

unhammer commented May 22, 2018

ftyers commented May 22, 2018

unhammer commented May 22, 2018

jonorthwash commented Dec 23, 2018

unhammer commented Dec 24, 2018

jonorthwash commented Dec 24, 2018

jonorthwash commented Dec 24, 2018

jonorthwash commented Dec 25, 2018

jonorthwash commented Dec 25, 2018

jonorthwash commented Dec 25, 2018

unhammer commented Dec 25, 2018 • edited Loading

jonorthwash commented Dec 25, 2018 • edited Loading

flammie commented Dec 28, 2018

jonorthwash commented Dec 28, 2018

ftyers commented Dec 29, 2018

MemduhG commented Dec 29, 2018

jonorthwash commented Dec 29, 2018

jonorthwash commented Dec 29, 2018

unhammer commented Dec 30, 2018

flammie commented Dec 31, 2018

unhammer commented Jan 1, 2019

AMR-KELEG commented May 21, 2019 • edited Loading

mr-martian commented May 23, 2020

mr-martian commented May 23, 2020 • edited Loading

AMR-KELEG commented May 23, 2020 • edited Loading

mr-martian commented May 23, 2020

jonorthwash commented May 24, 2020

mr-martian commented May 24, 2020

unhammer commented Dec 25, 2018 •

edited

Loading

jonorthwash commented Dec 25, 2018 •

edited

Loading

AMR-KELEG commented May 21, 2019 •

edited

Loading

mr-martian commented May 23, 2020 •

edited

Loading

AMR-KELEG commented May 23, 2020 •

edited

Loading