Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc #9

Closed
ana-kuznetsova opened this issue May 22, 2018 · 28 comments · Fixed by #87
Closed

Comments

@ana-kuznetsova
Copy link

Lttoolbox generates forms, but fails to analyze them.
For ex.:
echo "^a<prn><p1><sg>+guata<v><iv><pres>$" | lt-proc -g grn.autogen.bin aguata

But, there's no such a form in morph analyzer:
echo "aguata" | apertium -d . grn-morph ^aguata/*aguata$^./.<sent>$
Although some forms are analyzed correctly:

echo "ndaguatái" | apertium -d . grn-morph ^ndaguatái/nd<neg>+a<prn><p1><sg>+guata<v><iv><pres>+i<neg>$^./.<sent>$
We will be very grateful if you fix this.

@unhammer
Copy link
Member

Forms with + in them have to be sent through apertium-pretransfer first, which turns ^a+b$ into ^a$ ^b$.

@ftyers ftyers reopened this May 22, 2018
@ftyers ftyers changed the title Broken analyzer Problem with ATT to lttoolbox compiled transducers May 22, 2018
@ftyers
Copy link
Member

ftyers commented May 22, 2018

That's not quite the issue. The problem is that lttoolbox and HFST have different behaviours. Consider the following:

$ echo "^a<prn><p1><sg>+guata<v><iv><pres>$" | lt-proc -g grn.autogen.bin
aguata

$ echo "a<prn><p1><sg>+guata<v><iv><pres>" | hfst-lookup grn.autogen.hfst 
a<prn><p1><sg>+guata<v><iv><pres>	aguata	0,000000

$ echo "aguata" | lt-proc grn.automorf.bin 
^aguata/*aguata$

$ echo "aguata" | hfst-lookup grn.automorf.hfst 
aguata	a<prn><p1><sg>+guata<v><iv><pres>	0,000000
aguata	a<prn><p1><sg>+guata<v><tv>	0,000000

It's something to do with initial epsilon transitions I suspect.

@unhammer
Copy link
Member

Ah, sorry, I read the report a bit too quickly.

How was grn.autogen.bin created? Would it be possible to upload a minimal one that has only the relevant words?

@jonorthwash
Copy link
Member

As far as I'm concerned, this is a bug in lttoolbox. See apertium/apertium-yid#2 and apertium/apertium-yid@efdb8ea for a similar issue and my fairly simple work-around (which may not be possible in all languages).

@unhammer
Copy link
Member

the workaround is to shuffle around continuation lexicons?

is it possible to make a minimal test case out of this?

@jonorthwash
Copy link
Member

the workaround is to shuffle around continuation lexicons?

Yeah, pretty much, but that won't work in all cases. It seems to fall if you have abc: CONTIN ;. The work around is to put something on the right side of the :.

the workaround is to shuffle around continuation lexicons?

Should be pretty simple...

@jonorthwash
Copy link
Member

works.lexc:

Multichar_Symbols

%<det%>

LEXICON Root

Determiners ;

LEXICON INFL-Det

%<det%>: # ;

LEXICON Determiners

the:the INFL-Det ;

broken.lexc:

Multichar_Symbols

%<det%>

LEXICON Root

Determiners ;

LEXICON INFL-Det

%<det%>:the # ;

LEXICON Determiners

the: INFL-Det ;

Tested working and broken, respectively. Will post code to compile and test later.

@jonorthwash
Copy link
Member

To compile the above files:

$ hfst-lexc broken.lexc | hfst-invert | hfst-fst2fst -O -o broken.hfst
$ hfst-lexc works.lexc | hfst-invert | hfst-fst2fst -O -o works.hfst
$ hfst-fst2txt works.hfst > works.att
$ lt-comp lr works.att works.bin
$ hfst-fst2txt broken.hfst > broken.att
$ lt-comp lr broken.att broken.bin 

To test:

$ echo "the" | hfst-proc works.hfst
^the/the<det>$
$ echo "the" | hfst-proc broken.hfst
^the/the<det>$

$ echo "the" | lt-proc works.bin
^the/the<det>$
$ echo "the" | lt-proc broken.bin
^the/*the$

The contents of works.att:

0	1	t	t	0.000000	
1	2	h	h	0.000000	
2	3	e	e	0.000000	
3	4	ε	<det>	0.000000	
4	0.000000

The contents of broken.att:

0	1	@0@	t	0.000000
1	2	@0@	h	0.000000
2	3	@0@	e	0.000000
3	4	t	<det>	0.000000
4	5	h	@0@	0.000000
5	6	e	@0@	0.000000
6	0.000000

@jonorthwash
Copy link
Member

I'm starting to wonder if this is a bug in hfst-fst2txt...

@jonorthwash
Copy link
Member

Well, no, the "broken" transducer is valid for the given input and output, albeit not optimal.

The real issue seems to be related to how lt-proc deals with @0@.

On that note, what's the difference between @0@ and ε?

@flammie, this thread may be of interest.

@unhammer
Copy link
Member

unhammer commented Dec 25, 2018

lt-print produces ε for epsilons, hfst-fst2txt produces @0@.

hfst-txt2fst expects @0@ but can handle ε if you give -e ε.
I don't know if lt-comp handles both.

In any case, a sed 's/@0@/ε/g' broken.att >broken.latt && lt-comp lr broken.latt broken.ltbin && echo the | lt-proc broken.ltbin doesn't help, still unknown.

Also,

$ lt-print broken.ltbin
Error: empty set of final states

(whether @0@ or ε)


It may be that lttoolbox has some expectation that the input-side always has a symbol on the first transition. It's always possible to turn the fst into something where that's true, as long as we have no empty left-hand-side (which I think lt-proc complains about if you try to do that). I don't know how easy it is to do that transformation though. There's some discussion at hfst/hfst#400 about a tool that would do something similar

@jonorthwash
Copy link
Member

jonorthwash commented Dec 25, 2018

It may be that lttoolbox has some expectation that the input-side always has a symbol on the first transition.

I'm pretty sure this is true. At least, it won't follow a path where the first input-side character doesn't match the first input character (could be an easy fix?) I shifted the first left-side t up to state 0, and it seems to work:

nolongerbroken.att:

0	1	t	t	0.000000
1	2	@0@	h	0.000000
2	3	@0@	e	0.000000
3	4	@0@	<det>	0.000000
4	5	h	@0@	0.000000
5	6	e	@0@	0.000000
6	0.000000

Testing:

$ lt-comp lr nolongerbroken.att nolongerbroken.bin
$ echo the | lt-proc nolongerbroken.bin
^the/the<det>$

@flammie
Copy link
Member

flammie commented Dec 28, 2018

I think the idea that lttoolbox doesn't support e.g. initial epsilons on "input" side is probably correct, as far as I've understood lttoolbox only has compiler and lookupper for very specific subset of finite state automata. On a side note, this is also very suboptimal for any real fst library as well, I know openfst has the tools to fix this at least in some conditions, it's possible that ideal solution is not general case solvable in reasonable amount of time, see here http://www.openfst.org/twiki/bin/view/FST/PushDoc

@jonorthwash
Copy link
Member

I know openfst has the tools to fix this at least in some conditions

The linked tool fixes the transducer, iiuc. I think a fix to the parsing code would be simpler in the end, for several reasons.

@ftyers
Copy link
Member

ftyers commented Dec 29, 2018

I know openfst has the tools to fix this at least in some conditions

The linked tool fixes the transducer, iiuc. I think a fix to the parsing code would be simpler in the end, for several reasons.

This should be fixed in ATTCompiler probably.

@MemduhG
Copy link

MemduhG commented Dec 29, 2018

$ echo deken | hfst-lookup ckb.automorf.hfst
> deken krdn<v><tv><neg><pri><p2><pl>   0,000000
deken   krdn<v><tv><neg><pri><p3><pl>   0,000000
deken   krdn<v><tv><pri><p2><pl>        0,000000
deken   krdn<v><tv><pri><p3><pl>        0,000000

$ echo "deken" | lt-proc ckb.automorf.bin
^deken/*deken$

I seem to be having this too.

@jonorthwash
Copy link
Member

I seem to be having this too.

Yeah, because of how you're using flag diacritics, the beginning of your transducer for some of those paths looks like this:

0       1       @0@     @0@     0.000000
1       2       @P.Asp.Prog@    @P.Asp.Prog@    0.000000
2       471     d       @0@     0.000000

So you start with @0@ on the input side for all paths, it looks like. It'd be a wonder if your .bin transducer worked for any input. As discussed on IRC, using twol-style constraints to do your prefixational morphology will probably help.

@jonorthwash
Copy link
Member

The linked tool fixes the transducer, iiuc. I think a fix to the parsing code would be simpler in the end, for several reasons.

This should be fixed in ATTCompiler probably.

No, not in the compiler—I meant in the parser of compiled transducers. That is, it shouldn't just give up on an input if it has to go through a null input character/arc before it gets to the first character of the input. My point was that the fix to the compiler would probably be much, much harder to implement, because it requires shifting stuff around.

@unhammer
Copy link
Member

the parser of compiled transducer

That would be fst_processor.cc, which eventually calls state.cc's State::apply. Perhaps it'd be possible to make that function skip epsilons, but it's a quite overloaded function, so you'd have to make the change several places:

    101:State::apply(int const input)
    134:State::apply_override(int const input, int const old_sym, int const new_sym)
    198:State::apply(int const input, int const alt)
    246:State::apply_careful(int const input, int const alt)
    320:State::apply(int const input, int const alt1, int const alt2)
    382:State::apply(int const input, set<int> const alts)

I think it'd be more maintainable in the long run to do something like the pushlabels suggested by @flammie . Isn't this a problem that only manifests in languages already depending on HFST?

@flammie
Copy link
Member

flammie commented Dec 31, 2018

I think it'd be more maintainable in the long run to do something like the pushlabels suggested by @flammie . Isn't this a problem that only manifests in languages already depending on HFST?

I think in general case it can be said that this problem only appears in automata that are not made by ltcomp's dix compilation? In terms of maintainability, it's a trade-off between complex C++ code getting even more complex or some lines of make scripts and an extra dep, right? Both are quite bad.

@unhammer
Copy link
Member

unhammer commented Jan 1, 2019

It won't mean an extra dep if it's already in HFST and only affects language packages that already depend on HFST. But HFST might need a cli tool that exposes the pushlabels feature of openfst.

@AMR-KELEG
Copy link
Contributor

AMR-KELEG commented May 21, 2019

I bumped into the same problem while compiling hfst transducers that were converted to at&t format.
lt-print isn't working as expected and lt-proc doesn't produce the correct weights for the generated analyses!

@unhammer unhammer changed the title Problem with ATT to lttoolbox compiled transducers Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc May 22, 2019
@mr-martian
Copy link
Contributor

I think this is the same issue:

test.att

0	1	@0@	@0@	0.000000
1	2	@0@	c	0.000000
1	2	.	.	0.000000
2	0.000000

test2.att

0	1	@0@	c	0.000000
0	1	.	.	0.000000
1	0.000000
$ lt-comp lr test.att test.bin
main@standard 2 1
final@inconditional 3 2
$ lt-print test.bin
0	1	ε	ε	0.000000	
1	2	.	.	0.000000	
2	0.000000
--
0	1	ε	ε	0.000000	
$ lt-comp lr test2.att test2.bin
main@standard 1 0
final@inconditional 2 1
$ lt-print test2.bin
0	1	.	.	0.000000	
1	0.000000
--

(the default output of lt-print here is Error: empty set of final states, I deleted a call to joinFinals() in my local copy so that it would output anyway)

The output I would expect is as follows:

$ lt-print test.bin
0	1	ε	ε	0.000000	
1	2	.	.	0.000000	
2	0.000000
--
0	1	ε	ε	0.000000	
1	2	ε	c	0.000000	
2	0.000000
$ lt-print test2.bin
0	1	.	.	0.000000	
1	0.000000
--
0	1	ε	c	0.000000	
1	0.000000

@mr-martian
Copy link
Contributor

mr-martian commented May 23, 2020

This .att file produces the expected output.

0	1	@0@	@0@	0.000000
1	2	.	.	0.000000
1	2	@0@	c	0.000000
2	0.000000

I believe the issue is the following check:

/* skip lines that have an empty left side and output
if we haven't seen an input symbol */
if(upper == L"" && lower != L"" && !seen_input_symbol)
{
continue;
}

This discards any transition in the file (except the first line) which has an epsilon on the left side but not the right which is not preceded by at least one line which has a non-epsilon left side.

Removing this check results in the examples in my previous comment working as expected.

Why is the check there?

It was added in 6bce53b but I don't know what bug is being referred to. @ftyers

@AMR-KELEG
Copy link
Contributor

AMR-KELEG commented May 23, 2020

My analysis of the problem was that the current algorithm assumes that you need to have a way to determine whether a certain path belongs to the main or the inconditional part of the transducer.
An epsilon transition doesn't disambiguate this so you need to check the following transitions until you can determine the type of the path.

@mr-martian
Copy link
Contributor

@AMR-KELEG it sets the initial classification to BOTH and then epsilon transitions get the type of whatever precedes them. Thus if you remove the check, all 3 of my ATT files get compiled such that all transitions are classified as both punctuation and standard.

@jonorthwash
Copy link
Member

Aha, so this is related to #81 and #85?

@mr-martian
Copy link
Contributor

Aha, so this is related to #81 and #85?

The particular thing I was describing isn't about what counts as punctuation, but rather that everything counts as epsilon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants