-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial epsilons in ATT to lttoolbox compiled transducers not handled by lt-proc #9
Comments
Forms with + in them have to be sent through |
That's not quite the issue. The problem is that
It's something to do with initial epsilon transitions I suspect. |
Ah, sorry, I read the report a bit too quickly. How was |
As far as I'm concerned, this is a bug in lttoolbox. See apertium/apertium-yid#2 and apertium/apertium-yid@efdb8ea for a similar issue and my fairly simple work-around (which may not be possible in all languages). |
the workaround is to shuffle around continuation lexicons? is it possible to make a minimal test case out of this? |
Yeah, pretty much, but that won't work in all cases. It seems to fall if you have
Should be pretty simple... |
Tested working and broken, respectively. Will post code to compile and test later. |
To compile the above files:
To test:
The contents of
The contents of
|
I'm starting to wonder if this is a bug in |
Well, no, the "broken" transducer is valid for the given input and output, albeit not optimal. The real issue seems to be related to how On that note, what's the difference between @flammie, this thread may be of interest. |
lt-print produces hfst-txt2fst expects In any case, a Also,
(whether It may be that lttoolbox has some expectation that the input-side always has a symbol on the first transition. It's always possible to turn the fst into something where that's true, as long as we have no empty left-hand-side (which I think lt-proc complains about if you try to do that). I don't know how easy it is to do that transformation though. There's some discussion at hfst/hfst#400 about a tool that would do something similar |
I'm pretty sure this is true. At least, it won't follow a path where the first input-side character doesn't match the first input character (could be an easy fix?) I shifted the first left-side
Testing:
|
I think the idea that lttoolbox doesn't support e.g. initial epsilons on "input" side is probably correct, as far as I've understood lttoolbox only has compiler and lookupper for very specific subset of finite state automata. On a side note, this is also very suboptimal for any real fst library as well, I know openfst has the tools to fix this at least in some conditions, it's possible that ideal solution is not general case solvable in reasonable amount of time, see here http://www.openfst.org/twiki/bin/view/FST/PushDoc |
The linked tool fixes the transducer, iiuc. I think a fix to the parsing code would be simpler in the end, for several reasons. |
This should be fixed in ATTCompiler probably. |
I seem to be having this too. |
Yeah, because of how you're using flag diacritics, the beginning of your transducer for some of those paths looks like this:
So you start with |
No, not in the compiler—I meant in the parser of compiled transducers. That is, it shouldn't just give up on an input if it has to go through a null input character/arc before it gets to the first character of the input. My point was that the fix to the compiler would probably be much, much harder to implement, because it requires shifting stuff around. |
That would be
I think it'd be more maintainable in the long run to do something like the pushlabels suggested by @flammie . Isn't this a problem that only manifests in languages already depending on HFST? |
I think in general case it can be said that this problem only appears in automata that are not made by ltcomp's dix compilation? In terms of maintainability, it's a trade-off between complex C++ code getting even more complex or some lines of make scripts and an extra dep, right? Both are quite bad. |
It won't mean an extra dep if it's already in HFST and only affects language packages that already depend on HFST. But HFST might need a cli tool that exposes the pushlabels feature of openfst. |
I bumped into the same problem while compiling hfst transducers that were converted to at&t format. |
I think this is the same issue: test.att
test2.att
$ lt-comp lr test.att test.bin
main@standard 2 1
final@inconditional 3 2
$ lt-print test.bin
0 1 ε ε 0.000000
1 2 . . 0.000000
2 0.000000
--
0 1 ε ε 0.000000
$ lt-comp lr test2.att test2.bin
main@standard 1 0
final@inconditional 2 1
$ lt-print test2.bin
0 1 . . 0.000000
1 0.000000
-- (the default output of The output I would expect is as follows: $ lt-print test.bin
0 1 ε ε 0.000000
1 2 . . 0.000000
2 0.000000
--
0 1 ε ε 0.000000
1 2 ε c 0.000000
2 0.000000
$ lt-print test2.bin
0 1 . . 0.000000
1 0.000000
--
0 1 ε c 0.000000
1 0.000000 |
This .att file produces the expected output.
I believe the issue is the following check: lttoolbox/lttoolbox/att_compiler.cc Lines 238 to 243 in 5e69502
This discards any transition in the file (except the first line) which has an epsilon on the left side but not the right which is not preceded by at least one line which has a non-epsilon left side. Removing this check results in the examples in my previous comment working as expected. Why is the check there? It was added in 6bce53b but I don't know what bug is being referred to. @ftyers |
My analysis of the problem was that the current algorithm assumes that you need to have a way to determine whether a certain path belongs to the main or the inconditional part of the transducer. |
@AMR-KELEG it sets the initial classification to |
Lttoolbox generates forms, but fails to analyze them.
For ex.:
echo "^a<prn><p1><sg>+guata<v><iv><pres>$" | lt-proc -g grn.autogen.bin aguata
But, there's no such a form in morph analyzer:
echo "aguata" | apertium -d . grn-morph ^aguata/*aguata$^./.<sent>$
Although some forms are analyzed correctly:
echo "ndaguatái" | apertium -d . grn-morph ^ndaguatái/nd<neg>+a<prn><p1><sg>+guata<v><iv><pres>+i<neg>$^./.<sent>$
We will be very grateful if you fix this.
The text was updated successfully, but these errors were encountered: