parsing issues with converted transducer #57

jonorthwash · 2019-05-22T04:48:41Z

hfst-proc behaviour (expected):

$ echo "с." | hfst-proc sah.automorf.hfst 
^с./с.<abbr>$
$ echo "с.1" | hfst-proc sah.automorf.hfst 
^с./с.<abbr>$^1/1<num>/1<num><subst><nom>/1<num><subst><nom>+э<cop><aor><p3><sg>$

lt-proc behaviour (second one is unexpected):

$ echo "с." | lt-proc sah.automorf.bin 
^с./с.<abbr>$
$ echo "с.1" | lt-proc sah.automorf.bin 
^с/*с$^./.<sent>$^1/1<num>/1<num><subst><nom>/1<num><subst><nom>+э<cop><aor><p3><sg>$

Specifically, с. doesn't receive an analysis above; instead the . alone receives an analysis. My expectation is that the parsing would be LMLR, but it seems to be something else?

The text was updated successfully, but these errors were encountered:

unhammer · 2019-05-22T07:36:54Z

Just in case: could it be #9 ? (Does the path for c.1 start with epsilons on the input side?)

jonorthwash · 2019-05-23T02:17:12Z

Just in case: could it be #9 ? (Does the path for c.1 start with epsilons on the input side?)

I mean, yes, but so do all other paths, in this transducer:

0       1       ε       ε       0.000000        
1       13681   с       с       0.000000        
13681   723     .       .       0.000000        
723     7       ε       <abbr>  0.000000        
7       8       ε       ε       0.000000        
8       0.000000

Cf.

0       1       ε       ε       0.000000        
1       10      .       .       0.000000        
10      3       ε       <sent>  0.000000        
3       4       ε       ε       0.000000        
4       0.000000

But the latter path is in a separate section of the transducer, separated by -- in lt-print output (the former path is below the --, with most other things, and the latter is above, with only a few other things). This makes me think that @ftyers's hypothesis that it has to do with inconditional/standard section status might be right:

(00:44:56) spectie: it might expect that string to be in an inconditional section
(00:45:06) spectie: (there are different behaviours of the different sections)
(00:45:16) spectie: but the AttCompiler probably puts it in the standard section

unhammer · 2019-05-23T06:59:53Z

Hm, I think #9 might be about initial epsilons on input-side only (ie. not aligned, as in ε c and then c ε or something).

It's correct that lt-proc would need the path for c. to be in an inconditional section in order to appear immediately before other standard analyses. I guess the fix is that lt-comp on att files should put things ending in periods/punctuation in inconditional? That would also allow things like croc. tokenised as ^cro$^c.$ (avoid that by making sure the dictionary also has croc as one entry).

Is analysis of 1 in the standard section btw? (If it is in inconditional, the hypothesis is wrong – you can have a standard analysis immediately followed by inconditional.)

jonorthwash · 2019-05-23T17:20:12Z

Is analysis of 1 in the standard section btw? (If it is in inconditional, the hypothesis is wrong – you can have a standard analysis immediately followed by inconditional.)

Most of what's above the -- appear to be number-loop-related things, but I can't find any paths that are the analysis of just 1, whereas the part below -- does include the analysis of 1. I assume the part below -- is standard and not inconditional?

AMR-KELEG · 2019-06-08T01:34:51Z

I believe the attcompiler's classify function needs some refactoring/ bug-fixes:
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/att_compiler.cc#L375

I am not sure I can work on it given my GSoC project.
The fix shouldn't be that hard but I need to discuss it with my mentors.

mr-martian · 2022-04-22T18:50:24Z

Paths in the FST are classified based on the first non-tag non-epsilon symbol on the input side.

$ printf 'PATTERNS\n[c.]\n[1]\n' | lexd
0	1	c	c	0.000000	
0	2	1	1	0.000000	
1	2	.	.	0.000000	
2	0.000000
$ printf 'PATTERNS\n[c.]\n[1]\n' | lexd > blah.att
$ lt-comp lr blah.att blah.bin
main@standard 3 3
$ echo 'c. 1 c.1' | lt-proc blah.bin 
^c./c.$ ^1/1$ ^c/*c$.^1/1$

In this case, both c and 1 are alphanumeric, so they both go into the standard section type.

I think maybe the solution here is to allow two standard entries without intervening whitespace if they begin or end with non-alphanumeric characters.

unhammer · 2022-04-23T08:50:16Z

Isn't the solution rather to compile into inconditional those entries that begin or end with non-alphanumeric characters? Allowing analyses without intervening whitespace is the whole reason for having the inconditional/postblank/preblank feature in the first place, feels a bit redundant to in addition have special logic for entries in standard section that are not quite standard.

mr-martian · 2022-04-24T01:52:57Z

Upon further investigation I think you're right, but I'm not sure how to do that efficiently. Checking whether the initial character is punctuation can almost be done while reading in the file, but I'm having trouble coming up with something better than O(|V|^2) for checking ends.

On the other hand, maybe that's not so bad and really I should test this.

unhammer · 2022-04-24T19:46:06Z

I feel like this should also somehow be possible to solve by first reading them all into standard and then somehow splitting, or copying those paths into inconditional. (Like take the intersect with .*[[:punct:]] and union that into incond)

mr-martian mentioned this issue Feb 10, 2021

better ATT arc classification #110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing issues with converted transducer #57

parsing issues with converted transducer #57

jonorthwash commented May 22, 2019

unhammer commented May 22, 2019

jonorthwash commented May 23, 2019 •

edited

Loading

unhammer commented May 23, 2019

jonorthwash commented May 23, 2019

AMR-KELEG commented Jun 8, 2019 •

edited

Loading

mr-martian commented Apr 22, 2022

unhammer commented Apr 23, 2022 •

edited

Loading

mr-martian commented Apr 24, 2022

unhammer commented Apr 24, 2022

parsing issues with converted transducer #57

parsing issues with converted transducer #57

Comments

jonorthwash commented May 22, 2019

unhammer commented May 22, 2019

jonorthwash commented May 23, 2019 • edited Loading

unhammer commented May 23, 2019

jonorthwash commented May 23, 2019

AMR-KELEG commented Jun 8, 2019 • edited Loading

mr-martian commented Apr 22, 2022

unhammer commented Apr 23, 2022 • edited Loading

mr-martian commented Apr 24, 2022

unhammer commented Apr 24, 2022

jonorthwash commented May 23, 2019 •

edited

Loading

AMR-KELEG commented Jun 8, 2019 •

edited

Loading

unhammer commented Apr 23, 2022 •

edited

Loading