Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing issues with converted transducer #57

Open
jonorthwash opened this issue May 22, 2019 · 9 comments
Open

parsing issues with converted transducer #57

jonorthwash opened this issue May 22, 2019 · 9 comments

Comments

@jonorthwash
Copy link
Member

hfst-proc behaviour (expected):

$ echo "с." | hfst-proc sah.automorf.hfst 
^с./с.<abbr>$
$ echo "с.1" | hfst-proc sah.automorf.hfst 
^с./с.<abbr>$^1/1<num>/1<num><subst><nom>/1<num><subst><nom>+э<cop><aor><p3><sg>$

lt-proc behaviour (second one is unexpected):

$ echo "с." | lt-proc sah.automorf.bin 
^с./с.<abbr>$
$ echo "с.1" | lt-proc sah.automorf.bin 
^с/*с$^./.<sent>$^1/1<num>/1<num><subst><nom>/1<num><subst><nom>+э<cop><aor><p3><sg>$

Specifically, с. doesn't receive an analysis above; instead the . alone receives an analysis. My expectation is that the parsing would be LMLR, but it seems to be something else?

@unhammer
Copy link
Member

Just in case: could it be #9 ? (Does the path for c.1 start with epsilons on the input side?)

@jonorthwash
Copy link
Member Author

jonorthwash commented May 23, 2019

Just in case: could it be #9 ? (Does the path for c.1 start with epsilons on the input side?)

I mean, yes, but so do all other paths, in this transducer:

0       1       ε       ε       0.000000        
1       13681   с       с       0.000000        
13681   723     .       .       0.000000        
723     7       ε       <abbr>  0.000000        
7       8       ε       ε       0.000000        
8       0.000000

Cf.

0       1       ε       ε       0.000000        
1       10      .       .       0.000000        
10      3       ε       <sent>  0.000000        
3       4       ε       ε       0.000000        
4       0.000000

But the latter path is in a separate section of the transducer, separated by -- in lt-print output (the former path is below the --, with most other things, and the latter is above, with only a few other things). This makes me think that @ftyers's hypothesis that it has to do with inconditional/standard section status might be right:

(00:44:56) spectie: it might expect that string to be in an inconditional section
(00:45:06) spectie: (there are different behaviours of the different sections)
(00:45:16) spectie: but the AttCompiler probably puts it in the standard section

@unhammer
Copy link
Member

Hm, I think #9 might be about initial epsilons on input-side only (ie. not aligned, as in ε c and then c ε or something).

It's correct that lt-proc would need the path for c. to be in an inconditional section in order to appear immediately before other standard analyses. I guess the fix is that lt-comp on att files should put things ending in periods/punctuation in inconditional? That would also allow things like croc. tokenised as ^cro$^c.$ (avoid that by making sure the dictionary also has croc as one entry).

Is analysis of 1 in the standard section btw? (If it is in inconditional, the hypothesis is wrong – you can have a standard analysis immediately followed by inconditional.)

@jonorthwash
Copy link
Member Author

Is analysis of 1 in the standard section btw? (If it is in inconditional, the hypothesis is wrong – you can have a standard analysis immediately followed by inconditional.)

Most of what's above the -- appear to be number-loop-related things, but I can't find any paths that are the analysis of just 1, whereas the part below -- does include the analysis of 1. I assume the part below -- is standard and not inconditional?

@AMR-KELEG
Copy link
Contributor

AMR-KELEG commented Jun 8, 2019

I believe the attcompiler's classify function needs some refactoring/ bug-fixes:
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/att_compiler.cc#L375

I am not sure I can work on it given my GSoC project.
The fix shouldn't be that hard but I need to discuss it with my mentors.

@mr-martian
Copy link
Contributor

Paths in the FST are classified based on the first non-tag non-epsilon symbol on the input side.

$ printf 'PATTERNS\n[c.]\n[1]\n' | lexd
0	1	c	c	0.000000	
0	2	1	1	0.000000	
1	2	.	.	0.000000	
2	0.000000
$ printf 'PATTERNS\n[c.]\n[1]\n' | lexd > blah.att
$ lt-comp lr blah.att blah.bin
main@standard 3 3
$ echo 'c. 1 c.1' | lt-proc blah.bin 
^c./c.$ ^1/1$ ^c/*c$.^1/1$

In this case, both c and 1 are alphanumeric, so they both go into the standard section type.

I think maybe the solution here is to allow two standard entries without intervening whitespace if they begin or end with non-alphanumeric characters.

@unhammer
Copy link
Member

unhammer commented Apr 23, 2022

Isn't the solution rather to compile into inconditional those entries that begin or end with non-alphanumeric characters? Allowing analyses without intervening whitespace is the whole reason for having the inconditional/postblank/preblank feature in the first place, feels a bit redundant to in addition have special logic for entries in standard section that are not quite standard.

@mr-martian
Copy link
Contributor

Upon further investigation I think you're right, but I'm not sure how to do that efficiently. Checking whether the initial character is punctuation can almost be done while reading in the file, but I'm having trouble coming up with something better than O(|V|^2) for checking ends.

On the other hand, maybe that's not so bad and really I should test this.

@unhammer
Copy link
Member

I feel like this should also somehow be possible to solve by first reading them all into standard and then somehow splitting, or copying those paths into inconditional. (Like take the intersect with .*[[:punct:]] and union that into incond)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants