Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compounding on multiwords #138

Closed
unhammer opened this issue Apr 21, 2022 · 0 comments
Closed

compounding on multiwords #138

unhammer opened this issue Apr 21, 2022 · 0 comments

Comments

@unhammer
Copy link
Member

unhammer commented Apr 21, 2022

Currently, we only compound when a sequence is marked as unknown. A sequence is delimited by inconditionals or space, so if på togløpet. were analysed as ^på<pr>$ ^*togløpet$^.<sent>$ without compounding, we may get ^på<pr>$ ^tog<n><cmp>+løp<n>$^.<sent>$ with compounding – in this case, the compounded sequence had a space first and an inconditional after.

But this rule doesn't let us compound on multiwords that are in the dictionary, e.g. even though formel 1<n> is in dix, we get ^formel<n>$ ^1<det>$^-<guio>$^løp<n>$ as four elements instead of ^formel 1<n><cmp>+løp<n>$ – the whole string cannot be one unknown sequence ready for compounding since it contains a space.

Would it be possible to try compounding when we are reading a prefix with at least one space that ends in compound-only-L, and the following character is alphabetic? (This will happen when we've analysed formel 1-, the FST has an analysis that ends in compound-only-L, and the next character is l.) Should we only do it when the last one is dash, or can we do it in general?

Other examples where it'd be nice: "La Liga-målet" (La Liga<np> + mål<n>), "a cappella-konsertene"


A complication is that we still want the longest possible analysis if there is one in the dictionary, so if formel 1-løpet as a whole is in dix, we don't want to do compounding. So we can't make the decision to do compounding until we've first seen formel 1-/<compound-only-L> and then read the whole formel 1-løpet up until where regular analysis would give up, then try compound, and if that doesn't work try the regular analysis (which in this example gives four elements).

@unhammer unhammer changed the title compounding on multiwords ending in dash compounding on multiwords Apr 21, 2022
unhammer added a commit that referenced this issue Apr 21, 2022
unhammer added a commit that referenced this issue Apr 22, 2022
unhammer added a commit that referenced this issue Apr 22, 2022
So if "kake" and "formel 1-<compound-only-L>" are in dix, we can
analyse "formel 1-kake" as a compound. One left-part has to have all
the spaces (so "kakeformel 1" isn't supported, nor is
"formel 1-formel 1").

Only takes effect when run with -e option.

This closes #138
unhammer added a commit that referenced this issue Apr 22, 2022
So if "kake" and "formel 1-<compound-only-L>" are in dix, we can
analyse "formel 1-kake" as a compound. One left-part has to have all
the spaces (so "kakeformel 1" isn't supported, nor is
"formel 1-formel 1").

Only takes effect when run with -e option.

This closes #138
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant