-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compounding on multiwords #138
Comments
unhammer
changed the title
compounding on multiwords ending in dash
compounding on multiwords
Apr 21, 2022
unhammer
added a commit
that referenced
this issue
Apr 22, 2022
So if "kake" and "formel 1-<compound-only-L>" are in dix, we can analyse "formel 1-kake" as a compound. One left-part has to have all the spaces (so "kakeformel 1" isn't supported, nor is "formel 1-formel 1"). Only takes effect when run with -e option. This closes #138
unhammer
added a commit
that referenced
this issue
Apr 22, 2022
So if "kake" and "formel 1-<compound-only-L>" are in dix, we can analyse "formel 1-kake" as a compound. One left-part has to have all the spaces (so "kakeformel 1" isn't supported, nor is "formel 1-formel 1"). Only takes effect when run with -e option. This closes #138
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, we only compound when a sequence is marked as unknown. A sequence is delimited by inconditionals or space, so if
på togløpet.
were analysed as^på<pr>$ ^*togløpet$^.<sent>$
without compounding, we may get^på<pr>$ ^tog<n><cmp>+løp<n>$^.<sent>$
with compounding – in this case, the compounded sequence had a space first and an inconditional after.But this rule doesn't let us compound on multiwords that are in the dictionary, e.g. even though
formel 1<n>
is in dix, we get^formel<n>$ ^1<det>$^-<guio>$^løp<n>$
as four elements instead of^formel 1<n><cmp>+løp<n>$
– the whole string cannot be one unknown sequence ready for compounding since it contains a space.Would it be possible to try compounding when we are reading a prefix with at least one space that ends in compound-only-L, and the following character is alphabetic? (This will happen when we've analysed
formel 1-
, the FST has an analysis that ends in compound-only-L, and the next character isl
.) Should we only do it when the last one is dash, or can we do it in general?Other examples where it'd be nice: "La Liga-målet" (
La Liga<np>
+mål<n>
), "a cappella-konsertene"A complication is that we still want the longest possible analysis if there is one in the dictionary, so if
formel 1-løpet
as a whole is in dix, we don't want to do compounding. So we can't make the decision to do compounding until we've first seenformel 1-/<compound-only-L>
and then read the wholeformel 1-løpet
up until where regular analysis would give up, then try compound, and if that doesn't work try the regular analysis (which in this example gives four elements).The text was updated successfully, but these errors were encountered: