Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carefulcase eats words it can't generate #35

Open
unhammer opened this issue Oct 25, 2018 · 4 comments
Open

Carefulcase eats words it can't generate #35

unhammer opened this issue Oct 25, 2018 · 4 comments
Assignees

Comments

@unhammer
Copy link
Member

unhammer commented Oct 25, 2018

If the dictionary has

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
 <alphabet/>
 <sdefs>
   <sdef n="n"/>
   <sdef n="m"/>
   <sdef n="pl"/>
   <sdef n="def"/>
 </sdefs>
 <section id="main" type="standard">

<e><p><l>kakene</l><r>kake<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>

<e><p><l>pc-ane</l><r>pc<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>
<e><p><l>PC-ane</l><r>PC<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>

 </section>
</dictionary>

then we get

$ echo '^kake<n><m><pl><def>$ ^KAKE<n><m><pl><def>$ ^kake<n><m><pl><def>$'|lt-proc -C nob.autogen.bin 
kakene  kakene

I would like it to just fall back to "normal" generation for words it can't find exact case for, ie.

$ echo '^kake<n><m><pl><def>$ ^KAKE<n><m><pl><def>$ ^kake<n><m><pl><def>$'|lt-proc -C nob.autogen.bin 
kakene KAKENE kakene

while still retaining the -C functionality for words it can find exact matches for

$ echo '^PC<n><m><pl><def>$ ^pc<n><m><pl><def>$' | lt-proc -C nob.autogen.bin
PC-ane pc-ane
@jimregan
Copy link
Collaborator

jimregan commented Oct 28, 2018 via email

@unhammer
Copy link
Member Author

ouch :((

unhammer added a commit that referenced this issue Nov 17, 2018
…nowns

fix #34 - Carefulcase option -C not compatible with -g

but #35 - Carefulcase eats words it can't generate
still doesn't work if we get started on an ambiguous path
@unhammer
Copy link
Member Author

I added some tests in fd6e6dc – it turns out to be problematic if we start generating ^KAKE<n><f><pl><def>$ and see a possible path that starts ^K but then only ends up in other analyses (e.g. ^KK<np>$). Then we end up with #KAKE where we should have tried a lowercased analysis.

But if there were no such garden paths, ^KAKE<n><f><pl><def>$ does give an analysis – see difference between the two test dix'es added fd6e6dc#diff-839e968af7bf80a08ea4d97247cbe7fdR1

@unhammer
Copy link
Member Author

unhammer commented Apr 20, 2023

@mr-martian Do you think this is solvable? I'd love to have a solution for this (but in bilingual mode lt-proc -b), s.t. that I can e.g. have a dix with

<e>       <re>[a-zA-Z]+</re><p><l></l><r><s n="np"/></r></p></e>
<e>       <i>med</i>        <p><l></l><r><s n="pr"/></r></p></e>

and get

$ echo '^Med<pr>$ ^AbCd<np>$' |lt-proc -C -b nob-nno.autogen.bin
^Med<pr>/Med$ ^AbCd<np>/AbCd$

Currently, we can get either the one or the other:

$ echo '^Med<pr>$ ^AbCd<np>$' |lt-proc  -C tmp.bin # eats Med
 AbCd

$ echo '^Med<pr>$ ^AbCd<np>$' |lt-proc  -b tmp.bin # includes extra "Abcd"
^Med<pr>/Med$ ^AbCd<np>/AbCd/Abcd$

$ echo '^Med<pr>$ ^AbCd<np>$' |lt-proc  -c -g tmp.bin # fails to generate Med since lemma is lowercase
#Med AbCd

Possibly related to #167

unhammer added a commit to apertium/apertium-nno-nob that referenced this issue Apr 22, 2023
unhammer added a commit to apertium/apertium-nno-nob that referenced this issue Apr 26, 2023
unhammer added a commit to apertium/apertium-nno-nob that referenced this issue Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants