Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a postprocess to handle capitalisation #67

Closed
ftyers opened this issue Mar 9, 2020 · 8 comments
Closed

Make a postprocess to handle capitalisation #67

ftyers opened this issue Mar 9, 2020 · 8 comments
Labels
capitalisation enhancement New feature or request help wanted Extra attention is needed

Comments

@ftyers
Copy link
Member

ftyers commented Mar 9, 2020

Capitalisation should not be done in transfer, it should be done in a postprocess, much like "recasing" in SMT.

@ftyers ftyers added enhancement New feature or request help wanted Extra attention is needed labels Mar 9, 2020
@hectoralos
Copy link
Member

At what stage exactly and on the basis of which information? I'm thinking about how dealing with the difference in French nouns like "allemand" (the language) and "Allemand" (a person). Currently, I do this in transfer.

@khannatanmai
Copy link
Member

@ftyers we can use secondary tags to propagate the case till the post generator and then apply it there if needed.

@ftyers
Copy link
Member Author

ftyers commented Jul 3, 2020

This is related: #75

@ftyers
Copy link
Member Author

ftyers commented Jul 3, 2020

@hectoralos I would do it in posttransfer using the LU and perhaps a 1-2 word context window.

@unhammer
Copy link
Member

@ftyers basically only using dictionary case and "is this a sentence end"-context and ignoring input case? We'd lose the ability to keep UPPER CASE and Titles with Titlecase but maybe that's worth the code simplification …

@mr-martian
Copy link
Contributor

lt-proc could record the original capitalization and put that in word-bound blanks which could then be used to determine that.

@unhammer
Copy link
Member

@mr-martian lt-proc outputs the original word form anyway, so a separate step can do the job. I actually have a branch of nno-nob that just adds tags aa/Aa/AA that way to all words (capstag.rlx runs after morph ana/dis), removed again in transfer. I'm considering switching to this system so we can get dictionary-based correction but keep input caps (for start of sentence or where there are several upper-cased words in a row), but have to make sure it doesn't lead to regressions first.

@mr-martian
Copy link
Contributor

Processor added in 7e7004d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
capitalisation enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants