Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

truecasing #12

Closed
MaksymDel opened this issue Feb 21, 2019 · 5 comments
Closed

truecasing #12

MaksymDel opened this issue Feb 21, 2019 · 5 comments

Comments

@MaksymDel
Copy link

MaksymDel commented Feb 21, 2019

Hi,

did you do truecasing/lowercasing in your MT experiments? From the code I can't find any signs of this.

Is there any specific reason to do / not do it?

Thanks

@glample
Copy link
Contributor

glample commented Feb 21, 2019

Hi,

No, we never used truecasing / lowercasing. This was quite popular in PBSMT models, but for NMT the best is to use BPE. BPE are typically applied on sentences with regular casing.

@glample
Copy link
Contributor

glample commented Feb 21, 2019

We used lowercasing + BPE for XNLI though, as in this case the task is to do sentence classification, and the casing is not very useful. But in MT, where you need to generate, it's good to directly generate the good casing and BPE does this very well.

@MaksymDel
Copy link
Author

MaksymDel commented Feb 21, 2019

@glample Moses truecasing only modifies the case of the 1st word in a sentence (it does not modify things like Named Entities thou). This is to reduce the sparcity of the vocabulary (why to have both "Starting" and "starting" in the vocab?).

With BPE you have the same issue with the 1st wordpiece of the 1st word (e.g. start ing vs Start ing).

@glample
Copy link
Contributor

glample commented Feb 21, 2019

Yes you could use truecasing in combination with BPE, but probably it wouldn't make a big difference. Also it's nice to limit the number of preprocessing steps in practice, I guess this is also why people don't use truecasing anymore. But it wouldn't hurt to use it for sure.

@MaksymDel
Copy link
Author

@glample I didn't catch that people don't use truecasing anymore, so will look more into it. Thanks for pointing out!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants