Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling \r & \n on CLI #120

Open
jelmervdl opened this issue Oct 3, 2022 · 2 comments
Open

Handling \r & \n on CLI #120

jelmervdl opened this issue Oct 3, 2022 · 2 comments

Comments

@jelmervdl
Copy link
Collaborator

Lifted from @Godnoken #81 (comment)

I'm sorry if this is not on topic as you are talking about HTML primarily.

I noticed that I can't successfully translate texts that contain \n through the command line I "fixed" it by substituting \n with another symbol like * during translation and then vice versa after the translation is done. It works sometimes but often not, due to my text input having different sets of \n, \r and so on.

Is there or would there be a way to ignore all sorts of carriage returns?

Edit

Hmm, apologies. It seems like it does indeed translate with no problems and automatically removes all carriage returns on the command line. The issue I had must have had to do with Tauri.

I'll revise my question - Would it be possible to implement an option to only ignore \n & \r but not remove them?

Can we add a command line option that makes translateLocally CLI not treat a new line as a paragraph boundary?

I'm aware the sentence splitter has a mode where it just ignores newlines, and will treat them as tokens in the sentence. I'm not sure how these survive the translation. I'm assuming they'd need to be part of the vocab? None of our models are trained on sentences that contain newlines.

@kpu
Copy link
Collaborator

kpu commented Oct 3, 2022

Pretty sure the sentence splitter, when configured to unwrap lines, just replaces newline with space. There is no formatting preservation there. I guess we could do something alignment based, but the main use case for this is wrapped text with what are effectively soft returns. And soft returns just go back at column width, not a semantic position.

@Godnoken
Copy link

Thank you for bringing this up.

I put a band-aid on it for now by setting the container to the longest line's width & using auto word-breaking if anyone wonders. Replacing \n or \r didn't work perfectly due to some of the translation breaking if replaced with a symbol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants