Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving CJK Characters Support #186

Open
Cerallin opened this issue Mar 6, 2023 · 13 comments
Open

Improving CJK Characters Support #186

Cerallin opened this issue Mar 6, 2023 · 13 comments
Labels
addition Opportunity for more content

Comments

@Cerallin
Copy link
Contributor

Cerallin commented Mar 6, 2023

I am working on translating the Nintendo DS article into Chinese.

There are 2 minor issues about CJK characters (Chinese, Japanese, and Korean) that I want to ask about.

Space after sentences

The commas and periods in Chinese are as wide as two Latin characters, the same as all the other Chinese characters. Therefore, we do not add spaces after periods and commas. Avoiding unnecessary spaces is easy when writing with markdown: do not append a new line or add a space after each sentence.

I would like to know if it is convenient to remove these spaces after sentences in the Chinses translated markdown file. I'm not sure if this breaks the translation workflow of Crowdin.
(It does not matter that much, so don't worry if not possible.)

Spaces between CJK characters and Latin characters

It is suggested to add "padding" between CJK characters and Latin characters. The simplest and best way to do this is to import a js file. I would like to know if it is convenient to import it into your project. If not, as a compromise, I will manually add spaces when translating.

@flipacholas
Copy link
Owner

Hi, thanks for the translations. Just in case, there's been a previous translation that could possibly be used as a reference.

Regarding space after sentences, in general, I encourage translators to apply their best judgement. In this case, Crowdin separates translations by sentences, so I think it's autommatically adding the spaces after periods and/or commas. I don't think it will allow you to fix that issue by yourself, so I'll try to see if there's an option somewhere that fixes it.

Regarding spaces between CJK characters and Latin characters, I've noticed the previous NES translations uses characters that include padding (i.e. and ) , would that solve the issue?.

@Cerallin
Copy link
Contributor Author

Cerallin commented Mar 7, 2023

Thanks for your reply. I've seen the previous translation. I am not going to change anything on that page because I'm not familiar with NES at all. But I will try to be consistent with its conventions.

Regarding spaces between CJK characters and Latin characters, I've noticed the previous NES translations uses characters that include padding (i.e. and ) , would that solve the issue?.

Unfortunately, that's another issue. The usage of different brackets (() or ()) may seem complex in Chinese-English mixed text, but I will take care of it.

Adding spaces is what the NES translations actually did. e.g., This is an article about GBA will be translated into 这是关于 GBA 的文章, with spaces surrounding GBA. That's a compromise because a space character is a little too wide. What I usually do is add custom HTML tags and set their width to 0.8em.

The best solution I've thought of is to import js files to Chinese-translated markdown files only, and there are alternatives too: a script that processes text (if it's possible), or a Crowdin app/plugin (I don't know much about it yet).

@flipacholas
Copy link
Owner

Hmm, now that I think of, when the website is generated, the Markdown is converted into HTML, but in-between the conversion I can add regex calls.
So, as a more reliable alternative, I can try to come up with a regular expression that detects chinese characters next to Latin characters and, with that, adds a space in there (or an HTML tag). This wouldn't conflict with Crowdin, Pandoc and even work without JS.

So, while I experiment with this in the build scripts, could you try to make the translation without using the extra spaces? Hopefully this will work and I'll be able to port it to other CJK scripts. Thanks!

@Cerallin
Copy link
Contributor Author

Cerallin commented Mar 8, 2023

I can try to come up with a regular expression that detects chinese characters next to Latin characters and, with that, adds a space in there (or an HTML tag). This wouldn't conflict with Crowdin, Pandoc, and even work without JS.

I prefer workarounds without JS running in browsers too. I realized that your articles will be published not only on the website but also on EPUB using pandoc. In fact, there's no need to insert spaces in EPUB, because most of e-book readers can take care of the padding. On the other hand, spaces after sentences still need to be dropped.

Trim off spaces after sentences seem simple to me: just remove all the spaces (and line-feeds) after and , characters stand for comma and period, separately.

Now please let me introduce the regex rules for adding HTML tags with JS. The codes below are part of my hexo plugin. Please feel free to use or modify them, and I hope I can explain them clearly.

// Pattern rules taken from text-autospace.js
const hanzi = '[\u2E80-\u2FFF\u31C0-\u31EF\u3300-\u4DBF\u4E00-\u9FFF\uF900-\uFAFF\uFE30-\uFE4F]',
  punc = {
    base: "[@&=_\\$%\\^\\*-\\+/]",
    open: "[\\(\\[\\{<‘“]",
    close: "[,\\.\\?!:\\)\\]\\}>’”]"
  },
  latin = '[A-Za-z0-9\u00C0-\u00FF\u0100-\u017F\u0180-\u024F\u1E00-\u1EFF]' + '|' + punc.base,
  patterns = [
    RegExp('(' + hanzi + ')(' + latin + '|' + punc.open + ')', 'gi'),
    RegExp('(' + latin + '|' + punc.close + ')(' + hanzi + ')', 'gi')
  ];

Here are the explanations of each variable:

  1. hanzi: matches Chinese characters (but not all CJK characters).
  2. punc: punctuation characters.
  • base: punctuation characters do not appear in pairs, e.g. @, &, _, etc.
  • open: quotes and brackets like (, [, .
  • close: quotes and brackets like ), ], .
  1. latin: matches Latin characters and basic punctuation characters
  2. patterns: determine where to insert a space.

Assume that tags named <hl> are added between Chinese characters and Latin characters, here's the corresponding stylesheet:

html hl:after {
    content: ' ';
    display: inline;
    font-family: inherit;
    font-size: 0.8em;
}

html code hl,
html pre hl,
html kbd hl,
html samp hl,
html ruby hl,
html .tag-list-item hl {
    display: none;
}
html ol > hl,
html ul > hl {
    display: none;
}

Don't worry if any customized tag is placed in the wrong place, we still have a chance to decide whether to show them or not with CSS.

So, while I experiment with this in the build scripts, could you try to make the translation without using the extra spaces? Hopefully this will work and I'll be able to port it to other CJK scripts. Thanks!

I'm glad to do so and see if this helps other CJK translations, though there are Chinese translations only at present :-).

@flipacholas
Copy link
Owner

That's a great breakdown of the script and it will help me to port the regular expressions. Let me know when you get the chinese translation ready and I'll test the regex. Many thanks!

@Cerallin
Copy link
Contributor Author

I've just finished translating the Nintendo DS article (no extra spaces). Please handle it at a time that you deem appropriate.

@flipacholas
Copy link
Owner

flipacholas commented Mar 11, 2023

Great, I've deployed it here for testing (it doesn't have the <hr> spaces, for now): https://www.copetti.org/zh-hans/writings/consoles/nintendo-ds/

I'm checking the regex effects on the Markdown article, and there seems to be the following bugs:

  1. Some sentences have extra spaces. Just did a quick review on Crowdin and deleted some of them, so it's just a matter of correcting the translation.

  2. There are some false positives (I think?) with the regexes. For instance, the following text at the start of the article:

和任天堂的[上一代掌机](game-boy-advance)一样,NDS的系统围绕一个名为**CPU NTR**的大芯片展开。

is replaced like this:

和任天堂的<hl>[</hl>上一代掌机](game-boy-advance)</hl>一样,NDS</hl>的系统围绕一个名为<hl>**CPU NTR**</hl>的大芯片展开。

The regex is applied on Markdown, so I think that's creating some confusion on the rules (I'm assuming it was originally made for HTML?). I guess I just need to tweak the regex.

But overall, this is very good progress and I really appreciate there's a new article available in Chinese. I'll try to find the causes of the regex problems meanwhile. Thanks.

@Cerallin
Copy link
Contributor Author

Great, I've deployed it here for testing (it doesn't have the <hr> spaces, for now): https://www.copetti.org/zh-hans/writings/consoles/nintendo-ds/

Good news! Good to see my translation deployed. I may translate the GBA article later.

  1. Some sentences have extra spaces. Just did a quick review on Crowdin and deleted some of them, so it's just a matter of correcting the translation.

Okay, I will go through and check the spaces on Crowdin.

  1. There are some false positives (I think?) with the regexes. For instance, the following text at the start of the article:
和任天堂的[上一代掌机](game-boy-advance)一样,NDS的系统围绕一个名为**CPU NTR**的大芯片展开。

is replaced like this:

和任天堂的<hl>[</hl>上一代掌机](game-boy-advance)</hl>一样,NDS</hl>的系统围绕一个名为<hl>**CPU NTR**</hl>的大芯片展开。

The regex is applied on Markdown, so I think that's creating some confusion on the rules (I'm assuming it was originally made for HTML?). I guess I just need to tweak the regex.

You are right, it was originally made for HTML. I might give up writing markdown rules if I were you since markdown is very flexible so the regex may be too complex and loses readability. I suggest applying replacements to HTML files.

Please tell me if you need it and I will modify my hexo plugin to handle HTML files as an executable. NodeJS executables run slowly, but a few seconds per file sounds tolerable to me.

But overall, this is very good progress and I really appreciate there's a new article available in Chinese. I'll try to find the causes of the regex problems meanwhile. Thanks.

You are welcome. Please let me know if there's anything I can do to help.

P.S. I found some more characters with spaces after them to be trimmed off. The whole list is: ,。!?:.

@flipacholas
Copy link
Owner

Sound good! By the way, don't forget to sign your name or username here so I can credit you for the translation

@Cerallin
Copy link
Contributor Author

P.S. I found some more characters with spaces after them to be trimmed off. The whole list is: ,。!?:.

Oops, ;… are also in the list.

And, I've finished my spaces checking on Crowdin. ✌️

@flipacholas
Copy link
Owner

flipacholas commented Mar 12, 2023

Thanks! In my case I've been trying to learn more about how to improve the styling and layout for Chinese-speaking audiences (using simplified chinese scripts, in this case). I've recently changed the following (only visible in the chinese articles):

  • Reduced the margins between paragraphs.
  • Set the fonts to "Helvetica Neue", Helvetica, "Microsoft Yahei", "Hiragino Sans GB", "WenQuanYi Micro Hei", "微软雅黑", "华文细黑", STHeiti, sans-serif;
  • Added indentation at the start of each paragraph.
  • Justified text.

From your perspective, do you think they improve the reading experience for Chinese readers?

@Cerallin
Copy link
Contributor Author

Wow! They do help a lot! The font families cover the default fonts of most devices. It looks pretty good with text justified and indented.

@flipacholas
Copy link
Owner

Glad it helped! I think it will take me some time to get the regex rules to properly parse latin text. However, I'm glad that I can improve the reading through css as well.

@flipacholas flipacholas changed the title CJK Characters Support Improving CJK Characters Support Apr 5, 2023
@flipacholas flipacholas added the addition Opportunity for more content label Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition Opportunity for more content
Projects
None yet
Development

No branches or pull requests

2 participants