Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use specialized line break strategy for Chinese and Japanese #1206

Closed
mojavelinux opened this issue Aug 12, 2019 · 7 comments
Closed

Use specialized line break strategy for Chinese and Japanese #1206

mojavelinux opened this issue Aug 12, 2019 · 7 comments
Assignees
Milestone

Comments

@mojavelinux
Copy link
Member

Line break rules for Latin-based languages such as English and French are also being applied to Chinese and Japanese. However, Chinese and Japanese don't use spaces (at least not in the same way). While Latin-based languages have spaces between words where line breaks can occur, Chinese and Japanese are written without spaces in which a line break can occur between any two characters. Chinese and Japanese also use different punctuation for pause, full stop, and dash. These need to be taken into account.

While this isn't so much of a problem when the text is written exclusively in a CJK language (since the line break will be forced once the line is full), it becomes a problem when the text is mixed with another language such as English. All of a sudden, huge gaps appear because the groups of CJK languages get treated as a single "word".

Here's an example:

AsciiDoc 是一个人类可读的文件格式,语义上等同于 DocBook 的 XML,但使用纯文本标记了约定。可以使用任何文本编辑器创建文件把 AsciiDoc 和阅读“原样”,或呈现为HTML 或由 DocBook 的工具链支持的任何其他格式,如 PDF,TeX 的,Unix 的手册页,电子书,幻灯片演示等。

When rendered with Asciidoctor PDF, huge gaps appear in the line. This can be partially mitigated by changing the text alignment from left to justify, but then the gaps are just shifted to the end of the line.

The correct fix is to allow a line break between any two CJK characters, as long as one of the characters is not punctuation.

To activate this specialized logic, the author must set the scripts attribute in the document header to cjk.

Related issue: #82.

@mojavelinux mojavelinux added this to the v1.5.0.beta.3 milestone Aug 12, 2019
@mojavelinux mojavelinux self-assigned this Aug 12, 2019
@mojavelinux
Copy link
Member Author

mojavelinux commented Aug 12, 2019

At first I though it would be possible to patch the method in Prawn that scans for line break opportunities. However, that logic is very difficult to override (not to mention understand). A simpler approach, which @chloerei proposed, is to modify the string being typeset by inserting zero-width spaces at line break opportunities. This tells Prawn where it can break the line. While that may not be the most elegant solution, it gets us a solution that we can use today and leaves room for better solutions to come along...including a fix in Prawn itself.

Here's the crux of that logic:

string = string.gsub %r/(?=[\u3000\u30a0-\u30ff\u3040-\u309f\p{Han}\uff00-\uffef])/, ZeroWidthSpace
  • \u3000 is the ideographic space character
  • \u30a0-\u30ff is Hirgana (\p{Hiragana} is incomplete)
  • \u3040-\u309f is Katakana
  • \p{Han} are the unified CJK ideographs
  • \uff00-\uffef are half-width and full-width CJK forms

mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Aug 12, 2019
mojavelinux added a commit to mojavelinux/asciidoctor-pdf that referenced this issue Aug 12, 2019
@mojavelinux
Copy link
Member Author

cc: @diguage

@mojavelinux
Copy link
Member Author

While we can proceed with this workaround for Asciidoctor PDF users, this issue really needs to be filed upstream in Prawn for a long-term, proper fix.

@mojavelinux
Copy link
Member Author

I have one question. Is it normal to put spaces between English and Chinese or Japanese characters when mixing the languages? I've seen it done both ways and I'm just curious whether it's a rule or a stylistic choice.

@Gasol
Copy link
Contributor

Gasol commented Oct 28, 2019

I have one question. Is it normal to put spaces between English and Chinese or Japanese characters when mixing the languages? I've seen it done both ways and I'm just curious whether it's a rule or a stylistic choice.

It's just style.

FYI

https://chinese.stackexchange.com/questions/31746/spacing-guidelines-for-modern-chinese-writing
https://github.com/coldnew/pangu-spacing
https://pangu.space/

@mojavelinux
Copy link
Member Author

Thanks!

@Gasol
Copy link
Contributor

Gasol commented Oct 28, 2019

Thank you for bring fixes for CJK line-break on the upstream.

Gasol added a commit to Gasol/asciidoctor-pdf that referenced this issue Oct 29, 2019
Gasol added a commit to Gasol/asciidoctor-pdf that referenced this issue Oct 30, 2019
Fix chloerei/asciidoctor-pdf-cjk#4, Also see
asciidoctor#1206 for details

Require fileutils explictly to fix following errors when run command
with `rake spec`

    An error occurred in a `before(:suite)` hook.
    Failure/Error: FileUtils.mkdir_p output_dir

    NameError:
      uninitialized constant FileUtils
      Did you mean?  FileTest
    # ./spec/spec_helper.rb:187:in `block (2 levels) in <top (required)>'
mojavelinux pushed a commit to Gasol/asciidoctor-pdf that referenced this issue Oct 31, 2019
Fix chloerei/asciidoctor-pdf-cjk#4, Also see
asciidoctor#1206 for details

Require fileutils explictly to fix following errors when run command
with `rake spec`

    An error occurred in a `before(:suite)` hook.
    Failure/Error: FileUtils.mkdir_p output_dir

    NameError:
      uninitialized constant FileUtils
      Did you mean?  FileTest
    # ./spec/spec_helper.rb:187:in `block (2 levels) in <top (required)>'
mojavelinux pushed a commit that referenced this issue Nov 1, 2019
…s `cjk` (#1355)

break CJK characters in table when scripts attribute is `cjk`

A follow-up to
#1206. Also chloerei/asciidoctor-pdf-cjk#4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants