Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a CJK theme #82

Closed
chloerei opened this issue Jan 8, 2015 · 23 comments
Closed

Add a CJK theme #82

chloerei opened this issue Jan 8, 2015 · 23 comments

Comments

@chloerei
Copy link
Member

chloerei commented Jan 8, 2015

In order to generate CJK document, we need to add a CJK theme that uses DroidSans by default.

Previous discussion: http://discuss.asciidoctor.org/How-to-generate-fonts-td2358.html

@mojavelinux
Copy link
Member

I'm I correct in saying that we should not use justified body font in the CJK theme?

Also, it seems like Prawn doesn't break sentences at the correct location in CJK text (it tries to keep it all on the same line, leaving large spaces). We may have to address that issue separately.

@chloerei
Copy link
Member Author

chloerei commented Jan 9, 2015

Yes, base_align: justify will cause large spaces. I change it to left, Prawn still not break sentences very well.

Prawn's document says “line wrapping happens on white space or hyphens". Seems like when mixing Chinese and English with white space, it will break at white space first. I will take more time to learn how to fix it.(A heavyweight solutions is using a participle lib to insert Prawn::Text::ZWSP)

@mojavelinux
Copy link
Member

Thanks for confirming the issue.

The problem seems to be in the following lines in Prawn:

https://github.com/prawnpdf/prawn/blob/master/lib/prawn/text/formatted/line_wrap.rb#L143-L161

The break_chars should use the unicode groups for break chars instead of specific western language characters. I did something similar in Asciidoctor core to determine what a letter character is. Unicode provides blocks and shorthands for representing these characters in a multilingual way. For instance, it might be something like: \p{Space} or \p{Z}. See http://ruby-doc.org/core-2.1.5/Regexp.html#class-Regexp-label-Character+Properties

If necessary, we can patch Prawn in Asciidoctor PDF if they refuse to add this support. Fortunately, it's a method we can override ;)

@mojavelinux
Copy link
Member

We could also allow the ordering of break characters to be rearranged in the theme if necessary...though I'd like to try to do it in a universal way if possible.

@chloerei
Copy link
Member Author

chloerei commented Jan 9, 2015

Following your suggestion, I find a way to fix Chinese and Japanese line wrap. I rewrite this method:

https://github.com/prawnpdf/prawn/blob/master/lib/prawn/text/formatted/line_wrap.rb#L124-L133

        def scan_pattern
          pattern = "[^#{break_chars}]+#{soft_hyphen}|" +
            "[^#{break_chars}]+#{hyphen}+|" +
            "[^#{break_chars}]+|" +
            "[#{whitespace}]+|" +
            "#{hyphen}+[^#{break_chars}]*|" +
            "#{soft_hyphen}"

          Regexp.new(pattern)
        end

to this:

        def scan_pattern
          pattern = "\\p{Han}|\\p{Hiragana}|\\p{Katakana}|\\p{Common}|" + # <- break all CJ chars
            "[^#{break_chars}]+#{soft_hyphen}|" +
            "[^#{break_chars}]+#{hyphen}+|" +
            "[^#{break_chars}]+|" +
            "[#{whitespace}]+|" +
            "#{hyphen}+[^#{break_chars}]*|" +
            "#{soft_hyphen}"
          Regexp.new(pattern)
        end

Because Chinese and Japanese don't use break chars, so I don't change break_chars, instead I let it break all CJ chars.(Korean use normal space for break chars, no need to fix)

Use \p{Common} because Katakana do not contain 'ー' (https://bugs.ruby-lang.org/issues/5685), maybe have another small range script.

Example (by google translate, and I add space between English and CJK characters):

AsciiDoc is a human-readable document format, semantically equivalent to DocBook XML, but using plain-text mark-up conventions. AsciiDoc documents can be created using any text editor and read “as-is”, or rendered to HTML or any other format supported by a DocBook tool-chain, i.e. PDF, TeX, Unix manpages, e-books, slide presentations, etc.

AsciiDoc 是一个人类可读的文件格式,语义上等同于 DocBook 的 XML,但使用纯文本标记了约定。可以使用任何文本编辑器创建文件把 AsciiDoc 和阅读“原样”,或呈现为HTML 或由 DocBook 的工具链支持的任何其他格式,如 PDF,TeX 的,Unix 的手册页,电子书,幻灯片演示等。

AsciiDoc は、意味的には DocBook XML のに相当するが、プレーン·テキスト·マークアップの規則を使用して、人間が読めるドキュメントフォーマット、である。 AsciiDoc は文書は、任意のテキストエディタを使用して作成され、「そのまま"または、HTML や DocBook のツールチェーンでサポートされている他のフォーマット、すなわち PDF、TeX の、Unix の man ページ、電子書籍、スライドプレゼンテーションなどにレンダリングすることができます。

AsciiDoc 는 의미의 DocBook XML 에 해당하지만 일반 텍스트 마크 업 규칙을 사용하여 사람이 읽을 수있는 문서 형식입니다. AsciiDoc 문서는 텍스트 편집기를 사용하여 생성하고 "있는 그대로"읽거나, HTML 또는 DocBook 을 도구 체인에서 지원하는 다른 형식, 즉 PDF, 텍, 유닉스 맨 페이지, 전자 책, 슬라이드 프리젠 테이션 등을 렌더링 할 수 있습니다.

Before fix (Font: DroidSansFallback.ttf):

1

After fix:

2

Line wrap looks nice.

or aligh left:

3

Looks nice too.

Conclusion: Need a way to customize line wrap behavior.(theme or gem?)

@mojavelinux
Copy link
Member

Great news! I agree, it looks much nicer.

Once again, incredible information. Seriously, this information is going to allow us to make major leaps forward. It's invaluable. Thank you!

As a first step, I recommend perusing the line break change upstream in Prawn, see how it goes.

I think the line wrap behavior (left, justify) will definitely be something that belongs in the theme. There are already people writing in Western languages that prefer left justification, so it isn't just limited to CJK needs. We could even consider an AsciiDoc attribute, which I have been considering for core (see

@chloerei
Copy link
Member Author

Because Prawn support using zero-width space(U+200B) to control line wrap, I think is better to control in asciidoctor-pdf, not to rewrite Prawn.

So I need to insert zero-width space before calling Prawn's text method, I found a place to do it:

def typeset_text string, line_metrics, opts = {}
move_down line_metrics.padding_top
opts = { leading: line_metrics.leading, final_gap: line_metrics.final_gap }.merge opts
if (first_line_opts = opts.delete :first_line_options)
# TODO good candidate for Prawn enhancement!
text_with_formatted_first_line string, first_line_opts, opts
else
text string, opts
end
move_down line_metrics.padding_bottom
end

  def typeset_text string, line_metrics, opts = {}
    move_down line_metrics.padding_top
    opts = { leading: line_metrics.leading, final_gap: line_metrics.final_gap }.merge opts

    # insert zero-width space after CJ chars. <-
    string.gsub!(/\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Common}/) {|s| "#{s}#{::Prawn::Text::ZWSP}"}

    if (first_line_opts = opts.delete :first_line_options)
      text_with_formatted_first_line string, first_line_opts, opts
    else
      text string, opts
    end
    move_down line_metrics.padding_bottom
  end

Result is similar:

4

This code should be extracted to a method, and load regexp in theme or attribute.

One question: is it the best place to insert zero-width space?

@ProgramFan
Copy link

@chloerei This code seems to break inline markups, for example:

测试中文和English混合编排。这是中文。This is English. 什么都没有。 *黑体*. _斜体_.
测试中文和English混合编排。这是中文。This is English. 什么都没有。 *黑体*. _斜体_.
测试中文和English混合编排。这是中文。This is English. 什么都没有。 *黑体*. _斜体_.
测试中文和English混合编排。这是中文。This is English. 什么都没有。 *黑体*. _斜体_.
测试中文和English混合编排。这是中文。This is English. 什么都没有。 *黑体*. _斜体_.
测试中文和English混合编排。这是中文。This is English. 什么都没有。 *黑体*. _斜体_.
测试中文和English混合编排。这是中文。This is English. 什么都没有。 *黑体*. _斜体_.
测试中文和English混合编排。这是中文。This is English. 什么都没有。 *黑体*. _斜体_.

Will render as:
false

It seems that the \p{Common} regex insert space around ">" characters (I just guess) and make Prawn complain about 'unable to parse text'. Removal of \p{Common} character group results in correct rendering:
true

This may not be the right solution, but it works for my Chinese document. I want a lot this issue to be solved (may be I can contribute a CJK theme, just by replacing fallback themes in the original theme to DroidSansFallback or Adobe Source Han Sans (used in my example).

@chloerei
Copy link
Member Author

chloerei commented Jul 8, 2015

@ProgramFan How do you generate fonts? I have problem about generate bold font.(FontForge crashed when change weight of DroidSansFallbackFull.ttf)

@ProgramFan
Copy link

I use adobe source han sans, which contains fonts with different weights. If there are problems with DroidSansFallback, we can make a fully CJ capable ttf font from Adobe sources. But I am not familiar with FontForge.

@chloerei
Copy link
Member Author

chloerei commented Jul 9, 2015

@ProgramFan Prawn don't support otf yet, how do you set source han sans as font?

@ProgramFan
Copy link

@chloerei I uses a ttf version downloaded from http://akr.tw/2014/10/source-han-sans-ttf/. It's possible to generate directly from Adobe open sourced source by FontForge. But I have not tried yet.

@chloerei
Copy link
Member Author

chloerei commented Jul 9, 2015

@ProgramFan Thanks.

@chloerei
Copy link
Member Author

I create two gems to fix CJK problem:

https://github.com/chloerei/asciidoctor-pdf-cjk fix CJK line-wrap, and use to hold other CJK specific patch.

https://github.com/chloerei/asciidoctor-pdf-cjk-kai_gen_gothic A CJK theme contain theme.yml and fonts. It's named by the font KaiGen Gothic witch convert from Source Han Sans.

Install:

$ gem install asciidoctor-pdf-cjk-kai_gen_gothic

Download fonts from git release, run once:

$ asciidoctor-pdf-cjk-kai_gen_gothic-install

Then use to render PDF:

$ asciidoctor-pdf -r asciidoctor-pdf-cjk-kai_gen_gothic -a pdf-style=KaiGenGothicCN doc.asc

There are 4 pdf-style themes:

  • KaiGenGothicCN
  • KaiGenGothicJP
  • KaiGenGothicKR
  • KaiGenGothicTW

All these themes contain all CJK glyphs with region specific variants.

preview:

5

6

And I want to add these project to asciidoctor organization, what should I do?

@chloerei
Copy link
Member Author

Font KaiGen Gothic is convert by FontCreator, a commercial software. Is it a problem for security or other?

lethee added a commit to lethee/progit2-ko that referenced this issue Jul 27, 2015
lethee added a commit to lethee/progit2-ko that referenced this issue Jul 27, 2015
@sunsolve
Copy link

I used asciidoctor-pdf-cjk-kai_gen_gothic, Chinese character is displayed correctly,but there have another problem.

Because I need to use SVG in asciidoc,if there have Chinese character in SVG file,then it can't be converted correctly.

asciidoctor-pdf gave the following warning message when it executed:

The following text could not be fully converted to the Windows-1252 character set:
| 错误提示

screen shot 2015-08-26 at 13 29 45

@chloerei
Copy link
Member Author

@sunsolve Can you post doc source for testing?

@sunsolve
Copy link

test.adoc

I used plantuml here to generate a SVG diagram.

=== SVG Chinese character testing ===
[plantuml,"svg-test",svg]
....
start;
if (测试成功?) then (yes)
   :成功;
else (no)
   :失败;
endif
stop;
....

convert command:

$ asciidoctor-pdf -a lang=zh --safe-mode unsafe -d book -a icons=font -a toc -a experimental -r asciidoctor-diagram  -r asciidoctor-pdf-cjk-kai_gen_gothic -a pdf-style=KaiGenGothicCN test.adoc

@chloerei
Copy link
Member Author

I found asciidoctor-diagram set font family to sans-serif in svg text, and prawn-svg alias sans-serif to Helvetica, so it don't use font settings in theme.

Both asciidoctor-diagram and prawn-svg can not set font yet, I found a way to patch it. Create file config.rb

Prawn::Svg::Font::GENERIC_CSS_FONT_MAPPING.merge!(
  'sans-serif' => 'KaiGen Gothic CN'
)

then

$ asciidoctor-pdf -r asciidoctor-diagram -r ./config.rb -r asciidoctor-pdf-cjk-kai_gen_gothic -a pdf-style=KaiGenGothicCN test.adoc

preview

10

I don't know where is the best place to patch. If asciidoctor/asciidoctor-diagram#70 support, then asciidoctor-pdf can add a theme attribute or setting base_font_family.

@sunsolve
Copy link

Great,It did the work!

@gengjiawen
Copy link

@chloerei @mojavelinux
There is a bug in toc, if you using asciidoctor-pdf-cjk-kai_gen_gothic, you clicki item in toc, it won't jump.
The original works fine.

@mojavelinux
Copy link
Member

@gengjiawen I believe that is now solved in alpha.14.

@mojavelinux
Copy link
Member

This issue is too broad and therefore will never be addressed as is.

I've opened #1206 to address the line break issue.

The Asciidoctor Diagram issue doesn't pertain to this converter. The SVG follows the font used in the document. So as long as Asciidoctor Diagram produces the right SVG, then it should work fine in Asciidoctor PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants