Emphasis with CJK punctuation #650

ptmkenny · 2020-05-26T15:47:09Z

Hi, I encountered some strange behavior when using CJK full-width punctuation and trying to add emphasis.

Example punctuation that causes this issue:

。！？、

To my mind, all of these should work as emphasis, but some do and some don't:

**テスト。**テスト

**テスト**。テスト

**テスト、**テスト

**テスト**、テスト

**テスト？**テスト

**テスト**？テスト

I'm not sure if this is the spec as intended, but in Japanese, as a general rule there are no spaces in sentences, which leads to the following kind of problem when parsing emphasis.

In English, this is emphasized as expected:

This is **what I wanted to do.** So I am going to do it.

But the same sentence emphasized in the same way in Japanese fails:

これは**私のやりたかったこと。**だからするの。

The text was updated successfully, but these errors were encountered:

tats-u · 2023-11-13T09:50:22Z

This and the above issues are caused by the change in #618. It is mixed in only v0.30 spec.

https://spec.commonmark.org/0.30/changes

A left-flanking delimiter run is a delimiter run that is (1) not followed by Unicode whitespace, and either (2a) not followed by a Unicode punctuation character, or (2b) followed by a Unicode punctuation character and preceded by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

A single * character can open emphasis iff (if and only if) it is part of a left-flanking delimiter run.

A single * character can close emphasis iff it is part of a right-flanking delimiter run.

A double ** can open strong emphasis iff it is part of a left-flanking delimiter run.

A double ** can close strong emphasis iff it is part of a right-flanking delimiter run.

The definition of left- and -right-franking emphasis for * and ** must use ASCII punctuation characters instead of Unicode ones.

https://v1.mdxjs.com/

does not cause such problem, so remark depended by MDX v2+ is affected.

wooorm · 2023-11-13T10:03:55Z

Again, there is no change in 618. That PR is just about words, terminology.

MDX 1 did not follow CM correctly and had other bugs.

Can you please read what I say, and please stop spamming, and actually contribute?

tats-u · 2023-11-13T11:01:39Z

MDX 1 did not follow CM correctly and had other bugs.

The extension by MDX is not the culprit.

https://codesandbox.io/s/remark-playground-wmfor?file=/package.json

As of remark-parse v7, this problem is not reproduced, either.

https://prettier.io/playground/#N4Igxg9gdgLgprEAuEAqVhT00DTmg8qMHYMgZtGBSKoOGmgQAzrEl6A7EYM2xZIANCBAA4wCW0AzsqAEMATkIgB3AArCEfFAIA2YgQE8+LAEZCBYANZwYAZQEBbOABlOUOMgBmCnnA1bd+g222WA5shhCAro4gDsacPv6BPF7ycACKfhDwtvaBAFY8AB4GUbHxiUh28g4sAI65cBKibLIgAjwAtFZwACbNzCC+ApzyXgDCEMbGAsg18vJtkVCe0QCCML6c6n7wEnBCFlZJhYEAFjDG8gDq25zwPO5gcAYyJ5wAbifKw2A8aiC3AQCSUC2wBmBCnA402+BhgymimyKIDYogcBy0bGGMLgDiEt2sLEsqJgFQEnkGkMC7iEqOGgyEOia4igbRhlhgB04TRg22QAA4AAwsIRwUqcHm4-FDfLJFgwATqRnM1lIABMLD8DgAKhLZAUoXBjOpmi0mmYBJM-Hi4AAxCBCQZzLzDARLCAgAC+DqAA

Not reproduced in The latest Prettier (uses remark-parse v8), either.

That PR is just about words, terminology.

This means that the credit for the change goes to the fact that it turns to be clear that this specification is a terrible one that should be revised. Old remark-parse were based on an older ambiguous specification and consequently avoided this problem.

tats-u · 2023-11-13T11:04:28Z

https://spec.commonmark.org/0.29/

A punctuation character is an ASCII punctuation character or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.

You are right. I'm sorry. I will look for another version.

tats-u · 2023-11-13T11:10:53Z

I finally found that the current broken definition sentences were introduced in 0.14.

https://spec.commonmark.org/0.14/changes

https://spec.commonmark.org/0.13/

I will investigate why these are introduced.

tats-u · 2023-11-13T11:59:16Z

https://github.com/commonmark/commonmark-spec/blob/0.14/changelog.spec.txt

Improved rules for emphasis and strong emphasis. This improves parsing of emphasis around punctuation. For background see http://talk.commonmark.org/t/903/6. The basic idea of the change is that if the delimiter is part of a delimiter clump that has punctuation to the left and a normal character (non-space, non-punctuation) to the right, it can only be an opener. If it has punctuation to the right and a normal character (non-space, non-punctuation) to the left, it can only be a closer. This handles cases like
**Gomphocarpus (*Gomphocarpus physocarpus*, syn. *Asclepias physocarpa*)**
and
**foo "*bar*" foo**

http://talk.commonmark.org/t/903/6

There are some good ideas here 4. It looks hairy, but if I understand correctly, basic idea is fairly simple:

Strings of * or _ are divided into “left flanking” and “right flanking,” based on two things: the character immediately before them and the character immediately after.

Left-flanking delimiters can open emphasis, right flanking can close, and non-flanking delimiters are just regular text.

A delimiter is left-flanking if the character to the left has a lower rank than the character to the right, according to the following ranking: spaces and newlines are 0, punctuation (unicode categories Pc, Pd, Ps, Pe, Pi, Pf, Po, Sc, Sk, Sm or So) is 1, the rest 2. And similarly a delimiter is right-flanking if the character to the left has a higher rank than the character to the right.

Note

I replaced the link with a cache by the Wayback machine.

I conclude that this problem was caused by a lack of consideration for Chinese and Japanese by @jgm and the author of vfmd(@roop or possibly @akavel).

tats-u · 2023-11-13T12:03:29Z

I would like to ask them why they included non-ASCII punctuation characters and why only ASCII punctuation characters are not sufficient.

tats-u · 2023-11-13T14:52:14Z

I will blame https://github.com/vfmd/vfmd-spec/blob/gh-pages/specification.md later.

The test cases in vfmd considered only ASCII punctuation.

https://github.com/vfmd/vfmd-test/blob/f74cf615198f788a99f14975cc14a59b1cd3b8fe/tests/span_level/emphasis/with_punctuation.md

tats-u · 2023-11-13T15:15:28Z

I found the commit containing the initial definition in the spec of vfmd:

vfmd/vfmd-spec@7b53f05

@roop seems to live in India, and this may be because he added non-ASCII punctuation characters, but the trouble is that I do not know Hindi at all. I wonder if a space is always adjacent to punctuation characters in that language like European ones.

vassudanagunta · 2023-11-13T22:11:37Z

@tats-u dude, here and in your comments on #618 you come off as arrogant and very disrespectful. You make absolutist claims and then frequently correcting yourself because it turns out you didn't do your homework. You need to have the humility to realize that your perception that "something broke or is broken" might have to do with you not understanding one or more of the following (I don't have the time to figure out which ones, the responsibility is on you):

your specific perspective, which may not be universal, which may miss the forest for the single tree that you are most focused on
the problem, if there actually is one, might be downstream of CommonMark, in the tool you are using
if CommonMark is involved:
- the facts, the history, or the priorities of CommonMark
- the impossible expectation that CommonMark can be all things to all people.
- the difficulty in maintaining a spec where many users expect it to work how they want it without understanding

A more reasoned, respectful and helpful approach would be to have a discussion with other people who are affected by what you claim is broken, including the makers and other users of the downstream tool that you claim is now broken. Diagnose the problem with them, assuming they agree with you that there is a problem, before making a claim that the source of the problem is upstream in CommonMark.

If it turns out that you are alone in this, that should tell you something.

wooorm · 2023-11-14T14:33:16Z

@tats-u This issue is still open, so indeed it is looking for a solution. It is also something I have heard from others.

However, it is not easy to solve.
Many languages do use whitespace.
No languages use only ASCII.
Not using unicode would harm many users, too.

There are also legitimate cases where you do want to use an asterisk or underscore but don’t want it to result in emphasis/strong. Also in East-Asian languages.

One idea I have, that could potentially help emphasis/strong, is the Unicode line breaking algorithm: https://unicode.org/reports/tr14/.
It has to be researched, but it might come up with line breaking points that are better indicators than solely relying on whitespace/punctuation.
It might also be worse.

tats-u · 2023-11-14T15:39:25Z

@vassudanagunta I had got too much angry at that time. I do think it was over the limit now. ~~I wish GitHub would provide the draft comment feature out of box, and I could post many things at once without editing or additional ones.~~

the problem, if there actually is one, might be downstream of CommonMark, in the tool you are using

Let me say there are never in each framework. This problem can be reproduced in the most major JS Markdown frameworks, remark (unified) and markdown-it. Remark-related issues that I have raised are closed immediately with the reason that they are on spec.

the impossible expectation that CommonMark can be all things to all people.

I never have. This is why I have looked into the background and the impact of my proposed changes now.

the difficulty in maintaining a spec where many users expect it to work how they want it without understanding

It looks like a lot of work to study the impact of breaking changes and decide whether or not to apply them.

many users expect it to work how they want it without understanding

Due to this problem, it became necessary for me (us) to tell all Japanese (and some Chinese) Markdown writers to refrain from surrounding whole sentences with **, to use JSX , or to compromise with adding an extra space after the full-width punctuation marks 。 and ． if they are going to continue additional sentences.

<!-- What would you feel if Markdown would not recognize ** here as <strong> if you remove 4 or 5 spaces?   -->
**Don't surround the whole sentence with the double-asterisk without adding extra spaces!**      The Foobar language which is spoken by most CommonMark maintainers use as many as 6 spaces to split sentences.

the facts, the history

This is what I have looked into by digging through rummaging through the Git history, change logs, and test cases now.

the priorities of CommonMark

It is not surprising that maintainers and you lower the priority of this problem, since it does not affect any European language family, which puts space next to punctuation or parentheses.
I had got angry because I assumed that Japanese and Chinese were not even seen as third-class citizens in the Markdown world due to the background of this problem. (The change causing this problem assumes that all languages puts space next to punctuation or parentheses)

If it turns out that you are alone in this, that should tell you something.

I clearly doubt this.
You had better know many of users of specific languages (and they are not minor ones!) are (going to be) suffered by this problem.

@wooorm I apologize again at this time for my anger and for being too militant in my remarks.

My humble suggestions and comments on them:

Revert the concept of left- and right-flanking to prior to 0.14 (0.14 itself is not included)
- Old remark v8 used in Prettier, which is said to violate CM 0.14+ spec, correctly parses the cases presented in the change log in CM v0.14.
- I would like to know and have to investigate the impact of this change because it is a breaking change
Left- and right-flanking + ASCII punctuation (Unicode punctuation can be used in other parts)
- In addition to the issues you mentioned, the combination with link **[製品ほげ](./product-foo)**と**[製品ふが](./product-bar)**をお試しください still cannot be parsed as expected. Compromised solution
Left- and right-flanking + exclude Chinese- and Japanese-related punctuation from list
- Some users use ( ) “ ” without adjacent space. Compromised solution

Many languages do use whitespace.

I know. It is the background of this problem.

There are also legitimate cases where you do want to use an asterisk or underscore but don’t want it to result in emphasis/strong. Also in East-Asian languages.

I have looked for ones and their frequency. Escaping them does not modify the rendered content itself, but I have been disgusted of having to modify the content by adding extra space or to depend on the inline raw JSX tag () to avoid this problem, which puts the shackles on Markdown's expressive power.

Unicode line breaking algorithm

I will look into it later. (I do not expect it either)

Crissov · 2023-11-15T07:41:15Z

Checking the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po and Ps, U+3001 Ideographic Comma and U+3002 Ideographic Full Stop are of course included in what Commonmark considers punctuation marks, which are all treated alike.

For its definitions of flanking, CM could start to handle Open/Start Ps (e.g. () and Initial Pi (“) differently than Close/End Pe ()) and Final Pf (”), and both differently than the rest of Connector Pc (_), Dash Pd (-) and Other Po. However, this could only (somewhat) help with brackets and quotation marks or in contexts where they are present, since the characters in question are all part of that last category Po, which is the largest and most diverse by far.

Possibly affected Examples are, for instance: 363, 367+368, 371+372, 376 and 392–394.

tats-u · 2023-11-26T15:16:38Z

@Crissov

Possibly affected Examples are, for instance: 363, 367+368, 371+372, 376 and 392–394.

I checked the raised test cases. 367 is most affected in them.
I wonder how many Markdown writers use nested  for casual documents suitable for Markdown and whether if we can ask users to combine * and _ or use the raw  powered by MDX if they want to nest .
CJK languages do not use italic. They use https://en.wikipedia.org/wiki/Emphasis_mark, brackets (「」), or quotes (“”) for emphasizing words.
Emphasizing parens in that case may less natural for humans but is a simpler specification and easier to expect the behavior.
Japanese and Chinese do not use _-related syntax because it has too many restrictions, so 371 does not matter. You can keep the current behavior on _.
Other raised cases are not affected.

However, there are some ones not raised but more important. I am not convinced in the test case 378 (a**"foo"**\n→as is).
We may as well treat ** in it as .
It is popular to make text bold even in Chinese and Japanese and ** is used much more frequently than *.
MDN says that  can be nested but does not say that  is also nested.
It would be appreciated if the behavior of ** would be changed first. It is the highest priority for Chinese and Japanese.

handle Open/Start Ps (e.g. () and Initial Pi (“) differently than Close/End Pe ()) and Final Pf (”), and both differently than the rest of Connector Pc (_), Dash Pd (-) and Other Po.

Does it not mean that ** in 単語と**[単語と](word-and)**単語 is going to be treated as  by that change, does it?

FYI, as of https://hypestat.com/info/github.com, one in six visitors in GitHub live in China or Japan. This percentage would not be able to be ignored or underestimated.

wooorm · 2023-12-04T16:17:16Z

CJK languages do not use italic.

 elements have a default styling in HTML (italic), but you can change that. You can add 「」 before/after if you want, with CSS. Markdown does not dictate italic.

MDN says that  can be nested but does not say that  is also nested.

The “Permitted content: Phrasing content” bit allow it for both.

This percentage would not be able to be ignored or underestimated.

I don’t think anybody is underestimating that.
You can’t ignore all existing markdown users either, though, and break them.

Practically, this is also open source, which implies that somebody has to do the work for free here, probably because they think it’s fun or important to do. And then folks working on markdown parsers need to do it too. To illustrate, GitHub hasn’t really done anything in the last 3 years (just security vulnerabilities / new fancy footnote footnotes feature).

jgm · 2023-12-04T17:04:53Z

Getting emphasis right in markdown (especially nested emphasis) is very difficult. Changing the existing rules without messing up cases that currently work is highly nontrivial.

For what it's worth, my rationalized syntax djot has simpler rules for emphasis, gives you what you want in the above Japanese example, and allows you to use braces to clarify nesting in cases where it's unclear, e.g. {*foo{*bar*}*}. It might be worth a look.

tats-u · 2023-12-11T15:56:03Z

 elements have a default styling in HTML (italic), but you can change that. You can add 「」 before/after if you want, with CSS.

This is technically possible but not practical or necessary. It is much easier and faster to type "「" & "」" from the keyboard directly, and you cannot copy these brackets in ::before and ::after from the text.

Markdown does not dictate italic.

Almost all description on Markdown for newbies including the following say that * is for italic.

I do not know of SaaSes in Japan that customize the style of .

The current behavior of CommonMark forces newbies in China or Japan to try to decipher its spec. It is for developers of Markdown parsers, not for users except for experts.

CommonMark has now grown to the point where it can manipulate the largest Markdown implementations (remark, markdonw-it, goldmark (used by Hugo), commonmarker (possibly used by GitHub), and so on) from behind the scenes. We may well lobby to revise its specification. (unenforceable of course though!)

It would not be difficult to create a new specification of Markdown, but is difficult to give sufficient power to it.

These are why I had tried to stop the left- and right-flanking, but I have found a convincing plan to recently.

We have only to change by my plan:

The definitions of (2a) & (2b) in the left- and right-flanking delimiter run
Example 352 & 379, which should not occur in English and other many languages that are not suffered by this problem, because a space is mostly adjacent to punctuation in them.

Getting emphasis right in markdown (especially nested emphasis) is very difficult. Changing the existing rules without messing up cases that currently work is highly nontrivial.

We do not have to change the other. I hope most Chinese and Japanese can be convinced by it. Also, you can continue to nest  and  in other than Chinese or Japanese as you can do today. (We rarely need that feature in these languages) This will not break almost all existing documents written without abusing the details of the spec.

I don’t think anybody is underestimating that.

I am a little relieved to hear that. I apologize for the misunderstanding.

You can’t ignore all existing markdown users either, though, and break them.

It would affect too many documents if the left- & right-flanking rule were abolished. However, the new plan will not affect on most existing documents except for ones that abuse the details of the spec. Do you mean that they are also included in "all existing" ones?
For the first place, this feature is just an Easter egg. A little modification of that could be accepted. I would be appreciated if you could provide me some links to famous sites that describe on Markdown for intermediate level people and that mention the  &  nesting if you have time. I could not find one.

I suggest new terms "punctuation run preceded by space" & "puncuation run followed by space".

"... preceded ..." means: a sequence of Unicode punctuation characters preceded by Unicode whitespace
"... followed ..." means: a sequence of Unicode punctuation characters followed by Unicode whitespace

(2a) and (2b) is going to be changed like the following:

A left-flanking delimiter run is a delimiter run that is (1) not followed by Unicode whitespace, and either (2a) preceded by a Unicode whitespace, or (2b) not the first characters in puncuation run followed by space. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.
A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) followd by a Unicode whitespace, or (2b) not the last characters in puncuation run preceded by space. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace.

This change treats punctuation characters that are not adjacent to space as normal letters. To see if the "**" works as intended, one need only check the nearest whitespace and the punctuation characters around it. It make it possible to parse all of the followings:

**これは太字になりません。**ご注意ください。

カッコに注意**（太字にならない）**文が続く場合に要警戒。

**[リンク](https://example.com)**も注意。（画像も同様）

先頭の**`コード`も注意。**

**末尾の`コード`**も注意。

Also, we can parse even the following English as intended:

You should write “John**'s**” instead.

We do not concatenate too many punctuation characters, so we do not have to search more than ten and some (e.g. 16) punctuation characters for space from the previous or next of the target delimiter run.

To check if the delimiter run is "the last characters in punctuation run preceded by space" (without using cache):

flowchart TD
    Next{"Is the<br>next character<br>an Unicode punctuation<br>chracter?"}
    Next--> |YES| F["<code>return false</code>"]
    Next--> |NO| Init["<code>current =</code><br>(previous character)<br><code>n =</code><br>(Length of delimiter run)"]
    Init--> Exceed{"<code>n >= 16</code>?"}
    Exceed--> |YES| F
    Exceed --> |NO| Previous{"What type is <code>current</code>?"}
    Previous --> |Not punctuation or space| F
    Previous --> |Space| T["<code>return true</code>"]
    Previous --> |Unicode punctuation| Iter["<code>n++<br>current =</code><br>(previous character)"]
    Iter --> Exceed

In the current spec, to non-advanced users especially in China or Japan, "*" and "**" sometimes appear to be abandoning its duties. We must not let non-advanced users write Markdown in fear of this hidden feature.

Crissov · 2024-02-01T07:23:56Z

0.31 changes the wording slightly, but as far as I can tell this does not change flanking behavior at all.

A Unicode punctuation character is …

old:

an [ASCII punctuation character] or anything in the general Unicode categories Pc, Pd, Pe, Pf, Pi, Po, or Ps.
new:

a character in the Unicode P (puncuation) or S (symbol) general categories.

tats-u · 2024-02-04T13:51:43Z

The change made the situation even worse.
The following sentences are now unable to be parsed properly.

税込**¥10,000**で入手できます。

正解は**④**です。

The few improvements are only that it is easier to explain the condition to beginners (we can now use the single word “symbols”) and more consistent with ASCII punctuation characters.

jgm · 2024-02-04T17:27:15Z

This particular change was not intended to address this issue; it was just intended to make things more consistent.

@tats-u I am sorry, I have not yet had time to give your proposal proper consideration.

tats-u · 2024-02-05T14:53:04Z

This particular change was not intended to address this issue; it was just intended to make things more consistent.

I guess it, but as a result it did cause a breaking change and break some documents (much less than ones affected by 0.14 though), which is a kind of regressions you have mostly feared and cared about.
This change will be the basis for determining what kind of breaking changes will be acceptable in the future.

For the first place, we cannot easily access to convincing and practical examples that describe how legitimate controversial parts of specifications and changes are; we can easily find only ones that are designed only for testing and do not have meaning (e.g. *$*a. and *$*alpha.)

What is needed is like:

Price: **€**10 per month (note: you cannot pay in US$!)

I have not yet had time to give your proposal proper consideration.

FYI you do not have evaluate how optimize the algorithm in the above flowchart; it is too naive and can be optimized. All I want you to do first is to evaluate how acceptable breaking changes brought by my revision are. It might be better for me to make a PoC to make it easy to do it.

jgm · 2024-02-05T17:15:07Z

To be honest, I didn't anticipate these breaking changes, and I would have thought twice about the change if I had.

Having a parser to play with that implements your idea would make it easier to see what its consequences would be. (Ideally, a minimally altered cmark or commonmark.js.) It's also important to have a plan that can be implemented without significantly degrading the parser's performance. But my guess is that if it's just a check that has to be run once for each delimiter + punctuation run, it should be okay.

tats-u · 2024-05-09T23:37:38Z

@ArcticLampyrid You shouldn't confuse “”‘’ with （）｛｝［］. The latter ones are in Halfwidth and Fullwidth Forms and used by only CJ(K). The other people use only ASCII ones instead of them.
As you know, the former ones are mistakenly forcibly shared with all people in the world. The double-width forms must have been separated.

ArcticLampyrid · 2024-05-10T03:10:33Z

You shouldn't confuse “”‘’ with （）｛｝［］. The latter ones are in Halfwidth and Fullwidth Forms and used by only CJ(K). The other people use only ASCII ones instead of them.

Take a look at Unicode spec, （ (U+FF08) has nothing related to CJK (https://util.unicode.org/UnicodeJsps/character.jsp?a=%EF%BC%88).
It's just a general full-width symbol. in Block Halfwidth_And_Fullwidth_Forms, for Script Common.

Full-width symbols are not exclusive to CJK at all. They can be used in any language for typesetting purposes.

tats-u · 2024-05-12T05:38:40Z

general full-width

Not general.
Those who use alphabets on a daily basis use ASCII ones instead of them. There's no reason for non-CJK people to choose full width forms over ASCII.
They're defined in only legacy CJK encodings (Shift_JIS, GBK, BIG5, UHC, and so on) except for Unicode. CJK people have used them since the 20th century. They're a part of (negative) legacy of CJK languages. Most non-CJK people mustn't have noticed the existence of these characters or have a custom to use them.

tats-u · 2024-05-12T09:06:49Z

This Halfwidth and Fullwidth Forms was defined to provided so that older encodings containing both halfwidth and fullwidth characters can have lossless translation to/from Unicode. (words after "to" was quoted from Wikipedia)
This means this block is prepared for CJK languages (encodings).
I think it wouldn't have been born if CJK languages didn't accept vertical writing.
Why don't you treat it as a part of CJK?

See commonmark/commonmark-spec#650 (comment)

jgm · 2024-05-12T16:37:23Z

I think we should skip a standard variation selector in determining the "character before."

There remains the question how to identify CJK characters. There are two main proposals:

(a) script-based: use the Unicode Script property and check for one of the following scripts: Han, Hangul, Hiragana, Katakana, Bopomofo, Yiii. In cmark we could substitute code point ranges derived from these.

(b) unicode block based: e.g., the following Unicode blocks from @woorm's comment above:

Unicode Range	Description
2e80-2eff	CJK Radicals Supplement
2f00-2fdf	Kangxi Radicals
2FF0-2FFF	Ideographic Description Characters
3000-303F	JK Symbols and Punctuation
3040-309f	Hiragana
30a0-30ff	Katakana
3100-312f	Bopomofo
3130-318F	Kanbun
3190-319F	Kanbun
31C0-31EF	CJK Strokes
31F0-31FF	Katakana Phonetic Extensions
3200-32ff	Enclosed CJK Letters and Months
3300-33FF	CJK Compatibility
3400-4dbf	CJK Unified Ideographs Extension A
4e00-9fff	CJK Unified Ideographs
A000-A48F	Yi Syllables
A490-A4CF	Yi Radicals
f900-faff	CJK Compatibility Ideographs
FE10-FE1F	Vertical Forms
FE30-FE4F	CJK Compatibility Forms
FE50-FE6F	Small Form Variants
FF00-FFEE	Halfwidth and Fullwidth Forms
1B000-1B0FF	Kana Supplement
1B100-1B12F	Kana Extended-A
1B130-1B16F	Small Kana Extension
20000-2A6DF	CJK Unified Ideographs Extension B
2A700-2B73F	CJK Unified Ideographs Extension C
2B740-2B81F	CJK Unified Ideographs Extension D
2B820-2CEAF	CJK Unified Ideographs Extension E
2CEB0-2EBEF	CJK Unified Ideographs Extension F
2F800-2FA1F	CJK Compatibility Ideographs Supplement
30000-3134F	CJK Unified Ideographs Extension G

I am still not sure about how to decide between these. The script-based approach seems simpler. @tats-u pointed out some characters that were missing from the script-based regex in #650 (comment) Does the revision in #650 (comment) fix this problem?

The Ideographic Variation Selectors (U+E0100–U+E01EF) should also be included (if they're not already captured by (a) or (b)).

Wide parentheses: these do seem to be used only in CJK and may appear in some of the examples we want to handle; on the other hand it seems unprincipled if they're not in the scripts/ranges above. One alternative as @ArcticLampyrid suggests would be to make the flankingness determination sensitive to Unicode markings as open vs closed punctuation; this might make it unnecessary to include these in the CJK range, and might have other good effects elsewhere.

It would be helpful if someone could go through this thread and compile a large list of the examples that have been proposed, which we can use for tests.

rxliuli · 2024-05-12T17:38:06Z

It would be helpful if someone could go through this thread and compile a large list of the examples that have been proposed, which we can use for tests.

All example

[
  "これは**私のやりたかったこと。**だからするの。",
  "**[製品ほげ](./product-foo)**と**[製品ふが](./product-bar)**をお試しください",
  "単語と**[単語と](word-and)**単語",
  "**これは太字になりません。**ご注意ください。",
  "カッコに注意**（太字にならない）**文が続く場合に要警戒。",
  "**[リンク](https://example.com)**も注意。（画像も同様）",
  "先頭の**`コード`も注意。**",
  "**末尾の`コード`**も注意。",
  "税込**¥10,000**で入手できます。",
  "正解は**④**です。",
  "太郎は\\ **「こんにちわ」**\\ といった",
  "太郎は&#x200B;**「こんにちわ」**&#x200B;といった",
  "太郎は**「こんにちわ」**といった",
  "太郎は **「こんにちわ」** といった",
  "太郎は**「こんにちわ」**といった",
  "太郎は**\"こんにちわ\"**といった",
  "太郎は**こんにちわ**といった",
  "太郎は**「Hello」**といった",
  "太郎は**\"Hello\"**といった",
  "太郎は**Hello**といった",
  "太郎は**「Oh my god」**といった",
  "太郎は**\"Oh my god\"**といった",
  "太郎は**Oh my god**といった",
  "**C#**や**F#**は**「.NET」**というプラットフォーム上で動作します。",
  "IDが**001号**になります。",
  "IDが**００１号**になります。",
  "Go**「初心者」**を対象とした記事です。",
  "Go**\"初心者\"**を対象とした記事です。",
  "**[リンク](https://example.com)**も注意。",
  "先頭の**",
  "も注意。**",
  "**⻲田太郎**と申します",
  "・**㋐**:選択肢１つ目",
  "**真，**她",
  "**真。**她",
  "**真、**她",
  "**真；**她",
  "**真：**她",
  "**真？**她",
  "**真！**她",
  "**真“**她",
  "**真”**她",
  "**真‘**她",
  "**真’**她",
  "**真（**她",
  "**真）**她",
  "**真【**她",
  "**真】**她",
  "**真《**她",
  "**真》**她",
  "**真—**她",
  "**真～**她",
  "**真…**她",
  "**真·**她",
  "**真〃**她",
  "**真-**她",
  "**真々**她",
  "**真**她",
  "**真，** 她",
  "**真**，她",
  "**真，**&ZeroWidthSpace;她",
  "私は**⻲田太郎**と申します",
  "選択肢**㋐**: 1つ目の選択肢",
  "**さようなら︙**と太郎はいった。",
  ".NET**（.NET Frameworkは不可）**では、",
  "「禰󠄀」の偏は示ではなく**礻**です。",
  "Git**（注：不是GitHub）**",
  "太郎は**「こんにちわ」**といった。",
  "𰻞𰻞**（ビャンビャン）**麺",
  "ハイパーテキストコーヒーポット制御プロトコル**(HTCPCP)**",
  "﨑**(崎)**",
  "国際規格**[ECMA-262](https://tc39.es/ecma262/)**",
  "㐧**(第の俗字)**",
  "𠮟**(こちらが正式表記)**",
  "𪜈**(トモの合略仮名)**",
  "𫠉**(馬の俗字)**",
  "谺𬤲**(こだま)**石神社",
  "石𮧟**(いしただら)**",
  "ハイパーテキストコーヒーポット制御プロトコル**（HTCPCP）**",
  "﨑**（崎）**",
  "㐧**（第の俗字）** (Unofficial form)",
  "𠮟**（こちらが正式表記）**",
  "𪜈**（トモの合略仮名）** (Mixed-in uncommon joined katakana)",
  "𫠉**（馬の俗字）** (Unofficial form)",
  "谺𬤲**（こだま）**石神社 (shrine)",
  "石𮧟**（いしただら）** (address)",
  "**推荐几个框架：**React、Vue等前端框架。",
  "葛󠄀**(こちらが正式表記)**城市",
  "禰󠄀**(こちらが正式表記)**豆子"
]

tats-u · 2024-05-13T12:10:39Z

Removed some unintended duplicated ones (originally proposed by me) and added some.
The last 4 are CJK Unified Ideographs Extension H (tentative example; U+31350–U+323AF), Katakana Phonetic Extensions, Kana Supplement, and CJK Unified Ideographic Extension I (U+2EBF0–U+2EE5F).

[
  "これは**私のやりたかったこと。**だからするの。",
  "**[製品ほげ](./product-foo)**と**[製品ふが](./product-bar)**をお試しください",
  "単語と**[単語と](word-and)**単語",
  "**これは太字になりません。**ご注意ください。",
  "カッコに注意**（太字にならない）**文が続く場合に要警戒。",
  "**[リンク](https://example.com)**も注意。（画像も同様）",
  "先頭の**`コード`も注意。**",
  "**末尾の`コード`**も注意。",
  "税込**¥10,000**で入手できます。",
  "正解は**④**です。",
  "太郎は\\ **「こんにちわ」**\\ といった",
  "太郎は&#x200B;**「こんにちわ」**&#x200B;といった",
  "太郎は**「こんにちわ」**といった",
  "太郎は **「こんにちわ」** といった",
  "太郎は**「こんにちわ」**といった",
  "太郎は**\"こんにちわ\"**といった",
  "太郎は**こんにちわ**といった",
  "太郎は**「Hello」**といった",
  "太郎は**\"Hello\"**といった",
  "太郎は**Hello**といった",
  "太郎は**「Oh my god」**といった",
  "太郎は**\"Oh my god\"**といった",
  "太郎は**Oh my god**といった",
  "**C#**や**F#**は**「.NET」**というプラットフォーム上で動作します。",
  "IDが**001号**になります。",
  "IDが**００１号**になります。",
  "Go**「初心者」**を対象とした記事です。",
  "**[リンク](https://example.com)**も注意。",
  "先頭の**",
  "も注意。**",
  "**⻲田太郎**と申します",
  "・**㋐**:選択肢１つ目",
  "**真，**她",
  "**真。**她",
  "**真、**她",
  "**真；**她",
  "**真：**她",
  "**真？**她",
  "**真！**她",
  "**真“**她",
  "**真”**她",
  "**真‘**她",
  "**真’**她",
  "**真（**她",
  "真**（她**",
  "**真）**她",
  "**真【**她",
  "真**【她**",
  "**真】**她",
  "**真《**她",
  "真**《她**",
  "**真》**她",
  "**真—**她",
  "**真～**她",
  "**真…**她",
  "**真·**她",
  "**真〃**她",
  "**真-**她",
  "**真々**她",
  "**真**她",
  "**真，** 她",
  "**真**，她",
  "**真，**&ZeroWidthSpace;她",
  "私は**⻲田太郎**と申します",
  "選択肢**㋐**: 1つ目の選択肢",
  "**さようなら︙**と太郎はいった。",
  ".NET**（.NET Frameworkは不可）**では、",
  "「禰󠄀」の偏は示ではなく**礻**です。",
  "Git**（注：不是GitHub）**",
  "太郎は**「こんにちわ」**といった。",
  "𰻞𰻞**（ビャンビャン）**麺",
  "𰻞𰻞**(ビャンビャン)**麺",
  "ハイパーテキストコーヒーポット制御プロトコル**(HTCPCP)**",
  "﨑**(崎)**",
  "国際規格**[ECMA-262](https://tc39.es/ecma262/)**",
  "㐧**(第の俗字)**",
  "𠮟**(こちらが正式表記)**",
  "𪜈**(トモの合略仮名)**",
  "𫠉**(馬の俗字)**",
  "谺𬤲**(こだま)**石神社",
  "石𮧟**(いしただら)**",
  "**推荐几个框架：**React、Vue等前端框架。",
  "葛󠄀**(こちらが正式表記)**城市",
  "禰󠄀**(こちらが正式表記)**豆子",
  "**(U+317DB)**",
  "阿寒湖アイヌシアターイコㇿ**(Akanko Ainu Theater Ikor)**",
  "あ𛀙**(か)**よろし",
  "𮹝**(simplified form of 龘 in China)**",
  "大塚︀**(or 大塚 / 大塚)**"
]

jgm · 2024-05-13T18:33:32Z

Great! Do these include any with a Standard Variation Selector? Would be good to have one of those, too, with a CJK character with Standard Variation Selector before the **.

tats-u · 2024-05-14T03:44:35Z

大塚︀**(or 大塚 / 大塚)**

The first and last ones have the same form but only the last one is converted to the middle one if normalized. Only the first one contains a SVS.

https://light.fusic.co.jp/2021/05/06/20210506-sakai/ (Japanese)

jgm · 2024-05-14T04:38:15Z

Thanks, I added it to the list.

tats-u · 2024-05-14T08:55:01Z

"Go**\"初心者\"**を対象とした記事です。"

We have to note the current plan can't cover this case because neither o or " are CJK.

rxliuli · 2024-05-14T09:01:04Z

"Go**\"初心者\"**を対象とした記事です。"

We have to note the current plan can't cover this case because neither o or " are CJK.

@tats-u This may be another issue and should not be addressed in this issue 🤔

tats-u · 2024-05-14T09:07:04Z

I've wanted to split this issue into more than one.
Which should we create a new issue for, the current plan or corner cases (including the above case) not covered by the current plan?

tats-u · 2024-05-14T10:15:21Z

@rxliuli

We have to note the current plan can't cover this case because neither o or " are CJK.

This just means "Go**\"初心者\"**を対象とした記事です。" isn't appropriate for a test case for the current plan and should be removed from the entire list (or the test won't pass).

tats-u · 2024-05-14T14:18:41Z

@rxliuli

  "太郎は**\"こんにちわ\"**といった",
  "**真-**她"

They should be passed even in the current plan because the outer characters are han.
Is the reason why you removed the above cases just because they're duplicated and can be substituted by others using [ or ( instead of " or -?

rxliuli · 2024-05-14T14:57:59Z

@rxliuli
  "太郎は**\"こんにちわ\"**といった",
  "**真-**她"
They should be passed even in the current plan because the outer characters are han. Is the reason why you removed the above cases just because they're duplicated and can be substituted by others using [ or ( instead of " or -?

@tats-u Sorry, my bad, hidden.

jgm · 2024-05-14T15:42:51Z

I'm still not sure about the question I asked above regarding unicode scripts vs code point blocks. Is the script-based solution viable?

ptmkenny · 2024-05-14T17:05:21Z

I appreciate all the discussion; thank you.

Reading through the issue, it's not clear to me whether half-width and full-width punctuation and numbers are going to be treated the same (for example, is ( going to be handled the same as （. Specifically, I would expect the following additional test cases to pass (pass = ** evaluates to bold):

// Half-width parentheses
(**色**)
// Full-width parentheses
（**色**）
// Half-width parentheses
**(色)**
// Half-width parentheses, surrounded by CJK
カラー**(色)**を入力
// Full-width parentheses
**（色）**
// Full-width parentheses, surrounded by CJK
カラー**（色）**を入力
// Half-width 5 and half-width %
**5%**
// Half-width 5 and half-width %, surrounded by CJK
消費税が**5%**です。
// Full-width 5 and half-width %
**５%**
// Full-width 5 and half-width %, surrounded by CJK
消費税が**５%**です。
// Full-width 5 and full-width %
**５％**
// Full-width 5 and full-width %, surrounded by CJK
消費税が**５％**です。

jgm · 2024-05-14T18:08:27Z

These should be fine because the ** are adjacent to a CJK character:

// Half-width parentheses
(**色**)
// Full-width parentheses
（**色**）
// Half-width parentheses, surrounded by CJK
カラー**(色)**を入力
// Full-width parentheses, surrounded by CJK
カラー**（色）**を入力
// Half-width 5 and half-width %
**5%**
// Half-width 5 and half-width %, surrounded by CJK
消費税が**5%**です。
// Full-width 5 and half-width %, surrounded by CJK
消費税が**５%**です。
// Full-width 5 and full-width %, surrounded by CJK
消費税が**５％**です。

These cases should be fine, too, as long as there is whitespace around the ** (as there is in these examples).

// Half-width parentheses
**(色)**
// Full-width parentheses
**（色）**
// Full-width 5 and full-width %
**５％**
// Full-width 5 and half-width %
**５%**

But if you had something like

// Half-width parentheses
aa**(色)**bb

it would not be handled, unless we adopt the suggestion above of making flankingness detection sensitive to the open- and close-punctuation classes. Even that suggestion would not help with

// Full-width 5 and full-width %
aa**５％**bb
// Full-width 5 and half-width %
aa**５%**bb

The first of those cases could be handled if we decided to treat full-width 5 and % as CJK characters. As for the second, I don't think it's handled by any current proposal.

tats-u · 2024-05-14T23:03:52Z

unicode scripts vs code point blocks

If we were going to adopt the former as the main, we would have to make up for the lack of characters by the latter because scripts don't seem to cover some essential CJK symbols.

Scripts has the pros the spec doesn't have to be revised even if new Unicode blocks for CJK languages are appended in the future.
I think we can combine both if we want.

These should be fine because the ** are adjacent to a CJK character:
...

All correct.

jgm · 2024-05-14T23:36:08Z

If we were going to adopt the former as the main, we would have to make up for the lack of characters by the latter because scripts don't seem to cover some essential CJK symbols.

Can you be more specific about what, exactly, is missing?

tats-u · 2024-05-14T23:43:13Z

@jgm #650 (comment) at least

I doubt scx is common among languages other than ECMAScript-based languages.

tats-u · 2024-05-14T23:49:58Z

I'll look into https://www.unicode.org/Public/15.0.0/ucd/ScriptExtensions.txt or try \p{scx=...} in a JS REPL later.

rxliuli · 2024-05-14T23:57:41Z

These should be fine because the ** are adjacent to a CJK character:

// Half-width parentheses
(**色**)
// Full-width parentheses
（**色**）
// Half-width parentheses, surrounded by CJK
カラー**(色)**を入力
// Full-width parentheses, surrounded by CJK
カラー**（色）**を入力
// Half-width 5 and half-width %
**5%**
// Half-width 5 and half-width %, surrounded by CJK
消費税が**5%**です。
// Full-width 5 and half-width %, surrounded by CJK
消費税が**５%**です。
// Full-width 5 and full-width %, surrounded by CJK
消費税が**５％**です。

These cases should be fine, too, as long as there is whitespace around the ** (as there is in these examples).

// Half-width parentheses
**(色)**
// Full-width parentheses
**（色）**
// Full-width 5 and full-width %
**５％**
// Full-width 5 and half-width %
**５%**

@jgm @tats-u I believe that even addressing just these cases would greatly improve the current CJK support; it’s not necessary to solve all the problems at once.

But if you had something like
// Half-width parentheses
aa**(色)**bb
it would not be handled, unless we adopt the suggestion above of making flankingness detection sensitive to the open- and close-punctuation classes. Even that suggestion would not help with
// Full-width 5 and full-width %
aa**５％**bb
// Full-width 5 and half-width %
aa**５%**bb
The first of those cases could be handled if we decided to treat full-width 5 and % as CJK characters. As for the second, I don't think it's handled by any current proposal

I will pull some Chinese content from the internet and analyze the frequency of these cases.

tats-u · 2024-05-15T09:03:28Z

@rxliuli

I believe that even addressing just these cases would greatly improve the current CJK support; it’s not necessary to solve all the problems at once.

Agree with you and it's why I've shelved other more complex solutions.

I will pull some Chinese content from the internet and analyze the frequency of these cases.

Could you share it with us how? (Especially how to scrape and filter)
I'd like to apply it to Japanese content if possible.

tats-u · 2024-05-15T09:49:11Z

I'll look into https://www.unicode.org/Public/15.0.0/ucd/ScriptExtensions.txt or try \p{scx=...} in a JS REPL later.

It worked only on 「」. The others were not covered.

node
Welcome to Node.js v20.11.1.
Type ".help" for more information.
> /^(\p{scx=Han}|\p{scx=Hangul}|\p{scx=Hiragana}|\p{scx=Katakana}|\p{scx=Bopomofo})+$/u.test('禰󠄀豆子')
false
> /^(\p{scx=Han}|\p{scx=Hangul}|\p{scx=Hiragana}|\p{scx=Katakana}|\p{scx=Bopomofo})+$/u.test('（）')
false
> /^(\p{scx=Han}|\p{scx=Hangul}|\p{scx=Hiragana}|\p{scx=Katakana}|\p{scx=Bopomofo})+$/u.test('「」')
true

In the first place, fullwidth ASCII-compatible symbols are not written in ScriptExtensions.txt.

ArcticLampyrid · 2024-05-15T16:25:23Z

It worked only on 「」. The others were not covered.

As expected.

For the first one, it's related to SVS, which should be handled with additional rules.
For the second one, general full-width symbols are marked as common (at least in Unicode spec). Maybe we can add the whole Halfwidth and Fullwidth Forms (Unicode block) as supplement.

tats-u · 2024-05-16T00:04:32Z

it's related to SVS,

This is IVS, not SVS, but can be handled in the same way as SVS (or just treated as CJK as I've said).

Maybe we can add

Definitely we have to do. This is why I assert we should use Unicode blocks as primary or at least secondary.

wooorm mentioned this issue Dec 2, 2020

Incorrect handling of emphasis for Japanese language micromark/micromark#41

Closed

tats-u mentioned this issue Nov 13, 2023

Clarify wording in spec for character groups #618

Merged

This was referenced Jan 3, 2024

docs: add admonition for incompatibility with * and ** in Chinese and Japanese in MDX v2+ facebook/docusaurus#9692

Merged

MDX 3 prettier/prettier#12209

Open

jgm added a commit to commonmark/cmark that referenced this issue May 12, 2024

In determining "before_char", skip standard variation selector.

07c7662

See commonmark/commonmark-spec#650 (comment)

This comment was marked as outdated.

Sign in to view

Emphasis with CJK punctuation #650

Emphasis with CJK punctuation #650

Comments

ptmkenny commented May 26, 2020

tats-u commented Nov 13, 2023 • edited Loading

wooorm commented Nov 13, 2023

tats-u commented Nov 13, 2023 • edited Loading

tats-u commented Nov 13, 2023 • edited Loading

tats-u commented Nov 13, 2023 • edited Loading

tats-u commented Nov 13, 2023 • edited Loading

tats-u commented Nov 13, 2023 • edited Loading

tats-u commented Nov 13, 2023

tats-u commented Nov 13, 2023 • edited Loading

vassudanagunta commented Nov 13, 2023

wooorm commented Nov 14, 2023 • edited Loading

tats-u commented Nov 14, 2023 • edited Loading

Crissov commented Nov 15, 2023 • edited Loading

tats-u commented Nov 26, 2023 • edited Loading

wooorm commented Dec 4, 2023 • edited Loading

jgm commented Dec 4, 2023

tats-u commented Dec 11, 2023 • edited Loading

Crissov commented Feb 1, 2024 • edited Loading

tats-u commented Feb 4, 2024 • edited Loading

jgm commented Feb 4, 2024

tats-u commented Feb 5, 2024 • edited Loading

jgm commented Feb 5, 2024

tats-u commented May 9, 2024 • edited Loading

ArcticLampyrid commented May 10, 2024

tats-u commented May 12, 2024 • edited Loading

tats-u commented May 12, 2024 • edited Loading

jgm commented May 12, 2024

rxliuli commented May 12, 2024

tats-u commented May 13, 2024 • edited by jgm Loading

jgm commented May 13, 2024

tats-u commented May 14, 2024 • edited Loading

jgm commented May 14, 2024

tats-u commented May 14, 2024

rxliuli commented May 14, 2024

tats-u commented May 14, 2024 • edited Loading

tats-u commented May 14, 2024

This comment was marked as outdated.

tats-u commented May 14, 2024 • edited Loading

rxliuli commented May 14, 2024

jgm commented May 14, 2024

ptmkenny commented May 14, 2024

jgm commented May 14, 2024

tats-u commented May 14, 2024 • edited Loading

jgm commented May 14, 2024

tats-u commented May 14, 2024 • edited Loading

tats-u commented May 14, 2024

rxliuli commented May 14, 2024 • edited Loading

tats-u commented May 15, 2024 • edited Loading

tats-u commented May 15, 2024 • edited Loading

ArcticLampyrid commented May 15, 2024

tats-u commented May 16, 2024 • edited Loading

tats-u commented Nov 13, 2023 •

edited

Loading

tats-u commented Nov 13, 2023 •

edited

Loading

tats-u commented Nov 13, 2023 •

edited

Loading

tats-u commented Nov 13, 2023 •

edited

Loading

tats-u commented Nov 13, 2023 •

edited

Loading

tats-u commented Nov 13, 2023 •

edited

Loading

tats-u commented Nov 13, 2023 •

edited

Loading

wooorm commented Nov 14, 2023 •

edited

Loading

tats-u commented Nov 14, 2023 •

edited

Loading

Crissov commented Nov 15, 2023 •

edited

Loading

tats-u commented Nov 26, 2023 •

edited

Loading

wooorm commented Dec 4, 2023 •

edited

Loading

tats-u commented Dec 11, 2023 •

edited

Loading

Crissov commented Feb 1, 2024 •

edited

Loading

tats-u commented Feb 4, 2024 •

edited

Loading

tats-u commented Feb 5, 2024 •

edited

Loading

tats-u commented May 9, 2024 •

edited

Loading

tats-u commented May 12, 2024 •

edited

Loading

tats-u commented May 12, 2024 •

edited

Loading

tats-u commented May 13, 2024 •

edited by jgm

Loading

tats-u commented May 14, 2024 •

edited

Loading

tats-u commented May 14, 2024 •

edited

Loading

tats-u commented May 14, 2024 •

edited

Loading

tats-u commented May 14, 2024 •

edited

Loading

tats-u commented May 14, 2024 •

edited

Loading

rxliuli commented May 14, 2024 •

edited

Loading

tats-u commented May 15, 2024 •

edited

Loading

tats-u commented May 15, 2024 •

edited

Loading

tats-u commented May 16, 2024 •

edited

Loading