The current tests in smart_punct.txt for en/em dashes don't define behavior for certain longer combinations and the current code ends up resulting with hanging hyphens when they could be easily replaced in a different manner to only replace them with em/en dashes.
For example, a series of 10 hyphens results in 3 em dashes followed by a hyphen. In my opinion, it would make more sense for this to result in 5 en dashes. Additionally, 7 hyphens are converted into 2 em dashes and a hyphen, but I believe it should be 1 em dash and 2 en dashes. That is:
Current: ---------- => --- --- --- - => ———-
Better: ---------- => -- -- -- -- -- => –––––
Current: ------- => --- --- - => ——-
Better: ------- => --- -- -- => —––
To achieve this behavior, each group of hyphens would be collected and counted at once, assuming it is 2 hyphens or more (eg, /^(?<!-)(-{2,})/), and then the most optimal grouping would be figured out (in PHP for thephpleague/commonmark, but should be able to be converted to JavaScript/C fairly easily):
$count = strlen($matched);
$en_dash = '–';
$en_count = 0;
$em_dash = '—';
$em_count = 0;
if ($count % 3 === 0) { // If divisible by 3, use all em dashes
$em_count = $count / 3;
} elseif ($count % 2 === 0) { // If divisible by 2, use all en dashes
$en_count = $count / 2;
} elseif (($count - 2) % 3 === 0) { // If 2 extra dashes, use en dash for last 2; em dashes for rest
$em_count = floor(($count - 2) / 3);
$en_count = 1;
} else { // Use en dashes for last 4 hyphens; em dashes for rest
$em_count = floor(($count - 4) / 3);
$en_count = 2;
}
$inlineContext->getInlines()->add(new Text(
str_repeat($em_dash, $em_count).
str_repeat($en_dash, $en_count)
));
return true;
Is this something that CommonMark would be interested in implementing? (I can do, I just don't want to spend the time writing the code if it won't be accepted.) Or should the smart_punct.txt file be updated with tests that check for these edge cases?
The current tests in
smart_punct.txtfor en/em dashes don't define behavior for certain longer combinations and the current code ends up resulting with hanging hyphens when they could be easily replaced in a different manner to only replace them with em/en dashes.For example, a series of 10 hyphens results in 3 em dashes followed by a hyphen. In my opinion, it would make more sense for this to result in 5 en dashes. Additionally, 7 hyphens are converted into 2 em dashes and a hyphen, but I believe it should be 1 em dash and 2 en dashes. That is:
To achieve this behavior, each group of hyphens would be collected and counted at once, assuming it is 2 hyphens or more (eg,
/^(?<!-)(-{2,})/), and then the most optimal grouping would be figured out (in PHP forthephpleague/commonmark, but should be able to be converted to JavaScript/C fairly easily):Is this something that CommonMark would be interested in implementing? (I can do, I just don't want to spend the time writing the code if it won't be accepted.) Or should the smart_punct.txt file be updated with tests that check for these edge cases?