Skip to content

"Smart" replacement of hyphens with em/en dash seems strange #56

@0b10011

Description

@0b10011

The current tests in smart_punct.txt for en/em dashes don't define behavior for certain longer combinations and the current code ends up resulting with hanging hyphens when they could be easily replaced in a different manner to only replace them with em/en dashes.

For example, a series of 10 hyphens results in 3 em dashes followed by a hyphen. In my opinion, it would make more sense for this to result in 5 en dashes. Additionally, 7 hyphens are converted into 2 em dashes and a hyphen, but I believe it should be 1 em dash and 2 en dashes. That is:

Current: ---------- => --- --- --- -  => ———-
 Better: ---------- => -- -- -- -- -- => –––––

Current: ------- => --- --- - => ——-
 Better: ------- => --- -- -- => —––

To achieve this behavior, each group of hyphens would be collected and counted at once, assuming it is 2 hyphens or more (eg, /^(?<!-)(-{2,})/), and then the most optimal grouping would be figured out (in PHP for thephpleague/commonmark, but should be able to be converted to JavaScript/C fairly easily):

$count = strlen($matched);
$en_dash = '';
$en_count = 0;
$em_dash = '';
$em_count = 0;
if ($count % 3 === 0) { // If divisible by 3, use all em dashes
    $em_count = $count / 3;
} elseif ($count % 2 === 0) { // If divisible by 2, use all en dashes
    $en_count = $count / 2;
} elseif (($count - 2) % 3 === 0) { // If 2 extra dashes, use en dash for last 2; em dashes for rest
    $em_count = floor(($count - 2) / 3);
    $en_count = 1;
} else { // Use en dashes for last 4 hyphens; em dashes for rest
    $em_count = floor(($count - 4) / 3);
    $en_count = 2;
}
$inlineContext->getInlines()->add(new Text(
    str_repeat($em_dash, $em_count).
    str_repeat($en_dash, $en_count)
));
return true;

Is this something that CommonMark would be interested in implementing? (I can do, I just don't want to spend the time writing the code if it won't be accepted.) Or should the smart_punct.txt file be updated with tests that check for these edge cases?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions