Using bold in italics breaks italic markdown #1775

santaimpersonator · 2022-08-04T22:10:32Z

Description

I am running into a bug, similar to this discussion (squidfunk/mkdocs-material#2913 (comment)), where implementing a bold styling breaks the italics. However, it isn't an issue in reverse (italics in bold):

I believe I am using the latest release of the mkdocs-material theme 8.4.0, as the issue mentioned in the discussion is fixed:

I found a workaround to the problem by using underscores (_) instead of astericks (*):

Minimal Reproduction

Issue:

*(i.e. **Tools** > **Board** > **ESP32 Arduino** > `<board>`).*

(i.e. Tools > Board > ESP32 Arduino > <board>).

Test 1:

**Something *italic* in a bold statement.**

Something italic in a bold statement.

Test 2:

*Something **bold** in an italic statement.*

Something bold in an italic statement.

Workaround:

_Something **bold** in an italic statement._

Something bold in an italic statement.

Generic example:

***I'm italic and bold* I am just bold.**

I'm italic and bold I am just bold.

***I'm bold and italic!** I am just italic.*

I'm bold and italic! I am just italic.

Version(s) & System Info

Operating System: Win 11 (runnning WSL) -> Ubuntu 20.04.4 LTS
Python Version: 3.8
Package Version: mkdocs 1.3.1

The text was updated successfully, but these errors were encountered:

facelessuser · 2022-08-05T13:39:47Z

I wouldn't say this is a bug per se, but is currently expected. The question is whether we can adjust the rules in a sane and meanifulway without causing other side-effects.

Now, I'm not saying this won't be fixed, but we'd need to be careful. Python Markdown fundamentally handles its rules different than other parsers. Some will gather all rules recursively as they are passed, some will execute all * token rules, then take another pass on the same paragraph with _ rules (Python Markdown), etc.

I'll give the TL;DR here:

Currently, this is expected behavior based on the rules that BetterEm implements. BetterEm uses a series of regex, and yes that can be limited. The default Python Markdown also uses a series of regex and may not run into this specific case, but also has other confusing (or maybe to some less confusing cases). What was stated as a workaround is what I would have done intuitively.

I admit that it would be nice if BetterEm handled the the aforementioned case, but such a change would require new rules that do not affect current behavior and/or a re-write moving away from the regex approach of old that we simply extended from Python Markdown.

Why is this not a bug?

Historically, Python Mardown implemented bold and italic with simple regex. They just gobbled up text between the two tokens and called it a day. Ours improved upon that. Some intelligence was added to prevent some tokens from starting new spans, like (*<space>) and some rules to prevent trailing tokens that are trailing words but separated with spaces from ending a span (<space>*).

*(i.e. **Tools**

The first * is separated from the other two, and we don't allow bold and emphasis not at the tail of a word (e.g. *word<space>* won't match). But *word<space>** will match as the last * is not separated anymore. Now the second * is inside an emphasis tag and is separated from the end ** so they won't interact:

<em>(i.e. *</em>Tools**

Such rules are stated in the docs. We do have some special cases, I don't recall them all off the top of my head, but IIRC they are all *** related, such as:

***I'm italic and bold* I am just bold.**

Changes to this behavior would be considered an enhancement, and only if we can sanely do it without breaking other behavior.

What do I do in these cases?

Normally, I do not mix bold and italic in complex ways. I would have intuitively done your workaround.

_Something **bold** in an italic statement._

But I think BetterEm should handle this case!

That is a fair argument, but we'd have to come up with a sane way to handle this as the current implementation is just a series of complex regex. How do we identify this case and capture it with regex? Is there some sane regex rule we can inject that would catch this case and do exactly what we'd like it to do?

Maybe some rule that only matches a single * and a trailing * with no other * preceding it or following it? Maybe, there are potentially other ways.

How do others parsers do this?

Well, some may actually parse all rules one pass, they may use not regex rules are more flexible, unlike Python Markdown that generally applies regex rules in multiple passes on each paragraph.

Can't we do something similar as other parsers?

Well, maybe. We could completely re-write BetterEm to not use regex.

facelessuser · 2022-08-05T13:54:10Z

Further thinking out loud. The one issue with the Python Markdown parser, particularly with using the inline regex rules, is you have to get all cases resolved in one pass. You can get around this by creating multiple plugin insertions (which we do use), but we are still using basic regex rules in all these cases, and that makes them limited. There is a tradeoff between adding some rules as opposed to being completely relaxed.

I believe a more recursive approach is probably the way to handle such cases, and the way most parsers probably do this.
It's certainly something to explore. It would allow us to handle *, **, and *** in a more dynamic way and maybe better match more modern parsers in this regard.

facelessuser · 2022-08-05T14:05:06Z

We could actually just tokenize all *, ** and ***+ cases in a paragraph and then resolve them all replacing them with appropriate <strong> and <em> tags... Anyways, I've got some ideas we can try.

As this is an enhancement, it won't take the same priority as a bug, but I do think it is worth exploring.

santaimpersonator · 2022-08-10T01:23:28Z

@facelessuser Thanks for the in-depth explanation and considering the improvement 👍

Fixes #1775

facelessuser · 2022-11-07T22:20:11Z

Potential fix for this issue:

>>> import markdown
>>> markdown.markdown('*(i.e. **Tools** > **Board** > **ESP32 Arduino** > `<board>`).*', extensions=['pymdownx.betterem'])
'<p><em>(i.e. <strong>Tools</strong> &gt; <strong>Board</strong> &gt; <strong>ESP32 Arduino</strong> &gt; <code>&lt;board&gt;</code>).</em></p>'

In short, I've added a case to favor the ideal case (*content*) over the less ideal case (*content**). It seems like it will work pretty well, but I need to do a little more testing. It seems the slight change in test cases seems okay and matches other implementations, so I'm thinking this might be a suitable fix.

Fixes #1775

santaimpersonator added the T: bug Bug. label Aug 4, 2022

gir-bot added the S: triage Issue needs triage. label Aug 4, 2022

facelessuser added T: enhancement Enhancement. and removed T: bug Bug. S: triage Issue needs triage. labels Aug 5, 2022

facelessuser added a commit that referenced this issue Nov 7, 2022

Fix for em/strong corner cases

e3ae421

Fixes #1775

facelessuser mentioned this issue Nov 7, 2022

Fix for em/strong corner cases #1853

Merged

facelessuser added a commit that referenced this issue Nov 8, 2022

Fix for em/strong corner cases

bf7ef32

Fixes #1775

facelessuser added a commit that referenced this issue Nov 8, 2022

Fix for em/strong corner cases

0427045

Fixes #1775

facelessuser added a commit that referenced this issue Nov 8, 2022

Fix for em/strong corner cases

81e309c

Fixes #1775

facelessuser closed this as completed in #1853 Nov 8, 2022

facelessuser added a commit that referenced this issue Nov 8, 2022

Fix for em/strong corner cases (#1853)

18ba91e

Fixes #1775

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using bold in italics breaks italic markdown #1775

Using bold in italics breaks italic markdown #1775

santaimpersonator commented Aug 4, 2022 •

edited

facelessuser commented Aug 5, 2022

facelessuser commented Aug 5, 2022

facelessuser commented Aug 5, 2022

santaimpersonator commented Aug 10, 2022

facelessuser commented Nov 7, 2022

Using bold in italics breaks italic markdown #1775

Using bold in italics breaks italic markdown #1775

Comments

santaimpersonator commented Aug 4, 2022 • edited

Description

Minimal Reproduction

Issue:

Workaround:

Generic example:

Version(s) & System Info

facelessuser commented Aug 5, 2022

Why is this not a bug?

What do I do in these cases?

But I think BetterEm should handle this case!

How do others parsers do this?

Can't we do something similar as other parsers?

facelessuser commented Aug 5, 2022

facelessuser commented Aug 5, 2022

santaimpersonator commented Aug 10, 2022

facelessuser commented Nov 7, 2022

santaimpersonator commented Aug 4, 2022 •

edited