Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using bold in italics breaks italic markdown #1775

Closed
santaimpersonator opened this issue Aug 4, 2022 · 5 comments · Fixed by #1853
Closed

Using bold in italics breaks italic markdown #1775

santaimpersonator opened this issue Aug 4, 2022 · 5 comments · Fixed by #1853
Labels
T: enhancement Enhancement.

Comments

@santaimpersonator
Copy link

santaimpersonator commented Aug 4, 2022

Description

I am running into a bug, similar to this discussion (squidfunk/mkdocs-material#2913 (comment)), where implementing a bold styling breaks the italics. However, it isn't an issue in reverse (italics in bold):

image

I believe I am using the latest release of the mkdocs-material theme 8.4.0, as the issue mentioned in the discussion is fixed:

image

I found a workaround to the problem by using underscores (_) instead of astericks (*):

image

Minimal Reproduction

Issue:

*(i.e. **Tools** > **Board** > **ESP32 Arduino** > `<board>`).*

(i.e. Tools > Board > ESP32 Arduino > <board>).


Test 1:

**Something *italic* in a bold statement.**

Something italic in a bold statement.


Test 2:

*Something **bold** in an italic statement.*

Something bold in an italic statement.


Workaround:

_Something **bold** in an italic statement._

Something bold in an italic statement.


Generic example:

***I'm italic and bold* I am just bold.**

I'm italic and bold I am just bold.


***I'm bold and italic!** I am just italic.*

I'm bold and italic! I am just italic.

Version(s) & System Info

  • Operating System: Win 11 (runnning WSL) -> Ubuntu 20.04.4 LTS
  • Python Version: 3.8
  • Package Version: mkdocs 1.3.1
@gir-bot gir-bot added the S: triage Issue needs triage. label Aug 4, 2022
@facelessuser facelessuser added T: enhancement Enhancement. and removed T: bug Bug. S: triage Issue needs triage. labels Aug 5, 2022
@facelessuser
Copy link
Owner

I wouldn't say this is a bug per se, but is currently expected. The question is whether we can adjust the rules in a sane and meanifulway without causing other side-effects.

Now, I'm not saying this won't be fixed, but we'd need to be careful. Python Markdown fundamentally handles its rules different than other parsers. Some will gather all rules recursively as they are passed, some will execute all * token rules, then take another pass on the same paragraph with _ rules (Python Markdown), etc.

I'll give the TL;DR here:

Currently, this is expected behavior based on the rules that BetterEm implements. BetterEm uses a series of regex, and yes that can be limited. The default Python Markdown also uses a series of regex and may not run into this specific case, but also has other confusing (or maybe to some less confusing cases). What was stated as a workaround is what I would have done intuitively.

I admit that it would be nice if BetterEm handled the the aforementioned case, but such a change would require new rules that do not affect current behavior and/or a re-write moving away from the regex approach of old that we simply extended from Python Markdown.

Why is this not a bug?

Historically, Python Mardown implemented bold and italic with simple regex. They just gobbled up text between the two tokens and called it a day. Ours improved upon that. Some intelligence was added to prevent some tokens from starting new spans, like (*<space>) and some rules to prevent trailing tokens that are trailing words but separated with spaces from ending a span (<space>*).

*(i.e. **Tools**

The first * is separated from the other two, and we don't allow bold and emphasis not at the tail of a word (e.g. *word<space>* won't match). But *word<space>** will match as the last * is not separated anymore. Now the second * is inside an emphasis tag and is separated from the end ** so they won't interact:

<em>(i.e. *</em>Tools**

Such rules are stated in the docs. We do have some special cases, I don't recall them all off the top of my head, but IIRC they are all *** related, such as:

***I'm italic and bold* I am just bold.**

Changes to this behavior would be considered an enhancement, and only if we can sanely do it without breaking other behavior.

What do I do in these cases?

Normally, I do not mix bold and italic in complex ways. I would have intuitively done your workaround.

_Something **bold** in an italic statement._

But I think BetterEm should handle this case!

That is a fair argument, but we'd have to come up with a sane way to handle this as the current implementation is just a series of complex regex. How do we identify this case and capture it with regex? Is there some sane regex rule we can inject that would catch this case and do exactly what we'd like it to do?

Maybe some rule that only matches a single * and a trailing * with no other * preceding it or following it? Maybe, there are potentially other ways.

How do others parsers do this?

Well, some may actually parse all rules one pass, they may use not regex rules are more flexible, unlike Python Markdown that generally applies regex rules in multiple passes on each paragraph.

Can't we do something similar as other parsers?

Well, maybe. We could completely re-write BetterEm to not use regex.

@facelessuser
Copy link
Owner

Further thinking out loud. The one issue with the Python Markdown parser, particularly with using the inline regex rules, is you have to get all cases resolved in one pass. You can get around this by creating multiple plugin insertions (which we do use), but we are still using basic regex rules in all these cases, and that makes them limited. There is a tradeoff between adding some rules as opposed to being completely relaxed.

I believe a more recursive approach is probably the way to handle such cases, and the way most parsers probably do this.
It's certainly something to explore. It would allow us to handle *, **, and *** in a more dynamic way and maybe better match more modern parsers in this regard.

@facelessuser
Copy link
Owner

We could actually just tokenize all *, ** and ***+ cases in a paragraph and then resolve them all replacing them with appropriate <strong> and <em> tags... Anyways, I've got some ideas we can try.

As this is an enhancement, it won't take the same priority as a bug, but I do think it is worth exploring.

@santaimpersonator
Copy link
Author

@facelessuser Thanks for the in-depth explanation and considering the improvement 👍

@facelessuser
Copy link
Owner

Potential fix for this issue:

>>> import markdown
>>> markdown.markdown('*(i.e. **Tools** > **Board** > **ESP32 Arduino** > `<board>`).*', extensions=['pymdownx.betterem'])
'<p><em>(i.e. <strong>Tools</strong> &gt; <strong>Board</strong> &gt; <strong>ESP32 Arduino</strong> &gt; <code>&lt;board&gt;</code>).</em></p>'

In short, I've added a case to favor the ideal case (*content*) over the less ideal case (*content**). It seems like it will work pretty well, but I need to do a little more testing. It seems the slight change in test cases seems okay and matches other implementations, so I'm thinking this might be a suitable fix.

facelessuser added a commit that referenced this issue Nov 8, 2022
facelessuser added a commit that referenced this issue Nov 8, 2022
facelessuser added a commit that referenced this issue Nov 8, 2022
facelessuser added a commit that referenced this issue Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T: enhancement Enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants