Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quadratic behavior with inline HTML #380

Merged
merged 1 commit into from Jun 15, 2021

Conversation

nwellnhof
Copy link
Contributor

Repeated starting sequences like <?, <!DECL or <![CDATA[ could
lead to quadratic behavior if no matching ending sequence was found.
Separate the inline HTML scanners. Remember if scanning the whole input
for a specific ending sequence failed and skip subsequent scans.

The basic idea is to remove suffixes >, ?> and ]]> from the
respective regex. Since these regexes are already constructed to match
lazily, they will stop before an ending sequence. To check whether an
ending sequence was found, we can simply test whether the input buffer
is large enough to hold the match plus a potential suffix. If the regex
doesn't find the ending sequence, it will match so many characters that
this test is guaranteed to fail. In this case, we set a flag to avoid
further attempts to execute the regex.

To check which inline HTML regex to use, we inspect the start of the
text buffer. This allows some fixed characters to be removed from the
start of some regexes. matchlen is adjusted with a single addition
that accounts for both the relevant prefix and suffix.

Fixes #299.

Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could
lead to quadratic behavior if no matching ending sequence was found.
Separate the inline HTML scanners. Remember if scanning the whole input
for a specific ending sequence failed and skip subsequent scans.

The basic idea is to remove suffixes `>`, `?>` and `]]>` from the
respective regex. Since these regexes are already constructed to match
lazily, they will stop before an ending sequence. To check whether an
ending sequence was found, we can simply test whether the input buffer
is large enough to hold the match plus a potential suffix. If the regex
doesn't find the ending sequence, it will match so many characters that
this test is guaranteed to fail. In this case, we set a flag to avoid
further attempts to execute the regex.

To check which inline HTML regex to use, we inspect the start of the
text buffer. This allows some fixed characters to be removed from the
start of some regexes. `matchlen`  is adjusted with a single addition
that accounts for both the relevant prefix and suffix.

Fixes commonmark#299.
@jgm jgm merged commit 3253e19 into commonmark:master Jun 15, 2021
@nwellnhof nwellnhof deleted the quadratic-html branch July 13, 2021 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quadratic behaviour on pathological html
2 participants