Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic behaviour on pathological html #299

Closed
marcusklaas opened this issue Apr 29, 2019 · 10 comments
Closed

Quadratic behaviour on pathological html #299

marcusklaas opened this issue Apr 29, 2019 · 10 comments

Comments

@marcusklaas
Copy link

@marcusklaas marcusklaas commented Apr 29, 2019

Found this vulnerability in pulldown-cmark and md4c. It appears cmark is also vulnerable.

python -c 'print("a <![CDATA[" * 10000)' | time cmark > /dev/null
0.40user 0.00system 0:00.42elapsed 95%CPU (0avgtext+0avgdata 9720maxresident)k

python -c 'print("a <![CDATA[" * 20000)' | time cmark > /dev/null
1.60user 0.00system 0:01.62elapsed 98%CPU (0avgtext+0avgdata 17760maxresident)k

python -c 'print("a <![CDATA[" * 40000)' | time cmark > /dev/null
6.20user 0.02system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 34372maxresident)k
@jgm
Copy link
Member

@jgm jgm commented Apr 29, 2019

Thanks!

Loading

@jgm
Copy link
Member

@jgm jgm commented Apr 29, 2019

This has to do with parsing of CDATA elements as inline HTML, not HTML blocks. And this is handled entirely by a scanner defined using re2c, which works in linear time. However, in this case the linear-time parser gets applied repeatedly to the tail of the input string.

It's tricky because, while with regular tags we can quit parsing when we hit <, with CDATA, we need to keep parsing when we hit <![CDATA[, because this can occur within a CDATA element.

I guess we'll need to special-case CDATA somehow, keeping track of the last position we've failed to find ]]>.

Loading

@marcusklaas
Copy link
Author

@marcusklaas marcusklaas commented Apr 29, 2019

I came to a similar conclusion. Maybe simply setting a flag would be enough? If we fail to find ]]> once, I think we will never find it after.

Loading

@mity
Copy link
Contributor

@mity mity commented Apr 29, 2019

cmark is also vulnerable to

time python -c 'print("a" + "<!A" * 40000)' | ./src/cmark >/dev/null

It's likely the same story as it was in MD4C.

EDIT: No. Copied something else. cmark has problem with the one in next comment. Does cmark handle HTML processing instructions differently then other inline raw HTML kinds?

Loading

@mity
Copy link
Contributor

@mity mity commented Apr 29, 2019

And also to

time python -c 'print("a" + "<?" * 40000)' | ./src/cmark >/dev/null

Loading

@jgm
Copy link
Member

@jgm jgm commented Feb 16, 2020

Also

time python -c 'print("a <!A " * 40000)' | time cmark > /dev/null

Loading

@jgm
Copy link
Member

@jgm jgm commented Feb 16, 2020

I think that, instead of handling these with the regex, we need a manual scanner that can keep track of where it failed.

Loading

@nwellnhof
Copy link
Contributor

@nwellnhof nwellnhof commented Mar 22, 2021

Also found by OSS-Fuzz: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=32344

Possible fix: Don't try to reparse these constructs if the (fixed) character sequence required to terminate them wasn't found during the first scan.

Loading

nwellnhof added a commit to nwellnhof/cmark that referenced this issue Mar 22, 2021
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA` could
lead to quadratic behavior if no matching ending sequence was found.
Separate the inline HTML scanners. Remember if scanning the whole input
for a specific ending sequence failed and skip subsequent scans.

Fixes commonmark#299.
@nwellnhof
Copy link
Contributor

@nwellnhof nwellnhof commented Mar 22, 2021

Proposed fix: nwellnhof@1944fe7

Loading

@jgm
Copy link
Member

@jgm jgm commented Mar 25, 2021

@nwellnhof I haven't had a chance to really look in detail, but the idea looks good to me on first glance; do you want to submit a PR?

Loading

nwellnhof added a commit to nwellnhof/cmark that referenced this issue Mar 26, 2021
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could
lead to quadratic behavior if no matching ending sequence was found.
Separate the inline HTML scanners. Remember if scanning the whole input
for a specific ending sequence failed and skip subsequent scans.

The basic idea is to remove the suffixes `>`, `?>` and `]]>` from the
respective regex. Since these regexes are already constructed to match
lazily, they will stop before an ending sequence. To check whether an
ending sequence was found, we can simply check whether the input buffer
is large enough to hold the match plus a potential suffix. If the regex
doesn't find the ending sequence, it will match so many characters that
this test is guaranteed to fail. In this case, we set a flag to avoid
further attempts to execute the regex.

To check which inline HTML regex to use, we inspect the start of the
text buffer. This allows some fixed characters to be removed from the
start of some regexes. `matchlen`  is adjusted with a single addition
that accounts for both the relevant prefix and suffix.

Fixes commonmark#299.
nwellnhof added a commit to nwellnhof/cmark that referenced this issue Mar 26, 2021
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could
lead to quadratic behavior if no matching ending sequence was found.
Separate the inline HTML scanners. Remember if scanning the whole input
for a specific ending sequence failed and skip subsequent scans.

The basic idea is to remove suffixes `>`, `?>` and `]]>` from the
respective regex. Since these regexes are already constructed to match
lazily, they will stop before an ending sequence. To check whether an
ending sequence was found, we can simply test whether the input buffer
is large enough to hold the match plus a potential suffix. If the regex
doesn't find the ending sequence, it will match so many characters that
this test is guaranteed to fail. In this case, we set a flag to avoid
further attempts to execute the regex.

To check which inline HTML regex to use, we inspect the start of the
text buffer. This allows some fixed characters to be removed from the
start of some regexes. `matchlen`  is adjusted with a single addition
that accounts for both the relevant prefix and suffix.

Fixes commonmark#299.
@jgm jgm closed this in #380 Jun 15, 2021
jgm added a commit that referenced this issue Jun 15, 2021
Repeated starting sequences like `<?`, `<!DECL ` or `<![CDATA[` could
lead to quadratic behavior if no matching ending sequence was found.
Separate the inline HTML scanners. Remember if scanning the whole input
for a specific ending sequence failed and skip subsequent scans.

The basic idea is to remove suffixes `>`, `?>` and `]]>` from the
respective regex. Since these regexes are already constructed to match
lazily, they will stop before an ending sequence. To check whether an
ending sequence was found, we can simply test whether the input buffer
is large enough to hold the match plus a potential suffix. If the regex
doesn't find the ending sequence, it will match so many characters that
this test is guaranteed to fail. In this case, we set a flag to avoid
further attempts to execute the regex.

To check which inline HTML regex to use, we inspect the start of the
text buffer. This allows some fixed characters to be removed from the
start of some regexes. `matchlen`  is adjusted with a single addition
that accounts for both the relevant prefix and suffix.

Fixes #299.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

4 participants