Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic behaviour on pathological html #299

Open
marcusklaas opened this issue Apr 29, 2019 · 5 comments

Comments

@marcusklaas
Copy link

commented Apr 29, 2019

Found this vulnerability in pulldown-cmark and md4c. It appears cmark is also vulnerable.

python -c 'print("a <![CDATA[" * 10000)' | time cmark > /dev/null
0.40user 0.00system 0:00.42elapsed 95%CPU (0avgtext+0avgdata 9720maxresident)k

python -c 'print("a <![CDATA[" * 20000)' | time cmark > /dev/null
1.60user 0.00system 0:01.62elapsed 98%CPU (0avgtext+0avgdata 17760maxresident)k

python -c 'print("a <![CDATA[" * 40000)' | time cmark > /dev/null
6.20user 0.02system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 34372maxresident)k
@jgm

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

Thanks!

@jgm

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

This has to do with parsing of CDATA elements as inline HTML, not HTML blocks. And this is handled entirely by a scanner defined using re2c, which works in linear time. However, in this case the linear-time parser gets applied repeatedly to the tail of the input string.

It's tricky because, while with regular tags we can quit parsing when we hit <, with CDATA, we need to keep parsing when we hit <![CDATA[, because this can occur within a CDATA element.

I guess we'll need to special-case CDATA somehow, keeping track of the last position we've failed to find ]]>.

@marcusklaas

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

I came to a similar conclusion. Maybe simply setting a flag would be enough? If we fail to find ]]> once, I think we will never find it after.

@mity

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

cmark is also vulnerable to

time python -c 'print("a" + "<!A" * 40000)' | ./src/cmark >/dev/null

It's likely the same story as it was in MD4C.

EDIT: No. Copied something else. cmark has problem with the one in next comment. Does cmark handle HTML processing instructions differently then other inline raw HTML kinds?

@mity

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

And also to

time python -c 'print("a" + "<?" * 40000)' | ./src/cmark >/dev/null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.