Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Quadratic behaviour on pathological html #299
Found this vulnerability in pulldown-cmark and md4c. It appears cmark is also vulnerable.
python -c 'print("a <![CDATA[" * 10000)' | time cmark > /dev/null 0.40user 0.00system 0:00.42elapsed 95%CPU (0avgtext+0avgdata 9720maxresident)k python -c 'print("a <![CDATA[" * 20000)' | time cmark > /dev/null 1.60user 0.00system 0:01.62elapsed 98%CPU (0avgtext+0avgdata 17760maxresident)k python -c 'print("a <![CDATA[" * 40000)' | time cmark > /dev/null 6.20user 0.02system 0:06.25elapsed 99%CPU (0avgtext+0avgdata 34372maxresident)k
This has to do with parsing of CDATA elements as inline HTML, not HTML blocks. And this is handled entirely by a scanner defined using re2c, which works in linear time. However, in this case the linear-time parser gets applied repeatedly to the tail of the input string.
It's tricky because, while with regular tags we can quit parsing when we hit
I guess we'll need to special-case CDATA somehow, keeping track of the last position we've failed to find
It's likely the same story as it was in MD4C.
EDIT: No. Copied something else. cmark has problem with the one in next comment. Does cmark handle HTML processing instructions differently then other inline raw HTML kinds?