ReDoS: add another example to the qhelp in poly-redos, showing how to just limit the length of the input #13164

erik-krogh · 2023-05-15T12:38:46Z

Internal: See the backref for what the motivation for this change was (scroll down).

github-actions · 2023-05-15T12:39:55Z

QHelp previews:

java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp

Polynomial regular expression used on uncontrolled data

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine provided by Java uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Note that Java versions 9 and above have some mitigations against ReDoS; however they aren't perfect and more complex regular expressions can still be affected by this problem.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. Alternatively, an alternate regex library that guarantees linear time execution, such as Google's RE2J, may be used.

Example

Consider this use of a regular expression, which removes all leading and trailing whitespace in a string:

Pattern.compile("^\\s+|\\s+$").matcher(text).replaceAll("") // BAD

The sub-expression "\\s+$" will match the whitespace characters in text from left to right, but it can start matching anywhere within a whitespace sequence. This is problematic for strings that do not end with a whitespace character. Such a string will force the regular expression engine to process each whitespace sequence once per whitespace character in the sequence.

This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like "a b" will take milliseconds to process, but a similar string with a million spaces instead of just one will take several minutes.

Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ("^\\s+|(?<!\\s)\\s+$"), or just by using the built-in trim method (text.trim()).

Note that the sub-expression "^\\s+" is not problematic as the ^ anchor restricts when that sub-expression can start matching, and as the regular expression engine matches from left to right.

Example

As a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation:

"^0\\.\\d+E?\\d+$""

The problem with this regular expression is in the sub-expression \d+E?\d+ because the second \d+ can start matching digits anywhere after the first match of the first \d+ if there is no E in the input string.

This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity.

To make the processing faster, the regular expression should be rewritten such that the two \d+ sub-expressions do not have overlapping matches: "^0\\.\\d+(E\\d+)?$".

Example

Sometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity:

Pattern.matches("^(\\+|-)?(\\d+|(\\d*\\.\\d*))?(E|e)?([-+])?(\\d+)?$", str);

It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time.

if (str.length() > 1000) {
    throw new IllegalArgumentException("Input too long");
}

Pattern.matches("^(\\+|-)?(\\d+|(\\d*\\.\\d*))?(E|e)?([-+])?(\\d+)?$", str);

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

javascript/ql/src/Performance/PolynomialReDoS.qhelp

Polynomial regular expression used on uncontrolled data

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engines provided by many popular JavaScript platforms use backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this use of a regular expression, which removes all leading and trailing whitespace in a string:

text.replace(/^\s+|\s+$/g, ''); // BAD

The sub-expression "\s+$" will match the whitespace characters in text from left to right, but it can start matching anywhere within a whitespace sequence. This is problematic for strings that do not end with a whitespace character. Such a string will force the regular expression engine to process each whitespace sequence once per whitespace character in the sequence.

This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like "a b" will take milliseconds to process, but a similar string with a million spaces instead of just one will take several minutes.

Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind (/^\s+|(?<!\s)\s+$/g), or just by using the built-in trim method (text.trim()).

Note that the sub-expression "^\s+" is not problematic as the ^ anchor restricts when that sub-expression can start matching, and as the regular expression engine matches from left to right.

Example

As a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation:

/^0\.\d+E?\d+$/.test(str) // BAD

The problem with this regular expression is in the sub-expression \d+E?\d+ because the second \d+ can start matching digits anywhere after the first match of the first \d+ if there is no E in the input string.

This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity.

To make the processing faster, the regular expression should be rewritten such that the two \d+ sub-expressions do not have overlapping matches: ^0\.\d+(E\d+)?$.

Example

Sometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity:

/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str) // BAD

It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time.

if (str.length > 1000) {
    throw new Error("Input too long");
}

/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

python/ql/src/Security/CWE-730/PolynomialReDoS.qhelp

Polynomial regular expression used on uncontrolled data

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine provided by Python uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this use of a regular expression, which removes all leading and trailing whitespace in a string:

re.sub(r"^\s+|\s+$", "", text) # BAD

The sub-expression "\s+$" will match the whitespace characters in text from left to right, but it can start matching anywhere within a whitespace sequence. This is problematic for strings that do not end with a whitespace character. Such a string will force the regular expression engine to process each whitespace sequence once per whitespace character in the sequence.

This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like "a b" will take milliseconds to process, but a similar string with a million spaces instead of just one will take several minutes.

Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind (^\s+|(?<!\s)\s+$), or just by using the built-in strip method (text.strip()).

Note that the sub-expression "^\s+" is not problematic as the ^ anchor restricts when that sub-expression can start matching, and as the regular expression engine matches from left to right.

Example

As a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation:

^0\.\d+E?\d+$ # BAD

The problem with this regular expression is in the sub-expression \d+E?\d+ because the second \d+ can start matching digits anywhere after the first match of the first \d+ if there is no E in the input string.

This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity.

To make the processing faster, the regular expression should be rewritten such that the two \d+ sub-expressions do not have overlapping matches: ^0\.\d+(E\d+)?$.

Example

Sometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity:

match = re.search(r'^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$', str)

It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time.

if len(str) > 1000:
    raise ValueError("Input too long")

match = re.search(r'^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$', str)

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

ruby/ql/src/queries/security/cwe-1333/PolynomialReDoS.qhelp

Polynomial regular expression used on uncontrolled data

Some regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to n^k or even 2ⁿ. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match.

The regular expression engine used by the Ruby interpreter (MRI) uses backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower.

Note that Ruby 3.2 and later have implemented a caching mechanism that completely eliminates the worst-case time complexity for the regular expressions flagged by this query. The regular expressions flagged by this query are therefore only problematic for Ruby versions prior to 3.2.

Typically, a regular expression is affected by this problem if it contains a repetition of the form r* or r+ where the sub-expression r is ambiguous in the sense that it can match some string in multiple ways. More information about the precise circumstances can be found in the references.

Recommendation

Modify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter.

Example

Consider this use of a regular expression, which removes all leading and trailing whitespace in a string:

text.gsub!(/^\s+|\s+$/, '') # BAD

The sub-expression "\s+$" will match the whitespace characters in text from left to right, but it can start matching anywhere within a whitespace sequence. This is problematic for strings that do not end with a whitespace character. Such a string will force the regular expression engine to process each whitespace sequence once per whitespace character in the sequence.

This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like "a b" will take milliseconds to process, but a similar string with a million spaces instead of just one will take several minutes.

Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind (/^\s+|(?<!\s)\s+$/), or just by using the built-in strip method (text.strip!).

Note that the sub-expression "^\s+" is not problematic as the ^ anchor restricts when that sub-expression can start matching, and as the regular expression engine matches from left to right.

Example

As a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation:

/^0\.\d+E?\d+$/ # BAD

The problem with this regular expression is in the sub-expression \d+E?\d+ because the second \d+ can start matching digits anywhere after the first match of the first \d+ if there is no E in the input string.

This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity.

To make the processing faster, the regular expression should be rewritten such that the two \d+ sub-expressions do not have overlapping matches: /^0\.\d+(E\d+)?$/.

Example

Sometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity:

is_matching = /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.match?(str)

It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time.

if str.length > 1000
    raise ArgumentError, "Input too long"
end

is_matching = /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.match?(str)

References

OWASP: Regular expression Denial of Service - ReDoS.
Wikipedia: ReDoS.
Wikipedia: Time complexity.
James Kirrage, Asiri Rathnayake, Hayo Thielecke: Static Analysis for Regular Expression Denial-of-Service Attack.
Common Weakness Enumeration: CWE-1333.
Common Weakness Enumeration: CWE-730.
Common Weakness Enumeration: CWE-400.

…imit the length of the input

yoff

The added help text is quite nice 💪 but the formatting change is a little annoying.
I see in the nice previews that it does not matter to the user, but I feel that the previous formatting was easier to read; could we have that back please?

erik-krogh · 2023-05-15T20:37:44Z

I see in the nice previews that it does not matter to the user, but I feel that the previous formatting was easier to read; could we have that back please?

It does matter to the user. The nice previews render nicely because I fixed the formatting.
Check out this render of the previous version: https://codeql.github.com/codeql-query-help/javascript/js-polynomial-redos/#example

atorralba

Java changes and render mostly LGTM. Added a couple of comments regarding the use of contractions (which should be avoided according to https://github.com/github/codeql/blob/main/docs/query-help-style-guide.md#english-style).

Do you think we should ping the Docs team for a quick review from their side as well?

java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp

yoff · 2023-05-16T08:16:37Z

I see in the nice previews that it does not matter to the user, but I feel that the previous formatting was easier to read; could we have that back please?

It does matter to the user. The nice previews render nicely because I fixed the formatting. Check out this render of the previous version: https://codeql.github.com/codeql-query-help/javascript/js-polynomial-redos/#example

Aha! Well, in that case...

yoff

LGTM - good to get the formatting fixed also 😁

erik-krogh · 2023-05-17T06:43:24Z

Do you think we should ping the Docs team for a quick review from their side as well?

Let's do that 👍

isaacmbrown

Hi @erik-krogh from docs, this looks good generally from our point of view! I left a few minor suggestions (the first two apply to all instances just fyi).

isaacmbrown · 2023-05-19T09:33:23Z

java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp

+        <p>
+            Sometimes it is unclear how a regular expression can be rewritten to
+            avoid the problem. In such cases, it often suffices to limit the
+            length of the input string. For instance, the following complicated 


I'd maybe say avoid saying 'complicated', I think it has negative connotations that don't really fit the docs tone. You could say 'complex' if you think it's needed?

Suggested change

length of the input string. For instance, the following complicated

length of the input string. For instance, the following

isaacmbrown · 2023-05-19T09:37:35Z

java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp

+
+        <p>
+            It is not immediately obvious how to rewrite this regular expression 
+            to avoid the problem. However, it might be fine to limit the length


I think having an authoritative tone would encourage trust in the docs here, WDYT about replacing 'it might be fine', e.g. with something like this?

Suggested change

to avoid the problem. However, it might be fine to limit the length

to avoid the problem. However, you can mitigate performance issues by limiting the length

isaacmbrown · 2023-05-19T09:41:23Z

javascript/ql/src/Performance/PolynomialReDoS.qhelp

+if (str.length &gt; 1000) {
+    throw new Error("Input too long");
+}
+/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)</sample>


Could we add a line of whitespace here to match the other examples?

Suggested change

/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)</sample>

/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)</sample>

yoff

LGTM

github-actions bot added documentation Java JS Python Ruby labels May 15, 2023

erik-krogh added 2 commits May 15, 2023 16:47

add another example to the qhelp in poly-redos, showing how to just l…

d989359

…imit the length of the input

trim the whitespace in the poly-redos examples

83ca149

erik-krogh force-pushed the polyQhelp branch from d4018ae to 83ca149 Compare May 15, 2023 14:47

erik-krogh marked this pull request as ready for review May 15, 2023 17:07

erik-krogh requested review from a team as code owners May 15, 2023 17:07

yoff reviewed May 15, 2023

View reviewed changes

erik-krogh requested a review from yoff May 15, 2023 20:37

atorralba reviewed May 16, 2023

View reviewed changes

java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp Outdated Show resolved Hide resolved

java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelp Outdated Show resolved Hide resolved

yoff previously approved these changes May 16, 2023

View reviewed changes

avoid contractions

480e71f

erik-krogh dismissed yoff’s stale review via 480e71f May 17, 2023 06:43

erik-krogh added the ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. label May 17, 2023

isaacmbrown self-requested a review May 19, 2023 09:26

isaacmbrown previously approved these changes May 19, 2023

View reviewed changes

erik-krogh removed the ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. label May 21, 2023

erik-krogh added 2 commits May 21, 2023 22:17

Merge branch 'main' into polyQhelp

10bf17c

apply suggestions from doc review

710b309

erik-krogh dismissed isaacmbrown’s stale review via 710b309 May 21, 2023 20:19

erik-krogh requested review from yoff and atorralba May 21, 2023 20:19

atorralba approved these changes May 22, 2023

View reviewed changes

felicitymay assigned isaacmbrown May 22, 2023

yoff approved these changes May 22, 2023

View reviewed changes

erik-krogh merged commit 50cb5ea into github:main May 23, 2023

	length of the input string. For instance, the following complicated
	length of the input string. For instance, the following

	to avoid the problem. However, it might be fine to limit the length
	to avoid the problem. However, you can mitigate performance issues by limiting the length

	/^(\+\|-)?(\d+\|(\d\.\d))?(E\|e)?([-+])?(\d+)?$/.test(str)</sample>

	/^(\+\|-)?(\d+\|(\d\.\d))?(E\|e)?([-+])?(\d+)?$/.test(str)</sample>

ReDoS: add another example to the qhelp in poly-redos, showing how to just limit the length of the input #13164

ReDoS: add another example to the qhelp in poly-redos, showing how to just limit the length of the input #13164

Uh oh!

Conversation

erik-krogh commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Polynomial regular expression used on uncontrolled data

Recommendation

Example

Example

Example

References

Polynomial regular expression used on uncontrolled data

Recommendation

Example

Example

Example

References

Polynomial regular expression used on uncontrolled data

Recommendation

Example

Example

Example

References

Polynomial regular expression used on uncontrolled data

Recommendation

Example

Example

Example

References

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

erik-krogh commented May 15, 2023

Uh oh!

atorralba left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yoff commented May 16, 2023

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

erik-krogh commented May 17, 2023

Uh oh!

isaacmbrown left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaacmbrown May 19, 2023

Choose a reason for hiding this comment

Uh oh!

isaacmbrown May 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaacmbrown May 19, 2023

Choose a reason for hiding this comment

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erik-krogh commented May 15, 2023 •

edited

Loading

github-actions bot commented May 15, 2023 •

edited

Loading

isaacmbrown left a comment •

edited

Loading

isaacmbrown May 19, 2023 •

edited

Loading