-
Notifications
You must be signed in to change notification settings - Fork 1.8k
ReDoS: add another example to the qhelp in poly-redos, showing how to just limit the length of the input #13164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
QHelp previews: java/ql/src/Security/CWE/CWE-730/PolynomialReDoS.qhelpPolynomial regular expression used on uncontrolled dataSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine provided by Java uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form Note that Java versions 9 and above have some mitigations against ReDoS; however they aren't perfect and more complex regular expressions can still be affected by this problem. RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. Alternatively, an alternate regex library that guarantees linear time execution, such as Google's RE2J, may be used. ExampleConsider this use of a regular expression, which removes all leading and trailing whitespace in a string: Pattern.compile("^\\s+|\\s+$").matcher(text).replaceAll("") // BAD The sub-expression This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ( Note that the sub-expression ExampleAs a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation: "^0\\.\\d+E?\\d+$"" The problem with this regular expression is in the sub-expression This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity. To make the processing faster, the regular expression should be rewritten such that the two ExampleSometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity: Pattern.matches("^(\\+|-)?(\\d+|(\\d*\\.\\d*))?(E|e)?([-+])?(\\d+)?$", str); It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time. if (str.length() > 1000) {
throw new IllegalArgumentException("Input too long");
}
Pattern.matches("^(\\+|-)?(\\d+|(\\d*\\.\\d*))?(E|e)?([-+])?(\\d+)?$", str); References
javascript/ql/src/Performance/PolynomialReDoS.qhelpPolynomial regular expression used on uncontrolled dataSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engines provided by many popular JavaScript platforms use backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this use of a regular expression, which removes all leading and trailing whitespace in a string: text.replace(/^\s+|\s+$/g, ''); // BAD The sub-expression This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ( Note that the sub-expression ExampleAs a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation: /^0\.\d+E?\d+$/.test(str) // BAD The problem with this regular expression is in the sub-expression This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity. To make the processing faster, the regular expression should be rewritten such that the two ExampleSometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity: /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str) // BAD It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time. if (str.length > 1000) {
throw new Error("Input too long");
}
/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str) References
python/ql/src/Security/CWE-730/PolynomialReDoS.qhelpPolynomial regular expression used on uncontrolled dataSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine provided by Python uses a backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this use of a regular expression, which removes all leading and trailing whitespace in a string: re.sub(r"^\s+|\s+$", "", text) # BAD The sub-expression This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ( Note that the sub-expression ExampleAs a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation: ^0\.\d+E?\d+$ # BAD The problem with this regular expression is in the sub-expression This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity. To make the processing faster, the regular expression should be rewritten such that the two ExampleSometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity: match = re.search(r'^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$', str) It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time. if len(str) > 1000:
raise ValueError("Input too long")
match = re.search(r'^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$', str) References
ruby/ql/src/queries/security/cwe-1333/PolynomialReDoS.qhelpPolynomial regular expression used on uncontrolled dataSome regular expressions take a long time to match certain input strings to the point where the time it takes to match a string of length n is proportional to nk or even 2n. Such regular expressions can negatively affect performance, or even allow a malicious user to perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular expression to match. The regular expression engine used by the Ruby interpreter (MRI) uses backtracking non-deterministic finite automata to implement regular expression matching. While this approach is space-efficient and allows supporting advanced features like capture groups, it is not time-efficient in general. The worst-case time complexity of such an automaton can be polynomial or even exponential, meaning that for strings of a certain shape, increasing the input length by ten characters may make the automaton about 1000 times slower. Note that Ruby 3.2 and later have implemented a caching mechanism that completely eliminates the worst-case time complexity for the regular expressions flagged by this query. The regular expressions flagged by this query are therefore only problematic for Ruby versions prior to 3.2. Typically, a regular expression is affected by this problem if it contains a repetition of the form RecommendationModify the regular expression to remove the ambiguity, or ensure that the strings matched with the regular expression are short enough that the time-complexity does not matter. ExampleConsider this use of a regular expression, which removes all leading and trailing whitespace in a string: text.gsub!(/^\s+|\s+$/, '') # BAD The sub-expression This ultimately means that the time cost of trimming a string is quadratic in the length of the string. So a string like Avoid this problem by rewriting the regular expression to not contain the ambiguity about when to start matching whitespace sequences. For instance, by using a negative look-behind ( Note that the sub-expression ExampleAs a similar, but slightly subtler problem, consider the regular expression that matches lines with numbers, possibly written using scientific notation: /^0\.\d+E?\d+$/ # BAD The problem with this regular expression is in the sub-expression This is problematic for strings that do not end with a digit. Such a string will force the regular expression engine to process each digit sequence once per digit in the sequence, again leading to a quadratic time complexity. To make the processing faster, the regular expression should be rewritten such that the two ExampleSometimes it is unclear how a regular expression can be rewritten to avoid the problem. In such cases, it often suffices to limit the length of the input string. For instance, the following regular expression is used to match numbers, and on some non-number inputs it can have quadratic time complexity: is_matching = /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.match?(str) It is not immediately obvious how to rewrite this regular expression to avoid the problem. However, you can mitigate performance issues by limiting the length to 1000 characters, which will always finish in a reasonable amount of time. if str.length > 1000
raise ArgumentError, "Input too long"
end
is_matching = /^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.match?(str) References
|
…imit the length of the input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added help text is quite nice 💪 but the formatting change is a little annoying.
I see in the nice previews that it does not matter to the user, but I feel that the previous formatting was easier to read; could we have that back please?
It does matter to the user. The nice previews render nicely because I fixed the formatting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java changes and render mostly LGTM. Added a couple of comments regarding the use of contractions (which should be avoided according to https://github.com/github/codeql/blob/main/docs/query-help-style-guide.md#english-style).
Do you think we should ping the Docs team for a quick review from their side as well?
Aha! Well, in that case... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - good to get the formatting fixed also 😁
Let's do that 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @erik-krogh from docs, this looks good generally from our point of view! I left a few minor suggestions (the first two apply to all instances just fyi).
<p> | ||
Sometimes it is unclear how a regular expression can be rewritten to | ||
avoid the problem. In such cases, it often suffices to limit the | ||
length of the input string. For instance, the following complicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd maybe say avoid saying 'complicated', I think it has negative connotations that don't really fit the docs tone. You could say 'complex' if you think it's needed?
length of the input string. For instance, the following complicated | |
length of the input string. For instance, the following |
|
||
<p> | ||
It is not immediately obvious how to rewrite this regular expression | ||
to avoid the problem. However, it might be fine to limit the length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having an authoritative tone would encourage trust in the docs here, WDYT about replacing 'it might be fine', e.g. with something like this?
to avoid the problem. However, it might be fine to limit the length | |
to avoid the problem. However, you can mitigate performance issues by limiting the length |
if (str.length > 1000) { | ||
throw new Error("Input too long"); | ||
} | ||
/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)</sample> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a line of whitespace here to match the other examples?
/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)</sample> | |
/^(\+|-)?(\d+|(\d*\.\d*))?(E|e)?([-+])?(\d+)?$/.test(str)</sample> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Internal: See the backref for what the motivation for this change was (scroll down).