github · erik-krogh · Feb 27, 2020 · Feb 12, 2020
diff --git a/change-notes/1.24/analysis-javascript.md b/change-notes/1.24/analysis-javascript.md
@@ -37,6 +37,7 @@
 | Cross-site scripting through exception (`js/xss-through-exception`) | security, external/cwe/cwe-079, external/cwe/cwe-116              | Highlights potential XSS vulnerabilities where an exception is written to the DOM. Results are not shown on LGTM by default. |
 | Regular expression always matches (`js/regex/always-matches`) | correctness, regular-expressions | Highlights regular expression checks that trivially succeed by matching an empty substring. Results are shown on LGTM by default. |
 | Missing await (`js/missing-await`) | correctness | Highlights expressions that operate directly on a promise object in a nonsensical way, instead of awaiting its result. Results are shown on LGTM by default. |
+| Polynomial regular expression used on uncontrolled data (`js/polynomial-redos`) | security, external/cwe/cwe-730, external/cwe/cwe-400 | Highlights expensive regular expressions that may be used on malicious input. Results are shown on LGTM by default. | 
 | Prototype pollution in utility function (`js/prototype-pollution-utility`) | security, external/cwe/cwe-400, external/cwe/cwe-471 | Highlights recursive copying operations that are susceptible to prototype pollution. Results are shown on LGTM by default. |
 | Unsafe jQuery plugin (`js/unsafe-jquery-plugin`) | Highlights potential XSS vulnerabilities in unsafely designed jQuery plugins. Results are shown on LGTM by default. |
 

diff --git a/javascript/ql/src/Performance/PolynomialReDoS.qhelp b/javascript/ql/src/Performance/PolynomialReDoS.qhelp
@@ -0,0 +1,108 @@
+<!DOCTYPE qhelp PUBLIC
+"-//Semmle//qhelp//EN"
+"qhelp.dtd">
+
+<qhelp>
+
+	<include src="ReDoSIntroduction.qhelp" />
+
+	<example>
+		<p>
+
+			Consider this use of a regular expression, which removes
+			all leading and trailing whitespace in a string:
+
+		</p>
+
+		<sample language="javascript">
+			text.replace(/^\s+|\s+$/g, ''); // BAD
+		</sample>
+
+		<p>
+
+			The sub-expression <code>"\s+$"</code> will match the
+			whitespace characters in <code>text</code> from left to right, but it
+			can start matching anywhere within a whitespace sequence. This is
+			problematic for strings that do <strong>not</strong> end with a whitespace
+			character. Such a string will force the regular expression engine to
+			process each whitespace sequence once per whitespace character in the
+			sequence.
+
+		</p>
+
+		<p>
+
+			This ultimately means that the time cost of trimming a
+			string is quadratic in the length of the string. So a string like
+			<code>"a b"</code> will take milliseconds to process, but a similar
+			string with a million spaces instead of just one will take several
+			minutes.
+
+		</p>
+
+		<p>
+
+			Avoid this problem by rewriting the regular expression to
+			not contain the ambiguity about when to start matching whitespace
+			sequences. For instance, by using a negative look-behind
+			(<code>/^\s+|(?&lt;!\s)\s+$/g</code>), or just by using the built-in trim
+			method (<code>text.trim()</code>).
+
+		</p>
+
+		<p>
+
+			Note that the sub-expression <code>"^\s+"</code> is
+			<strong>not</strong> problematic as the <code>^</code> anchor restricts
+			when that sub-expression can start matching, and as the regular
+			expression engine matches from left to right.
+
+		</p>
+
+	</example>
+
+	<example>
+
+		<p>
+
+			As a similar, but slightly subtler problem, consider the
+			regular expression that matches lines with numbers, possibly written
+			using scientific notation:
+		</p>
+
+		<sample language="javascript">
+			^0\.\d+E?\d+$ // BAD
+		</sample>
+
+		<p>
+
+			The problem with this regular expression is in the
+			sub-expression <code>\d+E?\d+</code> because the second
+			<code>\d+</code> can start matching digits anywhere after the first
+			match of the first <code>\d+</code> if there is no <code>E</code> in
+			the input string.
+
+		</p>
+
+		<p>
+
+			This is problematic for strings that do <strong>not</strong>
+			end with a digit. Such a string will force the regular expression
+			engine to process each digit sequence once per digit in the sequence,
+			again leading to a quadratic time complexity.
+
+		</p>
+
+		<p>
+
+			To make the processing faster, the regular expression
+			should be rewritten such that the two <code>\d+</code> sub-expressions
+			do not have overlapping matches: <code>^0\.\d+(E\d+)?$</code>.
+
+		</p>
+
+	</example>
+
+	<include src="ReDoSReferences.qhelp"/>
+
+</qhelp>
diff --git a/javascript/ql/src/Performance/PolynomialReDoS.ql b/javascript/ql/src/Performance/PolynomialReDoS.ql
@@ -0,0 +1,23 @@
+/**
+ * @name Polynomial regular expression used on uncontrolled data
+ * @description A regular expression that can require polynomial time
+ *              to match user-provided values may be
+ *              vulnerable to denial-of-service attacks.
+ * @kind path-problem
+ * @problem.severity warning
+ * @precision high
+ * @id js/polynomial-redos
+ * @tags security
+ *       external/cwe/cwe-730
+ *       external/cwe/cwe-400
+ */
+
+import javascript
+import semmle.javascript.security.performance.PolynomialReDoS::PolynomialReDoS
+import DataFlow::PathGraph
+
+
+from Configuration cfg, DataFlow::PathNode source, DataFlow::PathNode sink
+where cfg.hasFlowPath(source, sink)
+select sink.getNode(), source, sink, "This expensive $@ use depends on $@.",
+  sink.getNode().(Sink).getRegExp(), "regular expression", source.getNode(), "a user-provided value"
diff --git a/javascript/ql/src/Performance/ReDoS.qhelp b/javascript/ql/src/Performance/ReDoS.qhelp
@@ -1,70 +1,34 @@
 <!DOCTYPE qhelp PUBLIC
-  "-//Semmle//qhelp//EN"
-  "qhelp.dtd">
+"-//Semmle//qhelp//EN"
+"qhelp.dtd">
+
 <qhelp>
 
-<overview>
-<p>
-Some regular expressions take a very long time to match certain input strings to the point where
-the time it takes to match a string of length <i>n</i> is proportional to <i>2<sup>n</sup></i>.
-Such regular expressions can negatively affect performance, or even allow a malicious user to
-perform a Denial of Service ("DoS") attack by crafting an expensive input string for the regular
-expression to match.
-</p>
-<p>
-The regular expression engines provided by many popular JavaScript platforms use backtracking
-non-deterministic finite automata to implement regular expression matching. While this approach
-is space-efficient and allows supporting advanced features like capture groups, it is not
-time-efficient in general. The worst-case time complexity of such an automaton can be exponential,
-meaning that for strings of a certain shape, increasing the input length by ten characters may
-make the automaton about 1000 times slower.
-</p>
-<p>
-Typically, a regular expression is affected by this problem if it contains a repetition of the
-form <code>r*</code> or <code>r+</code> where the sub-expression <code>r</code> is ambiguous in
-the sense that it can match some string in multiple ways. More information about the precise
-circumstances can be found in the references.
-</p>
-</overview>
+	<include src="ReDoSIntroduction.qhelp" />
 
-<recommendation>
-<p>
-Modify the regular expression to remove the ambiguity.
-</p>
-</recommendation>
+	<example>
+		<p>
+			Consider this regular expression:
+		</p>
+		<sample language="javascript">
+			/^_(__|.)+_$/
+		</sample>
+		<p>
+			Its sub-expression <code>"(__|.)+?"</code> can match the string <code>"__"</code> either by the
+			first alternative <code>"__"</code> to the left of the <code>"|"</code> operator, or by two
+			repetitions of the second alternative <code>"."</code> to the right. Thus, a string consisting
+			of an odd number of underscores followed by some other character will cause the regular
+			expression engine to run for an exponential amount of time before rejecting the input.
+		</p>
+		<p>
+			This problem can be avoided by rewriting the regular expression to remove the ambiguity between
+			the two branches of the alternative inside the repetition:
+		</p>
+		<sample language="javascript">
+			/^_(__|[^_])+_$/
+		</sample>
+	</example>
 
-<example>
-<p>
-Consider this regular expression:
-</p>
-<sample language="javascript">
-/^_(__|.)+_$/
-</sample>
-<p>
-Its sub-expression <code>"(__|.)+?"</code> can match the string <code>"__"</code> either by the
-first alternative <code>"__"</code> to the left of the <code>"|"</code> operator, or by two
-repetitions of the second alternative <code>"."</code> to the right. Thus, a string consisting
-of an odd number of underscores followed by some other character will cause the regular
-expression engine to run for an exponential amount of time before rejecting the input.
-</p>
-<p>
-This problem can be avoided by rewriting the regular expression to remove the ambiguity between
-the two branches of the alternative inside the repetition:
-</p>
-<sample language="javascript">
-/^_(__|[^_])+_$/
-</sample>
-</example>
+	<include src="ReDoSReferences.qhelp"/>
 
-<references>
-<li>
-OWASP:
-<a href="https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS">Regular expression Denial of Service - ReDoS</a>.
-</li>
-<li>Wikipedia: <a href="https://en.wikipedia.org/wiki/ReDoS">ReDoS</a>.</li>
-<li>Wikipedia: <a href="https://en.wikipedia.org/wiki/Time_complexity">Time complexity</a>.</li>
-<li>James Kirrage, Asiri Rathnayake, Hayo Thielecke:
-<a href="http://www.cs.bham.ac.uk/~hxt/research/reg-exp-sec.pdf">Static Analysis for Regular Expression Denial-of-Service Attack</a>.
-</li>
-</references>
 </qhelp>
diff --git a/javascript/ql/src/Performance/ReDoSIntroduction.qhelp b/javascript/ql/src/Performance/ReDoSIntroduction.qhelp
@@ -0,0 +1,55 @@
+<!DOCTYPE qhelp PUBLIC
+"-//Semmle//qhelp//EN"
+"qhelp.dtd">
+<qhelp>
+	<overview>
+		<p>
+
+			Some regular expressions take a long time to match certain
+			input strings to the point where the time it takes to match a string
+			of length <i>n</i> is proportional to <i>n<sup>k</sup></i> or even
+			<i>2<sup>n</sup></i>.  Such regular expressions can negatively affect
+			performance, or even allow a malicious user to perform a Denial of
+			Service ("DoS") attack by crafting an expensive input string for the
+			regular expression to match.
+
+		</p>
+
+		<p>
+
+			The regular expression engines provided by many popular
+			JavaScript platforms use backtracking non-deterministic finite
+			automata to implement regular expression matching. While this approach
+			is space-efficient and allows supporting advanced features like
+			capture groups, it is not time-efficient in general. The worst-case
+			time complexity of such an automaton can be polynomial or even
+			exponential, meaning that for strings of a certain shape, increasing
+			the input length by ten characters may make the automaton about 1000
+			times slower.
+
+		</p>
+
+		<p>
+
+			Typically, a regular expression is affected by this
+			problem if it contains a repetition of the form <code>r*</code> or
+			<code>r+</code> where the sub-expression <code>r</code> is ambiguous
+			in the sense that it can match some string in multiple ways. More
+			information about the precise circumstances can be found in the
+			references.
+
+		</p>
+	</overview>
+
+	<recommendation>
+
+		<p>
+
+			Modify the regular expression to remove the ambiguity, or
+			ensure that the strings matched with the regular expression are short
+			enough that the time-complexity does not matter.
+
+		</p>
+
+	</recommendation>
+</qhelp>
diff --git a/javascript/ql/src/Performance/ReDoSReferences.qhelp b/javascript/ql/src/Performance/ReDoSReferences.qhelp
@@ -0,0 +1,16 @@
+<!DOCTYPE qhelp PUBLIC
+"-//Semmle//qhelp//EN"
+"qhelp.dtd">
+<qhelp>
+	<references>
+		<li>
+			OWASP:
+			<a href="https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS">Regular expression Denial of Service - ReDoS</a>.
+		</li>
+		<li>Wikipedia: <a href="https://en.wikipedia.org/wiki/ReDoS">ReDoS</a>.</li>
+		<li>Wikipedia: <a href="https://en.wikipedia.org/wiki/Time_complexity">Time complexity</a>.</li>
+		<li>James Kirrage, Asiri Rathnayake, Hayo Thielecke:
+		<a href="http://www.cs.bham.ac.uk/~hxt/research/reg-exp-sec.pdf">Static Analysis for Regular Expression Denial-of-Service Attack</a>.
+		</li>
+	</references>
+</qhelp>
diff --git a/javascript/ql/src/semmle/javascript/security/performance/PolynomialReDoS.qll b/javascript/ql/src/semmle/javascript/security/performance/PolynomialReDoS.qll
@@ -0,0 +1,31 @@
+/**
+ * Provides a taint tracking configuration for reasoning about
+ * polynomial regular expression denial-of-service attacks.
+ *
+ * Note, for performance reasons: only import this file if
+ * `PolynomialReDoS::Configuration` is needed, otherwise
+ * `PolynomialReDoSCustomizations` should be imported instead.
+ */
+import javascript
+
+module PolynomialReDoS {
+  import PolynomialReDoSCustomizations::PolynomialReDoS
+
+  class Configuration extends TaintTracking::Configuration {
+    Configuration() { this = "PolynomialReDoS" }
+
+    override predicate isSource(DataFlow::Node source) { source instanceof Source }
+
+    override predicate isSink(DataFlow::Node sink) { sink instanceof Sink }
+
+    override predicate isSanitizerGuard(TaintTracking::SanitizerGuardNode node) {
+      super.isSanitizerGuard(node) or
+      node instanceof LengthGuard
+    }
+
+    override predicate isSanitizer(DataFlow::Node node) {
+      super.isSanitizer(node) or
+      node instanceof Sanitizer
+    }
+  }
+}