JS: add query js/useless-regexp-character-escape #2001

ghost · 2019-09-23T13:57:59Z

This adds a variant of https://eslint.org/docs/rules/no-useless-escape that focuses on semantic mistakes in regular expression strings. Unlike the eslint query, you want to fix most of these alerts immediately, and not just some rainy day.

The dual of this query, which flags the truly useless escapes in strings and regular expressions can be seen in javascript/ql/test/query-tests/Security/CWE-020/UselessCharacterEscape.ql. Productising this as a query is trivial, but evaluations show that it is going to be very noisy with semantic TPs that does not really need fixing.

The alert part of this PR contains a verbose use of newtype, as it improves the comprehensiveness of the alert messages significantly when the exact bad escape is highlighted rather than the entire source string. See https://lgtm.com/query/7716308191347971287/ for some example alert renderings for ace, vscode and codemirror. One semantic drawback of this UI improvement is that each mistake counts as one alert, as seen for example in the alerts for vscode.

Performance is fine, the overhead is negligble when evaluating the query with the three other CWE-20 queries (IncompleteHostnameRegExp.ql, IncompleteUrlSubstringSanitization.ql, MissingRegExpAnchor.ql). Crucially, CharacterEscapes::getAnEscapedCharacter is in fact so fast that we can easily afford to run it on all default.slugs source code strings, paving the way for a productised UselessCharacterEscape.ql if we want it one day.

Results are plentiful. Although some of the alerts are benign in the style of https://eslint.org/docs/rules/no-useless-escape, in particular because we do not attempt to distinguish the two different mistakes in new RegExp("\\.") and new RegExp("[\\.]").

asger-semmle

Excellent query overall! Very nice results.

the overhead is negligble when evaluating the query with the three other CWE-20 queries (IncompleteHostnameRegExp.ql, IncompleteUrlSubstringSanitization.ql, MissingRegExpAnchor.ql)

I'm not sure negligible is the right word here. How exactly was this data produced? In particular, is the time to compute the cached layers included in the absolute numbers?

If so, what I'm seeing is that it takes upwards of 8% of the time it takes to do SSA, type inference, call graph construction, etc. Not "8% more than these other queries".

asger-semmle · 2019-09-23T14:03:02Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

+
+		</p>
+
+		<p>


Empty paragraph?

asger-semmle · 2019-09-23T14:04:11Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.ql

@@ -0,0 +1,142 @@
+/**
+ * @name Useless regular expression character escape
+ * @description Prepending a backslash to a non-special character in a string does not have any effect, and may make regular expressions behave unexpectedly.


Could you add line breaks here?

asger-semmle · 2019-09-23T15:30:15Z

javascript/ql/src/semmle/javascript/CharacterEscapes.qll

+        raw = rawStringNode.(TemplateElement).getRawValue() and
+        quote = "`"
+      |
+        escapableChars = "b" + Sets::ordinaryEscapeChars() + quote + "\\\\"


Could you explain the + "\\\\" part? I understand that we don't want to treat \\ as an identity escape, but escapableChars seems to be used as a set, so why add two backslashes?

Good catch, I have moved "\\" into Sets::ordinaryEscapeChars(). An earlier iteration of this query used this in a regexp character class, so it had to be escaped twice in that context.

asger-semmle · 2019-09-23T15:35:04Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.ql

+}
+
+from RegExpPatternMistake mistake
+select mistake, "The escape sequence " + mistake.getMessage() + "  when it is used in a $@",


double space before when

asger-semmle · 2019-09-23T15:38:27Z

javascript/ql/src/semmle/javascript/CharacterEscapes.qll

+   */
+  string getAnIdentityEscapedCharacter(DataFlow::Node n, ASTNode rawStringNode, int i) {
+    exists(string raw, string escapableChars |
+      exists(string quote |


Do I understand correctly that this exists is to ensure that e.g. \" is flagged inside 'Foo \"bar\" baz'.
Is this issue even worth flagging?

We wont flag such an issue with js/useless-regexp-character-escape, as neither quote has special regex semantics.
But getAnIdentityEscapedCharacter is intended to be complete and independent of the any query purpose, and I think it is appropriate that the dual query in javascript/ql/test/query-tests/Security/CWE-020/UselessCharacterEscape.ql flags '\"' as a useless escape.

ghost · 2019-09-23T20:04:14Z

I'm not sure negligible is the right word here. How exactly was this data produced? In particular, is the time to compute the cached layers included in the absolute numbers?

If so, what I'm seeing is that it takes upwards of 8% of the time it takes to do SSA, type inference, call graph construction, etc. Not "8% more than these other queries".

The data is from a standard dist-compare run, so you are right that an interpretation is that "it takes upwards of 8% of the time it takes to do SSA". Still, the absolute timing number are small for a new query, so I am not concerned about performance at all. I can do a deeper performance inspection if you insist. (The ballpark 20k seconds for a full gecko-dev run puts the relative query cost at 1.5%. The ballpark 90k seconds for a full default.slugs run puts the relative query cost at 1%.)

xiemaisi

Broadly LGTM, with a few suggestions.

I'm not concerned about performance, given that the absolute overhead is small in most cases (except for gecko-dev).

xiemaisi · 2019-09-27T09:57:12Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

+	<overview>
+		<p>
+
+			Prepending a single backslash to a non-special character in a


I wonder whether this could be started off more naturally by first recapitulating the concept of backslash escaping in strings (which of course people will be familiar with), then move on to regular expressions, and finally explain the two layers of backslash interpretation happening in strings that are interpreted as regular expressions.

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.ql

xiemaisi · 2019-09-27T10:05:17Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.ql

+import javascript
+import semmle.javascript.CharacterEscapes::CharacterEscapes
+
+newtype TRegExpPatternMistake =


The injectors of this type could do with qldoc.

xiemaisi · 2019-09-27T10:07:15Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.ql

+    |
+      "b" = getAnEscapedCharacter(raw, index) and
+      not (
+        // except if the string is exactly \b


Could this logic be simplified by additionally keeping the (cooked) value of the string literal around?

xiemaisi · 2019-09-27T10:14:31Z

javascript/ql/src/semmle/javascript/CharacterEscapes.qll

+  private module Sets {
+    string ordinaryEscapeChars() { result = "fnrtvux0\\" }
+
+    string regexpCharRangeChars() { result = "bBcdDsSwW" }


Perhaps also include p and P for Unicode property classes.

I think I'd prefer the terminology "character class" over "character range" here and elsewhere.

javascript/ql/src/semmle/javascript/CharacterEscapes.qll

xiemaisi · 2019-09-27T10:29:47Z

javascript/ql/src/semmle/javascript/CharacterEscapes.qll

+        or
+        rawStringNode = n.asExpr().(TemplateLiteral).getAnElement() and
+        raw = rawStringNode.(TemplateElement).getRawValue() and
+        quote = "`"


This again looks like it could benefit from extracting it into a new predicate.

javascript/ql/src/semmle/javascript/CharacterEscapes.qll

mchammer01

@esben-semmle - this LGTM, just a few minor comments to aid comprehension (and one suggestion to avoid repeating a word).

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

mchammer01 · 2019-09-27T11:26:53Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

+			However, the check does not work properly for white space
+			as the two <code>\s</code> occurrences are semantically equivalent to
+			just <code>s</code>, meaning that the check will succeed for strings
+			like <code>"smy-markers"</code> instead of <code>" my-marker


Just double-checking that the white spaces before and after the second expression are deliberate here ;-)

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

ghost · 2019-10-01T07:52:36Z

Comments has been addressed. @mchammer01, can you take another look at the qhelp content? I have elaborated on the intro, as seen in eeab226.

mchammer01

@esben-semmle - thanks for addressing my initial comments, looks good.
I've spotted a tiny typo, and asked a question (I don't have an answer, just asking).
Hope this helps.

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

ghost · 2019-10-01T08:29:13Z

Thanks @mchammer01, I think that is it for the docreview.

mchammer01 · 2019-10-01T10:41:43Z

@esben-semmle - Looks like you didn't apply the hyphen changes (backslashed-escaped). Is this deliberate?

ghost · 2019-10-01T11:04:25Z

Thanks, my push failed due to github's suggestion-commits.

mchammer01

Thank you @esben-semmle, good to go from my point of view 😀

asger-semmle · 2019-10-10T10:45:20Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

+			Characters in string/template literals, and
+			regular expression literals can be backslash escaped by prepending a
+			backslash to them. A common example of this is <code>"\n"</code> which
+			corresponds to a single <code>newline</code>.


The terminology in this paragraph and the next doesn't seem right to me. It's either incorrect or a tautology depending on what it means for a character to be "escaped".

It think it would be clearer if we just stick to "escape sequence":

When a character in a string literal or regular expression is preceded by a blackslash, it is interpreted as part of an escape sequence. For example, the escape sequence \n corresponds to the newline character, not the n character.

However, not all characters change meaning when used in an escape sequence. In this case, the backslash just makes the character appear to mean something else, but the backslash actually has no effect. For example, the escape sequence \k just means k.

asger-semmle · 2019-10-10T11:00:42Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

+
+			The set of characters that change meaning when they are
+			backslash-escaped is different for regular expression literals and
+			string/template literals.


The string/template literal wording seems unnecessary, string literal should be fine (here and below).

asger-semmle

Thanks, LGTM now

ghost · 2019-10-10T20:38:50Z

@mchammer01 can I get another review of the overview section of the qhelp file here? I rewrote it just now.

xiemaisi

A bunch of minor comments and suggestions.

xiemaisi · 2019-10-14T13:59:58Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

+			sequences is different for regular expression literals and string
+			literals.
+
+			This can be problematic when a regular expression literal


This paragraph is a little terse. What is a "dynamic regular expression source"? What is "the problem"?

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.ql

xiemaisi · 2019-10-14T14:03:57Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.ql

+    char = getALikelyRegExpPatternMistake(src, mistake, rawStringNode, index)
+  } or
+  /**
+   * An identity-escaped character at `index` of `rawStringNode` that


This description sounds very general, but the injector only seems to look at \b escapes.

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.ql

javascript/ql/src/semmle/javascript/CharacterEscapes.qll

xiemaisi · 2019-10-14T14:15:02Z

javascript/ql/src/semmle/javascript/CharacterEscapes.qll

+  private module Sets {
+    string sharedEscapeChars() { result = "fnrtvux0\\" }
+
+    string regexpCharClassChars() { result = "bBcdDpPsSwW" }


As mentioned above, b and B are not character classes but assertions.

mchammer01

@esben-semmle - I did another editorial review of this, as requested.
One minor issue (see inline comment) where you repeat a word (the backslash) and use the wrong conjunction. I've made a suggestion, feel free to ignore if you don't agree.

mchammer01 · 2019-10-14T15:09:03Z

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

+
+			However, not all characters change meaning when used in an
+			escape sequence. In this case, the backslash just makes the character
+			appear to mean something else, but the backslash actually has no


Suggested change

appear to mean something else, but the backslash actually has no

appear to mean something else, and it actually has no

ghost · 2019-10-15T11:52:12Z

Comments addressed.

xiemaisi

Just one final nit (which you could perhaps address as part of cleaning up the commit history?), otherwise lgtm.

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp

ghost · 2019-10-15T22:17:10Z

Squashed and rebased.

ghost added the JS label Sep 23, 2019

ghost self-requested a review as a code owner September 23, 2019 13:57

asger-semmle reviewed Sep 23, 2019

View reviewed changes

xiemaisi suggested changes Sep 27, 2019

View reviewed changes

xiemaisi requested a review from mchammer01 September 27, 2019 10:36

mchammer01 reviewed Sep 27, 2019

View reviewed changes

mchammer01 reviewed Oct 1, 2019

View reviewed changes

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp Outdated Show resolved Hide resolved

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp Outdated Show resolved Hide resolved

mchammer01 previously approved these changes Oct 1, 2019

View reviewed changes

asger-semmle reviewed Oct 10, 2019

View reviewed changes

ghost dismissed mchammer01’s stale review via 95a2e3a October 10, 2019 12:33

asger-semmle previously approved these changes Oct 10, 2019

View reviewed changes

ghost dismissed asger-semmle’s stale review via 95f201d October 10, 2019 20:37

xiemaisi requested a review from mchammer01 October 14, 2019 13:37

xiemaisi suggested changes Oct 14, 2019

View reviewed changes

mchammer01 reviewed Oct 14, 2019

View reviewed changes

xiemaisi suggested changes Oct 15, 2019

View reviewed changes

javascript/ql/src/Security/CWE-020/UselessRegExpCharacterEscape.qhelp Outdated Show resolved Hide resolved

JS: add query js/useless-regexp-character-escape

e1d7434

xiemaisi approved these changes Oct 16, 2019

View reviewed changes

xiemaisi merged commit 7127624 into github:master Oct 16, 2019

	appear to mean something else, but the backslash actually has no
	appear to mean something else, and it actually has no

JS: add query js/useless-regexp-character-escape #2001

JS: add query js/useless-regexp-character-escape #2001

Uh oh!

Conversation

ghost commented Sep 23, 2019

Uh oh!

asger-semmle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost commented Sep 23, 2019

Uh oh!

xiemaisi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mchammer01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghost commented Oct 1, 2019

Uh oh!

mchammer01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ghost commented Oct 1, 2019

Uh oh!

mchammer01 commented Oct 1, 2019

Uh oh!

ghost commented Oct 1, 2019

Uh oh!

mchammer01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asger-semmle left a comment

Choose a reason for hiding this comment

Uh oh!

ghost commented Oct 10, 2019

Uh oh!

xiemaisi left a comment

Choose a reason for hiding this comment