Skip to content

JS: add query js/useless-regexp-character-escape #2001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from Oct 16, 2019
Merged

JS: add query js/useless-regexp-character-escape #2001

merged 1 commit into from Oct 16, 2019

Conversation

ghost
Copy link

@ghost ghost commented Sep 23, 2019

This adds a variant of https://eslint.org/docs/rules/no-useless-escape that focuses on semantic mistakes in regular expression strings. Unlike the eslint query, you want to fix most of these alerts immediately, and not just some rainy day.

The dual of this query, which flags the truly useless escapes in strings and regular expressions can be seen in javascript/ql/test/query-tests/Security/CWE-020/UselessCharacterEscape.ql. Productising this as a query is trivial, but evaluations show that it is going to be very noisy with semantic TPs that does not really need fixing.

The alert part of this PR contains a verbose use of newtype, as it improves the comprehensiveness of the alert messages significantly when the exact bad escape is highlighted rather than the entire source string. See https://lgtm.com/query/7716308191347971287/ for some example alert renderings for ace, vscode and codemirror. One semantic drawback of this UI improvement is that each mistake counts as one alert, as seen for example in the alerts for vscode.

Performance is fine, the overhead is negligble when evaluating the query with the three other CWE-20 queries (IncompleteHostnameRegExp.ql, IncompleteUrlSubstringSanitization.ql, MissingRegExpAnchor.ql). Crucially, CharacterEscapes::getAnEscapedCharacter is in fact so fast that we can easily afford to run it on all default.slugs source code strings, paving the way for a productised UselessCharacterEscape.ql if we want it one day.

Results are plentiful. Although some of the alerts are benign in the style of https://eslint.org/docs/rules/no-useless-escape, in particular because we do not attempt to distinguish the two different mistakes in new RegExp("\\.") and new RegExp("[\\.]").

@ghost ghost added the JS label Sep 23, 2019
@ghost ghost self-requested a review as a code owner September 23, 2019 13:57
Copy link
Contributor

@asger-semmle asger-semmle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent query overall! Very nice results.

the overhead is negligble when evaluating the query with the three other CWE-20 queries (IncompleteHostnameRegExp.ql, IncompleteUrlSubstringSanitization.ql, MissingRegExpAnchor.ql)

I'm not sure negligible is the right word here. How exactly was this data produced? In particular, is the time to compute the cached layers included in the absolute numbers?

If so, what I'm seeing is that it takes upwards of 8% of the time it takes to do SSA, type inference, call graph construction, etc. Not "8% more than these other queries".


</p>

<p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty paragraph?

@@ -0,0 +1,142 @@
/**
* @name Useless regular expression character escape
* @description Prepending a backslash to a non-special character in a string does not have any effect, and may make regular expressions behave unexpectedly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add line breaks here?

raw = rawStringNode.(TemplateElement).getRawValue() and
quote = "`"
|
escapableChars = "b" + Sets::ordinaryEscapeChars() + quote + "\\\\"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the + "\\\\" part? I understand that we don't want to treat \\ as an identity escape, but escapableChars seems to be used as a set, so why add two backslashes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I have moved "\\" into Sets::ordinaryEscapeChars(). An earlier iteration of this query used this in a regexp character class, so it had to be escaped twice in that context.

}

from RegExpPatternMistake mistake
select mistake, "The escape sequence " + mistake.getMessage() + " when it is used in a $@",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double space before when

*/
string getAnIdentityEscapedCharacter(DataFlow::Node n, ASTNode rawStringNode, int i) {
exists(string raw, string escapableChars |
exists(string quote |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that this exists is to ensure that e.g. \" is flagged inside 'Foo \"bar\" baz'.
Is this issue even worth flagging?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wont flag such an issue with js/useless-regexp-character-escape, as neither quote has special regex semantics.
But getAnIdentityEscapedCharacter is intended to be complete and independent of the any query purpose, and I think it is appropriate that the dual query in javascript/ql/test/query-tests/Security/CWE-020/UselessCharacterEscape.ql flags '\"' as a useless escape.

@ghost
Copy link
Author

ghost commented Sep 23, 2019

I'm not sure negligible is the right word here. How exactly was this data produced? In particular, is the time to compute the cached layers included in the absolute numbers?

If so, what I'm seeing is that it takes upwards of 8% of the time it takes to do SSA, type inference, call graph construction, etc. Not "8% more than these other queries".

The data is from a standard dist-compare run, so you are right that an interpretation is that "it takes upwards of 8% of the time it takes to do SSA". Still, the absolute timing number are small for a new query, so I am not concerned about performance at all. I can do a deeper performance inspection if you insist. (The ballpark 20k seconds for a full gecko-dev run puts the relative query cost at 1.5%. The ballpark 90k seconds for a full default.slugs run puts the relative query cost at 1%.)

Copy link

@xiemaisi xiemaisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly LGTM, with a few suggestions.

I'm not concerned about performance, given that the absolute overhead is small in most cases (except for gecko-dev).

<overview>
<p>

Prepending a single backslash to a non-special character in a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this could be started off more naturally by first recapitulating the concept of backslash escaping in strings (which of course people will be familiar with), then move on to regular expressions, and finally explain the two layers of backslash interpretation happening in strings that are interpreted as regular expressions.

import javascript
import semmle.javascript.CharacterEscapes::CharacterEscapes

newtype TRegExpPatternMistake =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The injectors of this type could do with qldoc.

|
"b" = getAnEscapedCharacter(raw, index) and
not (
// except if the string is exactly \b

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this logic be simplified by additionally keeping the (cooked) value of the string literal around?

private module Sets {
string ordinaryEscapeChars() { result = "fnrtvux0\\" }

string regexpCharRangeChars() { result = "bBcdDsSwW" }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also include p and P for Unicode property classes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer the terminology "character class" over "character range" here and elsewhere.

or
rawStringNode = n.asExpr().(TemplateLiteral).getAnElement() and
raw = rawStringNode.(TemplateElement).getRawValue() and
quote = "`"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again looks like it could benefit from extracting it into a new predicate.

@xiemaisi xiemaisi requested a review from mchammer01 September 27, 2019 10:36
Copy link
Contributor

@mchammer01 mchammer01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esben-semmle - this LGTM, just a few minor comments to aid comprehension (and one suggestion to avoid repeating a word).

However, the check does not work properly for white space
as the two <code>\s</code> occurrences are semantically equivalent to
just <code>s</code>, meaning that the check will succeed for strings
like <code>"smy-markers"</code> instead of <code>" my-marker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking that the white spaces before and after the second expression are deliberate here ;-)

@ghost
Copy link
Author

ghost commented Oct 1, 2019

Comments has been addressed. @mchammer01, can you take another look at the qhelp content? I have elaborated on the intro, as seen in eeab226.

Copy link
Contributor

@mchammer01 mchammer01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esben-semmle - thanks for addressing my initial comments, looks good.
I've spotted a tiny typo, and asked a question (I don't have an answer, just asking).
Hope this helps.

@ghost
Copy link
Author

ghost commented Oct 1, 2019

Thanks @mchammer01, I think that is it for the docreview.

@mchammer01
Copy link
Contributor

@esben-semmle - Looks like you didn't apply the hyphen changes (backslashed-escaped). Is this deliberate?

@ghost
Copy link
Author

ghost commented Oct 1, 2019

Thanks, my push failed due to github's suggestion-commits.

mchammer01
mchammer01 previously approved these changes Oct 1, 2019
Copy link
Contributor

@mchammer01 mchammer01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @esben-semmle, good to go from my point of view 😀

Characters in string/template literals, and
regular expression literals can be backslash escaped by prepending a
backslash to them. A common example of this is <code>"\n"</code> which
corresponds to a single <code>newline</code>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology in this paragraph and the next doesn't seem right to me. It's either incorrect or a tautology depending on what it means for a character to be "escaped".

It think it would be clearer if we just stick to "escape sequence":

When a character in a string literal or regular expression is preceded by a blackslash, it is interpreted as part of an escape sequence. For example, the escape sequence \n corresponds to the newline character, not the n character.

However, not all characters change meaning when used in an escape sequence. In this case, the backslash just makes the character appear to mean something else, but the backslash actually has no effect. For example, the escape sequence \k just means k.


The set of characters that change meaning when they are
backslash-escaped is different for regular expression literals and
string/template literals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string/template literal wording seems unnecessary, string literal should be fine (here and below).

asger-semmle
asger-semmle previously approved these changes Oct 10, 2019
Copy link
Contributor

@asger-semmle asger-semmle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM now

@ghost
Copy link
Author

ghost commented Oct 10, 2019

@mchammer01 can I get another review of the overview section of the qhelp file here? I rewrote it just now.

@xiemaisi xiemaisi requested a review from mchammer01 October 14, 2019 13:37
Copy link

@xiemaisi xiemaisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of minor comments and suggestions.

sequences is different for regular expression literals and string
literals.

This can be problematic when a regular expression literal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is a little terse. What is a "dynamic regular expression source"? What is "the problem"?

char = getALikelyRegExpPatternMistake(src, mistake, rawStringNode, index)
} or
/**
* An identity-escaped character at `index` of `rawStringNode` that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description sounds very general, but the injector only seems to look at \b escapes.

private module Sets {
string sharedEscapeChars() { result = "fnrtvux0\\" }

string regexpCharClassChars() { result = "bBcdDpPsSwW" }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, b and B are not character classes but assertions.

Copy link
Contributor

@mchammer01 mchammer01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esben-semmle - I did another editorial review of this, as requested.
One minor issue (see inline comment) where you repeat a word (the backslash) and use the wrong conjunction. I've made a suggestion, feel free to ignore if you don't agree.


However, not all characters change meaning when used in an
escape sequence. In this case, the backslash just makes the character
appear to mean something else, but the backslash actually has no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
appear to mean something else, but the backslash actually has no
appear to mean something else, and it actually has no

@ghost
Copy link
Author

ghost commented Oct 15, 2019

Comments addressed.

Copy link

@xiemaisi xiemaisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one final nit (which you could perhaps address as part of cleaning up the commit history?), otherwise lgtm.

@ghost
Copy link
Author

ghost commented Oct 15, 2019

Squashed and rebased.

@xiemaisi xiemaisi merged commit 7127624 into github:master Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants