-
Notifications
You must be signed in to change notification settings - Fork 1.8k
JS: Add UntrustedDataToExternalAPI query #4694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments from a superficial review. Impressively sophisticated; it's a shame you had to fight the API-graphs library in so many places, I hope we can find the time to go back and improve its interface to make this sort of thing easier in future.
...l/src/semmle/javascript/security/dataflow/ExternalAPIUsedWithUntrustedDataCustomizations.qll
Outdated
Show resolved
Hide resolved
...l/src/semmle/javascript/security/dataflow/ExternalAPIUsedWithUntrustedDataCustomizations.qll
Outdated
Show resolved
Hide resolved
...l/src/semmle/javascript/security/dataflow/ExternalAPIUsedWithUntrustedDataCustomizations.qll
Outdated
Show resolved
Hide resolved
// getParameter(i) requires a bindingset for i, so use the raw label | ||
param = base.getASuccessor("parameter " + lbl) and | ||
lbl != "-1" // ignore receiver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as param = base.getAParameter() and not param = base.getReceiver()
, which you've also used above? If so, should we have a predicate for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to param = base.getAParameter() and not param = base.getReceiver()
for now. Looks like I needed the parameter index at some earlier version of the predicate and forgot to clean it up.
It might make sense to look at the API for this. Generally I'd be in favor of being consistent with the data-flow API here (where the receiver isn't considered to be an argument or parameter) and just introduce the verbose getParameterOrReceiver(i)
and getAParameterOrReceiver
for the cases where you want both. We could introduce similar predicates in the data-flow API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in another PR. The refactoring would probably warrant more evaluations than I'd care to spend on this query.
Thanks for the ping. I'll review this tomorrow (had a bit of a disastrous day today and I am really behind). Hope it's ok! |
Co-authored-by: Max Schaefer <54907921+max-schaefer@users.noreply.github.com>
@mchammer01 no worries. It was GH who pinged you due to code ownership of the qhelp file. I should mention that the qhelp is nearly identical to the Java version from #3938. |
Oh I didn't know that (feel free to ignore some of my comments then) - just in the middle of reviewing this for you 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asgerf - this LGTM ✨
As mentioned to you, feel free to ignore some of my comments if the content within the qhelp file is nearly identical to that of the equivalent query for Java, which is already live.
Would be good if you could reword the query description (ql file) as it reads a bit oddly.
Where do we stand with regards to release notes? I know we've changed the way we generate these so I am not sure when these get written and by whom (and whether I need to review anything).
@@ -0,0 +1,17 @@ | |||
/** | |||
* @name Frequency counts for external APIs that are used with untrusted data | |||
* @description This reports the external APIs that are used with untrusted data, along with how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description doesn't follow the format specified in https://github.com/github/codeql/blob/master/docs/query-metadata-style-guide.md#query-descriptions-description.
Alternatively, what do you think of something like "Use this query to return the external APIs...."?
<overview> | ||
<p>Using unsanitized untrusted data in an external API can cause a variety of security issues. This query reports | ||
all external APIs that are used with untrusted data, along with how frequently the API is used, and how many | ||
unique sources of untrusted data flow to this API. This query is designed primarily to help identify which APIs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique sources of untrusted data flow to this API. This query is designed primarily to help identify which APIs | |
unique sources of untrusted data flow to this API. This query is primarily designed to help identify the APIs |
<p>Using unsanitized untrusted data in an external API can cause a variety of security issues. This query reports | ||
all external APIs that are used with untrusted data, along with how frequently the API is used, and how many | ||
unique sources of untrusted data flow to this API. This query is designed primarily to help identify which APIs | ||
may be relevant for security analysis of this application.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am not sure I understand what you mean here, so let me know if my suggestion doesn't make any sense
may be relevant for security analysis of this application.</p> | |
that may be relevant for the analysis of security within this application.</p> |
may be relevant for security analysis of this application.</p> | ||
|
||
<p>An external API is defined as a call to a function that is not defined in the source code, not overridden | ||
in the source code, and is not modeled as a taint step in the default taint library. External APIs may be from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the source code, and is not modeled as a taint step in the default taint library. External APIs may be from the | |
in the source code, and not modeled as a taint step in the default taint library. External APIs may come from the |
<p>An external API is defined as a call to a function that is not defined in the source code, not overridden | ||
in the source code, and is not modeled as a taint step in the default taint library. External APIs may be from the | ||
third party dependencies or from internal dependencies. The query will report the external package name, followed | ||
by an access path leading to the function, followed <code>[param x]</code> where <code>x</code> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand here. Have we got two successive "followed" or is the second one an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a typo in the second one, it was supposed to say followed by [param x]
, but yes, followed by
occurs twice.
|
||
<p>If the query were to return the API <code>express().get.[callback].[param 'res'].send() [param 0]</code>, | ||
this could correspond to the <code>X</code> in <code>express().get('/foo', (req, res) => res.send(X))</code>. | ||
First we should consider whether this a security relevant sink. In this case, this is writing to a HTTP response, so we should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First we should consider whether this a security relevant sink. In this case, this is writing to a HTTP response, so we should | |
First we should consider whether this a security-relevant sink. In this case, this is writing to a HTTP response, so we should |
@mchammer01 Sorry to put you through the review unnecessarily. I'd rather not try to mobilise a coordinated change to multiple queries while they are in flight, so I'd prefer to leave it as-is. I'll just mention that these are not ordinary queries, they're not run by default, mainly for use by specialists, and so probably shouldn't be held to the same standards as those in a query suite. @max-schaefer can I ask you for another review? (you're requesting changes) |
Adds a JS version of the query @lcartey added in #3938. The public-facing aspects of the query (naming, API, etc) are based on that query.
We use API graphs to find calls to functions that came from an external API. This isn't a hard guarantee that the call is in fact external, and indeeed we have to filter out common built-ins method names like
substring
andindexOf
as they generate too much noise.Since JS doesn't have nice canonical names for everything, a large part of the problem is coming up with readable names for things. Since the query can have thousands of results I opted for readability over raw precision. We must be able to scan over the sink names to find the interesting sinks in a reasonable amount of time.
I found it was important to place the sink very close to the external call mentioned in the name. An earlier formulation of the query used API graphs to generate sinks based on a property RHS of an object escaping into an external library, but the sink could be arbitrarily far way from the external call and it was often too hard to see what was happening based on that sink location.
To avoid losing that flow, however, we use a flow label to track arbitrarily deep into an object and then report sinks where the value contains a user controlled value.
Mainly to verify that the query doesn't blow up, I ran an evaluation compared to an earlier commit with this query rebased on top. (The query has changed a bit since that run; I might want to run another evaluation.)