Skip to content

Dataflow: Add support for implicit reads #6107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 24, 2021

Conversation

aschackmull
Copy link
Contributor

This adds a new overridable predicate on configurations:

  /**
   * Holds if an arbitrary number of implicit read steps of content `c` may be
   * taken at `node`.
   */
  predicate allowImplicitRead(Node node, Content c)

The purpose is to support sinks and taint steps accepting non-empty accesspaths.

I've also added a default override in taint-tracking configurations, which the language specific implementations can use to add such default support. This is just a (hopefully helpful) default and can be overridden by specific configurations. For Java, the default implementation is all array-, collection-, and map-value-read steps that match the type at the given sink / additional taint step.

This means that we ought to be able to drop the temporary store-as-taint workaround, which was costing quite some performance.

@asgerf
Copy link
Contributor

asgerf commented Jun 21, 2021

Quick question about how this relates to something we do in the JS libraries. We support custom load steps via:

  /**
   * EXPERIMENTAL. This API may change in the future.
   *
   * Holds if the property `prop` of the object `pred` should be loaded into `succ`.
   */
  predicate isAdditionalLoadStep(DataFlow::Node pred, DataFlow::Node succ, string prop)

I'm wondering if allowImplicitRead(node, c) would be equivalent to isAdditionalLoadStep(node, node, c), or is there something else going on here?

@aschackmull
Copy link
Contributor Author

I'm wondering if allowImplicitRead(node, c) would be equivalent to isAdditionalLoadStep(node, node, c), or is there something else going on here?

It is almost equivalent, yes. There's a slight difference if you also use type pruning, because this would easily remove the flow you gained from isAdditionalLoadStep(node, node, c) (the types don't match up). So conceptually you may think of this as an in-place additional read step with the quirks related to type pruning sorted such that it actually works.

@aschackmull aschackmull force-pushed the dataflow/implicit-reads branch from 5cbbae0 to d383c0f Compare June 21, 2021 12:43
@github-actions
Copy link
Contributor

⚠️ The head of this PR and the base branch were compared for differences in the framework coverage reports. The generated reports are available in the artifacts of this workflow run. The differences will be picked up by the nightly job after the PR gets merged. The following differences were found:

java

Generated file changes for java

  • Changes to framework-coverage-java.rst:
-    `Apache Commons Lang <https://commons.apache.org/proper/commons-lang/>`_,``org.apache.commons.lang3``,,417,,,,,,,,
+    `Apache Commons Lang <https://commons.apache.org/proper/commons-lang/>`_,``org.apache.commons.lang3``,,420,,,,,,,,
-    Totals,,84,1622,181,13,6,6,,33,1,58
+    Totals,,84,1625,181,13,6,6,,33,1,58
  • Changes to framework-coverage-java.csv:
- org.apache.commons.lang3,,,417,,,,,,,,,,,,,,324,93
+ org.apache.commons.lang3,,,420,,,,,,,,,,,,,,292,128

@aschackmull aschackmull force-pushed the dataflow/implicit-reads branch from e2fa670 to c06e152 Compare June 21, 2021 14:09
@github-actions
Copy link
Contributor

⚠️ The head of this PR and the base branch were compared for differences in the framework coverage reports. The generated reports are available in the artifacts of this workflow run. The differences will be picked up by the nightly job after the PR gets merged. The following differences were found:

java

Generated file changes for java

  • Changes to framework-coverage-java.rst:
-    `Apache Commons Lang <https://commons.apache.org/proper/commons-lang/>`_,``org.apache.commons.lang3``,,417,,,,,,,,
+    `Apache Commons Lang <https://commons.apache.org/proper/commons-lang/>`_,``org.apache.commons.lang3``,,420,,,,,,,,
-    Totals,,84,1622,181,13,6,6,,33,1,58
+    Totals,,84,1625,181,13,6,6,,33,1,58
  • Changes to framework-coverage-java.csv:
- org.apache.commons.lang3,,,417,,,,,,,,,,,,,,324,93
+ org.apache.commons.lang3,,,420,,,,,,,,,,,,,,292,128

@smowton
Copy link
Contributor

smowton commented Jun 21, 2021

Started a benchmark to check effects on codeql-go (without any default implementation)

smowton
smowton previously approved these changes Jun 22, 2021
Copy link
Contributor

@smowton smowton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go benchmark results are good (either neutral or slightly beneficial)

@aschackmull
Copy link
Contributor Author

Java perf measurements are thrown off by a case of bad magic. I'll need to fix it and restart them.

@github-actions
Copy link
Contributor

⚠️ The head of this PR and the base branch were compared for differences in the framework coverage reports. The generated reports are available in the artifacts of this workflow run. The differences will be picked up by the nightly job after the PR gets merged. The following differences were found:

java

Generated file changes for java

  • Changes to framework-coverage-java.rst:
-    `Apache Commons Lang <https://commons.apache.org/proper/commons-lang/>`_,``org.apache.commons.lang3``,,417,,,,,,,,
+    `Apache Commons Lang <https://commons.apache.org/proper/commons-lang/>`_,``org.apache.commons.lang3``,,420,,,,,,,,
-    Totals,,84,1624,181,13,6,6,,33,1,58
+    Totals,,84,1627,181,13,6,6,,33,1,58
  • Changes to framework-coverage-java.csv:
- org.apache.commons.lang3,,,417,,,,,,,,,,,,,,324,93
+ org.apache.commons.lang3,,,420,,,,,,,,,,,,,,292,128

@aschackmull
Copy link
Contributor Author

Looks like the performance regression by the store-as-taint workaround is indeed fixed:

Security/CWE/CWE-090/LdapInjection.ql	363	138	-225
Security/CWE/CWE-078/ExecUnescaped.ql	477	162	-315
Security/CWE/CWE-078/ExecTainted.ql	481	155	-326

But the new feature that's added to the dataflow library to do this does come with an added performance overhead - mainly due to tuple-numbering, as far as I have been able to see.

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid to me.


/** A reference through the contents of some collection-like container. */
private class CollectionContent extends Content, TCollectionContent {
override string toString() { result = "<element>" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change the C++ content toString()s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match Java. They were originally just copy-pasted from Java and weren't used in either language. When I added support for collection-flow in Java I updated them to something I thought made more sense (but the old somewhat arbitrary values still lingered in C++). So now that I moved them I thought I might as well update them. I did check with @MathiasVP beforehand that this was ok with him.

@github-actions
Copy link
Contributor

⚠️ The head of this PR and the base branch were compared for differences in the framework coverage reports. The generated reports are available in the artifacts of this workflow run. The differences will be picked up by the nightly job after the PR gets merged. The following differences were found:

java

Generated file changes for java

  • Changes to framework-coverage-java.rst:
-    `Apache Commons Lang <https://commons.apache.org/proper/commons-lang/>`_,``org.apache.commons.lang3``,,417,,,,,,,,
+    `Apache Commons Lang <https://commons.apache.org/proper/commons-lang/>`_,``org.apache.commons.lang3``,,420,,,,,,,,
-    Totals,,84,1634,181,13,6,6,,33,1,58
+    Totals,,84,1637,181,13,6,6,,33,1,58
  • Changes to framework-coverage-java.csv:
- org.apache.commons.lang3,,,417,,,,,,,,,,,,,,324,93
+ org.apache.commons.lang3,,,420,,,,,,,,,,,,,,292,128

@MathiasVP
Copy link
Contributor

https://jenkins.internal.semmle.com/job/Changes/job/CPP-Differences/2106/

CPP-Differences doesn't look super good. There's a clear performance regression on all of the dataflow queries. Is such a big slowdown expected?

Screenshot 2021-06-24 at 14 39 18

@aschackmull
Copy link
Contributor Author

Is such a big slowdown expected?

I'd say unfortunately yes for the time being. It seems like all dataflow queries are taking a hit across all languages. I've been trying to dig into it, but there's no clear single culprit - rather it seems like the wrapping of Node as TNodeNormal with all the necessary accompanying joins are to blame: It's a bunch of additional Node-sized scans and joins and tuplenumberings. On the plus side, I think a lot of the additional overhead can be run in parallel, and I'm not sure whether the LANG-Differences jobs does that (so the real cost may be less severe).

I do have some ideas for improving the situation, but I'd rather look into those as follow-up work.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

@aschackmull aschackmull merged commit 95ad8b5 into github:main Jun 24, 2021
@aschackmull aschackmull deleted the dataflow/implicit-reads branch June 24, 2021 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants