Skip to content

Data flow: Refactoring + performance improvements #2903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 23, 2020

Conversation

hvitved
Copy link
Contributor

@hvitved hvitved commented Feb 24, 2020

Highlights:

  • Introduce ReadTaintNode and TaintStoreNode to simplify logic for taint
    getters and taint setters, respectively.
  • nodeCandFwd2: Restrict stored column after a read, based on what it might
    be before a store of the same field.
  • nodeCand2: Restrict read column (renamed from stored) after a store, based
    on what it might be after a read of the same field.
  • Move big step predicates into a LocalFlowBigStep module.
  • Define predicates by dispatch in AccessPath[Front] class.
  • flowCandFwd0: Restrict apf column after a read, as it should be able to match
    a Boolean read column from nodeCand2.
  • flowFwd0: Restrict columns ap and apf after a read, by introducing a
    flowConsCandFwd predicate (similar to what is done in the previous pruning steps).
  • flowFwd0: Restrict columns ap and apf after a store, by introducing a
    flowConsCand predicate (similar to what is done in the previous pruning steps).

@hvitved hvitved force-pushed the dataflow/performance branch from 7a3f125 to bdeebc0 Compare February 26, 2020 08:32
@hvitved hvitved force-pushed the dataflow/performance branch 6 times, most recently from 6ebea81 to 17d491d Compare March 12, 2020 12:37
@hvitved hvitved changed the title WIP Data flow: Refactoring + performance improvements Mar 12, 2020
@hvitved
Copy link
Contributor Author

hvitved commented Mar 12, 2020

hvitved added 2 commits March 13, 2020 13:58
- Introduce `ReadTaintNode` and `TaintStoreNode` to simplify logic for taint
  getters and taint setters, respectively.
- `nodeCandFwd2`: Restrict `stored` column after a read, based on what it might
  be before a store of the same field.
- `nodeCand2`: Restrict `read` column (renamed from `stored`) after a store, based
  on what it might be after a read of the same field.
- Move big step predicates into a `LocalFlowBigStep` module.
- Define predicates by dispatch in `AccessPath[Front]` class.
- `flowCandFwd0`: Restrict `apf` column after a read, as it should be able to match
  a Boolean `read` column from `nodeCand2`.
- `flowFwd0`: Restrict columns `ap` and `apf` after a read, by introducing a
  `flowConsCandFwd` predicate (similar to what is done in the previous pruning steps).
- `flowFwd0`: Restrict columns `ap` and `apf` after a store, by introducing a
  `flowConsCand` predicate (similar to what is done in the previous pruning steps).
@hvitved hvitved force-pushed the dataflow/performance branch from 17d491d to f935f5e Compare March 13, 2020 13:10
@hvitved hvitved marked this pull request as ready for review March 13, 2020 13:24
@hvitved hvitved requested review from a team as code owners March 13, 2020 13:24
@jbj
Copy link
Contributor

jbj commented Mar 16, 2020

The C++ dist builds appear to be within the noise threshold, so no objections from me.

@aschackmull
Copy link
Contributor

A few removed results on JDK.

=== Changes per project (+new -fixed) ===
g/AdoptOpenJDK/openjdk-jdk11u                                               +2 -7
=== Changes per query (+new -fixed) ===
java/similar-file                                                           +2 -2
java/tainted-arithmetic                                                     +0 -4
java/zipslip                                                                +0 -1

Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments; I haven't read all the way through yet.

@@ -702,22 +711,36 @@ private predicate argumentFlowsThrough(
)
}

pragma[noinline]
private predicate readStoreNode(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some qldoc.

Comment on lines 729 to 731
nodeCand1(arg, config) and
readStoreNode(call, arg, f1, config) and
readStoreCand1(f1, unbind(config))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to fold the remaining nodeCand1 and readStoreCand1 lines into readStoreNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that would give a bad join-order. readStoreNode is only there to give a proper join-order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also using unbind? Ignoring configs (which unbind allows us to do) then all the joins are unary filters, so I don't see how that could be the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using unbind, i.e.

  TReadStoreNode(DataFlowCall call, ArgumentNode arg, Content f1, Configuration config) {
    exists(Content f2, Node out |
      nodeCand1(arg, config) and
      argumentValueFlowsThrough(call, arg, TContentSome(f1), TContentSome(f2), out) and
      nodeCand1(out, unbind(config)) and
      readStoreCand1(f1, unbind(config)) and
      readStoreCand1(f2, unbind(config))
    )
  }

yields this pipeline:

[2020-03-18 13:01:44] (121s) Tuple counts for dom#DataFlowImpl::TReadStoreNode#ffff:
                      20715     ~0%      {3} r1 = JOIN DataFlowImpl::readStoreCand1#ff AS L WITH DataFlowImplCommon::TContentSome#ff AS R ON FIRST 1 OUTPUT L.<1>, L.<0>, R.<1>
                      429111225 ~0%      {4} r2 = JOIN r1 WITH DataFlowImpl::readStoreCand1#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<0>, r1.<2>
                      429111225 ~0%      {4} r3 = JOIN r2 WITH DataFlowImplCommon::TContentSome#ff AS R ON FIRST 1 OUTPUT r2.<3>, R.<1>, r2.<1>, r2.<2>
                      235878    ~45%     {5} r4 = JOIN r3 WITH DataFlowImplCommon::Cached::FlowThrough::Final::argumentValueFlowsThrough#fffff#origOrder_23014#join_rhs AS R ON FIRST 2 OUTPUT R.<3>, r3.<2>, r3.<3>, R.<2>, R.<4>
                      198355    ~40%     {6} r5 = JOIN r4 WITH DataFlowImpl::nodeCand1#ff AS R ON FIRST 1 OUTPUT r4.<1>, r4.<2>, r4.<3>, r4.<0>, r4.<4>, R.<1>
                      198355    ~40%     {6} r6 = SELECT r5 ON r5.<1> >= r5.<5>
                      198355    ~40%     {6} r7 = SELECT r6 ON r6.<1> <= r6.<5>
                      198355    ~46%     {5} r8 = SCAN r7 OUTPUT r7.<4>, r7.<0>, r7.<2>, r7.<3>, r7.<5>
                      160468    ~57%     {6} r9 = JOIN r8 WITH DataFlowImpl::nodeCand1#ff AS R ON FIRST 1 OUTPUT r8.<1>, r8.<2>, r8.<3>, r8.<0>, r8.<4>, R.<1>
                      160468    ~57%     {6} r10 = SELECT r9 ON r9.<5> >= r9.<4>
                      160468    ~57%     {6} r11 = SELECT r10 ON r10.<5> <= r10.<4>
                      160468    ~52%     {4} r12 = SCAN r11 OUTPUT r11.<1>, r11.<2>, r11.<0>, r11.<4>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can't be right, that RA snippet cannot match that QL snippet. It's clearly joining on configuration and there's only two unbind in the RA vs 3 in the QL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it looks weird, but I just double checked, and the RA above is indeed what I get.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that RA cannot have come from that QL, so something is mismatched somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it again, same result. Would you mind trying?

Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, still not finished reading through.

Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments.

@hvitved
Copy link
Contributor Author

hvitved commented Mar 19, 2020

@aschackmull
Copy link
Contributor

The removed ZipSlip result looks like a FP, which is now successfully pruned by the additional cons candidate tracking.

@aschackmull
Copy link
Contributor

Let's leave the remaining join-order weirdness as-is, then we can deal with it once the cause is known.

@aschackmull aschackmull merged commit 888c504 into github:master Mar 23, 2020
@hvitved hvitved deleted the dataflow/performance branch March 23, 2020 09:14
max-schaefer pushed a commit to max-schaefer/codeql-go that referenced this pull request Mar 26, 2020
max-schaefer pushed a commit to max-schaefer/codeql-go that referenced this pull request May 6, 2020
max-schaefer pushed a commit to max-schaefer/codeql-go that referenced this pull request May 6, 2020
owen-mc pushed a commit to owen-mc/codeql-go that referenced this pull request Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants