Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow: Add a language specific term to join and branch #12236

Merged

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Feb 17, 2023

In the process of adding more dataflow to C/C++ we've seen a large performance regression on a codebase that contains code like:

void foo(int opcode, A* a) {
  switch(opcode) {
    case 0: /*...*/ break;
    case 1: /*...*/ break;
    ...
    case N: /*...*/ break;
  }
}

where the code inside /* ... */ causes the dataflow library to create a giant number of access paths:

# n stage nodes field conscand states tuples config
1 10 1 Fwd 95703 234 -1 1 116835 TaintedPathConfiguration
2 15 1 Rev 29976 111 -1 1 41712 TaintedPathConfiguration
3 20 2 Fwd 24169 85 147 1 61263 TaintedPathConfiguration
4 25 2 Rev 23181 82 130 1 68834 TaintedPathConfiguration
5 30 3 Fwd 6574 70 745 1 195395 TaintedPathConfiguration
6 35 3 Rev 5929 65 557 1 263019 TaintedPathConfiguration
7 40 4 Fwd 5884 65 1655 1 723901 TaintedPathConfiguration
8 45 4 Rev 5711 65 1561 1 1569580 TaintedPathConfiguration
9 50 5 Fwd 5510 64 4661 1 1657203 TaintedPathConfiguration
10 55 5 Rev 5433 64 3877 1 3534661 TaintedPathConfiguration
11 60 6 Fwd 5433 64 2246 1 1446310 TaintedPathConfiguration
12 65 6 Rev 5303 64 2246 1 1416719 TaintedPathConfiguration

As @aschackmull pointed out, this looks a lot like a virtual dispatch hierarchy like (sorry for my bad Java 😅):

public interface Foo {
  public void foo(A*);
}

public class Foo1 implements Foo {
  public void foo(A* a) { /* ... */ }
}

public class Foo2 implements Foo {
  public void foo(A* a) { /* ... */ }
}

...

public class FooN implements Foo {
  public void foo(A* a) { /* ... */ }
}

and when written like that, the dataflow library recognizes that such calls to foo may have a high fan-in/fan-out.

However, it's not clear how the dataflow library would recognize the switch pattern above to selectively disable field flow for such cases. Thus, this PR adds a new predicate that can be overwritten for each language to increase the join and branch values when calculating the expected fan-in and fan-out.

I've only implemented a "flow-into" version as this was the only thing that was necessary to fix the performance on C/C++.

Implementing this new predicate as none() (as I've done so in this PR) will preserve the current fan-in/fan-out estimates. I hope that makes this PR uncontroversial.

Once we implement the predicate for C/C++ (which I'll do in a follow-up PR) the stats will be the much more manageable:

# n stage nodes field conscand states tuples config
1 10 1 Fwd 95703 234 -1 1 116835 TaintedPathConfiguration
2 15 1 Rev 29976 111 -1 1 41712 TaintedPathConfiguration
3 20 2 Fwd 23561 85 145 1 54250 TaintedPathConfiguration
4 25 2 Rev 21495 79 110 1 49098 TaintedPathConfiguration
5 30 3 Fwd 5964 67 481 1 70626 TaintedPathConfiguration
6 35 3 Rev 5421 65 320 1 62158 TaintedPathConfiguration
7 40 4 Fwd 5393 65 885 1 134163 TaintedPathConfiguration
8 45 4 Rev 5360 65 846 1 248973 TaintedPathConfiguration
9 50 5 Fwd 5217 64 2337 1 432751 TaintedPathConfiguration
10 55 5 Rev 4657 54 2097 1 395159 TaintedPathConfiguration
11 60 6 Fwd 4657 54 1169 1 424518 TaintedPathConfiguration
12 65 6 Rev 4527 54 1169 1 409183 TaintedPathConfiguration

@MathiasVP MathiasVP force-pushed the language-specific-field-flow-branch-limit-term branch from 527a745 to c575155 Compare February 17, 2023 14:10
@MathiasVP MathiasVP marked this pull request as ready for review February 17, 2023 16:04
@MathiasVP MathiasVP requested review from a team as code owners February 17, 2023 16:04
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Feb 17, 2023
tausbn
tausbn previously approved these changes Feb 17, 2023
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as Python is concerned, this looks okay, I guess. 🙂
I'll be curious to see how this is actually used. At a glance, I would have no idea what kinds of integers to put in getAdditionalFlowIntoCallNodeTerm.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Feb 19, 2023

Thanks for the quick 👀, Taus!

[...] I'll be curious to see how this is actually used. At a glance, I would have no idea what kinds of integers to put in getAdditionalFlowIntoCallNodeTerm.

For C/C++ I intent to implement getAdditionalFlowIntoCallNodeTerm(call, p, arg) roughly as "when call's target does a switch on some parameter, this predicate returns the number of cases where p is used".

I'm not sure how relevant such a case would be for Python, since I imagine hope this pattern is replaced with some OOP concept involving inheritance (which would then be captured by the dataflow library already without any additions)

owen-mc
owen-mc previously approved these changes Feb 21, 2023
Copy link
Contributor

@owen-mc owen-mc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for Go, as it doesn't change anything. Switch statements are used a fair amount in Go, but we haven't noticed any particular slow-down caused by them.

@MathiasVP MathiasVP dismissed stale reviews from owen-mc and tausbn via eed6f57 March 1, 2023 15:28
@MathiasVP MathiasVP requested a review from a team as a code owner March 1, 2023 15:28
@MathiasVP MathiasVP requested review from hmac and removed request for a team March 1, 2023 15:28
@MathiasVP
Copy link
Contributor Author

Thanks for the approvals on this already 🙇! I had to merge in main to resolve some Ruby conflicts.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

@MathiasVP MathiasVP force-pushed the language-specific-field-flow-branch-limit-term branch from 50e8f5d to 555444f Compare March 1, 2023 16:53
@aschackmull
Copy link
Contributor

You've got yourself a tiny merge conflict now 😉

@MathiasVP MathiasVP force-pushed the language-specific-field-flow-branch-limit-term branch from 981392e to 92ad099 Compare March 6, 2023 13:51
@MathiasVP
Copy link
Contributor Author

@aschackmull a rebase + force-push was the easiest way for me to resolve this fun conflict. I think this PR has been revived now 🤞.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants