Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow: Replace stage 3 type pruning with flow-insensitive type pruning. #16785

Merged
merged 6 commits into from
Jun 24, 2024

Conversation

aschackmull
Copy link
Contributor

@aschackmull aschackmull commented Jun 19, 2024

It appears that most of the benefit from type pruning in stage 3 is closely correlated to the Content that's tracked in the access path. This means that we can drop flow-sensitive type pruning, which removes the risk for type-based fanout and consequent performance problems, by replacing it with a flow-insensitive type pruning that merely compares the tracked Content and the type of the current node with the possible container types given by the set of store steps.

So far this PR only changes stage 3. How this can impact subsequent changes still needs to be investigated. But this change alone appears to benefit performance - in particular on large databases.

@aschackmull
Copy link
Contributor Author

aschackmull commented Jun 21, 2024

I've measured the impact on data flow tuple counts in Java with MRVA on top 1000 using a combined set of 10 representative queries: I defined the cost to be the sum of forward and reverse tuple counts in stages 2 through 6. The general trend shows an improvement - in particular on larger cases.

project before after abs diff pct diff
orientechnologies/orientdb 9,620,589 7,453,341 -2,167,248 -22
apache/ignite 6,597,900 5,959,172 -638,728 -9
apache/hadoop 5,288,020 4,663,205 -624,815 -11
apache/solr 3,237,851 2,513,010 -724,841 -22
apache/geode 3,060,906 2,881,422 -179,484 -5
apache/doris 1,724,528 1,856,326 131,798 7
imagej/ImageJ 886,770 994,844 108,074 12
hazelcast/hazelcast 822,251 941,352 119,101 14
apache/nifi 799,274 751,900 -47,374 -5
apache/logging-log4j2 327,922 367,435 39,513 12
bazelbuild/bazel 283,181 334,014 50,833 17
apache/activemq 282,323 332,640 50,317 17
alibaba/fastjson 277,994 165,213 -112,781 -40

@aschackmull aschackmull added the no-change-note-required This PR does not need a change note label Jun 21, 2024
@@ -564,7 +564,6 @@ predicate neverSkipInPathGraph(Node n) {
* Holds if `t1` and `t2` are compatible, that is, whether data can flow from
* a node of type `t1` to a node of type `t2`.
*/
pragma[inline]
predicate compatibleTypes(DataFlowType t1, DataFlowType t2) { any() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has the stub implementation been changed to bind the parameters for some languages (cpp, go) but not others (python, swift)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in python and swift DataFlowType is a singleton.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that Go doesn't use types in dataflow, but I forgot that we have a hack in place to avoid an optimiser bug which means DataFlowType isn't a singleton. Apparently I can remove it now.

@aschackmull
Copy link
Contributor Author

Spot-checked one of the removed results for cs/exposure-of-sensitive-information. Indeed this was a FP that is now removed due to additional type strengthening in the added cast-node in the first commit.

@aschackmull
Copy link
Contributor Author

Dca looks good.

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work 🎉

@aschackmull
Copy link
Contributor Author

The CI failure is unrelated and already broken on main. Merging.

@aschackmull aschackmull merged commit 25d520a into github:main Jun 24, 2024
55 of 56 checks passed
@aschackmull aschackmull deleted the dataflow/stage3-notypes branch June 24, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants