JS/ATM: Various compilation fixes and performance improvements #7307

adityasharad · 2021-12-03T22:44:46Z

This PR attempts to improve the evaluation performance of the ATM queries and libraries, focused on slow bottlenecks that take 10s or more on a moderately large database.

Semantics should be largely unchanged, with the exception of a few places where I've used the strict version of aggregates, which will now have no results (rather than the unit of the aggregation operation) if there are no tuples satisfying the aggregation expressions.

Commit-wise review highly recommended. See commit descriptions and inline comments for further explanation of each change. Careful testing with the actual models against some candidate databases highly recommended, to make sure I haven't changed how features are represented.

Use the LabelParameter API instead of manually constructing the edge label.

…eir marker comments The `matchMarkerComment` predicate performs badly on any codebase with a moderately large number of comments, because the current implementation has to first compute the Cartesian product between the set of comments and the set of framework library comment regexes. Instead, match first against a single regex: the union of all framework library comment regexes. This computes a more benign Cartesian product, the same size as the set of comments. See inline comments for more details.

Change the cutoff logic from `count` to `strictcount`, since we know it only applies to a non-empty set of results. Use a single `strictconcat` aggregate to combine tokens in order of location, instead of computing a `rank` followed by a `concat`. Strictness introduces a slight change of behaviour because missing tokens will now result in no results from the predicate rather than an empty feature string.

Use a `strictcount` to identify whether there is exactly one feature or not. If so, we use it. If not, we use the empty string. Add context to ensure we filter the set of data flow nodes down to only the set of endpoint nodes. This performance optimisation avoids calculating the Cartesian product of data flow nodes and feature names, but it does not avoid calculating the (slightly smaller) Cartesian product of endpoint nodes and feature names. Product size = number of endpoint nodes * number of feature names. At time of writing there are 8 feature names.

erik-krogh · 2021-12-07T08:33:21Z

LGTM 👍

An evaluation looks great 👍

But you need to run the autoformatter on EndpointFeatures.qll.

esbena · 2021-12-07T09:21:36Z

An evaluation looks great +1

While impressive, the evaluation is not representative since it does not make use of the ml-model. A small DCA change is required to download the model on the fly.
I'm about to look into fixing ATM FPs so I might just add DCA ml-model support immediately.

0e31439 introduces some occasional duplicate tokens due to duplicate AST node attributes. The long-term fix is to update `CodeToFeatures.qll`, but for the short-term, we update the concatenation to concatenate unique (location, token) pairs.

henrymercer · 2021-12-07T14:18:49Z

0e31439 appears to some small side effects where we duplicate some tokens in the function body feature. I've pushed a couple of commits: the first removes that side effect, and the second autoformats EndpointFeatures.qll so that the language tests should start passing.

henrymercer

Will do some final testing of this internally, but LGTM otherwise.

henrymercer

LGTM for the ATM changes. There look to be a couple of failing tests in the standard JS library.

adityasharad · 2021-12-08T19:52:29Z

Looks like the regex-combining wasn't actually behaviour-preserving. Will see if I can fix that so it has no diffs, otherwise back out that change.

The receiver string and the regex were in the wrong order, leading to test failures when looking for matching comments.

adityasharad · 2021-12-10T01:39:29Z

Updated (see latest commits) to fix a logic bug in regex matching that was causing test failures (r.regexpMatch(s) is not the same as s.regexpMatch(r)!).

henrymercer

LGTM. Could I get a signoff from @github/codeql-javascript on the standard library changes?

erik-krogh

LGTM. Could I get a signoff from @github/codeql-javascript on the standard library changes?

Yes you can 👍
I got an evaluation running of the stdlib changes, and that looks good.

I have tried to fix the performance of FrameworkLibraries.qll a few times myself, but without success.
So it's great to see that someone was finally able to do it 🎉

adityasharad added 5 commits December 3, 2021 14:20

JS: Fix compilation errors in EndpointFeatures library

d0840af

Use the LabelParameter API instead of manually constructing the edge label.

JS: Replace an exists+concat with an equivalent strictconcat

fac2769

adityasharad requested review from henrymercer and Z80coder December 3, 2021 22:44

adityasharad requested a review from a team as a code owner December 3, 2021 22:44

github-actions bot added the JS label Dec 3, 2021

henrymercer added 2 commits December 7, 2021 14:16

JS: Fix occasional duplicate body tokens

016727d

0e31439 introduces some occasional duplicate tokens due to duplicate AST node attributes. The long-term fix is to update `CodeToFeatures.qll`, but for the short-term, we update the concatenation to concatenate unique (location, token) pairs.

JS: Autoformat

322e394

henrymercer reviewed Dec 7, 2021

View reviewed changes

henrymercer previously approved these changes Dec 7, 2021

View reviewed changes

adityasharad added 2 commits December 9, 2021 13:42

JS: Fix broken regex matching predicate

0c3daab

The receiver string and the regex were in the wrong order, leading to test failures when looking for matching comments.

JS: Expand explanatory comment about version placeholders

271b23b

adityasharad dismissed henrymercer’s stale review via a6a5a72 December 10, 2021 01:29

adityasharad force-pushed the atm/perf-debugging branch from a6a5a72 to 271b23b Compare December 10, 2021 01:38

henrymercer approved these changes Dec 10, 2021

View reviewed changes

erik-krogh approved these changes Dec 10, 2021

View reviewed changes

henrymercer merged commit 6e16704 into github:main Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JS/ATM: Various compilation fixes and performance improvements #7307

JS/ATM: Various compilation fixes and performance improvements #7307

Uh oh!

adityasharad commented Dec 3, 2021

Uh oh!

erik-krogh commented Dec 7, 2021

Uh oh!

esbena commented Dec 7, 2021

Uh oh!

henrymercer commented Dec 7, 2021

Uh oh!

henrymercer left a comment

Uh oh!

henrymercer left a comment

Uh oh!

adityasharad commented Dec 8, 2021

Uh oh!

adityasharad commented Dec 10, 2021

Uh oh!

henrymercer left a comment

Uh oh!

erik-krogh left a comment

Uh oh!

Uh oh!

JS/ATM: Various compilation fixes and performance improvements #7307

JS/ATM: Various compilation fixes and performance improvements #7307

Uh oh!

Conversation

adityasharad commented Dec 3, 2021

Uh oh!

erik-krogh commented Dec 7, 2021

Uh oh!

esbena commented Dec 7, 2021

Uh oh!

henrymercer commented Dec 7, 2021

Uh oh!

henrymercer left a comment

Choose a reason for hiding this comment

Uh oh!

henrymercer left a comment

Choose a reason for hiding this comment

Uh oh!

adityasharad commented Dec 8, 2021

Uh oh!

adityasharad commented Dec 10, 2021

Uh oh!

henrymercer left a comment

Choose a reason for hiding this comment

Uh oh!

erik-krogh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!