Skip to content

JS/ATM: Various compilation fixes and performance improvements #7307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 10, 2021

Conversation

adityasharad
Copy link
Collaborator

This PR attempts to improve the evaluation performance of the ATM queries and libraries, focused on slow bottlenecks that take 10s or more on a moderately large database.

Semantics should be largely unchanged, with the exception of a few places where I've used the strict version of aggregates, which will now have no results (rather than the unit of the aggregation operation) if there are no tuples satisfying the aggregation expressions.

Commit-wise review highly recommended. See commit descriptions and inline comments for further explanation of each change. Careful testing with the actual models against some candidate databases highly recommended, to make sure I haven't changed how features are represented.

Use the LabelParameter API instead of manually constructing the edge label.
…eir marker comments

The `matchMarkerComment` predicate performs badly on any codebase with
a moderately large number of comments, because the current implementation
has to first compute the Cartesian product between the set of comments
and the set of framework library comment regexes.

Instead, match first against a single regex:
the union of all framework library comment regexes.
This computes a more benign Cartesian product, the same size as the set of comments.

See inline comments for more details.
Change the cutoff logic from `count` to `strictcount`, since we know it only applies
to a non-empty set of results.

Use a single `strictconcat` aggregate to combine tokens in order of location,
instead of computing a `rank` followed by a `concat`.

Strictness introduces a slight change of behaviour because missing tokens will now result
in no results from the predicate rather than an empty feature string.
Use a `strictcount` to identify whether there is exactly one feature or not.
If so, we use it. If not, we use the empty string.
Add context to ensure we filter the set of data flow nodes down to only
the set of endpoint nodes.

This performance optimisation avoids calculating the Cartesian product
of data flow nodes and feature names, but it does not avoid calculating
the (slightly smaller) Cartesian product of endpoint nodes and feature names.
Product size = number of endpoint nodes * number of feature names.
At time of writing there are 8 feature names.
@adityasharad adityasharad requested a review from a team as a code owner December 3, 2021 22:44
@github-actions github-actions bot added the JS label Dec 3, 2021
@erik-krogh
Copy link
Contributor

LGTM 👍

An evaluation looks great 👍

But you need to run the autoformatter on EndpointFeatures.qll.

@esbena
Copy link
Contributor

esbena commented Dec 7, 2021

An evaluation looks great +1

While impressive, the evaluation is not representative since it does not make use of the ml-model. A small DCA change is required to download the model on the fly.
I'm about to look into fixing ATM FPs so I might just add DCA ml-model support immediately.

0e31439 introduces some occasional duplicate tokens due to duplicate AST
node attributes. The long-term fix is to update `CodeToFeatures.qll`,
but for the short-term, we update the concatenation to concatenate
unique (location, token) pairs.
@henrymercer
Copy link
Contributor

0e31439 appears to some small side effects where we duplicate some tokens in the function body feature. I've pushed a couple of commits: the first removes that side effect, and the second autoformats EndpointFeatures.qll so that the language tests should start passing.

Copy link
Contributor

@henrymercer henrymercer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do some final testing of this internally, but LGTM otherwise.

henrymercer
henrymercer previously approved these changes Dec 7, 2021
Copy link
Contributor

@henrymercer henrymercer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the ATM changes. There look to be a couple of failing tests in the standard JS library.

@adityasharad
Copy link
Collaborator Author

Looks like the regex-combining wasn't actually behaviour-preserving. Will see if I can fix that so it has no diffs, otherwise back out that change.

The receiver string and the regex were in the wrong order,
leading to test failures when looking for matching comments.
@adityasharad
Copy link
Collaborator Author

Updated (see latest commits) to fix a logic bug in regex matching that was causing test failures (r.regexpMatch(s) is not the same as s.regexpMatch(r)!).

Copy link
Contributor

@henrymercer henrymercer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could I get a signoff from @github/codeql-javascript on the standard library changes?

Copy link
Contributor

@erik-krogh erik-krogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could I get a signoff from @github/codeql-javascript on the standard library changes?

Yes you can 👍
I got an evaluation running of the stdlib changes, and that looks good.

I have tried to fix the performance of FrameworkLibraries.qll a few times myself, but without success.
So it's great to see that someone was finally able to do it 🎉

@henrymercer henrymercer merged commit 6e16704 into github:main Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants