-
Notifications
You must be signed in to change notification settings - Fork 1.8k
JS/ATM: Various compilation fixes and performance improvements #7307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use the LabelParameter API instead of manually constructing the edge label.
…eir marker comments The `matchMarkerComment` predicate performs badly on any codebase with a moderately large number of comments, because the current implementation has to first compute the Cartesian product between the set of comments and the set of framework library comment regexes. Instead, match first against a single regex: the union of all framework library comment regexes. This computes a more benign Cartesian product, the same size as the set of comments. See inline comments for more details.
Change the cutoff logic from `count` to `strictcount`, since we know it only applies to a non-empty set of results. Use a single `strictconcat` aggregate to combine tokens in order of location, instead of computing a `rank` followed by a `concat`. Strictness introduces a slight change of behaviour because missing tokens will now result in no results from the predicate rather than an empty feature string.
Use a `strictcount` to identify whether there is exactly one feature or not. If so, we use it. If not, we use the empty string. Add context to ensure we filter the set of data flow nodes down to only the set of endpoint nodes. This performance optimisation avoids calculating the Cartesian product of data flow nodes and feature names, but it does not avoid calculating the (slightly smaller) Cartesian product of endpoint nodes and feature names. Product size = number of endpoint nodes * number of feature names. At time of writing there are 8 feature names.
LGTM 👍 An evaluation looks great 👍 But you need to run the autoformatter on |
While impressive, the evaluation is not representative since it does not make use of the ml-model. A small DCA change is required to download the model on the fly. |
0e31439 introduces some occasional duplicate tokens due to duplicate AST node attributes. The long-term fix is to update `CodeToFeatures.qll`, but for the short-term, we update the concatenation to concatenate unique (location, token) pairs.
0e31439 appears to some small side effects where we duplicate some tokens in the function body feature. I've pushed a couple of commits: the first removes that side effect, and the second autoformats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do some final testing of this internally, but LGTM otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the ATM changes. There look to be a couple of failing tests in the standard JS library.
Looks like the regex-combining wasn't actually behaviour-preserving. Will see if I can fix that so it has no diffs, otherwise back out that change. |
The receiver string and the regex were in the wrong order, leading to test failures when looking for matching comments.
a6a5a72
to
271b23b
Compare
Updated (see latest commits) to fix a logic bug in regex matching that was causing test failures ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could I get a signoff from @github/codeql-javascript on the standard library changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could I get a signoff from @github/codeql-javascript on the standard library changes?
Yes you can 👍
I got an evaluation running of the stdlib changes, and that looks good.
I have tried to fix the performance of FrameworkLibraries.qll
a few times myself, but without success.
So it's great to see that someone was finally able to do it 🎉
This PR attempts to improve the evaluation performance of the ATM queries and libraries, focused on slow bottlenecks that take 10s or more on a moderately large database.
Semantics should be largely unchanged, with the exception of a few places where I've used the strict version of aggregates, which will now have no results (rather than the unit of the aggregation operation) if there are no tuples satisfying the aggregation expressions.
Commit-wise review highly recommended. See commit descriptions and inline comments for further explanation of each change. Careful testing with the actual models against some candidate databases highly recommended, to make sure I haven't changed how features are represented.