Skip to content

Sink endpoint characteristics #11055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 4, 2022
Merged

Conversation

tiferet
Copy link
Contributor

@tiferet tiferet commented Oct 31, 2022

This PR adds the class EndpointCharacteristic (formerly referred to as ClassificationReason in the design doc).

As a first step, it implements only the four characteristics that indicate that an endpoint is a sink. Subsequent PRs will add characteristics that indicate an endpoint is not a sink.

The definition of a known sink can now be written in a generic fashion in the base class ATMConfig.qll without needing each query's config to implement it independently.

The same logic will be used to surface positive training samples in a subsequent PR.

Update: I've written the characteristics that will replace NotASinkReason and verified that I can reproduce the current selection of training examples. I need to clean up that code before opening a PR with it, though.

Timing experiment: https://github.com/github/codeql-dca-main/issues/8273

Closes https://github.com/github/ml-ql-adaptive-threat-modeling/issues/2096

@tiferet tiferet marked this pull request as ready for review November 1, 2022 01:01
@tiferet tiferet requested review from a team, jhelie and kaeluka November 1, 2022 01:01
@tiferet
Copy link
Contributor Author

tiferet commented Nov 1, 2022

@kaeluka IIRC you said there's extensive testing in the PR checks, so the fact that those have passed indicates this PR has made no change to the training data or the endpoints that get scored at inference time, right?

@kaeluka
Copy link

kaeluka commented Nov 2, 2022

Yes, but they're no guarantee 👍. One crucial test is javascript/ql/experimental/adaptivethreatmodeling/test/endpoint_large_scale/ExtractEndpointDataTraining.qlref together with the ExtractEndpointDataTraining.expected file in the same dir.

Copy link

@kaeluka kaeluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left a few comments. Have you also started a performance evaluation for this change that you could link here?

@tiferet
Copy link
Contributor Author

tiferet commented Nov 2, 2022

I have left a few comments. Have you also started a performance evaluation for this change that you could link here?

It's this experiment that I pinged you about yesterday, because I don't understand the failure errors

@jhelie
Copy link
Contributor

jhelie commented Nov 2, 2022

Thanks @tiferet - I've had a look and LGTM, I don't have anything to add to @kaeluka's comments. My only suggestion would be regarding the below:

Update: I've written the characteristics that will replace NotASinkReason and verified that I can reproduce the current selection of training examples

I think we should define a canary database (or a couple, to hit all the sink types) that we can all use and refer to ensure we're reproducing the set of endpoints. Concretely I'd like us to define somewhere along these lines:

Using databases `foo`, `bar` and `baz` the current implementation extracts the following number of endpoints:
- negative: ...
- {sinkType}: ...
- Unknown endpoints: ... 

In comparison, with this WIP design we are extracting the following number of endpoints:
- negative: ...
- {sinkType}: ...
- Unknown endpoints: ... 

While a couple of databases won't capture all edge cases the aim would be to have a simple target so we can track progress and easily identify regressions.

EndpointCharacteristic() { any() }

// Indicators with confidence at or above this threshold are considered to be high-confidence indicators.
float getHighConfidenceThreshold() { result = 0.8 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I had actually was what is the role of this predicate? I'm guessing it's a categorical selection wrapper on the more fine grain confidence floats, but why do we need it at this stage? (same question with the Medium one below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I could have left these for a later PR. I use them in the endpoint selection code. For example, to implement logic such as "if the list of characteristics includes positive indicators with high confidence for this class, select this as a training sample belonging to the class". I put them in EndpointCharacteristic because I think the place that sets confidences for various types of endpoints should also define what we mean by "high confidence".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we introduce them in the PR that will use them then? I know you have written more code locally, but for us that will make things a little easier to follow along.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it's not worth deleting them from this PR just to add them in the next PR 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but in the future please do not introduce concepts not discussed beforehand and not used in the PR you open.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Jean. In general, it's nice to avoid introducing dead code.

@adityasharad
Copy link
Collaborator

I think we could accomplish Jean's suggestion using QL unit tests. The test case could have sample code with all the possible endpoint types, and you can check in an .expected file for the expected output of the endpoint query on that test case.

@tiferet
Copy link
Contributor Author

tiferet commented Nov 2, 2022

Regarding Jean and Aditya's testing suggestions, these QL tests already exist 👍 : e.g. javascript/ql/experimental/adaptivethreatmodeling/test/endpoint_large_scale/ExtractEndpointDataTraining.qlref together with the ExtractEndpointDataTraining.expected file

@tiferet
Copy link
Contributor Author

tiferet commented Nov 2, 2022

I think I've addressed all the review comments 🏓

I ran endpoint_large_scale/ExtractEndpointData and endpoint_large_scale/ExtractEndpointDataTraining locally and they run as fast as on main. Excluding DB extraction and query compilation,

  • ExtractEndpointData takes 3.5-4 seconds on main and 3.6-3.9 seconds on this branch ✅
  • ExtractEndpointDataTraining takes 3.6-3.9 seconds on main and 3.5-3.7 seconds on this branch ✅

What's left is to get the timing DCA tests to run (latest attempt), plus a (hopefully) final review....

@tiferet tiferet requested a review from kaeluka November 2, 2022 19:48
@tiferet
Copy link
Contributor Author

tiferet commented Nov 2, 2022

@kaeluka Using the latest CLI seems to have solved the DCA failures. I don't know how to read the resulting report. The instructions say If timing/ATM-Threshold issue summary, when done, contains failures, the respective repositories have exceeded their permissible overhead quota, but I can't find anything called timing/ATM-Threshold. If I understand this table correctly, though, it seems like we have a problem?

@jhelie
Copy link
Contributor

jhelie commented Nov 2, 2022

I can't find anything called timing/ATM-Threshold. If I understand this table correctly, though, it seems like we have a problem?

@tiferet see the Note on performance we added to the instructions recently:

  • the table you linked to is not the one used to decide whether we meet our KPI (as it's using relative and no relative_with_overhead)
  • the fact you do not see any ATM-threshold summary in the experiment issue suggests we do not have a problem (I think we don't show anything if there's no problem as real estate is at a premium on the DCA issue)

cc @esbena to confirm

@jhelie
Copy link
Contributor

jhelie commented Nov 2, 2022

Regarding Jean and Aditya's testing suggestions, these QL tests already exist

That's great - to clarify I had in mind something more basic that just checked on the endpoints, without coupling it to format as I thought we might have to break that format during implementation of the design (if only temporarily, as discussed we'll need to write glue-code to minimise disruption to the pipeline inputs). But we can proceed with these tests for now and cross that bridge when we get to it.

@tiferet
Copy link
Contributor Author

tiferet commented Nov 3, 2022

I can't find anything called timing/ATM-Threshold. If I understand this table correctly, though, it seems like we have a problem?

@tiferet see the Note on performance we added to the instructions recently:

  • the table you linked to is not the one used to decide whether we meet our KPI (as it's using relative and no relative_with_overhead)
  • the fact you do not see any ATM-threshold summary in the experiment issue suggests we do not have a problem (I think we don't show anything if there's no problem as real estate is at a premium on the DCA issue)

cc @esbena to confirm

But the absolute time difference is sometimes big (e.g. 191.7 seconds). That's why I assumed the relative_with_overhead would fail too.

@jhelie
Copy link
Contributor

jhelie commented Nov 3, 2022

But the absolute time difference is sometimes big (e.g. 191.7 seconds). That's why I assumed the relative_with_overhead would fail too.

I only had a quick look but the diff is only big for one source and from memory even then relative_with_overhead would be below the threshold - it's easy enough to do the maths though

@esbena
Copy link
Contributor

esbena commented Nov 3, 2022

The ATM thresholds have not been crossed. You can confirm by looking in reports/any.md and observing the ToC:

Title Interesting rows Undecided rows Bad rows Good rows min max
Alert count using manual result classifications, per query
Alert count using manual result classifications, per source and query
Analysis time, per source (ATM threshold)

^ nothing interesting.

For completeness:

source a targets b targets weight conclusion a_m a_std b_m b_std diff relative
Median (excl. partials) - - 33 0.428
Overall (excl. partials) - - 1382 1879 497 0.36
Siteimprove__alfa 3 3 72 1 82 0 10 0.139
microsoft__playwright 3 3 109.3 1.155 142 2 32.67 0.299
nodejs__node 3 3 621 7.937 812.7 4.933 191.7 0.309
son7211__demovul 3 3 92.67 2.082 126.7 2.517 34 0.367
MarToxAk__pdfsize 3 3 73.33 3.215 104.7 0.577 31.33 0.427
mozilla__pdf.js 3 3 74 1.732 105.7 3.512 31.67 0.428
mozyy__pdf.js 3 3 72.67 0.577 105 1 32.33 0.445
TechHamara__pdf.js 3 3 72 1 105 1 33 0.458
hckhanh__pdf.js-dist-viewer 3 3 71.33 0.577 104.3 0.577 33 0.463
uktrade__data-hub-frontend 3 3 65.33 3.215 99.33 2.309 34 0.52
navikt__fp-frontend 3 3 58 1 91.33 4.726 33.33 0.575

kaeluka
kaeluka previously approved these changes Nov 3, 2022
@tiferet
Copy link
Contributor Author

tiferet commented Nov 3, 2022

The ATM thresholds have not been crossed. You can confirm by looking in reports/any.md and observing the ToC

@esbena Thank you for the clarification about where to look! I've updated our instructions to reflect this ❤️

Copy link
Contributor

@henrymercer henrymercer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks generally good! A few comments.

EndpointCharacteristic() { any() }

// Indicators with confidence at or above this threshold are considered to be high-confidence indicators.
float getHighConfidenceThreshold() { result = 0.8 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Jean. In general, it's nice to avoid introducing dead code.

@tiferet tiferet requested a review from kaeluka November 4, 2022 14:06
kaeluka
kaeluka previously approved these changes Nov 4, 2022
Copy link

@kaeluka kaeluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all new changes are benign

Write the reasons that indicate that an endpoint is a sink for each sink type.

Also fix import error.
If the list of reasons includes positive indicators with maximal confidence for this class, it's a known sink for the class.

This negates the need for each query config to define the isKnownSink predicate individually.
@tiferet tiferet force-pushed the tiferet/sink-classification-reasons branch from fed54ef to 833041c Compare November 4, 2022 17:10
@tiferet tiferet requested a review from kaeluka November 4, 2022 17:22
@tiferet tiferet merged commit 5198ad7 into main Nov 4, 2022
@tiferet tiferet deleted the tiferet/sink-classification-reasons branch November 4, 2022 18:24
Copy link
Contributor

@aeisenberg aeisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comment changes. I'm not too familiar with this area, so I'm not even sure if these suggestions are correct. Feel free to take or leave them.

* This predicate describes what the characteristic tells us about an endpoint.
*
* Params:
* endpointClass: Class 0 is the negative class. Each positive int corresponds to a single sink type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean here, but it took me a second since 0 is not negative. I don't have any suggestions on improvement, though.

Comment on lines +35 to +36
* isPositiveIndicator: Does this characteristic indicate this endpoint _is_ a member of the class, or that it
* _isn't_ a member of the class?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

Suggested change
* isPositiveIndicator: Does this characteristic indicate this endpoint _is_ a member of the class, or that it
* _isn't_ a member of the class?
* isPositiveIndicator: If true, this endpoint is a member of the class.

Comment on lines +37 to +38
* confidence: A number in [0, 1], which tells us how strong an indicator this characteristic is for the endpoint
* belonging / not belonging to the given class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident that this comment change makes things better. It's mostly for my own understanding.

Suggested change
* confidence: A number in [0, 1], which tells us how strong an indicator this characteristic is for the endpoint
* belonging / not belonging to the given class.
* confidence: A float in [0, 1], which tells us how strong an indicator this characteristic is for the endpoint
* belonging / not belonging to the given class. 0 means complete confidence that this characteristic _does not_ indicate belonging to this endpoint. And 1 means complete confidence that this characteristic _does_ belong..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants