Sink endpoint characteristics #11055

tiferet · 2022-10-31T21:02:21Z

This PR adds the class EndpointCharacteristic (formerly referred to as ClassificationReason in the design doc).

As a first step, it implements only the four characteristics that indicate that an endpoint is a sink. Subsequent PRs will add characteristics that indicate an endpoint is not a sink.

The definition of a known sink can now be written in a generic fashion in the base class ATMConfig.qll without needing each query's config to implement it independently.

The same logic will be used to surface positive training samples in a subsequent PR.

Update: I've written the characteristics that will replace NotASinkReason and verified that I can reproduce the current selection of training examples. I need to clean up that code before opening a PR with it, though.

Timing experiment: https://github.com/github/codeql-dca-main/issues/8273

Closes https://github.com/github/ml-ql-adaptive-threat-modeling/issues/2096

tiferet · 2022-11-01T22:42:56Z

@kaeluka IIRC you said there's extensive testing in the PR checks, so the fact that those have passed indicates this PR has made no change to the training data or the endpoints that get scored at inference time, right?

kaeluka · 2022-11-02T13:38:15Z

Yes, but they're no guarantee 👍. One crucial test is javascript/ql/experimental/adaptivethreatmodeling/test/endpoint_large_scale/ExtractEndpointDataTraining.qlref together with the ExtractEndpointDataTraining.expected file in the same dir.

kaeluka

I have left a few comments. Have you also started a performance evaluation for this change that you could link here?

...ql/experimental/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/ATMConfig.qll

...l/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointCharacteristics.qll

tiferet · 2022-11-02T13:51:10Z

I have left a few comments. Have you also started a performance evaluation for this change that you could link here?

It's this experiment that I pinged you about yesterday, because I don't understand the failure errors

jhelie · 2022-11-02T14:19:56Z

Thanks @tiferet - I've had a look and LGTM, I don't have anything to add to @kaeluka's comments. My only suggestion would be regarding the below:

Update: I've written the characteristics that will replace NotASinkReason and verified that I can reproduce the current selection of training examples

I think we should define a canary database (or a couple, to hit all the sink types) that we can all use and refer to ensure we're reproducing the set of endpoints. Concretely I'd like us to define somewhere along these lines:

Using databases `foo`, `bar` and `baz` the current implementation extracts the following number of endpoints:
- negative: ...
- {sinkType}: ...
- Unknown endpoints: ... 

In comparison, with this WIP design we are extracting the following number of endpoints:
- negative: ...
- {sinkType}: ...
- Unknown endpoints: ...

While a couple of databases won't capture all edge cases the aim would be to have a simple target so we can track progress and easily identify regressions.

jhelie · 2022-11-02T16:31:03Z

...l/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointCharacteristics.qll

+  EndpointCharacteristic() { any() }
+
+  // Indicators with confidence at or above this threshold are considered to be high-confidence indicators.
+  float getHighConfidenceThreshold() { result = 0.8 }


One question I had actually was what is the role of this predicate? I'm guessing it's a categorical selection wrapper on the more fine grain confidence floats, but why do we need it at this stage? (same question with the Medium one below)

You're right, I could have left these for a later PR. I use them in the endpoint selection code. For example, to implement logic such as "if the list of characteristics includes positive indicators with high confidence for this class, select this as a training sample belonging to the class". I put them in EndpointCharacteristic because I think the place that sets confidences for various types of endpoints should also define what we mean by "high confidence".

Should we introduce them in the PR that will use them then? I know you have written more code locally, but for us that will make things a little easier to follow along.

IMHO it's not worth deleting them from this PR just to add them in the next PR 🤷

ok but in the future please do not introduce concepts not discussed beforehand and not used in the PR you open.

Agree with Jean. In general, it's nice to avoid introducing dead code.

adityasharad · 2022-11-02T16:34:25Z

I think we could accomplish Jean's suggestion using QL unit tests. The test case could have sample code with all the possible endpoint types, and you can check in an .expected file for the expected output of the endpoint query on that test case.

tiferet · 2022-11-02T17:51:22Z

Regarding Jean and Aditya's testing suggestions, these QL tests already exist 👍 : e.g. javascript/ql/experimental/adaptivethreatmodeling/test/endpoint_large_scale/ExtractEndpointDataTraining.qlref together with the ExtractEndpointDataTraining.expected file

tiferet · 2022-11-02T19:48:41Z

I think I've addressed all the review comments 🏓

I ran endpoint_large_scale/ExtractEndpointData and endpoint_large_scale/ExtractEndpointDataTraining locally and they run as fast as on main. Excluding DB extraction and query compilation,

ExtractEndpointData takes 3.5-4 seconds on main and 3.6-3.9 seconds on this branch ✅
ExtractEndpointDataTraining takes 3.6-3.9 seconds on main and 3.5-3.7 seconds on this branch ✅

What's left is to get the timing DCA tests to run (latest attempt), plus a (hopefully) final review....

tiferet · 2022-11-02T20:35:42Z

@kaeluka Using the latest CLI seems to have solved the DCA failures. I don't know how to read the resulting report. The instructions say If timing/ATM-Threshold issue summary, when done, contains failures, the respective repositories have exceeded their permissible overhead quota, but I can't find anything called timing/ATM-Threshold. If I understand this table correctly, though, it seems like we have a problem?

jhelie · 2022-11-02T22:31:15Z

I can't find anything called timing/ATM-Threshold. If I understand this table correctly, though, it seems like we have a problem?

@tiferet see the Note on performance we added to the instructions recently:

the table you linked to is not the one used to decide whether we meet our KPI (as it's using relative and no relative_with_overhead)
the fact you do not see any ATM-threshold summary in the experiment issue suggests we do not have a problem (I think we don't show anything if there's no problem as real estate is at a premium on the DCA issue)

cc @esbena to confirm

jhelie · 2022-11-02T22:39:16Z

Regarding Jean and Aditya's testing suggestions, these QL tests already exist

That's great - to clarify I had in mind something more basic that just checked on the endpoints, without coupling it to format as I thought we might have to break that format during implementation of the design (if only temporarily, as discussed we'll need to write glue-code to minimise disruption to the pipeline inputs). But we can proceed with these tests for now and cross that bridge when we get to it.

tiferet · 2022-11-03T00:52:45Z

I can't find anything called timing/ATM-Threshold. If I understand this table correctly, though, it seems like we have a problem?

@tiferet see the Note on performance we added to the instructions recently:

the table you linked to is not the one used to decide whether we meet our KPI (as it's using relative and no relative_with_overhead)

the fact you do not see any ATM-threshold summary in the experiment issue suggests we do not have a problem (I think we don't show anything if there's no problem as real estate is at a premium on the DCA issue)

cc @esbena to confirm

But the absolute time difference is sometimes big (e.g. 191.7 seconds). That's why I assumed the relative_with_overhead would fail too.

jhelie · 2022-11-03T07:06:46Z

But the absolute time difference is sometimes big (e.g. 191.7 seconds). That's why I assumed the relative_with_overhead would fail too.

I only had a quick look but the diff is only big for one source and from memory even then relative_with_overhead would be below the threshold - it's easy enough to do the maths though

esbena · 2022-11-03T08:24:46Z

The ATM thresholds have not been crossed. You can confirm by looking in reports/any.md and observing the ToC:

Title	Interesting rows	Undecided rows	Bad rows	Good rows	min	max
Alert count using manual result classifications, per query
Alert count using manual result classifications, per source and query
Analysis time, per source (ATM threshold)

^ nothing interesting.

For completeness:

source	`a` targets	`b` targets	weight	conclusion	a_m	a_std	b_m	b_std	diff	relative
Median (excl. partials)	-	-							33	0.428
Overall (excl. partials)	-	-			1382		1879		497	0.36

Siteimprove__alfa	3	3	❌	✗	72	1	82	0	10	0.139
microsoft__playwright	3	3	❌	✗	109.3	1.155	142	2	32.67	0.299
nodejs__node	3	3	❌	✗	621	7.937	812.7	4.933	191.7	0.309
son7211__demovul	3	3	❌	✗	92.67	2.082	126.7	2.517	34	0.367
MarToxAk__pdfsize	3	3	❌	✗	73.33	3.215	104.7	0.577	31.33	0.427
mozilla__pdf.js	3	3	❌	✗	74	1.732	105.7	3.512	31.67	0.428
mozyy__pdf.js	3	3	❌	✗	72.67	0.577	105	1	32.33	0.445
TechHamara__pdf.js	3	3	❌	✗	72	1	105	1	33	0.458
hckhanh__pdf.js-dist-viewer	3	3	❌	✗	71.33	0.577	104.3	0.577	33	0.463
uktrade__data-hub-frontend	3	3	❌	✗	65.33	3.215	99.33	2.309	34	0.52
navikt__fp-frontend	3	3	❌	✗	58	1	91.33	4.726	33.33	0.575

tiferet · 2022-11-03T17:51:27Z

The ATM thresholds have not been crossed. You can confirm by looking in reports/any.md and observing the ToC

@esbena Thank you for the clarification about where to look! I've updated our instructions to reflect this ❤️

henrymercer

This looks generally good! A few comments.

...ql/experimental/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/ATMConfig.qll

...l/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointCharacteristics.qll

henrymercer · 2022-11-03T18:59:41Z

...l/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointCharacteristics.qll

+  EndpointCharacteristic() { any() }
+
+  // Indicators with confidence at or above this threshold are considered to be high-confidence indicators.
+  float getHighConfidenceThreshold() { result = 0.8 }


Agree with Jean. In general, it's nice to avoid introducing dead code.

...l/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointCharacteristics.qll

kaeluka

all new changes are benign

Write the reasons that indicate that an endpoint is a sink for each sink type. Also fix import error.

If the list of reasons includes positive indicators with maximal confidence for this class, it's a known sink for the class. This negates the need for each query config to define the isKnownSink predicate individually.

Change the name to EndpointCharacteristics.

Make the implementations of specific `EndpointCharacteristic`s private.

aeisenberg

Just some minor comment changes. I'm not too familiar with this area, so I'm not even sure if these suggestions are correct. Feel free to take or leave them.

aeisenberg · 2022-11-04T18:25:49Z

...l/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointCharacteristics.qll

+   * This predicate describes what the characteristic tells us about an endpoint.
+   *
+   *  Params:
+   *  endpointClass: Class 0 is the negative class. Each positive int corresponds to a single sink type.


I know what you mean here, but it took me a second since 0 is not negative. I don't have any suggestions on improvement, though.

aeisenberg · 2022-11-04T18:26:49Z

...l/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointCharacteristics.qll

+   *  isPositiveIndicator: Does this characteristic indicate this endpoint _is_ a member of the class, or that it
+   *  _isn't_ a member of the class?


Minor:

Suggested change

* isPositiveIndicator: Does this characteristic indicate this endpoint _is_ a member of the class, or that it

* _isn't_ a member of the class?

* isPositiveIndicator: If true, this endpoint is a member of the class.

aeisenberg · 2022-11-04T18:29:48Z

...l/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointCharacteristics.qll

+   *  confidence: A number in [0, 1], which tells us how strong an indicator this characteristic is for the endpoint
+   *  belonging / not belonging to the given class.


I'm not confident that this comment change makes things better. It's mostly for my own understanding.

Suggested change

* confidence: A number in [0, 1], which tells us how strong an indicator this characteristic is for the endpoint

* belonging / not belonging to the given class.

* confidence: A float in [0, 1], which tells us how strong an indicator this characteristic is for the endpoint

* belonging / not belonging to the given class. 0 means complete confidence that this characteristic _does not_ indicate belonging to this endpoint. And 1 means complete confidence that this characteristic _does_ belong..

tiferet marked this pull request as ready for review November 1, 2022 01:01

tiferet requested review from a team, jhelie and kaeluka November 1, 2022 01:01

kaeluka reviewed Nov 2, 2022

View reviewed changes

jhelie reviewed Nov 2, 2022

View reviewed changes

tiferet requested a review from kaeluka November 2, 2022 19:48

kaeluka previously approved these changes Nov 3, 2022

View reviewed changes

henrymercer reviewed Nov 3, 2022

View reviewed changes

tiferet dismissed kaeluka’s stale review via 787352f November 4, 2022 13:46

github-actions bot added the ATM label Nov 4, 2022

github-advanced-security bot found potential problems Nov 4, 2022

View reviewed changes

tiferet requested a review from kaeluka November 4, 2022 14:06

kaeluka previously approved these changes Nov 4, 2022

View reviewed changes

tiferet added 2 commits November 4, 2022 09:30

Create the sink ClassificationReasons

08bbe59

Write the reasons that indicate that an endpoint is a sink for each sink type. Also fix import error.

Generalize the definition of a known sink:

a4939b9

If the list of reasons includes positive indicators with maximal confidence for this class, it's a known sink for the class. This negates the need for each query config to define the isKnownSink predicate individually.

tiferet added 7 commits November 4, 2022 09:30

Rename ClassificationReasons

c0cc754

Change the name to EndpointCharacteristics.

Enforce the abstraction over characteristics:

300456c

Make the implementations of specific `EndpointCharacteristic`s private.

Improve the import structure

cbf81b8

Format fixes

e60c016

Minor changes from code review

74c8bff

Remove predicates not yet used from the current PR

2aa4651

Fix QLDoc style errors

833041c

tiferet dismissed kaeluka’s stale review via 833041c November 4, 2022 17:10

tiferet force-pushed the tiferet/sink-classification-reasons branch from fed54ef to 833041c Compare November 4, 2022 17:10

tiferet requested a review from kaeluka November 4, 2022 17:22

henrymercer approved these changes Nov 4, 2022

View reviewed changes

tiferet merged commit 5198ad7 into main Nov 4, 2022

tiferet deleted the tiferet/sink-classification-reasons branch November 4, 2022 18:24

aeisenberg approved these changes Nov 4, 2022

View reviewed changes

		* isPositiveIndicator: Does this characteristic indicate this endpoint _is_ a member of the class, or that it
		* _isn't_ a member of the class?

	* isPositiveIndicator: Does this characteristic indicate this endpoint _is_ a member of the class, or that it
	* _isn't_ a member of the class?
	* isPositiveIndicator: If true, this endpoint is a member of the class.

		* confidence: A number in [0, 1], which tells us how strong an indicator this characteristic is for the endpoint
		* belonging / not belonging to the given class.

Sink endpoint characteristics #11055

Sink endpoint characteristics #11055

Uh oh!

Conversation

tiferet commented Oct 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiferet commented Nov 1, 2022

Uh oh!

kaeluka commented Nov 2, 2022

Uh oh!

kaeluka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tiferet commented Nov 2, 2022

Uh oh!

jhelie commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adityasharad commented Nov 2, 2022

Uh oh!

tiferet commented Nov 2, 2022

Uh oh!

tiferet commented Nov 2, 2022

Uh oh!

tiferet commented Nov 2, 2022

Uh oh!

jhelie commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhelie commented Nov 2, 2022

Uh oh!

tiferet commented Nov 3, 2022

Uh oh!

jhelie commented Nov 3, 2022

Uh oh!

esbena commented Nov 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiferet commented Nov 3, 2022

Uh oh!

henrymercer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaeluka left a comment

Choose a reason for hiding this comment

Uh oh!

aeisenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tiferet commented Oct 31, 2022 •

edited

Loading

jhelie commented Nov 2, 2022 •

edited

Loading

jhelie commented Nov 2, 2022 •

edited

Loading

esbena commented Nov 3, 2022 •

edited

Loading