Skip to content

Conversation

hmac
Copy link
Contributor

@hmac hmac commented Oct 20, 2022

This PR adds some internal documentation for the Ruby flow summary syntax. There's a few TODOs left where I'm not sure of some behaviour - if you know the answer, let me know and I'll fill them in.

@aibaars aibaars changed the base branch from main to codeql-cli-2.11.2 October 27, 2022 09:44
@aibaars
Copy link
Contributor

aibaars commented Oct 27, 2022

@hmac Could you split this PR? Documentation changes for the GA should go into codeql-cli-2.11.2 , but QL changes should go to main.

If you think the doc changes are not fit for the GA release, then please retarget this PR to main and remove the GA label.

@hmac
Copy link
Contributor Author

hmac commented Oct 27, 2022

I don't know why this has a GA label - these are internal docs (and also this PR is unfinished). I've removed it.

@hmac hmac changed the base branch from codeql-cli-2.11.2 to main October 27, 2022 22:17
@aibaars
Copy link
Contributor

aibaars commented Oct 28, 2022

I don't know why this has a GA label - these are internal docs (and also this PR is unfinished). I've removed it.

Ok great! I mistakenly labelled it as GA, sorry for that.

@hmac hmac force-pushed the flow-summary-docs branch from edadd59 to 9776dfd Compare November 3, 2022 23:14
@hmac hmac marked this pull request as ready for review November 4, 2022 01:32
@hmac hmac requested a review from a team as a code owner November 4, 2022 01:32
@hmac hmac added the no-change-note-required This PR does not need a change note label Nov 4, 2022
Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the excellent write-up.


`input` and `output` define the "from" and "to" locations in the flow summary.
They use a custom string-based syntax which is similar but not identical to
Models as Data. These strings are often referred to as access paths.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These strings should be identical to the input and output string in MaD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, MaD access paths map to API graph paths wheras flow summary paths map to dataflow nodes, and so the tokens supported by each can differ. This slack thread highlights one difference. Another example I've just spotted is that MaD has no support for the hash-splat specifier in Argument tokens.

Copy link
Contributor

@hvitved hvitved Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you mean. However, I think of MaD collectively as all the components that make up a summary; but you are right that the access path component (which identifies the relevant calls) maps to API graphs, where as the input/output components have their own (but similar) syntax.

Copy link
Contributor

@asgerf asgerf Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should call out the difference with the MaD path column just now (called the "identifying access path"). We can elaborate on this when documenting the MaD format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've slightly reworded this sentence, but I want to retain the reference to MaD because I think it's useful to the reader. Putting myself in the shoes of a newcomer to the codebase: if I've encountered MaD already then this helps provide context and a point of comparison, and if I haven't seen MaD yet then this lets me know that it exists and I can choose to explore more by reading the MaD documentation if I want (and if/when it exists).

Any keyword argument to the call.

#### `hash-splat`
The special "hash splat" argument/parameter, which is written as `**args`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will also include all explicit keyword arguments, wrapped in an implicit hash splat argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a commit that elaborates on this.

`ReturnValue` refers to the return value of the element identified in the
preceding access path. For example, `Argument[0].ReturnValue` refers to the
return value of the first argument. Of course this only makes sense if the first
argument is a callable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe callback instead of callable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I wrote callable to mean that, of course, it makes no sense to use ReturnValue on an object. Instead it must be a proc, lambda or block. I guess depending on how you define "callback", then these would all be considered callbacks. Ruby doesn't tend to use that terminology, though, and in cases like arr.detect(&:even?) I would probably not call &:even? a callback. 🤷

Would it improve things if I change this to say if the first argument is callable - i.e. it is a proc, lambda or block.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 182 to 183
TODO: I've not seen this component used in an output path; I don't know if it makes
sense to do so, or what meaning it would have.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't make sense to use in ouput.

Comment on lines 202 to 203
TODO: I've not seen this component used in an output path; I don't know if it makes
sense to do so, or what meaning it would have.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, doesn't make sense.

```

any data in any index of the first argument will be copied to the return value,
with the exception of data at index 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A prominent use case of WithoutElement is when you want to model a method that removes (a set of) element(s) from a collection. For example, if we want to model a method that removes the first element of the receiver collection, we can do so via

input = "Argument[self].WithoutElement[0]" and
output = "Argument[self]"

where both the input and the output refer to the same argument. The data-flow library will then prevent use-use flow from any corresponding receivers:

a[0] = taint
a[1] = taint
a.remove_first # use-use flow from `a` to the `a` below will be blocked, but there will still be flow from `[post-update] a` to `a` below
sink(a[0]) # good
sink(a[1]) # bad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've written this example into the docs. If you get time, it would be great to have an example of where you should use .WithoutElement[...].WithElement[...]. I've tried removing instances of this pattern from every flow summary and all tests still pass, which leads me to believe that it is redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good example of this is something like Hash#except. If we see a call like hash.except(:foo), we would like that to map to a flow summary with

input = Argument[self].WithoutElement[":foo"].WithElement[any]
output = ReturnValue

I.e. data can flow out of the call, but only when it is stored in some element that is not known to be at index :foo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure that WithElement[any] is necessary. Test 53 has an example:

private class S53 extends SimpleSummarizedCallable {
  S53() { this = "s53" }
  
  override predicate propagatesFlowExt(string input, string output, boolean preservesValue) {
    preservesValue = true and
    input = "Argument[self].WithoutElement[:foo]" and
    output = "ReturnValue"
  }
}
def m53(i, h)
    h[:foo] = source("a")
    h[:bar] = source("b")
    h[i] = source("c")
    
    sink h[:foo] # $ hasValueFlow=a hasValueFlow=c
    sink h[:bar] # $ hasValueFlow=b hasValueFlow=c
    
    x = h.s53()
    
    sink x[:foo]
    sink x[:bar] # $ hasValueFlow=b
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also incorrectly include flow for sink(source("d").s53()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... that seems very unintuitive to me, but you're certainly right. How do I explain this? Maybe something like:

An important point to note is that in a summary such as

input = "Argument[self].WithoutElement[0]" and
output = "ReturnValue"

if Argument[self] contains data, it will be copied to ReturnValue. If you only want to copy data in elements, and not in the container itself, add WithElement[any] to the input path:

input = "Argument[self].WithoutElement[0].WithElement[any]" and
output = "ReturnValue"

See tests 53 and 54 [todo: i'll add this test] for examples of this behaviour.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks good. WithoutElement[x] just says that data must not be stored inside a collection at index x, not that data must be stored inside a collection.

This is also the reason why something like a reverse method needs a summary like

input = "Argument[self].WithElement[any]" and
output = "ReturnValue"

and not

input = "Argument[self]" and
output = "ReturnValue"

@calumgrant
Copy link
Contributor

Ping @hmac - ok to merge (after resolving conflicts)?

@hmac
Copy link
Contributor Author

hmac commented Nov 16, 2022

There's a few comments above where I'd like to get a response from @hvitved before merging this.

@hmac hmac force-pushed the flow-summary-docs branch from c8f67ba to f49507e Compare November 25, 2022 03:55
@calumgrant calumgrant requested a review from asgerf November 28, 2022 09:54
@hmac hmac merged commit b70ca77 into github:main Dec 27, 2022
@hmac hmac deleted the flow-summary-docs branch December 27, 2022 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation no-change-note-required This PR does not need a change note Ruby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants