Skip to content

Conversation

hvitved
Copy link
Contributor

@hvitved hvitved commented Apr 1, 2022

Background

In Ruby, we are tracking flow through arrays with precise array index information, when available. That is, we are correctly able to handle

a[0] = tainted;
a[1] = not_tainted;
sink(a[0]); # bad
sink(a[1]); # good

Flow is tracked using the data flow library's notion of contents, reads, and stores, and we can model known/unknown array information as follows

newtype Content =
  KnownArrayElementContent(int i) { i in [0 .. 10] } or
  UnknownArrayElementContent()

Then the following statements

a1[0] = b1
a2[x] = b2
a3[1]
a4[x]

correspond to the following read/store steps

graph TD
  B1[b1]-->|"store(KnownArrayElementContent(0))"| A1["[post update] a1"]
  B2[b2]-->|"store(UnknownArrayElementContent())"| A2["[post update] a2"]
  A3-->|"read(UnknownArrayElementContent())"| A3I
  A3[a3]-->|"read(KnownArrayElementContent(1))"| A3I["a3[1]"]
  A4[a4]-->|"read(UnknownArrayElementContent())"| A4I["a4[x]"]
  A4-->|"read(KnownArrayElementContent([0..10]))"| A4I
Loading

assuming that x does not have a known constant value.

The Problem

In the example above, we can see how a4[x] gives rise to 11 read steps, since we need to be able to read out values stored at any known or unknown index. Unfortunately, we need to encode this up-front in the readStep relation, which means that all such reads will be replicated 11 times, and it would become even worse if we were to increase the range of KnownArrayElementContent.

Another issue is that when the data-flow library generates reverse store steps from read steps

a[x][1] = taint

we will get the following reverse store steps

graph TD
  A["[post update] a[x]"]-->|"store(UnknownArrayElementContent())"| A1["[post update] a"]
  A-->|"store(KnownArrayElementContent([0..10]))"| A1
Loading

whereas what we really wanted was just a single store step

graph TD
  A["[post update] a[x]"]-->|"store(UnknownArrayElementContent())"| A1["[post update] a"]
Loading

A Solution

This PR proposes a solution to both problems above, where we generalize the interface of the data-flow library for readStep and storeStep, to use a new type ContentSet instead of Content. ContentSet will allow us to represent sets of Contents:

newtype ContentSet =
  SingletonContent(Content c) or
  AnyArrayElementContent()

And we will be able to interpret ContentSets differently, depending on whether they occur in reads or in stores

Content getAStoreContent(ContentSet set) {
  set = SingletonContent(result)
  or
  // for reverse stores
  set = AnyArrayElementContent() and
  result = UnknownArrayElementContent()
}

Content getAReadContent(ContentSet set) {
  set = SingletonContent(result)
  or
  set = AnyArrayElementContent() and
  (result = UnknownArrayElementContent() or result = KnownArrayElementContent(_))
}

We will still be using Contents in access paths, but we will use ContentSets internally, and the data-flow library will make sure to avoid fan-out duplication via getAStoreContent/getAReadContent.

Returning to the original example

a1[0] = b1
a2[x] = b2
a3[1]
a4[x]

we will now have the following read/store steps:

graph TD
  B1[b1]-->|"store(SingletonContent(KnownArrayElementContent(0)))"| A1["[post update] a1"]
  B2[b2]-->|"store(AnyArrayElementContent()))"| A2["[post update] a2"]
  A3-->|"read(SingletonContent(UnknownArrayElementContent()))"| A3I
  A3[a3]-->|"read(SingletonContent(KnownArrayElementContent(1)))"| A3I["a3[1]"]
  A4[a4]-->|"read(AnyArrayElementContent())"| A4I["a4[x]"]
Loading

And for

a[x][1] = taint

we get just a single reverse store step

graph TD
  A["[post update] a[x]"]-->|"store(AnyArrayElementContent())"| A1["[post update] a"]
Loading

which translates via getAStoreContent into a store into UnknownArrayElementContent as desired.

PR Structure

This PR is structured into a sequence of logically self-contained commits, and I strongly encourage reviewing them in order:

  • a5040fd: Adds a Ruby test for the reverse store example above.
  • 309fd93: Introduces ContentSet into the data-flow library.
  • c4fbc61: Syncs the data flow files.
  • 725d76e: Implements ContentSet for Ruby, as per the description above.
  • d99bb65: Implements ContentSet for C++ as a simple bijection into Content.
  • b91858e: Implements ContentSet for Java as a simple bijection into Content.
  • 7113c1b: Implements ContentSet for C# as a simple bijection into Content.
  • 57f2a74: Implements ContentSet for Python as a simple bijection into Content.
  • 415a1c2: Updates shared Java/C# code for generating flow summaries.
  • cee527e: Updates the documentation on the shared data flow library.

@hvitved hvitved force-pushed the dataflow/interpret-read-store branch 3 times, most recently from 68503d8 to c55f66e Compare April 4, 2022 11:09
@hvitved hvitved force-pushed the dataflow/interpret-read-store branch from c55f66e to 415a1c2 Compare April 4, 2022 11:52
@hvitved hvitved changed the title Data flow: Add ContentExt, interpretRead, and interpretStore Data flow: Introduce ContentSet Apr 4, 2022
@hvitved hvitved marked this pull request as ready for review April 6, 2022 09:18
@hvitved hvitved requested review from a team as code owners April 6, 2022 09:18
@hvitved hvitved added the no-change-note-required This PR does not need a change note label Apr 6, 2022
Before
```
[2022-04-06 13:19:29] (96s) Tuple counts for DataFlowImpl2::Stage1::revFlowConsCand#7ad53399#ff/2@i14#aa10f2wi after 4.4s:
                      10681    ~0%     {2} r1 = SCAN DataFlowImpl2::Stage1::revFlow#7ad53399#fff#prev_delta OUTPUT In.0, In.2 'config'
                      982      ~1%     {3} r2 = JOIN r1 WITH DataFlowImpl2::readSet#7ad53399#ffff_2301#join_rhs ON FIRST 2 OUTPUT Rhs.3, Lhs.1 'config', Rhs.2
                      83691528 ~2%     {3} r3 = JOIN r2 WITH DataFlowPublic::ContentSet::getAReadContent#dispred#f0820431#ff ON FIRST 1 OUTPUT Lhs.1 'config', Lhs.2, Rhs.1 'c'
                      83581763 ~2%     {3} r4 = r3 AND NOT DataFlowImpl2::Stage1::revFlowConsCand#7ad53399#ff#prev(Lhs.2 'c', Lhs.0 'config')
                      83581763 ~0%     {3} r5 = SCAN r4 OUTPUT In.2 'c', In.0 'config', In.1
                      0        ~0%     {3} r6 = JOIN r5 WITH DataFlowImpl2::Stage1::fwdFlowConsCand#7ad53399#ff ON FIRST 2 OUTPUT Lhs.2, Lhs.1 'config', Lhs.0 'c'
                      0        ~0%     {2} r7 = JOIN r6 WITH DataFlowImpl2::Stage1::fwdFlow#7ad53399#2#fff_02#join_rhs ON FIRST 2 OUTPUT Lhs.2 'c', Lhs.1 'config'
                                       return r7
```

After
```
[2022-04-06 13:44:38] (6s) Tuple counts for DataFlowImpl2::Stage1::revFlowConsCand#7ad53399#ff/2@i14#5abbf2wn after 6ms:
                      10681  ~0%     {2} r1 = SCAN DataFlowImpl2::Stage1::revFlow#7ad53399#fff#prev_delta OUTPUT In.0, In.2 'config'
                      982    ~1%     {3} r2 = JOIN r1 WITH DataFlowImpl2::readSet#7ad53399#ffff_2301#join_rhs ON FIRST 2 OUTPUT Rhs.3, Lhs.1 'config', Rhs.2
                      109765 ~0%     {3} r3 = JOIN r2 WITH DataFlowImpl2::Stage1::fwdFlowConsCandSet#7ad53399#fff#reorder_0_2_1 ON FIRST 2 OUTPUT Lhs.1 'config', Lhs.2, Rhs.2 'c'
                      0      ~0%     {3} r4 = r3 AND NOT DataFlowImpl2::Stage1::revFlowConsCand#7ad53399#ff#prev(Lhs.2 'c', Lhs.0 'config')
                      0      ~0%     {3} r5 = SCAN r4 OUTPUT In.1, In.0 'config', In.2 'c'
                      0      ~0%     {2} r6 = JOIN r5 WITH DataFlowImpl2::Stage1::fwdFlow#7ad53399#2#fff_02#join_rhs ON FIRST 2 OUTPUT Lhs.2 'c', Lhs.1 'config'
                                     return r6
```
@hvitved
Copy link
Contributor Author

hvitved commented Apr 6, 2022

I have pushed another commit that is not strictly needed yet (it will be for #7914), but it belongs logically with this PR.

@asgerf
Copy link
Contributor

asgerf commented Apr 8, 2022

I think there's a point in the design space worth talking about here. In the literature of (sound) JavaScript analysis the standard solution is to have two "unknown" contents rather than one:

  • StoredWithUnknownKey: any store with an unknown key will write to here
  • ReadWithUnknownKey: any read with unknown key will read from here

The rule is then:

  • Store to unknown key: writes to StoredWithUnknownKey and ReadFromUnknownKey
  • Store to known key K: writes to K and ReadFromUnknownKey
  • Read from unknown key: reads from ReadFromUnknownKey
  • Read from known key K: reads from K and StoredWithUnknownKey

(Another common solution is to collapse a heterogenous array into a homogenous array as soon as any operation acts on it with non-constant index, but that is hard to apply in our setting).

I can't immediately say whether this is better than ContentSet, or if it actually maps down to the same thing. I guess you'd still need to inform the core library that the reverse of ReadFromUnknownKey is StoredWithUnknownKey. And in terms of writing access paths by hand, I can certainly see the convenience in being able to talk about the "unknown element" without having to specify if it's the read or write variant.

Anyway, this isn't meant as an objection, just thought I'd mention it in case it helps somehow.

@aschackmull
Copy link
Contributor

I think the StoredWithUnknownKey and ReadFromUnknownKey approach sounds promising, although I'd probably tweak it a bit to be StoredWithUnknownKey and StoredWithKnownKey instead with the following rules:

  • Store to unknown key: writes to StoredWithUnknownKey
  • Store to known key K: writes to K and StoredWithKnownKey
  • Read from unknown key: reads from StoredWithUnknownKey and StoredWithKnownKey
  • Read from known key K: reads from K and StoredWithUnknownKey

This should be equivalent, but ought to perform a bit better in an analysis that is forward-first, as our is.

@hvitved
Copy link
Contributor Author

hvitved commented Apr 19, 2022

Thanks for your inputs @asgerf and @aschackmull . I agree that the above should work as well (except we would still have to handle reverse stores), but I think I have a slight preference for the ContentSet approach because:

  1. As @asgerf points out, there will be only one "unknown element" content, so logic that works with Contents (such as inspecting access paths inside PathNodes) will not have to know about the two types of unknowns. And, if we for some reason care about the exact index, we will lose that information when a store is matched up with an uknown read.
  2. One could imagine future extensions based on range analysis, where we will be able to refine unknown reads/stores based on known bounds. With the ContentSet approach, we would again not have to touch Content, but only define new (internal) ContentSet entities and define their interpretation in terms of getAStoreContent and getAReadContent.

@aschackmull
Copy link
Contributor

As discussed offline: Let's move the store-related ContentSet-to-Content join into DataFlowImplCommon and reduce the diff in DataFlowImpl for the store steps.

@hvitved hvitved force-pushed the dataflow/interpret-read-store branch 2 times, most recently from 26e7971 to 360e3e7 Compare April 22, 2022 13:07
@hvitved hvitved force-pushed the dataflow/interpret-read-store branch from 360e3e7 to bc6ee10 Compare April 22, 2022 13:10
@hvitved hvitved removed the no-change-note-required This PR does not need a change note label Apr 22, 2022
read(n1, c, n2, pragma[only_bind_into](config))
revFlowIsReadAndStored(pragma[only_bind_into](c), pragma[only_bind_into](config)) and
read(n1, c, n2, pragma[only_bind_into](config)) and
revFlow(n2, pragma[only_bind_into](config))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This join-order is not optimal, but presumably good enough. Realising the optimal order would require another helper predicate, so I'm not suggesting any change here, merely noting the fact.

Copy link
Contributor

@aschackmull aschackmull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@hvitved hvitved merged commit bffa8fa into github:main Apr 25, 2022
@hvitved hvitved deleted the dataflow/interpret-read-store branch April 25, 2022 10:17
jketema added a commit to jketema/codeql that referenced this pull request Apr 27, 2022
…t-read-store"

This reverts commit bffa8fa, reversing
changes made to ba2a884.
owen-mc added a commit to owen-mc/codeql that referenced this pull request Nov 9, 2022
In github#8641, `localFlowExit` was
changed to use `Stage2::readStepCand` instead of `read`, which means
that the big-step relation is broken up less. This causes test result
changes. Nothing is lost from the `select` clause, but some results may
have fewer paths, and fewer nodes and edges are output in the test
results.
owen-mc added a commit to owen-mc/codeql that referenced this pull request Nov 9, 2022
In github#8641, `localFlowExit` was
changed to use `Stage2::readStepCand` instead of `read`, which means
that the big-step relation is broken up less. This causes test result
changes. Nothing is lost from the `select` clause, but some results may
have fewer paths, and fewer nodes and edges are output in the test
results.
owen-mc added a commit to owen-mc/codeql that referenced this pull request Nov 9, 2022
In github#8641, `localFlowExit` was
changed to use `Stage2::readStepCand` instead of `read`, which means
that the big-step relation is broken up less. This causes test result
changes. Nothing is lost from the `select` clause, but some results may
have fewer paths, and fewer nodes and edges are output in the test
results.
owen-mc added a commit to owen-mc/codeql that referenced this pull request Nov 17, 2022
In github#8641, `localFlowExit` was
changed to use `Stage2::readStepCand` instead of `read`, which means
that the big-step relation is broken up less. This causes test result
changes. Nothing is lost from the `select` clause, but some results may
have fewer paths, and fewer nodes and edges are output in the test
results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants