Data flow: Introduce `ContentSet` #8641

hvitved · 2022-04-01T14:37:59Z

Background

In Ruby, we are tracking flow through arrays with precise array index information, when available. That is, we are correctly able to handle

a[0] = tainted;
a[1] = not_tainted;
sink(a[0]); # bad
sink(a[1]); # good

Flow is tracked using the data flow library's notion of contents, reads, and stores, and we can model known/unknown array information as follows

newtype Content =
  KnownArrayElementContent(int i) { i in [0 .. 10] } or
  UnknownArrayElementContent()

Then the following statements

a1[0] = b1
a2[x] = b2
a3[1]
a4[x]

correspond to the following read/store steps

graph TD
  B1[b1]-->|"store(KnownArrayElementContent(0))"| A1["[post update] a1"]
  B2[b2]-->|"store(UnknownArrayElementContent())"| A2["[post update] a2"]
  A3-->|"read(UnknownArrayElementContent())"| A3I
  A3[a3]-->|"read(KnownArrayElementContent(1))"| A3I["a3[1]"]
  A4[a4]-->|"read(UnknownArrayElementContent())"| A4I["a4[x]"]
  A4-->|"read(KnownArrayElementContent([0..10]))"| A4I

assuming that x does not have a known constant value.

The Problem

In the example above, we can see how a4[x] gives rise to 11 read steps, since we need to be able to read out values stored at any known or unknown index. Unfortunately, we need to encode this up-front in the readStep relation, which means that all such reads will be replicated 11 times, and it would become even worse if we were to increase the range of KnownArrayElementContent.

Another issue is that when the data-flow library generates reverse store steps from read steps

a[x][1] = taint

we will get the following reverse store steps

graph TD
  A["[post update] a[x]"]-->|"store(UnknownArrayElementContent())"| A1["[post update] a"]
  A-->|"store(KnownArrayElementContent([0..10]))"| A1

whereas what we really wanted was just a single store step

graph TD
  A["[post update] a[x]"]-->|"store(UnknownArrayElementContent())"| A1["[post update] a"]

A Solution

This PR proposes a solution to both problems above, where we generalize the interface of the data-flow library for readStep and storeStep, to use a new type ContentSet instead of Content. ContentSet will allow us to represent sets of Contents:

newtype ContentSet =
  SingletonContent(Content c) or
  AnyArrayElementContent()

And we will be able to interpret ContentSets differently, depending on whether they occur in reads or in stores

Content getAStoreContent(ContentSet set) {
  set = SingletonContent(result)
  or
  // for reverse stores
  set = AnyArrayElementContent() and
  result = UnknownArrayElementContent()
}

Content getAReadContent(ContentSet set) {
  set = SingletonContent(result)
  or
  set = AnyArrayElementContent() and
  (result = UnknownArrayElementContent() or result = KnownArrayElementContent(_))
}

We will still be using Contents in access paths, but we will use ContentSets internally, and the data-flow library will make sure to avoid fan-out duplication via getAStoreContent/getAReadContent.

Returning to the original example

a1[0] = b1
a2[x] = b2
a3[1]
a4[x]

we will now have the following read/store steps:

graph TD
  B1[b1]-->|"store(SingletonContent(KnownArrayElementContent(0)))"| A1["[post update] a1"]
  B2[b2]-->|"store(AnyArrayElementContent()))"| A2["[post update] a2"]
  A3-->|"read(SingletonContent(UnknownArrayElementContent()))"| A3I
  A3[a3]-->|"read(SingletonContent(KnownArrayElementContent(1)))"| A3I["a3[1]"]
  A4[a4]-->|"read(AnyArrayElementContent())"| A4I["a4[x]"]

And for

a[x][1] = taint

we get just a single reverse store step

graph TD
  A["[post update] a[x]"]-->|"store(AnyArrayElementContent())"| A1["[post update] a"]

which translates via getAStoreContent into a store into UnknownArrayElementContent as desired.

PR Structure

This PR is structured into a sequence of logically self-contained commits, and I strongly encourage reviewing them in order:

a5040fd: Adds a Ruby test for the reverse store example above.
309fd93: Introduces ContentSet into the data-flow library.
c4fbc61: Syncs the data flow files.
725d76e: Implements ContentSet for Ruby, as per the description above.
d99bb65: Implements ContentSet for C++ as a simple bijection into Content.
b91858e: Implements ContentSet for Java as a simple bijection into Content.
7113c1b: Implements ContentSet for C# as a simple bijection into Content.
57f2a74: Implements ContentSet for Python as a simple bijection into Content.
415a1c2: Updates shared Java/C# code for generating flow summaries.
cee527e: Updates the documentation on the shared data flow library.

Before ``` [2022-04-06 13:19:29] (96s) Tuple counts for DataFlowImpl2::Stage1::revFlowConsCand#7ad53399#ff/2@i14#aa10f2wi after 4.4s: 10681 ~0% {2} r1 = SCAN DataFlowImpl2::Stage1::revFlow#7ad53399#fff#prev_delta OUTPUT In.0, In.2 'config' 982 ~1% {3} r2 = JOIN r1 WITH DataFlowImpl2::readSet#7ad53399#ffff_2301#join_rhs ON FIRST 2 OUTPUT Rhs.3, Lhs.1 'config', Rhs.2 83691528 ~2% {3} r3 = JOIN r2 WITH DataFlowPublic::ContentSet::getAReadContent#dispred#f0820431#ff ON FIRST 1 OUTPUT Lhs.1 'config', Lhs.2, Rhs.1 'c' 83581763 ~2% {3} r4 = r3 AND NOT DataFlowImpl2::Stage1::revFlowConsCand#7ad53399#ff#prev(Lhs.2 'c', Lhs.0 'config') 83581763 ~0% {3} r5 = SCAN r4 OUTPUT In.2 'c', In.0 'config', In.1 0 ~0% {3} r6 = JOIN r5 WITH DataFlowImpl2::Stage1::fwdFlowConsCand#7ad53399#ff ON FIRST 2 OUTPUT Lhs.2, Lhs.1 'config', Lhs.0 'c' 0 ~0% {2} r7 = JOIN r6 WITH DataFlowImpl2::Stage1::fwdFlow#7ad53399#2#fff_02#join_rhs ON FIRST 2 OUTPUT Lhs.2 'c', Lhs.1 'config' return r7 ``` After ``` [2022-04-06 13:44:38] (6s) Tuple counts for DataFlowImpl2::Stage1::revFlowConsCand#7ad53399#ff/2@i14#5abbf2wn after 6ms: 10681 ~0% {2} r1 = SCAN DataFlowImpl2::Stage1::revFlow#7ad53399#fff#prev_delta OUTPUT In.0, In.2 'config' 982 ~1% {3} r2 = JOIN r1 WITH DataFlowImpl2::readSet#7ad53399#ffff_2301#join_rhs ON FIRST 2 OUTPUT Rhs.3, Lhs.1 'config', Rhs.2 109765 ~0% {3} r3 = JOIN r2 WITH DataFlowImpl2::Stage1::fwdFlowConsCandSet#7ad53399#fff#reorder_0_2_1 ON FIRST 2 OUTPUT Lhs.1 'config', Lhs.2, Rhs.2 'c' 0 ~0% {3} r4 = r3 AND NOT DataFlowImpl2::Stage1::revFlowConsCand#7ad53399#ff#prev(Lhs.2 'c', Lhs.0 'config') 0 ~0% {3} r5 = SCAN r4 OUTPUT In.1, In.0 'config', In.2 'c' 0 ~0% {2} r6 = JOIN r5 WITH DataFlowImpl2::Stage1::fwdFlow#7ad53399#2#fff_02#join_rhs ON FIRST 2 OUTPUT Lhs.2 'c', Lhs.1 'config' return r6 ```

hvitved · 2022-04-06T12:03:03Z

I have pushed another commit that is not strictly needed yet (it will be for #7914), but it belongs logically with this PR.

asgerf · 2022-04-08T12:36:56Z

I think there's a point in the design space worth talking about here. In the literature of (sound) JavaScript analysis the standard solution is to have two "unknown" contents rather than one:

StoredWithUnknownKey: any store with an unknown key will write to here
ReadWithUnknownKey: any read with unknown key will read from here

The rule is then:

Store to unknown key: writes to StoredWithUnknownKey and ReadFromUnknownKey
Store to known key K: writes to K and ReadFromUnknownKey
Read from unknown key: reads from ReadFromUnknownKey
Read from known key K: reads from K and StoredWithUnknownKey

(Another common solution is to collapse a heterogenous array into a homogenous array as soon as any operation acts on it with non-constant index, but that is hard to apply in our setting).

I can't immediately say whether this is better than ContentSet, or if it actually maps down to the same thing. I guess you'd still need to inform the core library that the reverse of ReadFromUnknownKey is StoredWithUnknownKey. And in terms of writing access paths by hand, I can certainly see the convenience in being able to talk about the "unknown element" without having to specify if it's the read or write variant.

Anyway, this isn't meant as an objection, just thought I'd mention it in case it helps somehow.

aschackmull · 2022-04-08T14:10:02Z

I think the StoredWithUnknownKey and ReadFromUnknownKey approach sounds promising, although I'd probably tweak it a bit to be StoredWithUnknownKey and StoredWithKnownKey instead with the following rules:

Store to unknown key: writes to StoredWithUnknownKey
Store to known key K: writes to K and StoredWithKnownKey
Read from unknown key: reads from StoredWithUnknownKey and StoredWithKnownKey
Read from known key K: reads from K and StoredWithUnknownKey

This should be equivalent, but ought to perform a bit better in an analysis that is forward-first, as our is.

hvitved · 2022-04-19T12:36:59Z

Thanks for your inputs @asgerf and @aschackmull . I agree that the above should work as well (except we would still have to handle reverse stores), but I think I have a slight preference for the ContentSet approach because:

As @asgerf points out, there will be only one "unknown element" content, so logic that works with Contents (such as inspecting access paths inside PathNodes) will not have to know about the two types of unknowns. And, if we for some reason care about the exact index, we will lose that information when a store is matched up with an uknown read.
One could imagine future extensions based on range analysis, where we will be able to refine unknown reads/stores based on known bounds. With the ContentSet approach, we would again not have to touch Content, but only define new (internal) ContentSet entities and define their interpretation in terms of getAStoreContent and getAReadContent.

cpp/ql/lib/semmle/code/cpp/dataflow/internal/DataFlowImpl.qll

aschackmull · 2022-04-22T12:29:53Z

As discussed offline: Let's move the store-related ContentSet-to-Content join into DataFlowImplCommon and reduce the diff in DataFlowImpl for the store steps.

…read-store

cpp/ql/lib/semmle/code/cpp/dataflow/internal/DataFlowImpl.qll

aschackmull · 2022-04-25T07:53:50Z

cpp/ql/lib/semmle/code/cpp/dataflow/internal/DataFlowImpl.qll

-    read(n1, c, n2, pragma[only_bind_into](config))
+    revFlowIsReadAndStored(pragma[only_bind_into](c), pragma[only_bind_into](config)) and
+    read(n1, c, n2, pragma[only_bind_into](config)) and
+    revFlow(n2, pragma[only_bind_into](config))


This join-order is not optimal, but presumably good enough. Realising the optimal order would require another helper predicate, so I'm not suggesting any change here, merely noting the fact.

aschackmull

LGTM.

…t-read-store" This reverts commit bffa8fa, reversing changes made to ba2a884.

In github#8641, `localFlowExit` was changed to use `Stage2::readStepCand` instead of `read`, which means that the big-step relation is broken up less. This causes test result changes. Nothing is lost from the `select` clause, but some results may have fewer paths, and fewer nodes and edges are output in the test results.

github-actions bot added C# C++ Java Python Ruby labels Apr 1, 2022

hvitved force-pushed the dataflow/interpret-read-store branch 3 times, most recently from 68503d8 to c55f66e Compare April 4, 2022 11:09

hvitved added 9 commits April 4, 2022 13:51

Ruby: Add data-flow test for reverse array stores

a5040fd

Data flow: Introduce ContentSet

309fd93

Data flow: Sync files

c4fbc61

Ruby: Implement ContentSet

725d76e

C++: Implement ContentSet

d99bb65

Java: Implement ContentSet

b91858e

C#: Implement ContentSet

7113c1b

Python: Implement ContentSet

57f2a74

Java/C#: Update CaptureModels.qll

415a1c2

hvitved force-pushed the dataflow/interpret-read-store branch from c55f66e to 415a1c2 Compare April 4, 2022 11:52

hvitved changed the title ~~Data flow: Add ContentExt, interpretRead, and interpretStore~~ Data flow: Introduce ContentSet Apr 4, 2022

Document flow through arrays in dataflow.md

cee527e

github-actions bot added the documentation label Apr 6, 2022

hvitved marked this pull request as ready for review April 6, 2022 09:18

hvitved requested review from a team as code owners April 6, 2022 09:18

hvitved assigned aschackmull Apr 6, 2022

hvitved added the no-change-note-required This PR does not need a change note label Apr 6, 2022

aschackmull reviewed Apr 21, 2022

View reviewed changes

cpp/ql/lib/semmle/code/cpp/dataflow/internal/DataFlowImpl.qll Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into dataflow/interpret-…

b033f10

…read-store

hvitved force-pushed the dataflow/interpret-read-store branch 2 times, most recently from 26e7971 to 360e3e7 Compare April 22, 2022 13:07

hvitved added 2 commits April 22, 2022 15:09

Data flow: Inline getAStoreContent up-front

488a4ed

Data flow: Sync files

bc6ee10

hvitved force-pushed the dataflow/interpret-read-store branch from 360e3e7 to bc6ee10 Compare April 22, 2022 13:10

Update CaptureModels.qll

3edc72f

hvitved removed the no-change-note-required This PR does not need a change note label Apr 22, 2022

Add change notes

cf0a1e7

aschackmull reviewed Apr 25, 2022

View reviewed changes

cpp/ql/lib/semmle/code/cpp/dataflow/internal/DataFlowImpl.qll Outdated Show resolved Hide resolved

aschackmull reviewed Apr 25, 2022

View reviewed changes

Data flow: Simplify revFlowStore

2466288

aschackmull approved these changes Apr 25, 2022

View reviewed changes

hvitved merged commit bffa8fa into github:main Apr 25, 2022

hvitved deleted the dataflow/interpret-read-store branch April 25, 2022 10:17

jketema added a commit to jketema/codeql that referenced this pull request Apr 27, 2022

Revert "Merge pull request github#8641 from hvitved/dataflow/interpre…

7dd382a

…t-read-store" This reverts commit bffa8fa, reversing changes made to ba2a884.

hvitved mentioned this pull request May 18, 2022

Ruby: Data-flow through hashes #8942

Merged

owen-mc added the DataFlow Library label Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data flow: Introduce `ContentSet` #8641

Data flow: Introduce `ContentSet` #8641

Uh oh!

hvitved commented Apr 1, 2022 •

edited

Loading

Uh oh!

hvitved commented Apr 6, 2022

Uh oh!

asgerf commented Apr 8, 2022

Uh oh!

aschackmull commented Apr 8, 2022

Uh oh!

hvitved commented Apr 19, 2022

Uh oh!

Uh oh!

aschackmull commented Apr 22, 2022

Uh oh!

Uh oh!

aschackmull Apr 25, 2022

Uh oh!

aschackmull left a comment

Uh oh!

Uh oh!

Data flow: Introduce ContentSet #8641

Data flow: Introduce ContentSet #8641

Uh oh!

Conversation

hvitved commented Apr 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

The Problem

A Solution

PR Structure

Uh oh!

hvitved commented Apr 6, 2022

Uh oh!

asgerf commented Apr 8, 2022

Uh oh!

aschackmull commented Apr 8, 2022

Uh oh!

hvitved commented Apr 19, 2022

Uh oh!

Uh oh!

aschackmull commented Apr 22, 2022

Uh oh!

Uh oh!

aschackmull Apr 25, 2022

Choose a reason for hiding this comment

Uh oh!

aschackmull left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Data flow: Introduce `ContentSet` #8641

Data flow: Introduce `ContentSet` #8641

hvitved commented Apr 1, 2022 •

edited

Loading