GSoC

Local Provenance Compression

SPADE models data provenance as a property graph, with a set of key-value annotations on each vertex and edge that describe the monitored activity domain. Since provenance graphs can grow large, it is space-efficient to store them in a compressed form. However, using a vanilla file compression scheme requires the entire graph to be extracted before queries such as path traversals can be performed. This is untenable for ``big provenance''.

SPADE's CompressedStorage module provides a proof-of-concept for multiple optimizations over the previous state-of-the-art. In particular, the problem is decomposed into compressing the graph based on its structural properties, and independently using dictionary-based compression to encode the annotation sets more succinctly. A sample lineage query is also implemented. However, this module is not integrated with the rich query surface, QuickGrail. Also, the current design does not achieve reductions that match whole file compression.

Project Idea 1: Adding (De)Compression Extensions to the Data Ingestion / Query Layer

The goal will be to implement a new version of CompressedStorage that supports (possibly a subset of) the instruction set of QuickGrail. Since SPADE may record ``big provenance'', support for querying it efficiently is provided in the form of a custom query language. All queries issued in this language are translated into an intermediate representation (IR), framed in a graph processing language. Currently, support is implemented for translating this IR into the native database languages of Neo4j, Postgres, and Quickstep. The task will be to add support at the IR level for (de)compressing graph elements.

Project Idea 2: Adapting the Temporal Window Used for Property Compression

The goal will be to allow the window of temporal context used to build the dictionary for encoding annotations to be varied and possibly dynamically adapted. A smaller window is expected to yield higher performance querying. A larger window should result in better annotation compression. Development of a variable window approach will support tradeoffs between these. Once the window can be varied, a heuristic can be added to determine when to adjust the window size.

Distributed Provenance Sketches

Data provenance records may be used to ascertain the veracity of claims about activity that transpired on a computational platform. In practice, the response to a query about the provenance of a digital artifact may span multiple, independent administrative domains. In cases where such information can provide evidence that contradicts a statement about past activity (or lack thereof), agents may be incentivized to fraudulently alter the logs of system activity on hosts under their control. One possible approach to counter this is for each machine in a network to periodically share local provenance records with others, effectively committing to a claim about its state. While this does not preclude a priori deception, it facilitates detection of history-revising attacks.

Project Idea 3: Design and Implementation of New Provenance Sketches

Since each host could generate gigabytes of logs a day, the goal will be to design a summary data structure (based on a Bloom filter, Cuckoo hash, or similar construct) to represent provenance sketches. The aim is to reduce the storage and network overhead of peers making provenance integrity commitments. Earlier versions of SPADE had limited variants. The goal will be to learn from prior findings to design an improved version. The design can then be prototyped in the context of SPADE's distributed query support.

Project Idea 4: Data-driven Semantics-aware Sketch Construction

The goal will be to model the tradeoffs between the precision of detecting integrity violations, the quantity of data that would need to be transmitted over network links, and the freshness of the answers (based on the frequency of updates). This can be done using published provenance datasets. Further analysis can explore how to semantically partition provenance graphs to provide a higher utility-to-storage ratio than a baseline approach. This can then be implemented as a standalone tool for constructing provenance sketches.

Container Provenance Analysis

The user space notion of a container is implemented by virtualizing specific resources using namespaces in the Linux kernel. The choice of resources for which such support is added has been motivated by interest in more efficient resource sharing, rather than security. Consequently, there are global resources in the kernel that are not isolated from containers. In principle, it is possible that system calls may expose residual sharing to applications. Container management infrastructure effects isolation between instances by ad hoc leveraging of various security constructs, such as disabling access to resources at specific paths in the filesystem.

Current commodity management tools support numerous options. Further, they accommodate flexible security by allowing detailed specifications of the posture. The combination may give rise to inadvertent gaps. Exceptions may be explicitly added to access control policies to allow specific cross-container flows. It is likely that these have not been modeled, and in particular their composition may not have been validated as safe. In the presence of potential leaks that may result, monitoring whole system provenance can facilitate detection of the use of cross-container channels.

Project Idea 5: Detecting Isolation Violations

The CamFlow project uses Linux Security Module hooks to infer data provenance. SPADE's CamFlow Reporter can be used to ingest the emitted records. The resulting graph is fine-grained, with thread- and file-specific details represented in multiple connected vertices. Consequently, detecting the use of cross-container channels requires identifying connectivity between subgraphs, subject to the constraint that there are differences in the namespace annotations of constituent vertices. The goal is to design and prototype an efficient approach to do so, assuming access to SPADE's query surface.

Project Idea 6: Identifying Provenance Integrity Challenges

SPADE's Audit Reporter infers data provenance from system-call-related records generated by the Linux kernel's Audit subsystem. The CrossNamespaces Filter implements an approach to detect cross-container channels in Audit-derived provenance. The goal will be to implement a streaming algorithm that builds on this filter to detect possible compromises in the integrity of the provenance that may arise from adversarial manipulation of the records generated.

This material is based upon work supported by the National Science Foundation under Grants OCI-0722068, IIS-1116414, and ACI-1547467. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Setting up SPADE
Storing provenance
Collecting provenance
- Across the operating system
- Limiting collection to a part of the filesystem
  - On Linux
  - On macOS
- From an external application
- With compile-time instrumentation
- Using the reporting API
- Of transactions in the Bitcoin blockchain
- Filtering provenance
  - Using filters
  - Available filters
Viewing provenance
- In a graph database
- In a relational database
Querying SPADE
- Illustrative example
- Transforming query responses
  - Using transformers
  - Available transformers
- Protecting query responses
Miscellaneous

Provide feedback

Saved searches

Use saved searches to filter your results more quickly