Provenance Privacy

To preserve privacy while sharing provenance metadata with other hosts in a network, SPADE provides three techniques: sanitization, encryption, and differential privacy. The details for both are in the following sections.

Sanitization

Privacy preservation through sanitization performs an irreversible transformation on the response Graphs. Under sanitization, a graph annotation or its part is removed from the graph leaving no trace. Annotations of vertices and edges could be sanitized depending upon the sanitization level and scheme defined for each of those levels and annotations. It is available through the Sanitization transformer. There are 3 defined levels of sanitization: low, medium and high. For a given level of sanitization, provenance is individually sanitized for that level. For example, if the level defined is high, provenance is sanitized for the high level only.

To use the transformer, execute the following on the control client:

add transformer Sanitization sanitizationLevel={low,medium,high}

sanitizationLevel defines the level of sanitization to perform before sharing the provenance graphs during this session. The various settings of the sanitization process could be defined in the config file spade.transformer.Sanitization.config whose structure is as follows:

low
cwd,fsgid,fsuid,sgid,suid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime]

medium
command line,uid,gid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime],size

high
name,euid,remote address[sanitizeIpAddress],path[sanitizePath],time[sanitizeTime],operation

One line in the file contains the sanitizationLevel, followed by a comma-separated list of annotations to sanitize for that level. Each annotation could be followed by the names of custom code handlers for sanitization in square brackets like this: <annotation_name>[sanitizationHandler]. In the absence of these code handlers, the annotation would be sanitized with the default strategy. The various strategies for sanitizing composite annotations are illustrated at the end of the encryption section below.

Encryption

Privacy preservation through encryption performs reversible transformations on the response Graphs. Data is encrypted using Attribute-based encryption (ABE) policy. In this policy, attributes serve as the credentials of a host and a policy is defined over the encrypted data. We consider an attribute as the level of encryption or decryption to perform on the data. In a provenance graph, each annotation of a vertex or edge is encrypted according to the strategy defined for each level in the transformer.

There are 3 defined levels of encryption: low, medium and high. Each of these levels has an associated private key for encryption/decryption, as well as a common public key. The public key and the appropriate private keys have be to shared out-of-band with the other host in order for them to successfully decrypt the data shared with them. For a given level of encryption, provenance is individually encrypted/decrypted for that level.

Attribute-based encryption is available through the ABE transformer. SPADE use OpenABE implementation available under AGPL 3.0 license. OpenABE could be downloaded and installed for your system from OpenABE GitHub Repository. After installing OpenABE, complete the following steps:

Setup OpenABE and generate master key pair for the Ciphertext-Policy(CP) ABE algorithm.
Generate the private keys for each given set of attributes. A set of attributes corresponds to the level of encryption in our scheme.
Share the master public key and the private key(s) with the party you want to communicate.

The details for each step can be found in the first 6 pages of OpenABE CLI Documentation.

To use the transformer, execute the following on the control client:

add transformer ABE

Strategies

Following are the various strategies for encrypting composite annotations. The same strategies are used for sanitization using Sanitization transformer defined above.

remote address (xxx.xxx.xxx.xxx)

low, the second octet is encrypted.

medium, the third octet is encrypted.

high, the fourth octet is encrypted.

path (w/x/y/z/...)

low, path after first level is encrypted.

medium, path after the second level is encrypted.

high, path after the third level is encrypted.

time (yyyy-MM-dd HH:mm:ss.SSS)

low, day is encrypted.

medium, hour is encrypted.

high, minute, second and millisecond are encrypted.

The various settings of the encryption process could be defined in the config file spade.transformer.ABE.config. In the sample config file below, 'keysDirectory' contains the master public key for encryption and the secret keys used in decryption. After that, one line in the file contains the encryption level, followed by a comma-separated list of annotations to encrypt/decrypt. Each annotation could be followed by the name of the custom class for handling the encryption and decryption of that annotation in square brackets, like this: <annotation_name>[CustomClassName]. The custom class implements the functions containing the strategy for encryption and decryption of the annotation. In the absence of a custom class, the annotation would be encrypted/decrypted with the default strategy.

keysDirectory=cfg/keys/attributes


low
cwd,fsgid,fsuid,sgid,suid,remote cwd,fsgid,fsuid,sgid,suid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime]

medium
command line,uid,gid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime],size

high
name,euid,remote address[EncryptedIPAddress],path[EncryptedPath],time[EncryptedTime],operation

Differential Privacy

Differential privacy is a mechanism for sharing abstracted query responses from a database without disclosing information about individual records. With differential privacy, aggregate database information is returned with the addition of statistical noise. The aim is to provide useful but privacy-preserving information to the querier. Foundational work on ε-differential privacy provides a mathematical definition of the mechanism.

The QuickGrail query surface in SPADE allows users to send four types of aggregate queries. These are: (i) mean, (ii) standard deviation, (iii) histogram, and (iv) distribution queries. The histogram query shows the count of each unique value of a specified annotation key. The *distribution query is similar but instead automatically determines the range of values associated with the specified key, creates a specified number of sub-ranges (partitions), and reports the counts in each sub-range (partition).

SPADE enables the result of the above aggregate queries to be made differentially private. The implementation uses Google's open-source differential-privacy library.

To send an aggregate query in SPADE's query client, use the stat command as follows:

stat <vertex | edge> <annotation name> <aggregate type> [<additional arguments>] <graph variable>

The possible values for <aggregate type> are mean, std, histogram, and distribution. For example, the mean file size may be of interest. Given a graph variable $files, assume each file vertex in it has a filesize annotation. This query can then be used:

stat vertex filesize mean $files

As another example, the number of processes owned by each user may be of interest. Given a variable $processes, this query can be used:

stat vertex owner histogram $processes

A distribution is an abstraction of a histogram where the values are grouped together into a specified number of partitions. Assume the time annotation on edges reports how long an operation took. Given an $operations variable, a distribution with 5 partitions can be computed with:

stat edge time distribution 5 $operations

Each partition contains the same number of unique time annotation keys. The count for the partition is the number of edges with one of those keys.

Differential privacy for aggregate queries can be enabled in SPADE's configuration as follows. To enable differential privacy, set the epsilon value to the desired level of privacy in cfg/spade.core.AbstractAnalyzer.config. To disable differential privacy, set epsilon to -1.

For background on differential privacy and its implementation, see:

This material is based upon work supported by the National Science Foundation under Grants OCI-0722068, IIS-1116414, and ACI-1547467. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Setting up SPADE
Storing provenance
Collecting provenance
- Across the operating system
- Limiting collection to a part of the filesystem
  - On Linux
  - On macOS
- From an external application
- With compile-time instrumentation
- Using the reporting API
- Of transactions in the Bitcoin blockchain
- Filtering provenance
  - Using filters
  - Available filters
Viewing provenance
- In a graph database
- In a relational database
Querying SPADE
- Illustrative example
- Transforming query responses
  - Using transformers
  - Available transformers
- Protecting query responses
Miscellaneous

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provenance Privacy

Sanitization

Encryption

Strategies

Differential Privacy

Clone this wiki locally