Skip to content
Ashish Gehani edited this page Jun 8, 2021 · 28 revisions

Data is continuously transformed by computation. Understanding the origins of a piece of data can help in a variety of circumstances. For example, the data's history can facilitate fault analysis, decide how much the data should be trusted, or aid in profiling applications.

SPADE provides functionality to track and analyze the provenance of data that arises from multiple sources, distributed over the wide area, and at varied levels of abstraction.


SPADE provides a cross-platform distributed data provenance collection, filtration, storage, and querying service. It includes support for collecting provenance from the Linux, Mac OS X, and Windows operating systems. SPADE uses the auditing functionality of each operating system, which remains stable across various releases, to transparently record the provenance of all data. Installation can be performed with a pre-built package or from source code.

Easy to Deploy

SPADE automates the generation and collection of data provenance at the operating system level. It provides a broad view of activity across all the computers it is installed on in a distributed system. SPADE does this without requiring applications or the operating systems to be modified. It reports information about the name, owner, group, parent, host, creation time, command line, and environment variables of each process. It also reports the name, path, host, size, and modification time of files read or written during a computation. All this information can be collected with a few simple commands.

Flexible Querying

SPADE supports the use of variables, constraints, lineage, path, and set operators when searching local provenance records. It also supports graph and relational (SQL) queries over local provenance. Provenance collected by SPADE can also be inspected with third-party tools, such as Neoclipse and SQL Workbench. Finally, the SPADE query tool can transparently resolve path and lineage queries that span multiple hosts in a distributed system.

Modular and Extensible

SPADE is designed to be extensible in multiple ways. A reporter can be implemented to collect provenance activity about a new domain of interest. A new filter can be written to perform novel transformations on provenance events. A new storage system can be added to record provenance in a different format. A new sketch can be designed to optimize the distributed querying. A new transformer can be used to dynamically rewrite query responses.

Please use the links in the sidebar on the right to learn how to use SPADE to collect, filter, store, and query your provenance records.

Clone this wiki locally