Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
R
 
 
 
 
man
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Overview

Tabular data sets are common, and many data processing tasks must be repeated on multiple similar data samples. In practice, however, there may be unexpected changes in structure across different batches of data, which are likely to break the analytical pipeline.

Datadiff identifies structural differences between pairs of (related) tabular data sets and returns an executable summary (or "patch") which is both a description of the differences and a corrective transformation.

In making comparisons, datadiff considers the following (composable) patch types:

  • column permutation
  • column insertion
  • column deletion
  • column replacement
  • recoding of categorical data
  • linear transformation of numerical data

Installation

Datadiff is implemented in R and can be built from source or installed using the devtools package as follows.

# Install the most recent release from GitHub:
# install.packages("devtools")
devtools::install_github("alan-turing-institute/datadiff")

Usage

Diff two data frames with ddiff(df1, df2).

For more information and examples, see the package vignette:

# Build the vignette on package installation:
devtools::install_github("alan-turing-institute/datadiff", build_vignettes = TRUE)
vignette("datadiff")

About

Datadiff is diff for data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages