Skip to content
Datadiff is diff for data
R
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data
man
tests
vignettes
.Rbuildignore
.gitignore
DESCRIPTION
LICENSE
NAMESPACE
README.md
datadiff.Rproj

README.md

Overview

Tabular data sets are common, and many data processing tasks must be repeated on multiple similar data samples. In practice, however, there may be unexpected changes in structure across different batches of data, which are likely to break the analytical pipeline.

Datadiff identifies structural differences between pairs of (related) tabular data sets and returns an executable summary (or "patch") which is both a description of the differences and a corrective transformation.

In making comparisons, datadiff considers the following (composable) patch types:

  • column permutation
  • column insertion
  • column deletion
  • column replacement
  • recoding of categorical data
  • linear transformation of numerical data

Installation

Datadiff is implemented in R and can be built from source or installed using the devtools package as follows.

# Install the most recent release from GitHub:
# install.packages("devtools")
devtools::install_github("alan-turing-institute/datadiff")

Usage

Diff two data frames with ddiff(df1, df2).

For more information and examples, see the package vignette:

# Build the vignette on package installation:
devtools::install_github("alan-turing-institute/datadiff", build_vignettes = TRUE)
vignette("datadiff")
You can’t perform that action at this time.