Skip to content
Find and replace erroneous fields in data using validation rules
Branch: master
Clone or download
Latest commit 0544177 Jul 18, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R fix for issue #20 Jul 18, 2019
examples generate warning when a rule is ignored Jan 6, 2017
issues fix for issue #20 Jul 18, 2019
man
tests fix for issue #20 Jul 18, 2019
uRos2018 updating Sep 13, 2018
useR2017 wip Nov 2, 2017
.Rbuildignore first attempt Sep 3, 2018
.gitignore checking in Jul 10, 2015
.travis.yml checking travis Dec 29, 2016
DESCRIPTION updated version number and so on Jul 18, 2019
NAMESPACE tweaking May 20, 2019
NEWS.md updated version number and so on Jul 18, 2019
README.md Update README.md Jul 18, 2019
appveyor.yml wip Aug 3, 2015
codecov.yml improving code coverage Dec 28, 2016
errorlocate.Rproj fixing build Jun 29, 2017

README.md

Build Status CRAN Downloads Coverage Status Mentioned in Awesome Official Statistics

Error localization

Find errors in data given a set of validation rules. The errorlocate helps to identify obvious errors in raw datasets.

It works in tandem with the package validate. With validate you formulate data validation rules to which the data must comply.

For example:

  • "age cannot be negative": age >= 0.
  • "if a person is married, he must be older then 16 years": if (married ==TRUE) age > 16.
  • "Profit is turnover minus cost": profit == turnover - cost.

While validate can check if a record is valid or not, it does not identify which of the variables are responsible for the invalidation. This may seem a simple task, but is actually quite tricky: a set of validation rules forms a web of dependent variables: changing the value of an invalid record to repair for rule 1, may invalidate the record for rule 2.

errorlocate provides a small framework for record based error detection and implements the Felligi Holt algorithm. This algorithm assumes there is no other information available then the values of a record and a set of validation rules. The algorithm minimizes the (weighted) number of values that need to be adjusted to remove the invalidation.

Installation

errorlocate can be installed from CRAN:

install.packages("errorlocate")

Beta versions can be installed with drat:

drat::addRepo("data-cleaning")
install.packages("errorlocate")

The latest development version of errorlocate can be installed from github with devtools:

devtools::install_github("data-cleaning/errorlocate")

Usage

library(magrittr)

rules <- validator( profit == turnover - cost
                  , cost >= 0.6 * turnover
                  , turnover >= 0
                  , cost >= 0 # is implied
)

data <- data.frame(profit=750, cost=125, turnover=200)

data_no_error <-
  data %>%
  replace_errors(rules)

# faulty data was replaced with NA
data_no_error

errors_removed(data_no_error)
You can’t perform that action at this time.