Find errors in data given a set of validation rules.
errorlocate helps to identify obvious errors in raw datasets.
It works in tandem with the package
validate you formulate data validation rules to which the data must comply.
- "age cannot be negative":
age >= 0.
- "if a person is married, he must be older then 16 years":
if (married ==TRUE) age > 16.
- "Profit is turnover minus cost":
profit == turnover - cost.
validate can check if a record is valid or not, it does not identify
which of the variables are responsible for the invalidation. This may seem a simple task,
but is actually quite tricky: a set of validation rules forms a web
of dependent variables: changing the value of an invalid record to repair for rule 1, may invalidate
the record for rule 2.
errorlocate provides a small framework for record based error detection and implements the Felligi Holt
algorithm. This algorithm assumes there is no other information available then the values of a record
and a set of validation rules. The algorithm minimizes the (weighted) number of values that need
to be adjusted to remove the invalidation.
errorlocate can be installed from CRAN:
Beta versions can be installed with
The latest development version of
errorlocate can be installed from github with
library(magrittr) rules <- validator( profit == turnover - cost , cost >= 0.6 * turnover , turnover >= 0 , cost >= 0 # is implied ) data <- data.frame(profit=750, cost=125, turnover=200) data_no_error <- data %>% replace_errors(rules) # faulty data was replaced with NA data_no_error errors_removed(data_no_error)