Skip to content

Commit

Permalink
added vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
markvanderloo committed Oct 29, 2015
1 parent 72c5e78 commit f482cd4
Show file tree
Hide file tree
Showing 3 changed files with 104 additions and 2 deletions.
2 changes: 0 additions & 2 deletions pkg/R/modifier.R
Original file line number Diff line number Diff line change
Expand Up @@ -108,8 +108,6 @@ setGeneric("modify", function(dat, x, ...) standardGeneric("modify"))





#### S4 IMPLEMENTATIONS
setMethod("expr","modifier",function(x,...){
lapply(x$rules, function(r) r@expr)
Expand Down
1 change: 1 addition & 0 deletions pkg/R/modify.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ get_rule_guard <- function(r,dat){
}

#' @rdname modify
#' @export
setMethod("modify",c("data.frame","modifier"), function(dat, x, ...){
# options <- clone_and_merge(modify_options(x),...)
modifiers <- x$exprs(vectorize=FALSE)
Expand Down
103 changes: 103 additions & 0 deletions pkg/vignettes/introduction.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
title: "Introduction to validate.modify"
author: "Mark van der Loo and Edwin de Jonge"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to validate.modify}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

### A first statement

In the iris dataset, replace `Sepal.Width` with 4 value if it exceedsd 4.
```{r}
library(validate.modify)
library(magrittr)
iris %<>% modify_so( if(Sepal.Width > 4 ) Sepal.Width <- 4 )
```

### Why this package

Data cleaning work flows or scripts typically contain a lot of 'if this do that'
type of statements. Such statements are typically condensed expert knowledge.
With this package, such 'data modifying rules' are taken out of the code and
become in stead parameters to the work flow. This allows you to maintain, document
and reason about data modification rules separately from the flow of your programme.

This means you, the data munger can focus on the content and let R do the work.


### Basic workflow

The workflow of `validate.modify` is designed to take two concerns of your hands. The first concern is how to implement the many ideas and rules that define how and when to modify data. The second concern is related to how to apply such rules to your data. We therefore introduce two nouns and one verb that govern the basic workflow.

- data: This is your data, currently this must be stored in a `data.frame`.
- `modifier`: This is an object that stores (conditional) data modification rules.
- `modify`: This is a function that applies the rules in a modifier to your data.

Here's an example using the `retailers` data set from the [validate](https://cran.r-project.org/package=validate) package.
```{r}
data("retailers", package="validate")
head(retailers[-(1:2)],3)
```

First we define a set of modifying rules, using `modifier`.
```{r}
m <- modifier(
if (other.rev < 0) other.rev <- -1 * other.rev
, if ( is.na(staff.costs) ) staff.costs <- mean(staff.costs)
)
```
Next, the rules can be applied to our data.
```{r}
ret1 <- modify(retailers,m)
```

Alternatively, if you're a fan of the [magrittr](https://cran.r-project.org/package=magrittr), package you can do this
```{r,eval=FALSE}
library(magrittr)
ret2 <- retailers %>% modifier(m)
```
or even
```{r,eval=FALSE}
retailers %<>% modify_so(
if ( other.rev < 0) other.rev <- -1 * other.rev
, if ( is.na(staff.costs) ) staff.costs <- mean(staff.costs)
)
```
Here, the `%<>%` operator makes sure that the original dataset gets overwritten, and `modify_so` is a shortcut function for defining modificaton rules in-line.

### Handling missing values

The rules you define in a `modifier` are executed on records where the conditions yields `TRUE`. In R this poses the problem on what to do when in a record the condition evaluates to `NA`. For example, the condition
```
other.rev < 0
```
in the first rule of `m` above evaluates to `NA` in the first record of the `retailers` dataset. Such cases are handled by treating it as if the condition evaluated to `FALSE`.



### Importing and exporting rules from file


### Performance, and a glimpse under the hood.

You, the user can assume that the rules are evaluated record-by-record. In reality, the package is smart enough to analyse the rules a little bit and to make sure they can be evaluated in a vectorized manner.

In short, when you call `modify`, or `modify_so`, the following steps are performed.

1. The







### Difference with dplyr::mutate




0 comments on commit f482cd4

Please sign in to comment.