Handling of missing values #63

clauswilke · 2020-12-05T22:25:31Z

Do we have a systematic way of handling NA values on the Rust side? It's not clear to me what the current intent is to handling missing values. Most basic Rust data types (e.g., String) cannot handle missing values. Instead, Rust uses options. Maybe we should systematically wrap all incoming data types into options?

As an example, the current implementation silently converts NA_character_ into "NA", which I would argue is incorrect behavior.

library(rextendr)

rust_function(
  code = "fn rust_strings(input: &str) -> String {
    input.to_string()
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

rust_strings("a")
#> [1] "a"
rust_strings(NA_character_)
#> [1] "NA"

^{Created on 2020-12-05 by the reprex package (v0.3.0)}

The text was updated successfully, but these errors were encountered:

clauswilke · 2020-12-05T22:35:18Z

I think wrapping all input into options would be more idiomatic, but it doesn't currently work.

library(rextendr)

rust_function(
  code = r"(fn rust_strings(input: Option<&str>) -> String {
    // ... do something with input
    "hello".to_string() // some dummy output for now
  })",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)
#> Error: Rust code could not be compiled successfully. Aborting.

^{Created on 2020-12-05 by the reprex package (v0.3.0)}

This is the compiler error message from the above build attempt:

   Compiling rextendr8 v0.0.1 (/var/folders/b1/13gn4j655jddkfhxmtk5tsfm0000gn/T/RtmpxTXC7C/file283a636057aa)
error[E0599]: no variant or associated item named `from_robj` found for enum `Option<&str>` in the current scope
 --> src/lib.rs:3:1
  |
3 | #[extendr]
  | ^^^^^^^^^^ variant or associated item not found in `Option<&str>`
  |
  = note: this error originates in an attribute macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

For more information about this error, try `rustc --explain E0599`.
error: could not compile `rextendr8`

To learn more, run the command again with --verbose.

lebensterben · 2020-12-06T00:27:47Z

extendr/extendr-macros/src/lib.rs

Line 63 in 578ce65

Some(parse_quote!{ extendr_api::unwrap_or_throw(<#ty>::from_robj(&#varname)) })

Option<&str> doesn't implement FromRobj

This is easy to solve.
For all T implements FromRobj, the two following identities holds

<Option<T>>::from_robj(x) == <T>::from_robj(x) iff the Robj is NOT NA.
<Option<T>>::from_robj(x) == None iff the Robj is NA.

yutannihilation · 2020-12-06T05:54:27Z

Apache Arrow also uses Option<T> to represent the missingness (note that the arrow format internally has a separate bitmap to represent null c.f. doc), so this seems the right way.

https://docs.rs/arrow/2.0.0/arrow/#array

Handling NaN might be a bit more complicated.

yutannihilation · 2020-12-06T06:15:08Z

Handling NaN might be a bit more complicated.

Here's the current behavior. Should I file a separate issue for this?

library(rextendr)

rust_function(
  code = "fn rust_int(input: i32) -> i32 {
    input
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)
#> build directory: /tmp/Rtmp9cpfVn/file226494fffa9e6
#> Prebuild libR bindings are not available. Run `install_libR_bindings()` to improve future build times.

rust_int(Inf)
#> [1] 2147483647
rust_int(NaN)
#> [1] 0

^{Created on 2020-12-06 by the reprex package (v0.3.0)}

clauswilke · 2020-12-06T06:53:32Z

I'm not sure there's a problem with Inf and NaN. They work if you use the f64 type.

library(rextendr)

rust_function(
  code = "fn rust_dbl(input: f64) -> f64 {
    input
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

rust_dbl(Inf)
#> [1] Inf
rust_dbl(NaN)
#> [1] NaN

^{Created on 2020-12-06 by the reprex package (v0.3.0)}

And note that if you append Inf or NaN to an integer vector in R it converts the vector to double:

library(tibble)
tibble(x = c(1L, 2L, 3L, Inf, NaN), y = 1L:5L)
#> # A tibble: 5 x 2
#>       x     y
#>   <dbl> <int>
#> 1     1     1
#> 2     2     2
#> 3     3     3
#> 4   Inf     4
#> 5   NaN     5

^{Created on 2020-12-06 by the reprex package (v0.3.0)}

So the question is rather whether extendr should implicitly convert integers to doubles. Given your example, I think the answer may be no. But either way, it's a separate issue.

clauswilke · 2020-12-06T06:59:20Z

One more Inf example, created by division by zero in Rust. I think Inf and NaN work because they are part of the IEEE standard, it's only NA that makes trouble.

library(rextendr)

rust_function(
  code = "fn div_by_zero(input: f64) -> f64 {
    input / 0.
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

div_by_zero(3.)
#> [1] Inf
div_by_zero(-3.)
#> [1] -Inf

^{Created on 2020-12-06 by the reprex package (v0.3.0)}

yutannihilation · 2020-12-06T08:06:01Z

because they are part of the IEEE standard, it's only NA that makes trouble.

Oh, thanks, got it!

So the question is rather whether extendr should implicitly convert integers to doubles. Given your example, I think the answer may be no. But either way, it's a separate issue.

Agreed. Now I feel my example should return NA for both Inf and NaN, but let's discuss this another time and focus on missing values.

as.integer(Inf)
#> Warning: NAs introduced by coercion to integer range
#> [1] NA
as.integer(NaN)
#> [1] NA

^{Created on 2020-12-06 by the reprex package (v0.3.0)}

lebensterben · 2020-12-06T08:11:48Z

@yutannihilation

See this
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=ece28b8252f43834c424594c86a31e1a

For your example, the only correct behaviour is to return an error, not even a NA.

yutannihilation · 2020-12-06T09:59:38Z

Aw, thanks. This is horrible... I agree this should return an error.

clauswilke · 2020-12-06T16:29:50Z

My take is slightly different. I think the example by @lebensterben shows that coercion of R double to any Rust numerical data types is not possible. Thus, the only reasonable approach is to always convert R doubles to Option<f64>, with NA_REAL converted to None. From there on, we can safely handle the value in Rust, e.g. extract the value from the option if it exists and then convert to f32 or i32 or whatever.

clauswilke · 2020-12-07T18:24:19Z

Thinking some more about this, I think there are three ways we can handle missing values, and only two of them are a good idea. In the following, let T be some type we want to extract from an R object, e.g. f64.

These are the options:

Only provide an interface to Option<T>. So function arguments for functions exported via #[extendr] would have to be Option<T>, they could never be just T.
Provide an interface to both Option<T> and T. For the latter, throw an R error if a missing value is encountered. In other words, we can use T as a function argument but such a function could never accept missing values as input.
Provide an interface to T that ignores missing values. This is what we're currently doing. It can lead to silent data corruption, as shown above.

I think both options 1 and 2 are fine, and are idiomatic within both the R and the Rust mindsets. Option 3 is not acceptable, I believe. It will cause data corruption, and thus people will eventually be very upset with us for providing this interface. Also, an interface that requires the programmer to handle a type in a special way or risk data corruption, without forcing this behavior at the compile stage, goes very much against the philosophy of Rust to be strictly typed and prevent undefined behavior.

At the same time, I understand having to constantly juggle Option<T> everywhere is annoying. Therefore, I think the second approach is best, provide both Option<T> and T interfaces but throw an error when the T interface is used with missing values.

clauswilke · 2020-12-07T18:36:24Z

Here is another example of silent data corruption. Again, I think this is a pervasive issue, not a minor corner case.

library(rextendr)

rust_function(
  code = "fn rust_add(x: i32, y: i32) -> i32 {
    x + y
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

# Ok
rust_add(2, 3)
#> [1] 5

# Ok
rust_add(NA_integer_, 0)
#> [1] NA

# This is bad
rust_add(NA_integer_, 1)
#> [1] -2147483647

^{Created on 2020-12-07 by the reprex package (v0.3.0)}

lebensterben · 2020-12-07T20:04:21Z

In terms of R INTEGER and REAL, their NA value are smallest representable i32, and a hard codes value, respectively.

The case of REAL_NA is actually easy to dealt with. Because it's seen as NaN by Rust, and algebraic operation on NaN returns NaN, so Rust compiler simply won't modify the value. When it's returned to R, it's still REAL_NA.

For INTEGER_NA, it's a bit complex. It's i32::MIN, and Rust algebraic operations WILL modify its value.

clauswilke · 2020-12-07T20:15:03Z

@lebensterben I think your previous example with type casts has shown that we cannot rely on Rust treating NA_real_ properly. If somewhere in the calculation the f64 is cast to some other type, we can expect incorrect results. Example follows.

library(rextendr)

rust_function(
  code = "fn rust_add(x: f64, y: f64) -> f32 {
    (x + y) as f32
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

# Ok
rust_add(2, 3)
#> [1] 5

# This is bad
rust_add(NA_real_, 0)
#> [1] NaN

# This is also bad
rust_add(NA_integer_, 0)
#> [1] -2147483648

^{Created on 2020-12-07 by the reprex package (v0.3.0)}

For comparison, this is the expected behavior:

R_add <- function(x, y) x + y

R_add(2, 3)
#> [1] 5
R_add(NA_real_, 0)
#> [1] NA
R_add(NA_integer_, 0)
#> [1] NA

^{Created on 2020-12-07 by the reprex package (v0.3.0)}

clauswilke · 2020-12-07T20:21:26Z

Oh, and even without type casts, if you were to argue that converting NA to NaN is acceptable, you'd end up with addition that is not commutative.

library(rextendr)

rust_function(
  code = "fn rust_add(x: f64, y: f64) -> f64 { x + y }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

rust_add(NaN, NA_real_)
#> [1] NaN
rust_add(NA_real_, NaN)
#> [1] NA

^{Created on 2020-12-07 by the reprex package (v0.3.0)}

lebensterben · 2020-12-07T20:24:49Z

@clauswilke
Both Rust and R use the standard way of checking the first few bits of a float to determine whether it is NaN. But in addition, R will explicitly do another round of check to distinguish NaN from NA, by checking some magical bits.

If the float is coerced to another type in Rust, then certainly the magical bits in NA_REAL won't work again.

But if f64 is always treated as f64, and never converted, any algebraic operation on it won't modify the value. For sure this is not bullet-proof from mistakes.

clauswilke · 2020-12-07T20:27:46Z

But if f64 is always treated as f64, and never converted, any algebraic operation on it won't modify the value. For sure this is not bullet-proof from mistakes.

As I pointed out in my other comment (which I may have been typing while you typed yours), that's only true if you're willing to forego the commutative property. Adding a NaN to an NA will be different from adding an NA to a NaN, because Rust simply keeps the first value.

lebensterben · 2020-12-07T20:36:39Z

rust_add(NaN, NA_real_)
#> [1] NaN
rust_add(NA_real_, NaN)
#> [1] NA

This exemplifies the way how Rust compiler works with NaN.
For any x of Rust float type, NaN + x == NaN.

This also holds for when there are two values, x and y, which are both seen as Rust NaN.
It follows that x + y == x.

That's why NaN + NA == NaN, while NA + NaN == NA

This IS the expected behaviour, and that's a lovely property to our application, that NA is never mutated.

Sure this violates commutativity. But from R's perspective, NaN is treated similar to NA.
When NaN is coerced to other type in R, it's turned to a NA in the appropriate type.

andy-thomason · 2020-12-09T09:25:09Z

Just to clarify. NA_REAL is defined in arithmetic.c:

static double R_ValueOfNA(void)
{
    /* The gcc shipping with Fedora 9 gets this wrong without
     * the volatile declaration. Thanks to Marc Schwartz. */
    volatile ieee_double x;
    x.word[hw] = 0x7ff00000;
    x.word[lw] = 1954;
    return x.value;
}

This generates a special NaN with (someone's birthday?) in the mantissa.

Any arithmetic using NA will yield a NaN. The asymetic behaviour is interesting!

andy-thomason · 2020-12-09T09:30:31Z

Note that comparisons with NaN will always fail. Which leads to the classical NaN test:

if x == x {
// x is not a NaN
} else {
// x is a NaN
}

andy-thomason · 2020-12-09T10:54:49Z

To summarise:

We should implement from_robj() for Option<i32> and `Option for inputs.
We should implement Robj::from() for Option<i32> and Option<f64> for outputs.

As a wider note, we may want iterators to do the same.
maybe str_iter_opt() numeric_iter_opt() integer_iter_opt() to give an optional iterators.

Likewise, collect_robj() should take options.

andy-thomason · 2020-12-09T10:56:45Z

Another question: should from_robj::<i32/f64/str> reject if the value is NA?

andy-thomason · 2020-12-09T11:01:07Z

Not as idiomatic, but the formal version becomes:

fn rust_add(x: Option<f64>, y: Option<f64>) -> Option<f64> {
  if let (Some(x), Some(y)) = (x, y) {
    Some(x + y)
  } else {
    None
  }
}

lebensterben · 2020-12-09T12:30:27Z

Any arithmetic using NA will yield a NaN. The asymetic behaviour is interesting!

This is because NA is technically NaN according to ISO 754.

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=aba2a339f61e2be2b36b33aaa9fbc73c

Essentially, R made the decision that NA_INTEGER is smaller than any other integers, and NA_REAL is "greater" than any other doubles.

clauswilke · 2020-12-09T15:47:19Z

Another question: should from_robj::<i32/f64/str> reject if the value is NA?

Yes, I think that's the correct approach. Either people deal with NAs explicitly by working with options, or they accept the implicit assumption that their code cannot handle NAs and thus there'll be an error if an NA is encountered.

This was referenced Dec 6, 2020

Stricter input checking for extendr functions #64

Closed

Issue #49: str_iter() for factors #58

Merged

clauswilke added this to the 0.1.11 milestone Dec 10, 2020

andy-thomason linked a pull request Dec 17, 2020 that will close this issue

Handling of missing values. #77

Merged

andy-thomason closed this as completed in #77 Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of missing values #63

Handling of missing values #63

clauswilke commented Dec 5, 2020 •

edited

Loading

clauswilke commented Dec 5, 2020 •

edited

Loading

lebensterben commented Dec 6, 2020

yutannihilation commented Dec 6, 2020

yutannihilation commented Dec 6, 2020

clauswilke commented Dec 6, 2020

clauswilke commented Dec 6, 2020

yutannihilation commented Dec 6, 2020

lebensterben commented Dec 6, 2020 •

edited

Loading

yutannihilation commented Dec 6, 2020

clauswilke commented Dec 6, 2020

clauswilke commented Dec 7, 2020

clauswilke commented Dec 7, 2020

lebensterben commented Dec 7, 2020

clauswilke commented Dec 7, 2020 •

edited

Loading

clauswilke commented Dec 7, 2020

lebensterben commented Dec 7, 2020

clauswilke commented Dec 7, 2020

lebensterben commented Dec 7, 2020

andy-thomason commented Dec 9, 2020 •

edited

Loading

andy-thomason commented Dec 9, 2020

andy-thomason commented Dec 9, 2020 •

edited

Loading

andy-thomason commented Dec 9, 2020

andy-thomason commented Dec 9, 2020

lebensterben commented Dec 9, 2020

clauswilke commented Dec 9, 2020

Handling of missing values #63

Handling of missing values #63

Comments

clauswilke commented Dec 5, 2020 • edited Loading

clauswilke commented Dec 5, 2020 • edited Loading

lebensterben commented Dec 6, 2020

yutannihilation commented Dec 6, 2020

yutannihilation commented Dec 6, 2020

clauswilke commented Dec 6, 2020

clauswilke commented Dec 6, 2020

yutannihilation commented Dec 6, 2020

lebensterben commented Dec 6, 2020 • edited Loading

yutannihilation commented Dec 6, 2020

clauswilke commented Dec 6, 2020

clauswilke commented Dec 7, 2020

clauswilke commented Dec 7, 2020

lebensterben commented Dec 7, 2020

clauswilke commented Dec 7, 2020 • edited Loading

clauswilke commented Dec 7, 2020

lebensterben commented Dec 7, 2020

clauswilke commented Dec 7, 2020

lebensterben commented Dec 7, 2020

andy-thomason commented Dec 9, 2020 • edited Loading

andy-thomason commented Dec 9, 2020

andy-thomason commented Dec 9, 2020 • edited Loading

andy-thomason commented Dec 9, 2020

andy-thomason commented Dec 9, 2020

lebensterben commented Dec 9, 2020

clauswilke commented Dec 9, 2020

clauswilke commented Dec 5, 2020 •

edited

Loading

clauswilke commented Dec 5, 2020 •

edited

Loading

lebensterben commented Dec 6, 2020 •

edited

Loading

clauswilke commented Dec 7, 2020 •

edited

Loading

andy-thomason commented Dec 9, 2020 •

edited

Loading

andy-thomason commented Dec 9, 2020 •

edited

Loading