Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of missing values #63

Closed
clauswilke opened this issue Dec 5, 2020 · 25 comments · Fixed by #77
Closed

Handling of missing values #63

clauswilke opened this issue Dec 5, 2020 · 25 comments · Fixed by #77
Milestone

Comments

@clauswilke
Copy link
Member

clauswilke commented Dec 5, 2020

Do we have a systematic way of handling NA values on the Rust side? It's not clear to me what the current intent is to handling missing values. Most basic Rust data types (e.g., String) cannot handle missing values. Instead, Rust uses options. Maybe we should systematically wrap all incoming data types into options?

As an example, the current implementation silently converts NA_character_ into "NA", which I would argue is incorrect behavior.

library(rextendr)

rust_function(
  code = "fn rust_strings(input: &str) -> String {
    input.to_string()
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

rust_strings("a")
#> [1] "a"
rust_strings(NA_character_)
#> [1] "NA"

Created on 2020-12-05 by the reprex package (v0.3.0)

@clauswilke
Copy link
Member Author

clauswilke commented Dec 5, 2020

I think wrapping all input into options would be more idiomatic, but it doesn't currently work.

library(rextendr)

rust_function(
  code = r"(fn rust_strings(input: Option<&str>) -> String {
    // ... do something with input
    "hello".to_string() // some dummy output for now
  })",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)
#> Error: Rust code could not be compiled successfully. Aborting.

Created on 2020-12-05 by the reprex package (v0.3.0)

This is the compiler error message from the above build attempt:

   Compiling rextendr8 v0.0.1 (/var/folders/b1/13gn4j655jddkfhxmtk5tsfm0000gn/T/RtmpxTXC7C/file283a636057aa)
error[E0599]: no variant or associated item named `from_robj` found for enum `Option<&str>` in the current scope
 --> src/lib.rs:3:1
  |
3 | #[extendr]
  | ^^^^^^^^^^ variant or associated item not found in `Option<&str>`
  |
  = note: this error originates in an attribute macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

For more information about this error, try `rustc --explain E0599`.
error: could not compile `rextendr8`

To learn more, run the command again with --verbose.

@lebensterben
Copy link
Contributor

Some(parse_quote!{ extendr_api::unwrap_or_throw(<#ty>::from_robj(&#varname)) })

Option<&str> doesn't implement FromRobj

This is easy to solve.
For all T implements FromRobj, the two following identities holds

  1. <Option<T>>::from_robj(x) == <T>::from_robj(x) iff the Robj is NOT NA.
  2. <Option<T>>::from_robj(x) == None iff the Robj is NA.

@yutannihilation
Copy link
Contributor

Apache Arrow also uses Option<T> to represent the missingness (note that the arrow format internally has a separate bitmap to represent null c.f. doc), so this seems the right way.

https://docs.rs/arrow/2.0.0/arrow/#array

Handling NaN might be a bit more complicated.

@yutannihilation
Copy link
Contributor

Handling NaN might be a bit more complicated.

Here's the current behavior. Should I file a separate issue for this?

library(rextendr)

rust_function(
  code = "fn rust_int(input: i32) -> i32 {
    input
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)
#> build directory: /tmp/Rtmp9cpfVn/file226494fffa9e6
#> Prebuild libR bindings are not available. Run `install_libR_bindings()` to improve future build times.

rust_int(Inf)
#> [1] 2147483647
rust_int(NaN)
#> [1] 0

Created on 2020-12-06 by the reprex package (v0.3.0)

@clauswilke
Copy link
Member Author

I'm not sure there's a problem with Inf and NaN. They work if you use the f64 type.

library(rextendr)

rust_function(
  code = "fn rust_dbl(input: f64) -> f64 {
    input
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

rust_dbl(Inf)
#> [1] Inf
rust_dbl(NaN)
#> [1] NaN

Created on 2020-12-06 by the reprex package (v0.3.0)

And note that if you append Inf or NaN to an integer vector in R it converts the vector to double:

library(tibble)
tibble(x = c(1L, 2L, 3L, Inf, NaN), y = 1L:5L)
#> # A tibble: 5 x 2
#>       x     y
#>   <dbl> <int>
#> 1     1     1
#> 2     2     2
#> 3     3     3
#> 4   Inf     4
#> 5   NaN     5

Created on 2020-12-06 by the reprex package (v0.3.0)

So the question is rather whether extendr should implicitly convert integers to doubles. Given your example, I think the answer may be no. But either way, it's a separate issue.

@clauswilke
Copy link
Member Author

One more Inf example, created by division by zero in Rust. I think Inf and NaN work because they are part of the IEEE standard, it's only NA that makes trouble.

library(rextendr)

rust_function(
  code = "fn div_by_zero(input: f64) -> f64 {
    input / 0.
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

div_by_zero(3.)
#> [1] Inf
div_by_zero(-3.)
#> [1] -Inf

Created on 2020-12-06 by the reprex package (v0.3.0)

@yutannihilation
Copy link
Contributor

because they are part of the IEEE standard, it's only NA that makes trouble.

Oh, thanks, got it!

So the question is rather whether extendr should implicitly convert integers to doubles. Given your example, I think the answer may be no. But either way, it's a separate issue.

Agreed. Now I feel my example should return NA for both Inf and NaN, but let's discuss this another time and focus on missing values.

as.integer(Inf)
#> Warning: NAs introduced by coercion to integer range
#> [1] NA
as.integer(NaN)
#> [1] NA

Created on 2020-12-06 by the reprex package (v0.3.0)

@lebensterben
Copy link
Contributor

lebensterben commented Dec 6, 2020

@yutannihilation

See this
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=ece28b8252f43834c424594c86a31e1a

For your example, the only correct behaviour is to return an error, not even a NA.

@yutannihilation
Copy link
Contributor

Aw, thanks. This is horrible... I agree this should return an error.

@clauswilke
Copy link
Member Author

My take is slightly different. I think the example by @lebensterben shows that coercion of R double to any Rust numerical data types is not possible. Thus, the only reasonable approach is to always convert R doubles to Option<f64>, with NA_REAL converted to None. From there on, we can safely handle the value in Rust, e.g. extract the value from the option if it exists and then convert to f32 or i32 or whatever.

@clauswilke
Copy link
Member Author

Thinking some more about this, I think there are three ways we can handle missing values, and only two of them are a good idea. In the following, let T be some type we want to extract from an R object, e.g. f64.

These are the options:

  1. Only provide an interface to Option<T>. So function arguments for functions exported via #[extendr] would have to be Option<T>, they could never be just T.

  2. Provide an interface to both Option<T> and T. For the latter, throw an R error if a missing value is encountered. In other words, we can use T as a function argument but such a function could never accept missing values as input.

  3. Provide an interface to T that ignores missing values. This is what we're currently doing. It can lead to silent data corruption, as shown above.

I think both options 1 and 2 are fine, and are idiomatic within both the R and the Rust mindsets. Option 3 is not acceptable, I believe. It will cause data corruption, and thus people will eventually be very upset with us for providing this interface. Also, an interface that requires the programmer to handle a type in a special way or risk data corruption, without forcing this behavior at the compile stage, goes very much against the philosophy of Rust to be strictly typed and prevent undefined behavior.

At the same time, I understand having to constantly juggle Option<T> everywhere is annoying. Therefore, I think the second approach is best, provide both Option<T> and T interfaces but throw an error when the T interface is used with missing values.

@clauswilke
Copy link
Member Author

Here is another example of silent data corruption. Again, I think this is a pervasive issue, not a minor corner case.

library(rextendr)

rust_function(
  code = "fn rust_add(x: i32, y: i32) -> i32 {
    x + y
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

# Ok
rust_add(2, 3)
#> [1] 5

# Ok
rust_add(NA_integer_, 0)
#> [1] NA

# This is bad
rust_add(NA_integer_, 1)
#> [1] -2147483647

Created on 2020-12-07 by the reprex package (v0.3.0)

@lebensterben
Copy link
Contributor

In terms of R INTEGER and REAL, their NA value are smallest representable i32, and a hard codes value, respectively.

The case of REAL_NA is actually easy to dealt with. Because it's seen as NaN by Rust, and algebraic operation on NaN returns NaN, so Rust compiler simply won't modify the value. When it's returned to R, it's still REAL_NA.

For INTEGER_NA, it's a bit complex. It's i32::MIN, and Rust algebraic operations WILL modify its value.

@clauswilke
Copy link
Member Author

clauswilke commented Dec 7, 2020

@lebensterben I think your previous example with type casts has shown that we cannot rely on Rust treating NA_real_ properly. If somewhere in the calculation the f64 is cast to some other type, we can expect incorrect results. Example follows.

library(rextendr)

rust_function(
  code = "fn rust_add(x: f64, y: f64) -> f32 {
    (x + y) as f32
  }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

# Ok
rust_add(2, 3)
#> [1] 5

# This is bad
rust_add(NA_real_, 0)
#> [1] NaN

# This is also bad
rust_add(NA_integer_, 0)
#> [1] -2147483648

Created on 2020-12-07 by the reprex package (v0.3.0)

For comparison, this is the expected behavior:

R_add <- function(x, y) x + y

R_add(2, 3)
#> [1] 5
R_add(NA_real_, 0)
#> [1] NA
R_add(NA_integer_, 0)
#> [1] NA

Created on 2020-12-07 by the reprex package (v0.3.0)

@clauswilke
Copy link
Member Author

Oh, and even without type casts, if you were to argue that converting NA to NaN is acceptable, you'd end up with addition that is not commutative.

library(rextendr)

rust_function(
  code = "fn rust_add(x: f64, y: f64) -> f64 { x + y }",
  patch.crates_io = c(
    'extendr-api = { git = "https://github.com/extendr/extendr" }',
    'extendr-macros = { git = "https://github.com/extendr/extendr" }'
  )
)

rust_add(NaN, NA_real_)
#> [1] NaN
rust_add(NA_real_, NaN)
#> [1] NA

Created on 2020-12-07 by the reprex package (v0.3.0)

@lebensterben
Copy link
Contributor

@clauswilke
Both Rust and R use the standard way of checking the first few bits of a float to determine whether it is NaN. But in addition, R will explicitly do another round of check to distinguish NaN from NA, by checking some magical bits.

If the float is coerced to another type in Rust, then certainly the magical bits in NA_REAL won't work again.

But if f64 is always treated as f64, and never converted, any algebraic operation on it won't modify the value. For sure this is not bullet-proof from mistakes.

@clauswilke
Copy link
Member Author

But if f64 is always treated as f64, and never converted, any algebraic operation on it won't modify the value. For sure this is not bullet-proof from mistakes.

As I pointed out in my other comment (which I may have been typing while you typed yours), that's only true if you're willing to forego the commutative property. Adding a NaN to an NA will be different from adding an NA to a NaN, because Rust simply keeps the first value.

@lebensterben
Copy link
Contributor

rust_add(NaN, NA_real_)
#> [1] NaN
rust_add(NA_real_, NaN)
#> [1] NA

This exemplifies the way how Rust compiler works with NaN.
For any x of Rust float type, NaN + x == NaN.

This also holds for when there are two values, x and y, which are both seen as Rust NaN.
It follows that x + y == x.

That's why NaN + NA == NaN, while NA + NaN == NA

This IS the expected behaviour, and that's a lovely property to our application, that NA is never mutated.

Sure this violates commutativity. But from R's perspective, NaN is treated similar to NA.
When NaN is coerced to other type in R, it's turned to a NA in the appropriate type.

@andy-thomason
Copy link
Contributor

andy-thomason commented Dec 9, 2020

Just to clarify. NA_REAL is defined in arithmetic.c:

static double R_ValueOfNA(void)
{
    /* The gcc shipping with Fedora 9 gets this wrong without
     * the volatile declaration. Thanks to Marc Schwartz. */
    volatile ieee_double x;
    x.word[hw] = 0x7ff00000;
    x.word[lw] = 1954;
    return x.value;
}

This generates a special NaN with (someone's birthday?) in the mantissa.

Any arithmetic using NA will yield a NaN. The asymetic behaviour is interesting!

@andy-thomason
Copy link
Contributor

Note that comparisons with NaN will always fail. Which leads to the classical NaN test:

if x == x {
// x is not a NaN
} else {
// x is a NaN
}

@andy-thomason
Copy link
Contributor

andy-thomason commented Dec 9, 2020

To summarise:

  • We should implement from_robj() for Option<i32> and `Option for inputs.
  • We should implement Robj::from() for Option<i32> and Option<f64> for outputs.

As a wider note, we may want iterators to do the same.
maybe str_iter_opt() numeric_iter_opt() integer_iter_opt() to give an optional iterators.

Likewise, collect_robj() should take options.

@andy-thomason
Copy link
Contributor

Another question: should from_robj::<i32/f64/str> reject if the value is NA?

@andy-thomason
Copy link
Contributor

Not as idiomatic, but the formal version becomes:

fn rust_add(x: Option<f64>, y: Option<f64>) -> Option<f64> {
  if let (Some(x), Some(y)) = (x, y) {
    Some(x + y)
  } else {
    None
  }
}

@lebensterben
Copy link
Contributor

Any arithmetic using NA will yield a NaN. The asymetic behaviour is interesting!

This is because NA is technically NaN according to ISO 754.

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=aba2a339f61e2be2b36b33aaa9fbc73c

Essentially, R made the decision that NA_INTEGER is smaller than any other integers, and NA_REAL is "greater" than any other doubles.

@clauswilke
Copy link
Member Author

Another question: should from_robj::<i32/f64/str> reject if the value is NA?

Yes, I think that's the correct approach. Either people deal with NAs explicitly by working with options, or they accept the implicit assumption that their code cannot handle NAs and thus there'll be an error if an NA is encountered.

@clauswilke clauswilke added this to the 0.1.11 milestone Dec 10, 2020
@andy-thomason andy-thomason linked a pull request Dec 17, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants