-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of missing values #63
Comments
I think wrapping all input into options would be more idiomatic, but it doesn't currently work.
Created on 2020-12-05 by the reprex package (v0.3.0) This is the compiler error message from the above build attempt:
|
extendr/extendr-macros/src/lib.rs Line 63 in 578ce65
This is easy to solve.
|
Apache Arrow also uses https://docs.rs/arrow/2.0.0/arrow/#array Handling |
Here's the current behavior. Should I file a separate issue for this? library(rextendr)
rust_function(
code = "fn rust_int(input: i32) -> i32 {
input
}",
patch.crates_io = c(
'extendr-api = { git = "https://github.com/extendr/extendr" }',
'extendr-macros = { git = "https://github.com/extendr/extendr" }'
)
)
#> build directory: /tmp/Rtmp9cpfVn/file226494fffa9e6
#> Prebuild libR bindings are not available. Run `install_libR_bindings()` to improve future build times.
rust_int(Inf)
#> [1] 2147483647
rust_int(NaN)
#> [1] 0 Created on 2020-12-06 by the reprex package (v0.3.0) |
I'm not sure there's a problem with
Created on 2020-12-06 by the reprex package (v0.3.0) And note that if you append
Created on 2020-12-06 by the reprex package (v0.3.0) So the question is rather whether extendr should implicitly convert integers to doubles. Given your example, I think the answer may be no. But either way, it's a separate issue. |
One more
Created on 2020-12-06 by the reprex package (v0.3.0) |
Oh, thanks, got it!
Agreed. Now I feel my example should return as.integer(Inf)
#> Warning: NAs introduced by coercion to integer range
#> [1] NA
as.integer(NaN)
#> [1] NA Created on 2020-12-06 by the reprex package (v0.3.0) |
For your example, the only correct behaviour is to return an error, not even a NA. |
Aw, thanks. This is horrible... I agree this should return an error. |
My take is slightly different. I think the example by @lebensterben shows that coercion of R double to any Rust numerical data types is not possible. Thus, the only reasonable approach is to always convert R doubles to |
Thinking some more about this, I think there are three ways we can handle missing values, and only two of them are a good idea. In the following, let These are the options:
I think both options 1 and 2 are fine, and are idiomatic within both the R and the Rust mindsets. Option 3 is not acceptable, I believe. It will cause data corruption, and thus people will eventually be very upset with us for providing this interface. Also, an interface that requires the programmer to handle a type in a special way or risk data corruption, without forcing this behavior at the compile stage, goes very much against the philosophy of Rust to be strictly typed and prevent undefined behavior. At the same time, I understand having to constantly juggle |
Here is another example of silent data corruption. Again, I think this is a pervasive issue, not a minor corner case.
Created on 2020-12-07 by the reprex package (v0.3.0) |
In terms of R INTEGER and REAL, their NA value are smallest representable i32, and a hard codes value, respectively. The case of REAL_NA is actually easy to dealt with. Because it's seen as NaN by Rust, and algebraic operation on NaN returns NaN, so Rust compiler simply won't modify the value. When it's returned to R, it's still REAL_NA. For INTEGER_NA, it's a bit complex. It's i32::MIN, and Rust algebraic operations WILL modify its value. |
@lebensterben I think your previous example with type casts has shown that we cannot rely on Rust treating
Created on 2020-12-07 by the reprex package (v0.3.0) For comparison, this is the expected behavior:
Created on 2020-12-07 by the reprex package (v0.3.0) |
Oh, and even without type casts, if you were to argue that converting
Created on 2020-12-07 by the reprex package (v0.3.0) |
@clauswilke If the float is coerced to another type in Rust, then certainly the magical bits in NA_REAL won't work again. But if f64 is always treated as f64, and never converted, any algebraic operation on it won't modify the value. For sure this is not bullet-proof from mistakes. |
As I pointed out in my other comment (which I may have been typing while you typed yours), that's only true if you're willing to forego the commutative property. Adding a |
This exemplifies the way how Rust compiler works with NaN. This also holds for when there are two values, x and y, which are both seen as Rust NaN. That's why NaN + NA == NaN, while NA + NaN == NA This IS the expected behaviour, and that's a lovely property to our application, that NA is never mutated. Sure this violates commutativity. But from R's perspective, NaN is treated similar to NA. |
Just to clarify. NA_REAL is defined in arithmetic.c:
This generates a special NaN with (someone's birthday?) in the mantissa. Any arithmetic using NA will yield a NaN. The asymetic behaviour is interesting! |
Note that comparisons with NaN will always fail. Which leads to the classical NaN test: if x == x { |
To summarise:
As a wider note, we may want iterators to do the same. Likewise, |
Another question: should from_robj::<i32/f64/str> reject if the value is NA? |
Not as idiomatic, but the formal version becomes:
|
This is because NA is technically NaN according to ISO 754. Essentially, R made the decision that NA_INTEGER is smaller than any other integers, and NA_REAL is "greater" than any other doubles. |
Yes, I think that's the correct approach. Either people deal with |
Do we have a systematic way of handling
NA
values on the Rust side? It's not clear to me what the current intent is to handling missing values. Most basic Rust data types (e.g.,String
) cannot handle missing values. Instead, Rust uses options. Maybe we should systematically wrap all incoming data types into options?As an example, the current implementation silently converts
NA_character_
into"NA"
, which I would argue is incorrect behavior.Created on 2020-12-05 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: