Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3628: [R] Expose Decimal128Array using vctrs #2845

Closed
wants to merge 26 commits into from

Conversation

romainfrancois
Copy link
Contributor

This is a WIP for holding decimal types in a vctrs record, that I submit for initial feedback.

I'm probably doing something wrong, because the ToString methods gives me nonsense:

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
#> The following objects are masked from 'package:base':
#> 
#>     array, table
library(vctrs)
#> 
#> Attaching package: 'vctrs'
#> The following object is masked from 'package:arrow':
#> 
#>     list_of

(x <- vec_cast(1:10, new_decimal128()))
#> <arrow_decimal128[10]>
#>  [1] 1  2  3  4  5  6  7  8  9  10
# ???
(a <- array(x))
#> arrow::Decimal128Array <0x7fc838463bd0> 
#> [
#>   18446744073709551616,
#>   36893488147419103232,
#>   55340232221128654848,
#>   73786976294838206464,
#>   92233720368547758080,
#>   110680464442257309696,
#>   129127208515966861312,
#>   147573952589676412928,
#>   166020696663385964544,
#>   184467440737095516160
#> ]

a$FormatValue(0L)
#> [1] "18446744073709551616"

a$as_vector()
#> <arrow_decimal128[10]>
#>  [1] 1  2  3  4  5  6  7  8  9  10

The basic idea is to host a Decimal128 (which takes 128 bits) in a complex (which also takes 128 bits), so that it minimizes copies (eventually when we use ALTREP).

@wesm wesm changed the title Arrow 3307/decimal ARROW-3628: [R] Expose Decimal128Array using vctrs Oct 26, 2018
@wesm
Copy link
Member

wesm commented Oct 26, 2018

I created a JIRA for this and updated the issue title

@romainfrancois romainfrancois force-pushed the ARROW-3307/decimal branch 2 times, most recently from 99bd849 to d428642 Compare November 2, 2018 14:14
@romainfrancois romainfrancois force-pushed the ARROW-3307/decimal branch 3 times, most recently from 11d41cf to 107b09b Compare November 9, 2018 20:48
r/src/array.cpp Outdated Show resolved Hide resolved
r/src/array.cpp Outdated Show resolved Hide resolved
r/src/array.cpp Outdated Show resolved Hide resolved
@wesm
Copy link
Member

wesm commented Nov 10, 2018

I opened https://issues.apache.org/jira/browse/ARROW-3747, we should probably go ahead and do that if it causes no problem as it will make things easier here

@romainfrancois
Copy link
Contributor Author

Please yes. This was very confusing that i could not just use the 16bytes of an Rcomplex or an arrow::Decimal128 interchangeably.

The ComplexVector is indeed used as a host for a contiguous array of things that occupy 16 bytes.

That’s the same workaround as the bit64 📦 uses for representing 64 bits integers hosted in a double vector.

@wesm
Copy link
Member

wesm commented Nov 10, 2018

OK if you rebase then things ought to work better

@wesm
Copy link
Member

wesm commented Nov 10, 2018

sorry, you'll need to wait for #2940 to be merged

@romainfrancois
Copy link
Contributor Author

This still needs some care about:

  • how to handle the nulls on the R side
  • what to do with a decimal128 record

but the basics are in place, and works more naturally with #2940:

library(arrow)
a <- array(1:10, new_decimal128())
a
#> arrow::Decimal128Array 
#> [
#>   1,
#>   2,
#>   3,
#>   4,
#>   5,
#>   6,
#>   7,
#>   8,
#>   9,
#>   10
#> ]
a$as_vector()
#> <arrow_decimal128[10]>
#>  [1] 1  2  3  4  5  6  7  8  9  10

Created on 2018-11-12 by the reprex package (v0.2.1.9000)

@wesm
Copy link
Member

wesm commented Nov 12, 2018

Maybe a question for @hadley, what's the NA strategy for custom vectors?

@romainfrancois
Copy link
Contributor Author

I would be inclined to use a something binary compatible with the arrow bitmap. buffers, hosted in e.g. a raw vector, but I'm not sure vctrs records would allow that.

In this example, we'd have a field data that is a complex vector (just because it offers 16 bytes), and conceptually we could have a null_bitmap field, following something like this: (with more care):

bitmap <- function(n) {
  # each byte (element of a raw vector) gives 
  # 8 values
  nbytes <- ceiling(n/8)
  new_vctr(raw(nbytes), n = n, class = "bitmap")
}

vec_size.bitmap <- function(x) {
  attr(x, "n")
}
# ... other methods to fool vctrs

The bitmap could also be an ALTREP logical vector, so that as far as R is concerned it's just a regular logical vector, but internally its memory come from the arrow::Buffer. We can do that in R 3.6 with wch/r-source@b60247b

The advantage is that vctrs would just see this as a logical vector, the downside is that it may be materialized into a true contiguous logical vector (i.e. a vector of int) if dataptr is called on the altrep object.

@romainfrancois
Copy link
Contributor Author

The alternative is to use some sort of sentinel here too, this is what e.g. integer64 does in bit64

@romainfrancois
Copy link
Contributor Author

🤔 maybe not as a field, but as another attribute that holds buffer + offset. I'll try that.

@hadley
Copy link
Contributor

hadley commented Nov 12, 2018

Personally, either I'd recommend holding off on Decimal128 coercion for now, or putting it in the simple possible structure that an expert could still compute with. There is no tooling in R available to work with numbers of this type, so there's no reason to invest in them at this time (except as much as needed to facilitate a round-trip between arrow and R).

@romainfrancois
Copy link
Contributor Author

Makes sense, having a minimum structure to work with for now is fine, and then we can leave things like + etc ... (which essentially would have to use methods of the C++ class arrow::Decimal128 for later).

What we have now can host the value buffer in a complex vector. because that's in a record, we get things for free, like vec_slice, ...

I'll see what I can do for the nulls, but yeah I don't want to spend too much time on this until there's a need.

@romainfrancois romainfrancois force-pushed the ARROW-3307/decimal branch 5 times, most recently from 0f31c8d to 0ff4ed9 Compare November 19, 2018 11:17
@romainfrancois romainfrancois added Component: R WIP PR is work in progress labels Nov 22, 2018
kou and others added 25 commits March 29, 2019 06:20
Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#4069 from kou/release-fix-typo-in-mail-template and squashes the following commits:

15dc3c8 <Kouhei Sutou>  Fix typos in vote e-mail template
  * Use jq in local
  * Ensure making variables for "for" function local for parallel processing

Author: Kouhei Sutou <kou@clear-code.com>

Closes apache#4070 from kou/release-improve-binary-upload-performance and squashes the following commits:

2463d3c <Kouhei Sutou>  Improve 03-binary performance
@wesm
Copy link
Member

wesm commented May 16, 2019

Closing until a consensus about this feature can be reached

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants