# Read arrow file in R

In [1]:
ncvote <- arrow::read_ipc_file("ncvoter_Statewide.arrow")  # as a tibble
tibble::glimpse(ncvote)

Rows: 8,778,585
Columns: 67
$ county_id                [3m[38;5;246m<int>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ county_desc              [3m[38;5;246m<fct>[39m[23m ALAMANCE, ALAMANCE, ALAMANCE, ALAMANCE, ALAMA…
$ voter_reg_num            [3m[38;5;246m<int>[39m[23m 9005990, 9178574, 9205561, 9048723, 9019674, …
$ ncid                     [3m[38;5;246m<chr>[39m[23m "AA56273", "AA201627", "AA216996", "AA98377",…
$ last_name                [3m[38;5;246m<fct>[39m[23m AABEL, AARDEN, AARMSTRONG, AARON, AARON, AARO…
$ first_name               [3m[38;5;246m<fct>[39m[23m RUTH, JONI, TIMOTHY, CHRISTINA, CLAUDIA, JAME…
$ middle_name              [3m[38;5;246m<fct>[39m[23m EVELYN, AUTUMN, DUANE, CASTAGNA, HAYDEN, MICH…
$ name_suffix_lbl          [3m[38;5;246m<fct>[39m[23m [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mNA[39m, [31mN[39m…
$ status_cd 

Counting the voter status categories can be done with the `dplyr` package.

In [2]:
dplyr::count(ncvote, status_cd, voter_status_desc, sort=TRUE)

[38;5;246m# A tibble: 5 × 3[39m
  status_cd voter_status_desc       n
  [3m[38;5;246m<fct>[39m[23m     [3m[38;5;246m<fct>[39m[23m               [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m A         ACTIVE            7[4m0[24m[4m8[24m[4m9[24m338
[38;5;250m2[39m R         REMOVED           1[4m1[24m[4m0[24m[4m0[24m070
[38;5;250m3[39m I         INACTIVE           [4m4[24m[4m0[24m[4m1[24m159
[38;5;250m4[39m D         DENIED             [4m1[24m[4m7[24m[4m3[24m591
[38;5;250m5[39m S         TEMPORARY           [4m1[24m[4m4[24m427

Reading a large Arrow IPC file as a `data.frame` or `tibble` is slow because the data must be copied from the Arrow representation to R's representation.  R only allows for one type of integer (32-bit signed), one type of floating point representation (64-bit), and requires that the levels of factors must be strings.

Also, R's representation of missing values, via sentinels, is different from the Arrow representation (optional bitvector) so you can't memory-map and just pass pointers.

You can avoid the translation/copying and work with a pyarrow-like representation.

In [3]:
tbl = arrow::read_ipc_file("ncvoter_Statewide.arrow", as_data_frame = FALSE)
tbl

Table
8778585 rows x 67 columns
$county_id <int8 not null>
$county_desc <dictionary<values=string, indices=int8> not null>
$voter_reg_num <int32 not null>
$ncid <string not null>
$last_name <dictionary<values=string, indices=int32> not null>
$first_name <dictionary<values=string, indices=int32>>
$middle_name <dictionary<values=string, indices=int32>>
$name_suffix_lbl <dictionary<values=string, indices=int8>>
$status_cd <dictionary<values=string, indices=int8> not null>
$voter_status_desc <dictionary<values=string, indices=int8> not null>
$reason_cd <dictionary<values=string, indices=int8>>
$voter_status_reason_desc <dictionary<values=string, indices=int8>>
$res_street_address <string not null>
$res_city_desc <dictionary<values=string, indices=int16>>
$state_cd <dictionary<values=string, indices=int8>>
$zip_code <int32>
$mail_addr1 <string>
$mail_addr2 <dictionary<values=string, indices=int32>>
$mail_addr3 <dictionary<values=string, indices=int16>>
$mail_addr4 <dictionary<values=string,

In [4]:
class(tbl)

[1] "Table"        "ArrowTabular" "ArrowObject"  "R6"          

Many of the dplyr functions have methods for this type of table.  Usually you would apply `as.data.frame` to the result before printing.

In [5]:
as.data.frame(dplyr::count(tbl, status_cd, voter_status_desc, sort=TRUE))

  status_cd voter_status_desc       n
1         A            ACTIVE 7089338
2         R           REMOVED 1100070
3         I          INACTIVE  401159
4         D            DENIED  173591
5         S         TEMPORARY   14427