-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul type serialization/deserialization machinery #156
Conversation
Codecov Report
@@ Coverage Diff @@
## main #156 +/- ##
==========================================
+ Coverage 84.50% 85.58% +1.07%
==========================================
Files 23 23
Lines 2769 2865 +96
==========================================
+ Hits 2340 2452 +112
+ Misses 429 413 -16
Continue to review full report at Codecov.
|
if I get a chance in the next few days I'll try this out over in beacon-biosignals/Onda.jl#68 (my initial use case) and report back |
src/arrowtypes.jl
Outdated
A few `ArrowKind`s allow slightly more custom overloads for their `fromarrow` methods: | ||
* `ListKind{true}`: for `String` types, they may overload `fromarrow(::Type{T}, ptr::Ptr{UInt8}, len::Int) = ...` to avoid | ||
materializing a `String` | ||
* `StructKind`: may overload `fromarrow(::Type{T}, x::NamedTuple)`, or `fromarrow(::Type{T}; kw...)` (values passed as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a fromarrow(::Type{T}, x::NamedTuple)
method but it doesn't seem to be getting called:
julia> include("examples/tour.jl"); # run from root of Onda directory, generates some test state
julia> infos = [eeg.info, eeg.info, eeg.info, eeg.info]; # dummy data
julia> using Arrow
julia> tbl = Arrow.Table(Arrow.tobuffer((x=infos,)))
Arrow.Table: (x = SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}[Error showing value of type Arrow.Table:
ERROR: MethodError: no method matching SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}(::String, ::Array{String,1}, ::String, ::Float64, ::Int64, ::String, ::Float64)
Stacktrace:
[1] fromarrow(::Type{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}}, ::String, ::Array{String,1}, ::String, ::Vararg{Any,N} where N) at /Users/jarrettrevels/.julia/dev/Arrow/src/arrowtypes.jl:134
[2] getindex at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:47 [inlined]
[3] isassigned(::Arrow.Struct{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8},Tuple{Arrow.List{String,Int32,Array{UInt8,1}},Arrow.List{Array{String,1},Int32,Arrow.List{String,Int32,Array{UInt8,1}}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}},Arrow.Primitive{Int64,Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Int64) at ./abstractarray.jl:408
[4] show_delim_array(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Struct{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8},Tuple{Arrow.List{String,Int32,Array{UInt8,1}},Arrow.List{Array{String,1},Int32,Arrow.List{String,Int32,Array{UInt8,1}}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}},Arrow.Primitive{Int64,Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Char, ::String, ::Char, ::Bool, ::Int64, ::Int64) at ./show.jl:740
[5] show_delim_array at ./show.jl:733 [inlined]
[6] show_vector(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Struct{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8},Tuple{Arrow.List{String,Int32,Array{UInt8,1}},Arrow.List{Array{String,1},Int32,Arrow.List{String,Int32,Array{UInt8,1}}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}},Arrow.Primitive{Int64,Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Char, ::Char) at ./arrayshow.jl:476
[7] show_vector at ./arrayshow.jl:461 [inlined]
[8] show at ./arrayshow.jl:432 [inlined]
[9] show(::IOContext{REPL.Terminals.TTYTerminal}, ::NamedTuple{(:x,),Tuple{Arrow.Struct{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8},Tuple{Arrow.List{String,Int32,Array{UInt8,1}},Arrow.List{Array{String,1},Int32,Arrow.List{String,Int32,Array{UInt8,1}}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}},Arrow.Primitive{Int64,Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}}}) at ./namedtuple.jl:150
[10] show(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Table) at /Users/jarrettrevels/.julia/dev/Tables/src/Tables.jl:196
[11] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::Arrow.Table) at ./multimedia.jl:47
[12] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:214
[13] display(::REPL.REPLDisplay, ::Any) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:218
[14] display(::Any) at ./multimedia.jl:328
[15] #invokelatest#1 at ./essentials.jl:710 [inlined]
[16] invokelatest at ./essentials.jl:709 [inlined]
[17] print_response(::IO, ::Any, ::Bool, ::Bool, ::Any) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:238
[18] print_response(::REPL.AbstractREPL, ::Any, ::Bool, ::Bool) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:223
[19] (::REPL.var"#do_respond#54"{Bool,Bool,REPL.var"#64#73"{REPL.LineEditREPL,REPL.REPLHistoryProvider},REPL.LineEditREPL,REPL.LineEdit.Prompt})(::Any, ::Any, ::Any) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:822
[20] #invokelatest#1 at ./essentials.jl:710 [inlined]
[21] invokelatest at ./essentials.jl:709 [inlined]
[22] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/LineEdit.jl:2355
[23] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:1144
[24] (::REPL.var"#38#42"{REPL.LineEditREPL,REPL.REPLBackendRef})() at ./task.jl:356
Looks like it's splatting the arguments instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I realized it would be quite a bit more inefficient to generate the full NamedTuple. My hope was that the compiler would optimize it out entirely, but I would need to comb over the code to see if that's even possible. We'd probably have to include the arrow type as another type parameter to the Struct array type. That might be a good idea anyways since I'm currently deep in the weeds on a big where it gets pretty confusing if the Struct eltype is Julia or arrow.
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Still to do:
|
end | ||
return B, nodeidx, bufferidx | ||
end | ||
|
||
function build(f::Meta.Field, L::Meta.Null, batch, rb, de, nodeidx, bufferidx, convert) | ||
@debug 2 "building array: L = $L" | ||
return MissingVector(rb.nodes[nodeidx].length), nodeidx + 1, bufferidx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh goodness; somehow forgot to have the + 1
on my nodeidx here when I updated the code 😱 . There goes an hour of my life.
all columns when serializing
…nsion type metadata htat can be used in JuliaType
Ok, just pushed a new commit that allows us to properly support cases like #135. The wrinkle there is the custom type In the arrow spec about extension types, it also officially supports setting the |
Ok, I've updated some docs + the manual and added some more tests. I think this is ready to go. I'll leave it open for another day or two for others to try things out if they want, but then I think we're ready for a merge + 1.3 release. |
* Start work on overhauling type serialization architecture * More work; serialization is pretty much done but not tested * fix timetype ArrowTypes definitions * more work to get tests passing * get tests passing? * fix * Fix apache#75 by supporting Set serialization/deserialization * Fix apache#85 by supporting tuple serialization/deserialization * Lots of cleanup * few more fixes * Update src/arrowtypes.jl Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com> * Update src/arrowtypes.jl Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com> * fix NullKind reading * Fix apache#134 by requiring concrete or union of concrete element types for all columns when serializing * Add new ArrowTypes.arrowmetadata method for providing additional extension type metadata htat can be used in JuliaType * Update manual * tests Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Ok, this PR is in response to issues like #135, #134, #132, #88, #85, and a few conversations on slack. In short, the automatic registering of
struct
types is pretty smelly, we're not set up to handle parametric cases very well, and the usage of Dicts internally is fairly un-Julian. We were also pretty inconsistent internally around how things were handled; some types were baked into theregistertype!
machinery (Symbol, Char), some had customArrowType
definitions (UUID), and some were baked in even further to the array conversion code (all the time types). The proposal in this PR is roughly as follows, though I encourage anyone interested to read through the documented interface methods in the arrowtypes.jl file:ArrowType
toArrowKind
; it was getting all too muddled together that there is a specific set of native types that arrow support, and those types all fall under a smaller set of physical layout configurations; the former is now tied to a "new"ArrowType
, while the latter is nowArrowKind
. Most custom types don't need to worry about overloadingArrowKind
, because we define them generally on abstract types in theArrowTypes
module already (likeAbstractArray
,AbstractDict
, etc.). Custom types will more often overloadArrowType
, which maps a user's custom type to a natively supported arrow type; users should note that custom structs defined likestruct
ormutable struct
, however, are natively supported for serialization, but without additional definitions (arrowname
,JuliaType
), there is no automatic deserializationArrowType
definition for my type, then I'm required to also define anArrowTypes.toarrow(x::T)
method that converts my type to the native arrow type I defined inArrowType
; this provides a serialization "hook" to do any desired transformation; note however, that definingArrowTypes.toarrow
doesn't require having anArrowType
definition; I may want to simply ignore a field or two in my custom struct, which I can do by definingtoarrow
and dropping the fields by returning a pared down struct or NamedTupleArrowTypes.arrowname(T) => Symbol
, andArrowTypes.JuliaType(::Val{Symbol(name)}, S) = T
. The first takes my custom type as sole argument, and returns aSymbol
of what my custom type will have stored in the column schema metadata. The latter definition will overload aVal
wrapping my symbol name fromarrowname
(a not-uncommon Julia technique for value dispatch), and the native arrow serialized type as 2nd argument (S
). These two definitions allow the "tagging" of my custom type during serialization (arrowname
) and the conversion from native arrow type to my custom type during deserialization (JuliaType
). InJuliaType
, having the native arrow type as 2nd argument allows for parametric types to define a singlearrowname
, and rely on the serialized type to re-parameterize their custom type. Custom types may also choose to just include parameters in theirarrowname
definition; as long as a correspondingJuliaType
definition exists that exactly matchesVal
argument, all should be wellArrowTypes.fromarrow
, which provides a deserialization "hook". It uses the result ofJuliaType
definition to provide the native arrow objects as arguments and custom types can then "construct themselves" appropriately.All in all, I like the power and flexibility of the new system. I think it follows more traditional Julia dispatch patterns, while providing enough customizability to cover necessary use-cases.
The current test suite passes for me locally, but I have run into an occasional race-condition when reading I believe. I'm going to look more into that. I purposely changed as little as possible in the test suite because I wanted this PR to be as non-breaking as possible. I don't think I'm aware of anyone who was using/defining things with
ArrowTypes.ArrowType
or other internals, which weren't really documented anyway, because I kind of knew it would need an overhaul at some point. I've kept the registertype!` machinery in place for now to avoid breaking things too much.I'd appreciate any feedback/concerns if people have them. My plan as of now is to go through all the type-related issues and ensure we can now handle the requests, fix any bugs, and add additional tests around some of the new functionality.