Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul type serialization/deserialization machinery #156

Merged
merged 17 commits into from
Mar 29, 2021
Merged

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Mar 24, 2021

Ok, this PR is in response to issues like #135, #134, #132, #88, #85, and a few conversations on slack. In short, the automatic registering of struct types is pretty smelly, we're not set up to handle parametric cases very well, and the usage of Dicts internally is fairly un-Julian. We were also pretty inconsistent internally around how things were handled; some types were baked into the registertype! machinery (Symbol, Char), some had custom ArrowType definitions (UUID), and some were baked in even further to the array conversion code (all the time types). The proposal in this PR is roughly as follows, though I encourage anyone interested to read through the documented interface methods in the arrowtypes.jl file:

  • Rename ArrowType to ArrowKind; it was getting all too muddled together that there is a specific set of native types that arrow support, and those types all fall under a smaller set of physical layout configurations; the former is now tied to a "new" ArrowType, while the latter is now ArrowKind. Most custom types don't need to worry about overloading ArrowKind, because we define them generally on abstract types in the ArrowTypes module already (like AbstractArray, AbstractDict, etc.). Custom types will more often overload ArrowType, which maps a user's custom type to a natively supported arrow type; users should note that custom structs defined like struct or mutable struct, however, are natively supported for serialization, but without additional definitions (arrowname, JuliaType), there is no automatic deserialization
  • If I define a custom ArrowType definition for my type, then I'm required to also define an ArrowTypes.toarrow(x::T) method that converts my type to the native arrow type I defined in ArrowType; this provides a serialization "hook" to do any desired transformation; note however, that defining ArrowTypes.toarrow doesn't require having an ArrowType definition; I may want to simply ignore a field or two in my custom struct, which I can do by defining toarrow and dropping the fields by returning a pared down struct or NamedTuple
  • If I want my custom type to be deserializable, I need to implement two methods: ArrowTypes.arrowname(T) => Symbol, and ArrowTypes.JuliaType(::Val{Symbol(name)}, S) = T. The first takes my custom type as sole argument, and returns a Symbol of what my custom type will have stored in the column schema metadata. The latter definition will overload a Val wrapping my symbol name from arrowname (a not-uncommon Julia technique for value dispatch), and the native arrow serialized type as 2nd argument (S). These two definitions allow the "tagging" of my custom type during serialization (arrowname) and the conversion from native arrow type to my custom type during deserialization (JuliaType). In JuliaType, having the native arrow type as 2nd argument allows for parametric types to define a single arrowname, and rely on the serialized type to re-parameterize their custom type. Custom types may also choose to just include parameters in their arrowname definition; as long as a corresponding JuliaType definition exists that exactly matches Val argument, all should be well
  • The final new interface method is ArrowTypes.fromarrow, which provides a deserialization "hook". It uses the result of JuliaType definition to provide the native arrow objects as arguments and custom types can then "construct themselves" appropriately.

All in all, I like the power and flexibility of the new system. I think it follows more traditional Julia dispatch patterns, while providing enough customizability to cover necessary use-cases.

The current test suite passes for me locally, but I have run into an occasional race-condition when reading I believe. I'm going to look more into that. I purposely changed as little as possible in the test suite because I wanted this PR to be as non-breaking as possible. I don't think I'm aware of anyone who was using/defining things with ArrowTypes.ArrowType or other internals, which weren't really documented anyway, because I kind of knew it would need an overhaul at some point. I've kept the registertype!` machinery in place for now to avoid breaking things too much.

I'd appreciate any feedback/concerns if people have them. My plan as of now is to go through all the type-related issues and ensure we can now handle the requests, fix any bugs, and add additional tests around some of the new functionality.

@codecov
Copy link

codecov bot commented Mar 24, 2021

Codecov Report

Merging #156 (e35648a) into main (2cacbe5) will increase coverage by 1.07%.
The diff coverage is 91.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #156      +/-   ##
==========================================
+ Coverage   84.50%   85.58%   +1.07%     
==========================================
  Files          23       23              
  Lines        2769     2865      +96     
==========================================
+ Hits         2340     2452     +112     
+ Misses        429      413      -16     
Impacted Files Coverage Δ
src/Arrow.jl 54.54% <ø> (-1.98%) ⬇️
src/arraytypes/bool.jl 86.66% <66.66%> (ø)
src/arraytypes/primitive.jl 82.00% <66.66%> (ø)
src/write.jl 95.69% <66.66%> (+0.01%) ⬆️
src/arraytypes/map.jl 96.96% <71.42%> (+0.09%) ⬆️
src/arraytypes/fixedsizelist.jl 82.79% <80.00%> (-5.85%) ⬇️
src/arraytypes/struct.jl 96.72% <84.61%> (+11.42%) ⬆️
src/arraytypes/unions.jl 93.93% <90.00%> (+11.06%) ⬆️
src/arrowtypes.jl 91.66% <90.74%> (+1.66%) ⬆️
src/arraytypes/list.jl 90.75% <94.73%> (+0.57%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cacbe5...e35648a. Read the comment docs.

@jrevels
Copy link
Contributor

jrevels commented Mar 24, 2021

if I get a chance in the next few days I'll try this out over in beacon-biosignals/Onda.jl#68 (my initial use case) and report back

src/arrowtypes.jl Outdated Show resolved Hide resolved
src/arrowtypes.jl Outdated Show resolved Hide resolved
A few `ArrowKind`s allow slightly more custom overloads for their `fromarrow` methods:
* `ListKind{true}`: for `String` types, they may overload `fromarrow(::Type{T}, ptr::Ptr{UInt8}, len::Int) = ...` to avoid
materializing a `String`
* `StructKind`: may overload `fromarrow(::Type{T}, x::NamedTuple)`, or `fromarrow(::Type{T}; kw...)` (values passed as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over in https://github.com/beacon-biosignals/Onda.jl/pull/68/files#diff-4c48d3b8e6e0b82d379c756b160263e0a23aabbb82a4f1fadfbbf02f35b0814cR304

I have a fromarrow(::Type{T}, x::NamedTuple) method but it doesn't seem to be getting called:

julia> include("examples/tour.jl"); # run from root of Onda directory, generates some test state

julia> infos = [eeg.info, eeg.info, eeg.info, eeg.info]; # dummy data

julia> using Arrow

julia> tbl = Arrow.Table(Arrow.tobuffer((x=infos,)))
Arrow.Table: (x = SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}[Error showing value of type Arrow.Table:
ERROR: MethodError: no method matching SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}(::String, ::Array{String,1}, ::String, ::Float64, ::Int64, ::String, ::Float64)
Stacktrace:
 [1] fromarrow(::Type{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}}, ::String, ::Array{String,1}, ::String, ::Vararg{Any,N} where N) at /Users/jarrettrevels/.julia/dev/Arrow/src/arrowtypes.jl:134
 [2] getindex at /Users/jarrettrevels/.julia/dev/Arrow/src/arraytypes/struct.jl:47 [inlined]
 [3] isassigned(::Arrow.Struct{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8},Tuple{Arrow.List{String,Int32,Array{UInt8,1}},Arrow.List{Array{String,1},Int32,Arrow.List{String,Int32,Array{UInt8,1}}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}},Arrow.Primitive{Int64,Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Int64) at ./abstractarray.jl:408
 [4] show_delim_array(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Struct{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8},Tuple{Arrow.List{String,Int32,Array{UInt8,1}},Arrow.List{Array{String,1},Int32,Arrow.List{String,Int32,Array{UInt8,1}}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}},Arrow.Primitive{Int64,Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Char, ::String, ::Char, ::Bool, ::Int64, ::Int64) at ./show.jl:740
 [5] show_delim_array at ./show.jl:733 [inlined]
 [6] show_vector(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Struct{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8},Tuple{Arrow.List{String,Int32,Array{UInt8,1}},Arrow.List{Array{String,1},Int32,Arrow.List{String,Int32,Array{UInt8,1}}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}},Arrow.Primitive{Int64,Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}, ::Char, ::Char) at ./arrayshow.jl:476
 [7] show_vector at ./arrayshow.jl:461 [inlined]
 [8] show at ./arrayshow.jl:432 [inlined]
 [9] show(::IOContext{REPL.Terminals.TTYTerminal}, ::NamedTuple{(:x,),Tuple{Arrow.Struct{SamplesInfo{String,Array{String,1},String,Float64,Int64,var"#s89",Float64} where var"#s89"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8},Tuple{Arrow.List{String,Int32,Array{UInt8,1}},Arrow.List{Array{String,1},Int32,Arrow.List{String,Int32,Array{UInt8,1}}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}},Arrow.Primitive{Int64,Array{Int64,1}},Arrow.List{String,Int32,Array{UInt8,1}},Arrow.Primitive{Float64,Array{Float64,1}}}}}}) at ./namedtuple.jl:150
 [10] show(::IOContext{REPL.Terminals.TTYTerminal}, ::Arrow.Table) at /Users/jarrettrevels/.julia/dev/Tables/src/Tables.jl:196
 [11] show(::IOContext{REPL.Terminals.TTYTerminal}, ::MIME{Symbol("text/plain")}, ::Arrow.Table) at ./multimedia.jl:47
 [12] display(::REPL.REPLDisplay, ::MIME{Symbol("text/plain")}, ::Any) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:214
 [13] display(::REPL.REPLDisplay, ::Any) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:218
 [14] display(::Any) at ./multimedia.jl:328
 [15] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [16] invokelatest at ./essentials.jl:709 [inlined]
 [17] print_response(::IO, ::Any, ::Bool, ::Bool, ::Any) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:238
 [18] print_response(::REPL.AbstractREPL, ::Any, ::Bool, ::Bool) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:223
 [19] (::REPL.var"#do_respond#54"{Bool,Bool,REPL.var"#64#73"{REPL.LineEditREPL,REPL.REPLHistoryProvider},REPL.LineEditREPL,REPL.LineEdit.Prompt})(::Any, ::Any, ::Any) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:822
 [20] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [21] invokelatest at ./essentials.jl:709 [inlined]
 [22] run_interface(::REPL.Terminals.TextTerminal, ::REPL.LineEdit.ModalInterface, ::REPL.LineEdit.MIState) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/LineEdit.jl:2355
 [23] run_frontend(::REPL.LineEditREPL, ::REPL.REPLBackendRef) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:1144
 [24] (::REPL.var"#38#42"{REPL.LineEditREPL,REPL.REPLBackendRef})() at ./task.jl:356

Looks like it's splatting the arguments instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I realized it would be quite a bit more inefficient to generate the full NamedTuple. My hope was that the compiler would optimize it out entirely, but I would need to comb over the code to see if that's even possible. We'd probably have to include the arrow type as another type parameter to the Struct array type. That might be a good idea anyways since I'm currently deep in the weeds on a big where it gets pretty confusing if the Struct eltype is Julia or arrow.

quinnj and others added 4 commits March 24, 2021 23:47
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
@quinnj
Copy link
Member Author

quinnj commented Mar 25, 2021

Still to do:

  • Still working on supporting unsupported ARROW:extension:name type: "JuliaLang.Nothing" #132; it's a doozy and that's really good because it's really stress testing all the type serialization stuff; there's some kind of bug either writing or reading the lines_of_code and contributors fields that's causing things to get messed up. It's been tricky to minimally reproduce, so I'll have to spend some more time on it tomorrow
  • Another pass over docs; things have evolved a bit and I still need to update the proper manual on all the new custom types stuff
  • More tests; I'm still working through open issues making sure we fix/support everything, but I'd like to do some additional unit tests on top of that as well

end
return B, nodeidx, bufferidx
end

function build(f::Meta.Field, L::Meta.Null, batch, rb, de, nodeidx, bufferidx, convert)
@debug 2 "building array: L = $L"
return MissingVector(rb.nodes[nodeidx].length), nodeidx + 1, bufferidx
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh goodness; somehow forgot to have the + 1 on my nodeidx here when I updated the code 😱 . There goes an hour of my life.

…nsion type metadata htat can be used in JuliaType
@quinnj
Copy link
Member Author

quinnj commented Mar 25, 2021

Ok, just pushed a new commit that allows us to properly support cases like #135. The wrinkle there is the custom type Interval has non-field-inferred/based type parameters. You could potentially support that by just having separate ArrowTypes.arrowname/ArrowTypes.JuliaType definitions for different type parameter combinations. That's probably reasonable if the # of unique type parameters is low, but otherwise, becomes impractical pretty quickly.

In the arrow spec about extension types, it also officially supports setting the "ARROW:extension:metadata" key with a string value to provide additional extension type metadata needed for recreating the original logical type. That seems to fit pretty well with our need, so I've added an optional ArrowTypes.arrowmetadata(T) method users can overload to provide any additional type information necessary when deserializing. This is then available by overloading ArrowTypes.JuliaType(::Val{Symbol(name)}, S, arrowmetadata) and using arrowmetadata however necessary to recreate the types. I gave an example of this for Intervals here.

@quinnj
Copy link
Member Author

quinnj commented Mar 26, 2021

Ok, I've updated some docs + the manual and added some more tests. I think this is ready to go. I'll leave it open for another day or two for others to try things out if they want, but then I think we're ready for a merge + 1.3 release.

@quinnj quinnj merged commit ff53d13 into main Mar 29, 2021
@quinnj quinnj deleted the jq/newarrowtypes branch March 29, 2021 13:22
tanmaykm pushed a commit to tanmaykm/arrow-julia that referenced this pull request Apr 7, 2021
* Start work on overhauling type serialization architecture

* More work; serialization is pretty much done but not tested

* fix timetype ArrowTypes definitions

* more work to get tests passing

* get tests passing?

* fix

* Fix apache#75 by supporting Set serialization/deserialization

* Fix apache#85 by supporting tuple serialization/deserialization

* Lots of cleanup

* few more fixes

* Update src/arrowtypes.jl

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>

* Update src/arrowtypes.jl

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>

* fix NullKind reading

* Fix apache#134 by requiring concrete or union of concrete element types for
all columns when serializing

* Add new ArrowTypes.arrowmetadata method for providing additional extension type metadata htat can be used in JuliaType

* Update manual

* tests

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants