Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable field-order-agnostic overloads of fromarrow for struct types #493

Merged
merged 11 commits into from
Dec 5, 2023

Conversation

jrevels
Copy link
Contributor

@jrevels jrevels commented Dec 3, 2023

Motivated by beacon-biosignals/Legolas.jl#94 (comment)

Still requires:

@jrevels jrevels marked this pull request as draft December 3, 2023 22:39
@codecov-commenter
Copy link

codecov-commenter commented Dec 3, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (787768f) 87.45% compared to head (465025c) 87.37%.

Files Patch % Lines
src/arraytypes/struct.jl 81.81% 2 Missing ⚠️
src/ArrowTypes/src/ArrowTypes.jl 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #493      +/-   ##
==========================================
- Coverage   87.45%   87.37%   -0.08%     
==========================================
  Files          26       26              
  Lines        3283     3288       +5     
==========================================
+ Hits         2871     2873       +2     
- Misses        412      415       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jrevels jrevels marked this pull request as ready for review December 4, 2023 17:44
@@ -33,23 +33,33 @@ isnamedtuple(T) = false
istuple(::Type{<:Tuple}) = true
istuple(T) = false

@propagate_inbounds function Base.getindex(s::Struct{T,S}, i::Integer) where {T,S}
if isdefined(ArrowTypes, :StructElement)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with this approach to avoid needing to change Arrow's compat bounds on ArrowTypes

@jrevels
Copy link
Contributor Author

jrevels commented Dec 4, 2023

Did some benchmarking, looks like there is a small perf hit here to the general case that needs to be mitigated before this can be merged:

using Arrow, ArrowTypes, Random, BenchmarkTools

struct Foo
    a::Int
    b::String
    c::Vector{String}
    d::Float64
end

ArrowTypes.arrowname(::Type{Foo}) = Symbol("JuliaLang.Foo")
ArrowTypes.JuliaType(::Val{Symbol("JuliaLang.Foo")}, T) = Foo

genfoo() = Foo(rand(1:10), randstring(10), [randstring(10) for _ in 1:rand(2:5)], rand())
t = (; f = [genfoo() for _ in 1:1000])
f = Arrow.Table(Arrow.tobuffer(t)).f
@benchmark sum(x -> x.d, $f)

results on jrevels:jr/structelement

BenchmarkTools.Trial: 6891 samples with 1 evaluation.
 Range (min … max):  631.291 μs …   2.410 ms  ┊ GC (min … max):  0.00% … 70.30%
 Time  (median):     646.250 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   724.383 μs ± 299.599 μs  ┊ GC (mean ± σ):  10.04% ± 15.20%

  █▅▃▁                                                      ▂▂  ▁
  ████▆▆▅▄▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃███ █
  631 μs        Histogram: log(frequency) by time       1.95 ms <

 Memory estimate: 2.53 MiB, allocs estimate: 22470.

results on main

BenchmarkTools.Trial: 8118 samples with 1 evaluation.
 Range (min … max):  531.667 μs …   2.338 ms  ┊ GC (min … max): 0.00% … 73.71%
 Time  (median):     550.333 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   614.904 μs ± 266.473 μs  ┊ GC (mean ± σ):  9.82% ± 14.78%

  █▇▄▂                                                      ▁▂▁ ▁
  ████▇▆▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███ █
  532 μs        Histogram: log(frequency) by time       1.78 ms <

 Memory estimate: 2.23 MiB, allocs estimate: 18482.

@jrevels
Copy link
Contributor Author

jrevels commented Dec 4, 2023

Alright, I changed the approach here after profiling different candidate approaches. Looks like leaving NamedTuple construction (or even just any intermediate struct construction) in the critical path here borks getindex performance, annoyingly, so ended up not going that route. Users of this feature can still construct NamedTuples themselves, if needed, but this new approach avoids doing so on the default path.

on this PR now:

BenchmarkTools.Trial: 8170 samples with 1 evaluation.
 Range (min  max):  530.500 μs    2.318 ms  ┊ GC (min  max): 0.00%  74.22%
 Time  (median):     547.833 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   610.966 μs ± 263.948 μs  ┊ GC (mean ± σ):  9.75% ± 14.73%

  █▆▃▁                                                      ▁▃▁ ▁
  ████▇▇▅▃▅▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███ █
  530 μs        Histogram: log(frequency) by time       1.75 ms <

 Memory estimate: 2.22 MiB, allocs estimate: 18387.

Copy link
Member

@ericphanson ericphanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched for Struct{ and found a usage in

function arrowtype(b, x::Struct{T,S}) where {T,S}
, does that need updating?

Generally looks good to me though.

bench.jl Outdated Show resolved Hide resolved
src/ArrowTypes/src/ArrowTypes.jl Outdated Show resolved Hide resolved
Copy link
Member

@ericphanson ericphanson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh also, the formatter wants this change, it looks like

diff --git a/test/runtests.jl b/test/runtests.jl
index 5210685..ed288b3 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -1027,7 +1027,11 @@ end
                 end
                 ArrowTypes.arrowname(::Type{Foo493}) = Symbol("JuliaLang.Foo493")
                 ArrowTypes.JuliaType(::Val{Symbol("JuliaLang.Foo493")}, T) = Foo493
-                function ArrowTypes.fromarrowstruct(::Type{Foo493}, ::Val{fnames}, x...) where {fnames}
+                function ArrowTypes.fromarrowstruct(
+                    ::Type{Foo493},
+                    ::Val{fnames},
+                    x...,
+                ) where {fnames}
                     nt = NamedTuple{fnames}(x)
                     return Foo493(nt.x + 1, nt.y + 1)
                 end

@jrevels
Copy link
Contributor Author

jrevels commented Dec 5, 2023

searched for Struct{ and found a usage in arrow-julia/src/eltypes.jl
does that need updating?

interesting. I think based on the docstring of arrowtype, this doesn't need to be updated / is working as intended:

Given a FlatBuffers.Builder and a Julia column or column eltype, Write the field.type flatbuffer definition of the eltype

if the intent of the function was to yield the actual flatbuffer definition of the arrow data stored in the provided column, then actually we'd consider this method buggy on main and this PR could fix it. but i don't think that's the intent, so i think we're good as-is

@jrevels
Copy link
Contributor Author

jrevels commented Dec 5, 2023

AFAICT this is good to merge - idk what the hanging "verify release" GHA check is doing but everything else is green 👍

@ericphanson ericphanson merged commit 3712291 into apache:main Dec 5, 2023
17 of 18 checks passed
@ericphanson
Copy link
Member

@kou would you be able to help initiate a release for v2.7.0?

@jrevels
Copy link
Contributor Author

jrevels commented Dec 5, 2023

thanks @ericphanson and @kou !

@jrevels jrevels deleted the jr/structelement branch December 5, 2023 18:50
@kou
Copy link
Member

kou commented Dec 5, 2023

Sure!
I'll do it.

@kou
Copy link
Member

kou commented Dec 5, 2023

@kou
Copy link
Member

kou commented Dec 9, 2023

@ericphanson Could you check my reply on dev@? https://lists.apache.org/thread/6ww80f7m16bob09g5g0nx0mj5jyckxnw
I think that we can release a new version once we resolve your problem.

@kou
Copy link
Member

kou commented Dec 10, 2023

Passed: https://lists.apache.org/thread/z6y8484wsyd7155cq2jsv4gm4ojpdv3m

@ericphanson
Copy link
Member

Awesome, and I see you registered both package as well. Thanks very much!

positional arguments; so if my custom type `Interval` has two fields `first` and `last`, then I'd overload like
`ArrowTypes.fromarrow(::Type{Interval}, first, last) = ...`. Note the default implementation is
`ArrowTypes.fromarrow(::Type{T}, x...) = T(x...)`, so if your type already accepts all arguments in a constructor
no additional `fromarrow` method should be necessary (default struct constructors have this behavior).
* Alternatively, may overload `fromarrowstruct(::Type{T}, ::Val{fnames}, x...)`, where `fnames` is a tuple of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be so slow in the review here; it might be worth adding another example the docs here like we have above, something like:

So taking the example from before, if my arrow data happens to have the `first` and `last` fields for my `Interval` type reversed, I could implement `fromarrowstruct(::Type{Interval}, ::Val{(last, first}), last, first) = Interval(first, last)`

I find concrete examples usually help drive home the core point of a new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants