Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrized parser for speed #30

Open
asinghvi17 opened this issue Apr 11, 2024 · 20 comments
Open

Parametrized parser for speed #30

asinghvi17 opened this issue Apr 11, 2024 · 20 comments

Comments

@asinghvi17
Copy link
Collaborator

It would be great to have a parameterized parser object for speed. Things we could specialize on are:

  • Endianness (little->big endian is just bswap on the result Float64 or Int)
  • Geometry type: when parsing SQL query results, we often know that the column contains only one kind of geometry. In this case, it would be nice to be able to call a parser which only parses, e.g., points, or polygons, etc. This parameterized parser could simply error if the geometry was found to be of an incorrect type. This also guarantees return type and would probably make fast inner loops possible when parsing columns in GeoParquet and GeoPackage.
    • The generic parser would be type unstable and do all necessary type checking, similar to the current approach.
@rafaqz
Copy link
Collaborator

rafaqz commented Apr 11, 2024

We also know when it contains a small union, and a function with a conditional like this should allow both optimisations:

function wrap_wkb(::Type{Traits}, ::Val{HasZ}, ::Val{HasM}, ::Val{BigEndian}, wkb) where {Traits, HasZ, HasM, BigEndian}
    if PolygonTrait <: Traits && _get_wkb_typenum(wkb) == POLYGON_CODE
        return WKBPolygon{HasZ,HasM,BigEndian}(wkb)
    elseif MultiPolygonTrait <: Traits &&  _get_wkb_typenum(wkb) == MULTIPOLYGON_CODE
        return WKBMultiPolygon{HasZ,HasM,BigEndian}(wkb)
    elseif MultiPoint <: Traits &&  _get_wkb_typenum(wkb) == MULTIPOINT_CODE
        return WKBMultiPoint{HasZ,HasM,BigEndian}(wkb)
    ... # Repeat for all GeoInterface traits

    else
        error("type code not recognised")
    end
end

The compiler should be able to delete the code for branches not in the Traits type union.

The idea is this would be used on a vector/column of geometries after a function barrier where the types were worked out for the whole column before the function barrier. In GeoPacakage this can be done with union on the geometry type column as Integers, then a conversion to types. So one point of instability.

Nested geoms can use views into the wkb, maybe with some type of runtime information to deal with the fact some data is missing from child objects (like endianness) that is present if they're at the top layer.

This should be able to mean all the cost is at the top layer and wrap_wkb is essentially free wrappers over the wkb data. The cost will increase as the type union gets larger and we lose first stability and then small union optimisations.

(The trait types should probably go in a wrapper rather than a raw type union, as in GeometryOps, but I left it simple for clarity)

As you said @asinghvi17 we would just error in the nighmare datasets where e.g. endianness or HasZ changed down the column.

Then the type parameter BigEndian would decide if we need to lazily apply bswap as we return point coordinates. So it would have no cost if the conversion isn't needed.

@asinghvi17
Copy link
Collaborator Author

asinghvi17 commented Apr 11, 2024

That sounds good! Another thing - I am not sure if this is the case, but a quick rematerializer for linestrings using reinterpret may also be useful, since it would batch the work of point.

For example, this works:

julia> reinterpret(Tuple{Float64, Float64}, rand(UInt8, 32))
2-element reinterpret(Tuple{Float64, Float64}, ::Vector{UInt8}):
 (3.2263176797028075e170, 1.0680129991772126e-122)
 (-8.397436769436745e-141, 6.732599484892397e-141)

and we could do the same to get a quick reinterpretation of an entire linestring, rather than going point-by-point. I haven't benchmarked this for an entire linestring, but it's 13% slower (1.7ns vs 1.5ns) than indexing into a pre-constructed vector of tuples - which is not bad!

@rafaqz
Copy link
Collaborator

rafaqz commented Apr 11, 2024

Oh cool. Absolutely. Is it already stored like that in wkb?

Things like this is where knowing HasZ and HasM from the start will really help.

@evetion
Copy link
Owner

evetion commented Apr 11, 2024

You need to parse some stuff for the dimensions and the number of coordinates, but sure. I use reinterpret all over (in serializing mostly), so this would work fine.

@rafaqz
Copy link
Collaborator

rafaqz commented Apr 11, 2024

I think we want to do that outside the function barrier and then maybe check at each geometry that it matches, and either error or reinterpret

@asinghvi17
Copy link
Collaborator Author

At least for values of "maybe has M" or "maybe has Z" it would have to be dynamic, but when it's defined that the geometry definitely has M or Z, that approach should be fine AFAICT.

We would have to allow the user to choose whether to use a strict or dynamic parser in this case, since it's possible that people made flawed datasets.

@rafaqz
Copy link
Collaborator

rafaqz commented Apr 11, 2024

Totally. We could have parser=:strict as the default and error with a messge "Dataset has inconsistent foo, try theparser=:dynamic keyword"

@asinghvi17
Copy link
Collaborator Author

ls_as_wkb is a WKB of 300,000 points

# Parsing approach 2 - reduce to reinterpret
@benchmark collect(reinterpret(Tuple{Float64, Float64}, view($ls_as_wkb, (1+4+4+1):length($ls_as_wkb))))

# BenchmarkTools.Trial: 10000 samples with 334 evaluations.
#  Range (min … max):  258.111 ns …  3.375 μs  ┊ GC (min … max): 0.00% … 89.28%
#  Time  (median):     274.826 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   280.685 ns ± 97.172 ns  ┊ GC (mean ± σ):  1.11% ±  2.94%

#            ▁▄█▇▅▂  ▁ ▂▁
#   ▁▁▂▃▃▃▄▃▅███████▇████▆▆▄▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
#   258 ns          Histogram: frequency by time          326 ns <

#  Memory estimate: 160 bytes, allocs estimate: 4.

which is not bad! GFT.WellKnownBinary(GFT.Geom(), ls_as_wkb) |> GI.getpoint |> collect is still running as I write this.

@evetion
Copy link
Owner

evetion commented Apr 11, 2024

ls_as_wkb is a WKB of 300,000 points

Nice! Can we create a repo, or attach a few of these binary test/benchmark files to a release, so we can add this to CI, and test ourselves as well?

@asinghvi17
Copy link
Collaborator Author

asinghvi17 commented Apr 11, 2024

In this case it's as simple as:

import GeoFormatTypes as GFT, WellKnownGeometry as WKG, GeoInterface as GI

tups = tuple.(rand(300_000), rand(300_000))
geoms_as_points = GFT.val.(WKG.getwkb.(GI.Point.(tups)))
geoms_as_linestring = GFT.val(WKG.getwkb(GI.LineString(tups)))

and try the different parsing approaches from there.

# parse geoms as points, each WKB individually
@benchmark begin
    map($(geoms_as_points)) do geom
        only(reinterpret(Tuple{Float64, Float64}, view(geom, (5+1):length(geom)))) # this has to be parsed differently in EWKB and GeoPkg wkb
    end
end # median time 3ms v/s regular parsing 300ms

# parse the whole linestring
@benchmark collect(reinterpret(Tuple{Float64, Float64}, view($ls_as_wkb, (1+4+4+1):length($ls_as_wkb))))
# median time 290ns vs unknown (still running) 
# we could even drop collect here

(there was an actual test geopkg file for reproducible coordinates, but I figure that doesn't matter so much for parsing).

@asinghvi17
Copy link
Collaborator Author

Looks like GI.getpoint hangs because it's calling ncoord every point...
[2] getgeom(T::GeoInterface.LineStringTrait, geom::GeoFormatTypes.WellKnownBinary{GeoFormatTypes.Geom, Vector{UInt8}}, i::Int64)
@ WellKnownGeometry ~/.julia/packages/WellKnownGeometry/t1n94/src/wkb.jl:190

We could optimize that pretty easily no?

@evetion
Copy link
Owner

evetion commented Apr 11, 2024

If you know it's a line string, or know the dimension, yes, absolutely.

@evetion
Copy link
Owner

evetion commented Apr 11, 2024

import GeoFormatTypes as GFT, WellKnownGeometry as WKG, GeoInterface as GI

tups = tuple.(rand(300_000), rand(300_000))
geoms_as_points = GFT.val.(WKG.getwkb.(GI.Point.(tups)))
geoms_as_linestring = GFT.val(WKG.getwkb(GI.LineString(tups)))

It seems we haven't defined ncoord on things other than Point in GeoInterface? Did you fix this locally?

julia> WKG.getwkb(GI.LineString(tups))
ERROR: MethodError: no method matching ncoord(::GeoInterface.LineStringTrait, ::GeoInterface.Wrappers.LineString{false, false, Vector{Tuple{Float64, Float64}}, Nothing, Nothing})

Closest candidates are:
  ncoord(::Any)
   @ GeoInterface ~/.julia/packages/GeoInterface/tuF7K/src/interface.jl:104
  ncoord(::Union{GeoInterface.GeometryCollectionTrait, GeoInterface.LineStringTrait, GeoInterface.LinearRingTrait, GeoInterface.MultiLineStringTrait, GeoInterface.MultiPointTrait, GeoInterface.MultiPolygonTrait, GeoInterface.PointTrait, GeoInterface.PolygonTrait}, ::GeoFormatTypes.WellKnownBinary{GeoFormatTypes.Geom, <:AbstractVector{UInt8}})
   @ WellKnownGeometry ~/code/WellKnownGeometry/src/wkb.jl:164
  ncoord(::Union{GeoInterface.GeometryCollectionTrait, GeoInterface.LineStringTrait, GeoInterface.LinearRingTrait, GeoInterface.MultiLineStringTrait, GeoInterface.MultiPointTrait, GeoInterface.MultiPolygonTrait, GeoInterface.PointTrait, GeoInterface.PolygonTrait}, ::GeoFormatTypes.WellKnownText{GeoFormatTypes.Geom})
   @ WellKnownGeometry ~/code/WellKnownGeometry/src/wkt.jl:136

Stacktrace:
 [1] ncoord(geom::GeoInterface.Wrappers.LineString{false, false, Vector{Tuple{Float64, Float64}}, Nothing, Nothing})
   @ GeoInterface ~/.julia/packages/GeoInterface/tuF7K/src/interface.jl:104
 [2] _getwkb!(data::Vector{…}, type::GeoInterface.LineStringTrait, geom::GeoInterface.Wrappers.LineString{…}, first::Bool, repeat::Bool)
   @ WellKnownGeometry ~/code/WellKnownGeometry/src/wkb.jl:113
 [3] getwkb!
   @ ~/code/WellKnownGeometry/src/wkb.jl:129 [inlined]
 [4] getwkb(geom::GeoInterface.Wrappers.LineString{false, false, Vector{Tuple{Float64, Float64}}, Nothing, Nothing})
   @ WellKnownGeometry ~/code/WellKnownGeometry/src/wkb.jl:56
Some type information was truncated. Use `show(err)` to see complete types.

@asinghvi17
Copy link
Collaborator Author

Huh, no this just worked for me...let me try it in a fresh session!

@asinghvi17
Copy link
Collaborator Author

asinghvi17 commented Apr 11, 2024

No, I'm on latest released GeoInterface and WKG v0.2.2 (one patch version behind)

Edit: trying this on WKG v0.2.3 fails as you posted.

@evetion
Copy link
Owner

evetion commented Apr 11, 2024

Ouch, my bad then it seems. In 0.2.3 I call ncoord for the first time on non-point geoms it seems, to indeed check is3d/ismeasured stuff. GeoInterface.wrappers don't implement that yet, but it should (and we need to update our tests there too?).

@asinghvi17
Copy link
Collaborator Author

Monkeypatching

GI.ncoord(::GI.Wrappers.WrapperGeometry{false, false}) = 2
GI.ncoord(::GI.Wrappers.WrapperGeometry{false, true}) = 3
GI.ncoord(::GI.Wrappers.WrapperGeometry{true, false}) = 3
GI.ncoord(::GI.Wrappers.WrapperGeometry{true, true}) = 4

fixes it for now, but this probably needs a better check. Maybe GI.is3d and GI.ismeasured?

@evetion
Copy link
Owner

evetion commented Apr 11, 2024

ls_as_wkb is a WKB of 300,000 points

# Parsing approach 2 - reduce to reinterpret
@benchmark collect(reinterpret(Tuple{Float64, Float64}, view($ls_as_wkb, (1+4+4+1):length($ls_as_wkb))))

# BenchmarkTools.Trial: 10000 samples with 334 evaluations.
#  Range (min … max):  258.111 ns …  3.375 μs  ┊ GC (min … max): 0.00% … 89.28%
#  Time  (median):     274.826 ns              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   280.685 ns ± 97.172 ns  ┊ GC (mean ± σ):  1.11% ±  2.94%

#            ▁▄█▇▅▂  ▁ ▂▁
#   ▁▁▂▃▃▃▄▃▅███████▇████▆▆▄▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
#   258 ns          Histogram: frequency by time          326 ns <

#  Memory estimate: 160 bytes, allocs estimate: 4.

which is not bad! GFT.WellKnownBinary(GFT.Geom(), ls_as_wkb) |> GI.getpoint |> collect is still running as I write this.

For what it's worth, getpoint, gets you a point, as in creating a new WKB Point. You probably want coordinates instead, if it is written roughly like this:

function GI.coordinates(
    T::GI.LineStringTrait,
    geom::WKBtype,
)
    size = headersize + numsize
    ncoord = GI.ncoord(geom)
    offset = typesize(wkbsubtype(T), geom[size+1:end], ncoord)
    reinterpret(NTuple{ncoord,Float64}, @view geom.val[size+1:size+offset*GI.ngeom(geom)])
end

Which while still a bit unstable with ncoord, is already doable with 8.958 μs (15 allocations: 469.34 KiB).

@rafaqz
Copy link
Collaborator

rafaqz commented Apr 11, 2024

@asinghvi17 do you want to PR your patch to GI? I must have forgotten to add it ( or maybe we never had that, but we should)

@evetion does getpoint have to return a WKB Point? Why not a Tuple ? coordinates is not generic because it often allocates.

I thin we need getpoint(geom) to always be super fast or there is no generic fast way to get points.

@evetion
Copy link
Owner

evetion commented Apr 11, 2024

Yeah, it probably needs to be a tuple. And if the type has noop ncoord it should be pretty fast, even without refactoring it everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants