-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
specialize offsets map #6
Comments
danhhz
added a commit
that referenced
this issue
Jun 14, 2019
FastIntMap is great, but we can do better by specializing the offsets map at code generating time. For now, switch everything to an array of `[MaxFieldID+1]uint16` which should be the speed of light for absolute fastest thing we could do. This also guarantees zero allocations for every operation (excepting only field `Set`s on a message with an under allocated buffer). This is terrible for messages with sparse fields, and there are some ideas in #6 for a second offset map implementation for those messages and heuristics for when to use which map. This will be fixed by v1, but in the meantime zeropb will be an exceptionally bad fit for messages with sparse field ids. Touches #6. name old time/op new time/op delta DecodeSimpleAccessNone/pb-8 181ns ± 1% 183ns ± 1% ~ (p=0.079 n=5+5) DecodeSimpleAccessNone/gogopb-8 106ns ± 0% 108ns ± 1% +1.51% (p=0.016 n=4+5) DecodeSimpleAccessNone/zeropb-8 72.1ns ± 3% 50.4ns ± 3% -30.16% (p=0.008 n=5+5) DecodeSimpleAccessAll/pb-8 235ns ± 1% 234ns ± 0% ~ (p=0.095 n=5+4) DecodeSimpleAccessAll/gogopb-8 155ns ± 1% 156ns ± 1% ~ (p=0.103 n=5+5) DecodeSimpleAccessAll/zeropb-8 166ns ± 0% 132ns ± 0% -20.48% (p=0.029 n=4+4) DecodeSimpleAccessRepeatedly/pb-8 356ns ± 1% 348ns ± 2% -2.19% (p=0.024 n=5+5) DecodeSimpleAccessRepeatedly/gogopb-8 267ns ± 1% 283ns ±25% ~ (p=0.690 n=5+5) DecodeSimpleAccessRepeatedly/zeropb-8 362ns ± 1% 305ns ± 1% -15.81% (p=0.016 n=5+4) DecodeComplexAccessOne/pb-8 1.70µs ± 1% 1.77µs ± 9% +4.38% (p=0.008 n=5+5) DecodeComplexAccessOne/gogopb-8 848ns ± 1% 854ns ± 0% +0.64% (p=0.048 n=5+5) DecodeComplexAccessOne/zeropb-8 478ns ± 0% 167ns ± 0% -65.09% (p=0.016 n=5+4) DecodeComplexAccessRepeatedMessage/pb-8 1.89µs ± 0% 1.89µs ± 0% ~ (p=0.114 n=4+4) DecodeComplexAccessRepeatedMessage/gogopb-8 1.03µs ± 0% 1.03µs ± 0% ~ (p=0.167 n=5+5) DecodeComplexAccessRepeatedMessage/zeropb-8 1.24µs ± 3% 0.72µs ± 1% -42.22% (p=0.008 n=5+5) EncodeSimpleSetAll/pb-8 232ns ± 2% 220ns ± 3% -4.84% (p=0.008 n=5+5) EncodeSimpleSetAll/gogopb-8 182ns ± 1% 183ns ± 2% ~ (p=0.317 n=5+5) EncodeSimpleSetAll/zeropb-8 139ns ± 0% 119ns ± 1% -14.39% (p=0.016 n=4+5) EncodeSimpleSetRepeatedly/pb-8 283ns ± 2% 272ns ± 1% -3.89% (p=0.008 n=5+5) EncodeSimpleSetRepeatedly/gogopb-8 196ns ± 1% 196ns ± 0% ~ (p=0.683 n=5+5) EncodeSimpleSetRepeatedly/zeropb-8 378ns ± 0% 350ns ± 2% -7.35% (p=0.016 n=4+5) EncodeComplex/pb-8 1.10µs ± 1% 1.11µs ± 2% ~ (p=0.881 n=5+5) EncodeComplex/gogopb-8 761ns ± 0% 764ns ± 0% +0.47% (p=0.024 n=5+5) EncodeComplex/zeropb-8 1.38µs ± 1% 0.66µs ± 0% -52.19% (p=0.016 n=5+4) name old speed new speed delta DecodeSimpleAccessNone/pb-8 594MB/s ± 1% 588MB/s ± 1% ~ (p=0.095 n=5+5) DecodeSimpleAccessNone/gogopb-8 1.02GB/s ± 0% 1.00GB/s ± 0% -1.70% (p=0.008 n=5+5) DecodeSimpleAccessNone/zeropb-8 1.50GB/s ± 3% 2.14GB/s ± 3% +43.21% (p=0.008 n=5+5) DecodeSimpleAccessAll/pb-8 459MB/s ± 1% 460MB/s ± 1% ~ (p=0.421 n=5+5) DecodeSimpleAccessAll/gogopb-8 695MB/s ± 1% 690MB/s ± 0% -0.80% (p=0.016 n=5+5) DecodeSimpleAccessAll/zeropb-8 648MB/s ± 0% 814MB/s ± 0% +25.59% (p=0.008 n=5+5) DecodeSimpleAccessRepeatedly/pb-8 303MB/s ± 1% 310MB/s ± 2% +2.27% (p=0.008 n=5+5) DecodeSimpleAccessRepeatedly/gogopb-8 404MB/s ± 1% 387MB/s ±21% ~ (p=0.690 n=5+5) DecodeSimpleAccessRepeatedly/zeropb-8 298MB/s ± 1% 354MB/s ± 1% +18.88% (p=0.016 n=5+4) DecodeComplexAccessOne/pb-8 271MB/s ± 1% 260MB/s ± 8% -4.01% (p=0.008 n=5+5) DecodeComplexAccessOne/gogopb-8 542MB/s ± 1% 538MB/s ± 0% ~ (p=0.056 n=5+5) DecodeComplexAccessOne/zeropb-8 960MB/s ± 0% 2743MB/s ± 0% +185.66% (p=0.008 n=5+5) DecodeComplexAccessRepeatedMessage/pb-8 244MB/s ± 0% 244MB/s ± 0% ~ (p=0.114 n=4+4) DecodeComplexAccessRepeatedMessage/gogopb-8 445MB/s ± 0% 446MB/s ± 0% ~ (p=0.310 n=5+5) DecodeComplexAccessRepeatedMessage/zeropb-8 370MB/s ± 3% 640MB/s ± 1% +73.03% (p=0.008 n=5+5) EncodeSimpleSetAll/pb-8 466MB/s ± 2% 489MB/s ± 3% +4.99% (p=0.008 n=5+5) EncodeSimpleSetAll/gogopb-8 592MB/s ± 1% 589MB/s ± 1% ~ (p=0.421 n=5+5) EncodeSimpleSetAll/zeropb-8 776MB/s ± 1% 903MB/s ± 1% +16.40% (p=0.008 n=5+5) EncodeSimpleSetRepeatedly/pb-8 381MB/s ± 2% 397MB/s ± 1% +4.06% (p=0.008 n=5+5) EncodeSimpleSetRepeatedly/gogopb-8 550MB/s ± 1% 551MB/s ± 0% ~ (p=0.794 n=5+5) EncodeSimpleSetRepeatedly/zeropb-8 286MB/s ± 0% 308MB/s ± 2% +7.86% (p=0.016 n=4+5) EncodeComplex/pb-8 404MB/s ± 1% 401MB/s ± 2% ~ (p=0.841 n=5+5) EncodeComplex/gogopb-8 604MB/s ± 0% 601MB/s ± 0% -0.47% (p=0.016 n=5+5) EncodeComplex/zeropb-8 315MB/s ± 1% 659MB/s ± 0% +109.08% (p=0.008 n=5+5) name old alloc/op new alloc/op delta DecodeSimpleAccessNone/pb-8 136B ± 0% 136B ± 0% ~ (all equal) DecodeSimpleAccessNone/gogopb-8 112B ± 0% 112B ± 0% ~ (all equal) DecodeSimpleAccessNone/zeropb-8 0.00B 0.00B ~ (all equal) DecodeSimpleAccessAll/pb-8 136B ± 0% 136B ± 0% ~ (all equal) DecodeSimpleAccessAll/gogopb-8 112B ± 0% 112B ± 0% ~ (all equal) DecodeSimpleAccessAll/zeropb-8 0.00B 0.00B ~ (all equal) DecodeSimpleAccessRepeatedly/pb-8 136B ± 0% 136B ± 0% ~ (all equal) DecodeSimpleAccessRepeatedly/gogopb-8 112B ± 0% 112B ± 0% ~ (all equal) DecodeSimpleAccessRepeatedly/zeropb-8 0.00B 0.00B ~ (all equal) DecodeComplexAccessOne/pb-8 1.11kB ± 0% 1.11kB ± 0% ~ (all equal) DecodeComplexAccessOne/gogopb-8 960B ± 0% 960B ± 0% ~ (all equal) DecodeComplexAccessOne/zeropb-8 0.00B 0.00B ~ (all equal) DecodeComplexAccessRepeatedMessage/pb-8 1.11kB ± 0% 1.11kB ± 0% ~ (all equal) DecodeComplexAccessRepeatedMessage/gogopb-8 960B ± 0% 960B ± 0% ~ (all equal) DecodeComplexAccessRepeatedMessage/zeropb-8 0.00B 0.00B ~ (all equal) EncodeSimpleSetAll/pb-8 84.0B ± 0% 84.0B ± 0% ~ (all equal) EncodeSimpleSetAll/gogopb-8 192B ± 0% 192B ± 0% ~ (all equal) EncodeSimpleSetAll/zeropb-8 0.00B 0.00B ~ (all equal) EncodeSimpleSetRepeatedly/pb-8 92.0B ± 0% 92.0B ± 0% ~ (all equal) EncodeSimpleSetRepeatedly/gogopb-8 192B ± 0% 192B ± 0% ~ (all equal) EncodeSimpleSetRepeatedly/zeropb-8 0.00B 0.00B ~ (all equal) EncodeComplex/pb-8 560B ± 0% 560B ± 0% ~ (all equal) EncodeComplex/gogopb-8 1.02kB ± 0% 1.02kB ± 0% ~ (all equal) EncodeComplex/zeropb-8 1.20kB ± 0% 0.00kB -100.00% (p=0.008 n=5+5) name old allocs/op new allocs/op delta DecodeSimpleAccessNone/pb-8 4.00 ± 0% 4.00 ± 0% ~ (all equal) DecodeSimpleAccessNone/gogopb-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) DecodeSimpleAccessNone/zeropb-8 0.00 0.00 ~ (all equal) DecodeSimpleAccessAll/pb-8 4.00 ± 0% 4.00 ± 0% ~ (all equal) DecodeSimpleAccessAll/gogopb-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) DecodeSimpleAccessAll/zeropb-8 0.00 0.00 ~ (all equal) DecodeSimpleAccessRepeatedly/pb-8 4.00 ± 0% 4.00 ± 0% ~ (all equal) DecodeSimpleAccessRepeatedly/gogopb-8 1.00 ± 0% 1.00 ± 0% ~ (all equal) DecodeSimpleAccessRepeatedly/zeropb-8 0.00 0.00 ~ (all equal) DecodeComplexAccessOne/pb-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) DecodeComplexAccessOne/gogopb-8 7.00 ± 0% 7.00 ± 0% ~ (all equal) DecodeComplexAccessOne/zeropb-8 0.00 0.00 ~ (all equal) DecodeComplexAccessRepeatedMessage/pb-8 33.0 ± 0% 33.0 ± 0% ~ (all equal) DecodeComplexAccessRepeatedMessage/gogopb-8 7.00 ± 0% 7.00 ± 0% ~ (all equal) DecodeComplexAccessRepeatedMessage/zeropb-8 0.00 0.00 ~ (all equal) EncodeSimpleSetAll/pb-8 2.00 ± 0% 2.00 ± 0% ~ (all equal) EncodeSimpleSetAll/gogopb-8 2.00 ± 0% 2.00 ± 0% ~ (all equal) EncodeSimpleSetAll/zeropb-8 0.00 0.00 ~ (all equal) EncodeSimpleSetRepeatedly/pb-8 4.00 ± 0% 4.00 ± 0% ~ (all equal) EncodeSimpleSetRepeatedly/gogopb-8 2.00 ± 0% 2.00 ± 0% ~ (all equal) EncodeSimpleSetRepeatedly/zeropb-8 0.00 0.00 ~ (all equal) EncodeComplex/pb-8 7.00 ± 0% 7.00 ± 0% ~ (all equal) EncodeComplex/gogopb-8 3.00 ± 0% 3.00 ± 0% ~ (all equal) EncodeComplex/zeropb-8 2.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The offsets map is currently implemented with FastIntMap since it was convenient, but FastIntMap has to either be tuned in one general way or else multiple versions code generated to get the nice compiler specializations for all the masks. We could make the happy zero-allocation case more likely if we specialize it at code generation time using our knowledge of the message fields.
For densely packed fields an array is sufficient. It'd be easiest to start with a
[]uint32
but the values are offsets into a message and so we likely don't need the full range of 4294967296, so that would use more space than necessary. Maybe a followup to make it smaller? We'd have a similar problem to FastIntMap then where we'd have to have all the various sizes we want to use in the zeropb package (or maybe codegen the crazy ones into the generated proto code, eek). We also have to decide how to handle offsets that are larger than whatever range we pick. If we want any valid input to work, we need to fall back to allocating amap
and we've basically re-derived FastIntMap (this is exactly why I used it in the first place).My initial thought is starting with
[]uint32
and eating the struct size. That's definitely the simplest. It's also not as bad as it sounds. Most of the fields that would be generated other protobuf libraries are 4-8 bytes per field, which means best case is the same size struct and worst case is 2x. Sigh, maybe uint16 really is the way to go, seems like 65536 bytes should be enough. Instead of the map fallback, decode could error if given a buffer longer than max offset, which seems like a reasonable limitation.If we do specialize densely packed fields, we need a heuristic and also a second impl for non-densely packed fields. My initial idea for the heuristic is if there are no more holes than fields. That is, if
max field id <= 2 * num fields
. The second impl could initially always be a map (unfortunate but predictable performance). We could add a proto option to let the user override the heuristic if they're okay with the tradeoff.Both offset impls would be structs that happen to expose the same set of methods. Then the codegen would simply put one or the other in the generated message struct and the rest of the codegen would be the same. No interfaces so no allocations.
There's probably things we could eventually do to improve the second impl if we cared (though I initially do not). It could be an array (of a size computed by some heuristic over the field types) and a slice view of that array (or I guess it'd have to be two of these?). Then we keep all the fields sorted and do everything via binary search, appending to add new parsed fields. That would be allocation free until we got too many fields, which isn't bad. It also doesn't have the unpredictability of FastIntMap's flip from small to large.
The text was updated successfully, but these errors were encountered: