Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make osmflat more compact (especially when compressed) #70

Merged
merged 6 commits into from Oct 26, 2022

Conversation

VeaaC
Copy link
Collaborator

@VeaaC VeaaC commented Oct 24, 2022

  • Store granularity explicitly instead of pre-multiplied numbers
  • Move Ids to separate optional sub-archive
  • Reduce bits needed for coordinates from 40 to 32
  • Remove unused header information

Comparison:

Comparing:

  • Compressed PBF (internal zlib compression)
  • Unpacked osmflat with and without optional Ids (only new version has them optional)
  • Compression with pzstd level 3
  • Compression with shuffly ( https://github.com/VeaaC/shuffly ) + pzstd level 3

Before:

Dataset PBF (zlib) osmflat w Ids zstd + osmflat w Ids shuffly + zstd + osmflat w Ids
Berlin 70M 227M 94M 58M
Europe 26G 89G 41G 23G
Planet 66G 223G 102G 57G

After:

Dataset PBF (zlib) osmflat w Ids osmflat w/o Ids zstd + osmflat w Ids zstd + osmflat w/o Ids shuffly + zstd + osmflat w Ids shuffly + zstd + osmflat w/o Ids
Berlin 70M 214M 176M 83M 69M 53M 49M
Europe 26G 84G 68G 36G 30G 20G 19G
Planet 66G 208G 164G 97G 76G 48G 47G

Observations:

  • More compact in all scenarios.
  • Using shuffly is still worth it (Ids have a compression ratio of > factor 20), but a bit less so due to rearranged data, and granularity.
  • Shuffly compressed version is smaller than compressed PBF (with and without Ids)
  • Most people might want to use a version without ids since it saves disk space
  • Ids are almost for free if compressed with shuffly (due to data being sorted by id)

flatdata/osm.flatdata Show resolved Hide resolved
osmflatc/src/main.rs Show resolved Hide resolved
@hallahan
Copy link
Contributor

Such a big difference! I wonder how much better it would be if you reduce integer sizes to 38 bytes instead of 40?

@VeaaC VeaaC mentioned this pull request Oct 26, 2022
@VeaaC
Copy link
Collaborator Author

VeaaC commented Oct 26, 2022

Not much / anything at all: Most structures would only be a few bit smaller, and flatdata rounds up to the next byte. It would only help if a structure had 4 references, each saving 2 bits.

flatdata/osm.flatdata Show resolved Hide resolved
* Store granularity explicitly instead of pre-multiplied numbers
* Move Ids to separate optional sub-archive
* Reduce bits needed for coordinates from 40 to 32
* Remove unused header information
@VeaaC VeaaC merged commit 6c74740 into boxdot:master Oct 26, 2022
@VeaaC VeaaC deleted the compact branch October 26, 2022 15:08
@hallahan
Copy link
Contributor

How do you do your size benchmarking? Is there a script somewhere?

@boxdot
Copy link
Owner

boxdot commented Oct 26, 2022

How do you do your size benchmarking? Is there a script somewhere?

+1 for documenting the command lines which produced the above numbers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants