Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting vectors #66

Open
hallahan opened this issue Oct 24, 2022 · 9 comments
Open

Sorting vectors #66

hallahan opened this issue Oct 24, 2022 · 9 comments

Comments

@hallahan
Copy link
Contributor

Right now, it looks like we can only build a Vec by appending. It would be nice to be able to sort our vectors after they have been built. For example, I may want to compute a hilbert index for the nodes and sort them by that. That way, I can access the data with good locality.

Is there a straightforward way to sort?

@boxdot
Copy link
Owner

boxdot commented Oct 24, 2022

Usually, what we sort data in memory and then write it to disk with e.g. ExternalVector. We had some experimental work on sorting ExternalVector externally, which is useful when the data does not fit on disk, but unfortunately we never finished it.

@hallahan
Copy link
Contributor Author

You would need a beefy machine to sort the OSM planet. There are 7_992_520_838 nodes. Just sorting a 64bit index would be ~60GB of RAM. It would be cool if that wasn't needed, because I can run your tool otherwise on my laptop.

What is the challenge behind making vectors backed by memmap mutable?

@VeaaC
Copy link
Collaborator

VeaaC commented Oct 25, 2022

There are several problems with memory mapping mutable vectors:
A) It violates Rust's safety guarantees: A mutable reference invalidates all other references, but nothing stops another applications from opening the same file. We already violate this slightly by read-only memory mapping (somebody could manually open it writeable).
B) Just because it is mutably memory mapped doesn't mean you can sort it easily: Many sorting algorithms are not in-place, or optimized for external-memory sorting
C) Some flatdata structs are not self-contained, they also read data from the "next" item in the vector (@range attribution does that). Those types of structs must not be re-ordered.

That's why our general plan was to rather extend ExternalVector with a dedicated sorting method that is optimized for external memory, and can only be called when finalizing the vector (e.g. super-scalar-sample-sort, see https://github.com/lorenzhs/ssssort ), and prevents being used in case the struct is not self-contained.

@hallahan
Copy link
Contributor Author

The reason I ask is that I'm experimenting with creating tile indices using a Hilbert Curve. I am converting Node LatLons to their corresponding Hilbert number, then sorting Nodes by that. For ways, I do the same thing, with the Hilbert number derived from their Point on Surface.

osmflat-rs seems like a compelling fit for a back-end format, but an easy way of memmap sorting is needed.

@hallahan
Copy link
Contributor Author

Here's a general idea behind sorting vectors. Mutable accesors could be built into the storage code in flatdata to allow this sort of thing and maybe provide some guardrails...

#74

It would be super cool if you could annotate the schema to make the sorting populate throughout all references from the sorted vector.

I haven't yet tried sorting the nodes themselves yet.

@VeaaC
Copy link
Collaborator

VeaaC commented Oct 26, 2022

Building sorting into flatdata itself is definitely a feature we are looking forward to. Automatically updating any references is most likely not going to happen, since such a thing usually requires having a large in-memory index that remaps references, and also reordering data references with @range.

The nodes in osmflat itself are not sortable directly, though (as they have @range(tags) tag_first_idx: u64 : 40;), so one would most likely create a temporary intermediate storage for sorting.

Regarding the usability of osmflat as a backend format: While it can word for simple use cases I think it is much better suited as a input format to build your backend format: Most backends will need some of:

  • specialized indices (hilbert/zorder curves, quadtrees, tiles, etc)
  • pre-evaluated tags (convert tags into proper attributes, sanitize inputs, etc)
  • precomputed datastructures/algorithms (connect ways for routing, simplify polygons, assemble multi-polgyons, etc)

In my opinion osmflat is rather suited as an input format to build your own backend format out of: It is much easier to consume than PBF, requires less RAM (due to references being already resolved). We were mostly designing it as a data-exchange format replacing PBF.

@hallahan
Copy link
Contributor Author

Extending ExternalVector with a sort is a pretty interesting idea. Not sure if there is a parent issue on this.

Have you thought about radix sort?

https://docs.rs/radsort/latest/radsort/index.html

@VeaaC
Copy link
Collaborator

VeaaC commented Oct 31, 2022

Radix sort is usually not the best choice if you need to sort data that does not fit into RAM: You would need one pass over the data for each 4/8/16 bits (whatever your word size is). E.g. assuming 16 bits words, and 32 byte structures you would need up to 16 passes of reading/writing the data. Some of the later passes might fit into memory, but it mostly depends on the data.

Compare that e.g. with (assuming that your data size is less than RAM ^ 1.5, e.g. 4 GB RAM -> less than 281 PB data ):

  • a super-scalar-sample-sort: It has one pass for classifying data, one pass for shuffling it around.
  • k-way merge sort: one pass to sort each block in RAM, one pass to merge all of them

@hallahan
Copy link
Contributor Author

Makes sense. Basically, the sort needs to be localized first to reduce paging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants