Sorting vectors #66

hallahan · 2022-10-24T01:22:23Z

Right now, it looks like we can only build a Vec by appending. It would be nice to be able to sort our vectors after they have been built. For example, I may want to compute a hilbert index for the nodes and sort them by that. That way, I can access the data with good locality.

Is there a straightforward way to sort?

The text was updated successfully, but these errors were encountered:

boxdot · 2022-10-24T07:47:56Z

Usually, what we sort data in memory and then write it to disk with e.g. ExternalVector. We had some experimental work on sorting ExternalVector externally, which is useful when the data does not fit on disk, but unfortunately we never finished it.

hallahan · 2022-10-24T14:28:19Z

You would need a beefy machine to sort the OSM planet. There are 7_992_520_838 nodes. Just sorting a 64bit index would be ~60GB of RAM. It would be cool if that wasn't needed, because I can run your tool otherwise on my laptop.

What is the challenge behind making vectors backed by memmap mutable?

VeaaC · 2022-10-25T11:00:20Z

There are several problems with memory mapping mutable vectors:
A) It violates Rust's safety guarantees: A mutable reference invalidates all other references, but nothing stops another applications from opening the same file. We already violate this slightly by read-only memory mapping (somebody could manually open it writeable).
B) Just because it is mutably memory mapped doesn't mean you can sort it easily: Many sorting algorithms are not in-place, or optimized for external-memory sorting
C) Some flatdata structs are not self-contained, they also read data from the "next" item in the vector (@range attribution does that). Those types of structs must not be re-ordered.

That's why our general plan was to rather extend ExternalVector with a dedicated sorting method that is optimized for external memory, and can only be called when finalizing the vector (e.g. super-scalar-sample-sort, see https://github.com/lorenzhs/ssssort ), and prevents being used in case the struct is not self-contained.

hallahan · 2022-10-25T15:38:49Z

The reason I ask is that I'm experimenting with creating tile indices using a Hilbert Curve. I am converting Node LatLons to their corresponding Hilbert number, then sorting Nodes by that. For ways, I do the same thing, with the Hilbert number derived from their Point on Surface.

osmflat-rs seems like a compelling fit for a back-end format, but an easy way of memmap sorting is needed.

hallahan · 2022-10-26T00:47:15Z

Here's a general idea behind sorting vectors. Mutable accesors could be built into the storage code in flatdata to allow this sort of thing and maybe provide some guardrails...

#74

It would be super cool if you could annotate the schema to make the sorting populate throughout all references from the sorted vector.

I haven't yet tried sorting the nodes themselves yet.

VeaaC · 2022-10-26T06:12:43Z

Building sorting into flatdata itself is definitely a feature we are looking forward to. Automatically updating any references is most likely not going to happen, since such a thing usually requires having a large in-memory index that remaps references, and also reordering data references with @range.

The nodes in osmflat itself are not sortable directly, though (as they have @range(tags) tag_first_idx: u64 : 40;), so one would most likely create a temporary intermediate storage for sorting.

Regarding the usability of osmflat as a backend format: While it can word for simple use cases I think it is much better suited as a input format to build your backend format: Most backends will need some of:

specialized indices (hilbert/zorder curves, quadtrees, tiles, etc)
pre-evaluated tags (convert tags into proper attributes, sanitize inputs, etc)
precomputed datastructures/algorithms (connect ways for routing, simplify polygons, assemble multi-polgyons, etc)

In my opinion osmflat is rather suited as an input format to build your own backend format out of: It is much easier to consume than PBF, requires less RAM (due to references being already resolved). We were mostly designing it as a data-exchange format replacing PBF.

hallahan · 2022-10-29T17:44:38Z

Extending ExternalVector with a sort is a pretty interesting idea. Not sure if there is a parent issue on this.

Have you thought about radix sort?

https://docs.rs/radsort/latest/radsort/index.html

VeaaC · 2022-10-31T08:37:23Z

Radix sort is usually not the best choice if you need to sort data that does not fit into RAM: You would need one pass over the data for each 4/8/16 bits (whatever your word size is). E.g. assuming 16 bits words, and 32 byte structures you would need up to 16 passes of reading/writing the data. Some of the later passes might fit into memory, but it mostly depends on the data.

Compare that e.g. with (assuming that your data size is less than RAM ^ 1.5, e.g. 4 GB RAM -> less than 281 PB data ):

a super-scalar-sample-sort: It has one pass for classifying data, one pass for shuffling it around.
k-way merge sort: one pass to sort each block in RAM, one pass to merge all of them

hallahan · 2022-10-31T15:49:29Z

Makes sense. Basically, the sort needs to be localized first to reduce paging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorting vectors #66

Sorting vectors #66

hallahan commented Oct 24, 2022

boxdot commented Oct 24, 2022

hallahan commented Oct 24, 2022

VeaaC commented Oct 25, 2022

hallahan commented Oct 25, 2022

hallahan commented Oct 26, 2022

VeaaC commented Oct 26, 2022

hallahan commented Oct 29, 2022

VeaaC commented Oct 31, 2022

hallahan commented Oct 31, 2022

Sorting vectors #66

Sorting vectors #66

Comments

hallahan commented Oct 24, 2022

boxdot commented Oct 24, 2022

hallahan commented Oct 24, 2022

VeaaC commented Oct 25, 2022

hallahan commented Oct 25, 2022

hallahan commented Oct 26, 2022

VeaaC commented Oct 26, 2022

hallahan commented Oct 29, 2022

VeaaC commented Oct 31, 2022

hallahan commented Oct 31, 2022