Add option for merging output into a single file #8

extremecarver · 2023-04-19T07:48:09Z

Phyghtmap used to output to a single -pbf file. Would it be possible to have the same behaviour again instead of one output file per input file?
I know I could use osmconvert to do this - but I think it would be easier if pyhgtmap can output directly to a single file.

And yeah - great improvements overall and much faster now! Is the jobs parameter doing anything? In phyghtmap it used to be broken that I could max a value of 2 but no more (didn't check this for a long time - so maybe was solved at some point).

agrenott · 2023-04-19T10:33:55Z

The jobs parameter was actually working (at least on phyghtmap 2.23, the latest one); but due to the constraints put on parallelism (notably to handle single output), it was actually quite difficult to really use more than 2-3 CPUs at any given time.

This is why I took the decision to remove single file output as:

I don't need it :D
Constraints are different depending on the output format (OSM and O5M expect to have all nodes before the first way, PBF doesn't seem to care)
Actual writing of output file is now the most time-consuming part, and can't be parallelized with single output

I could probably re-introduce the single output option without adding too much complexity, but it means I probably won't bother handling parallelization in this case.

extremecarver · 2023-04-19T10:52:49Z

Is it still handling the node numbering right? Did you have any problems with merging using osmconvert? If current approach and then merge with osmconvert then delete the single files is much faster vs writing to single file up front I think it's okay. Should mention in instructions however how you intend them to be merged (for my usecase single files is not an option - but if I know it's reliable to merge fine too).

Ah okay - I could not see any speed diffrence between jobs=2 or jobs=12 (hexacore CPU with 12 threads). Jobs=1 was much slower on 2.23.

agrenott · 2023-04-19T12:26:31Z

I kept the logic to avoid nodes & ways numbering overlap, so the resulting files should merge nicely. I didn't try though.

extremecarver · 2023-04-19T12:47:58Z

Well merging with osmconvert only works for o5m files and is quite slow... Multiple pbf files cannot be merged. Writing to 05m is much slower than writing to pbf however.

So a bit unexact as Germany is small and I didn't look at seconds - but roughly it takes twice the time for me - 4 minutes instead of 2 minutes just writing to multiple pbf files.

Here is a my sample command for germany (note somehow bash has a problem with _ so I need to set a variable for it):

Underline=_
nice -n 19 pyhgtmap --earthexplorer-user=extremecarver --earthexplorer-password=Testmap0 --jobs=12 --polygon=/home/contourlines/bounds/"$COUNTRY".poly --step=$step --no-zero-contour --void-range-max=-420 --output-prefix="$COUNTRY2" --line-cat=$detail --start-node-id=10000000 --start-way-id=10000000 --source=$SOURCE --max-nodes-per-way=230 --max-nodes-per-tile=0 --o5m --hgtdir=/home/contourlines/hgt --simplifyContoursEpsilon=0.00001 -j16
dup3="$COUNTRY2""$Underline".o5m
osmconvert $dup3 -o="$COUNTRY2""$Underline".osm.pbf
rm "$COUNTRY2""$Underline".o5m

So maybe in that case writing pbf directly would be faster? I don't know any tool that is faster than osmconvert.

extremecarver · 2023-04-20T04:57:22Z

Following up here (instead of the closed Topic on Europe)- the single file option will be needed because otherwis it is not possible to create continents into a single file.
Osmconvert can only process 1001 files - and osmium capitulates very quickly with pbf as input files (maybe 300 max) while with o5m it will run out of memory.

Compiling Europe 10m interval to o5m with pyhgtmap took 2:20 hours plus 18minuts to merge them with osmium.
Compiling Europe 10m interval to pbf took 1:13 hours - no way to merge them later. Osmium crashse quickly with:
"Open failed for 'europe10m_lon36.00_37.00lat65.00_66.00_srtm1v3.0.osm.pbf': Too many open files"
and I doubt it could do it anyhow for big countries as it would run out of memory too.

So for continents, Russia, China, Canada, USA and maybe Brazil if you want it in a single file - it would be needed to have a "slow" output into a single pbf file.

And yeah it's clear writing to pbf is much faster than writing to o5m. That's running --max-nodes-per-tile=0.
And yeah actually I need to split those filse later again usually to max-nodes=6400000 - howevre as many flat areas would result in much smaller 1"x1" tiles - I first need to merge them then split them again. Because in the end for my use case it makes a big difference if I end up with 1077 tiles or 1700 tiles (because the current approach creatse many 1"x1" tiles that are much smaller than 6400000 nodes with some like in the Alps tiles being much bigger for 1"x1"). 1077 vs 17000 is approx for Europe.
For maps that I craete with 20m contourlines the difference will be double as I then use twice the max-nodes value for splitting them (or would need to run pyhgtmap again with 20m interval instead of just running my map compiler later dropping 10m,30m,.... lines and using only 20m,40m,....).
Some other people may have other use cases however and it may be important for them to actually have a single output file.

I still wonder a bit about the comment - Actual writing of output file is now the most time-consuming part - because the time difference above for Europe between 05m and pbf is certainly not down to writing to HDD. While my HDD isn't blazing fast it can write 200MB per second (continous) or maybe 50MB/s for less continous and has a 512MB buffer that would speed up even more for files less than 1GB in size (server grade HDD)

agrenott · 2023-04-20T06:27:45Z

I'm off for a week, I'll have a look to the single file output when back.

Concerning the file generation, it's not the IO taking time (it's actually using another thread with pyosmium, and is done in batches), but the computing. Pyosmium interface requires a function call per node, and for millions of nodes this takes a lot of CPU. I think in the latest profiling I did this now more than half of the total processing time.

agrenott · 2023-05-01T01:16:13Z

More details concerning this point:

Concerning the file generation, it's not the IO taking time (it's actually using another thread with pyosmium, and is done in batches), but the computing. Pyosmium interface requires a function call per node, and for millions of nodes this takes a lot of CPU. I think in the latest profiling I did this now more than half of the total processing time.

Profiling the generation of a single output from 2 view1 local files (with python -m yappi -f callgrind -o yappi_ex1.out ../../pyhgtmap/main.py --pbf --log=DEBUG --max-nodes-per-tile=0 /mnt/g/git/garmin_mtb/work/hgt/VIEW1/N46E014.hgt /mnt/g/git/garmin_mtb/work/hgt/VIEW1/N46E015.hgt):

is the time spent writing NODES to PBF output (11215767 nodes in this example)
is the time spent actually generating contours
is the time spent writing WAYS to PBF output (50796 ways in this example)

At best, parallelization could allow processing 2 in parallel of (1+3), which would be ~25% improvement of the overall elapsed time. Not really worth the added complexity until one find a way to optimize the actual PBF output part.

extremecarver · 2023-05-01T06:20:28Z

Thanks a lot for adding it back.

agrenott added the enhancement New feature or request label Apr 19, 2023

agrenott closed this as completed in ea0665d May 1, 2023

agrenott mentioned this issue May 1, 2023

Optimize PBF format nodes writing #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for merging output into a single file #8

Add option for merging output into a single file #8

extremecarver commented Apr 19, 2023

agrenott commented Apr 19, 2023

extremecarver commented Apr 19, 2023

agrenott commented Apr 19, 2023

extremecarver commented Apr 19, 2023 •

edited

extremecarver commented Apr 20, 2023

agrenott commented Apr 20, 2023

agrenott commented May 1, 2023 •

edited

extremecarver commented May 1, 2023

Add option for merging output into a single file #8

Add option for merging output into a single file #8

Comments

extremecarver commented Apr 19, 2023

agrenott commented Apr 19, 2023

extremecarver commented Apr 19, 2023

agrenott commented Apr 19, 2023

extremecarver commented Apr 19, 2023 • edited

extremecarver commented Apr 20, 2023

agrenott commented Apr 20, 2023

agrenott commented May 1, 2023 • edited

extremecarver commented May 1, 2023

extremecarver commented Apr 19, 2023 •

edited

agrenott commented May 1, 2023 •

edited