Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault during DAG build persists #80

Closed
patrickbr opened this issue Jan 31, 2024 · 6 comments
Closed

Segfault during DAG build persists #80

patrickbr opened this issue Jan 31, 2024 · 6 comments

Comments

@patrickbr
Copy link
Member

patrickbr commented Jan 31, 2024

Unfortunately, the latest weekly run crashed this morning with a segmentation fault during the DAG build for the USA dataset. This seems to be the same bug we encountered on the planet.osm dataset in autumn, and which we thought to be fixed now. I am atm trying to reproduce this in gdb.

@patrickbr
Copy link
Member Author

patrickbr commented Feb 5, 2024

The segfault occurred again this weekend, and again, multiple attempts since yesterday to reproduce this with gdb/valgrind/thread-sanitizer failed. I now manually went through the code line by line and found one line where an access to the first box ID of a geometry was not protected by a check whether the geometry has any box IDs at all. Normally, each geometry should be assigned at least one box ID, as the grid covers the entire globe. However, it might be that in extreme edge cases (for example, a polygon lying completely on an edge of the grid) a geometry may not have any box ID at all. This case is now caught with 755e5af.

I now restarted the build with 755e5af, let's see what happens.

@patrickbr
Copy link
Member Author

patrickbr commented Feb 5, 2024

Update: this wasn't the cause.

Additional runs of 755e5af today on our ob machine (✓ = success without any error/warning, ✗ = segfault) :

  • usa.osm, in gdb, with -g: ✓
  • usa.osm, in gdb, without -g: ✓
  • usa.osm, with thread-sanitizer: ✓
  • usa.osm, release build: ✗

I will create an artificial USA datasets containing only named areas tomorrow and test it with valgrind.

ATM, our weekly update runs entirely within gdb.

@patrickbr
Copy link
Member Author

Still no luck reproducing this, even with synthetic datasets, and with the removal of all sanity checks we have in prepareDAG(). So far, I wasn't even able to reproduce it with master, on the same machine (ob), with the same command line and build parameters as for the weekly build, but outside of a Docker environment. I now bumped the Ubuntu version in the Docker container to 22.04, let's see how this behaves.

@patrickbr
Copy link
Member Author

patrickbr commented Feb 8, 2024

I was now finally able to reproduce this in a single-threaded environment (so no data races) on the ABB (Asia) dataset with basic debug output.

The crash occurred during a check between these areas:

https://www.openstreetmap.org/relation/5615030
https://www.openstreetmap.org/relation/1949881

The latter crosses the dateline and thus the boundaries of the mercator projection, so there is certainly something special about it. How that relates to the crash remains to be investigated :)

@patrickbr
Copy link
Member Author

patrickbr commented Feb 13, 2024

Current state, all with prepareDAG() single-threaded, processing order of areas deterministic.

Note that from Boost 1.78 upwards, GEOMETRYCOLLECTIONS() are supported and enabled.

Machine Env Mode Dataset Crash?
ob Docker (Ubuntu 22.04, Boost 1.78) -O3 abb.osm.pbf Yes
ob Docker (Ubuntu 22.04, Boost 1.78) -g, gdb abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.74) -g abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.79) -g abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.80) -g abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.84) -g abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.84) -O3 abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.79) -O3 abb.osm.pbf No
ob Docker (Ubuntu 22.04, Boost 1.78) -g -fsanitize=address abb.osm.pbf Yes ††
ob Docker (Ubuntu 22.04, Boost 1.78) -g, gdb abb.osm.pbf TODO
ob Docker (Ubuntu 22.04, Boost 1.74) -g, gdb abb.osm.pbf TODO
ob bare (Ubuntu 22.04, Boost 1.74) -O3 abb.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g abb.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address abb.osm.pbf TODO
ob bare (Ubuntu 22.04, Boost 1.74) -g, valgrind abb.osm.pbf TODO, will take very long (weeks)
ob bare (Ubuntu 22.04, Boost 1.74) -g, valgrind 5615030-1949881.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g, valgrind geo.osm.pbf (box around Antimeridian) No
ob bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address abb.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address 5615030-1949881.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address geo.osm.pbf (box around Antimeridian) No
ob bare (Ubuntu 22.04, Boost 1.74) -g, gdb abb.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g, gdb 5615030-1949881.osm.pbf No
ob bare (Ubuntu 22.04, Boost 1.74) -g, gdb geo.osm.pbf (box around Antimeridian) No
patrick (local) bare (Ubuntu 22.04, Boost 1.78) -g, valgrind 5615030-1949881.osm.pbf No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g, valgrind 5615030-1949881.osm.pbf No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g, valgrind geo.osm.pbf (box around Antimeridian) No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address 5615030-1949881.osm.pbf No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g -fsanitize=address geo.osm.pbf (box around Antimeridian) No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g, gdb 5615030-1949881.osm.pbf No
patrick (local) bare (Ubuntu 22.04, Boost 1.74) -g, gdb geo.osm.pbf (box around Antimeridian) No

† Segfault after checking the areas 5615030, 1949881 as described above, possibly during stack cleanup. Exactly reproduced every time (tested 4x)
†† AddressSanitizer failed to allocate 0x1f000 (126976) bytes at address fe31ea9b000 (errno: 12), during dataset loading via libosmium, took 4 days

@patrickbr
Copy link
Member Author

patrickbr commented Feb 28, 2024

The latest build (now with Boost 1.84) ran through without problems. TLDR: with Boost 1.78, the code segfaulted every time during the comparison of areas 5615030 and 1949881 in a single-threaded environment, but only if nobody looked (it ran through fine with gdb, valgrind, and with the thread sanitizer enabled). Later and earlier Boost versions worked fine.

Closing this now, although I am still not 100% convinced that the cause is not in our code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant