Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data binning segfaults when data is in device memory on non-unified memory systems #1122

Open
BenWibking opened this issue Mar 28, 2023 · 9 comments
Labels

Comments

@BenWibking
Copy link

Data binning works fine when running on CPU. However, when running this actions file

-
  action: "add_queries"
  queries:
    bin_pressure:
      params:
        expression: "binning('dP_over_P','avg', [axis('x',[-50,50]), axis('y', [-50,50]), axis('z', num_bins=64)])"
        name: dP_profile
-
  action: "save_session"

on A100 GPUs, I get a segmentation fault:

Loguru caught a signal: SIGSEGV
Stack trace:
14            0x45bc0e ./build-gpu/bin/athenaPK() [0x45bc0e]
13      0x7f3227bb8493 __libc_start_main + 243
12            0x445c1f ./build-gpu/bin/athenaPK() [0x445c1f]
11            0x737781 ./build-gpu/bin/athenaPK() [0x737781]
10            0x7df6fe ./build-gpu/bin/athenaPK() [0x7df6fe]
9             0x8c32f8 ./build-gpu/bin/athenaPK() [0x8c32f8]
8       0x7f324c485707 ascent::Ascent::execute(conduit::Node const&) + 439
7       0x7f324c4afd30 ascent::AscentRuntime::Execute(conduit::Node const&) + 1072
6       0x7f324c14d0d1 flow::Workspace::execute() + 817
5       0x7f324c66d500 ascent::runtime::filters::BasicQuery::execute() + 672
4       0x7f324c4c78ed ascent::runtime::expressions::ExpressionEval::evaluate(std::string, std::string) + 2717
3       0x7f324c14d0d1 flow::Workspace::execute() + 817
2       0x7f324c57ff8e ascent::runtime::expressions::Binning::execute() + 846
1       0x7f324c57f066 ascent::runtime::expressions::binning_interface(std::string const&, std::string const&, conduit::Node const&, conduit::Node const&, conduit::Node const&, conduit::Node&, conduit::Node&, conduit::Node&) + 2454
0       0x7f324c4ef27b ascent::runtime::expressions::binning(conduit::Node const&, conduit::Node&, std::string const&, std::string const&, double, std::string const&) + 6875
2023-03-28 12:23:58.600 (  32.416s) [main thread     ]                       :0     FATL| Signal: SIGSEGV

I can provide a full reproducer if needed.

@nicolemarsaglia
Copy link
Contributor

Hey @BenWibking sorry for the delay.

A reproducer would be great and I can do my best to try to help you figure out this issue!

@BenWibking
Copy link
Author

Thanks. I've put a reproducer here: parthenon-hpc-lab/athenapk#49. Let me know what you find.

@BenWibking
Copy link
Author

BenWibking commented Mar 31, 2023

@nicolemarsaglia I've rebuilt Ascent + TPLs with debugging info and I get a more informative backtrace. The segmentation fault happens here:

(cuda-gdb) bt
#0  ascent::runtime::expressions::binning (dataset=..., bin_axes=..., reduction_var="Density", reduction_op="avg", empty_bin_val=0,
    component="") at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_blueprint_architect.cpp:1649
#1  0x00007fb75d836285 in ascent::runtime::expressions::binning_interface (reduction_var="Density", reduction_op="avg",
    n_empty_bin_val=..., n_component=..., n_axis_list=..., dataset=..., n_binning=..., n_output_axes=...)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_expression_filters.cpp:3158
#2  0x00007fb75d836c1a in ascent::runtime::expressions::Binning::execute (this=0xc40cdc0)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/expressions/ascent_expression_filters.cpp:3197
#3  0x00007fb75d0b9ba9 in flow::Workspace::execute (this=0x7ffd1f630098)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/flow/flow_workspace.cpp:303
#4  0x00007fb75d7423a7 in ascent::runtime::expressions::ExpressionEval::evaluate (this=0x7ffd1f630030,
    expr="binning('Density','avg', [axis('x',[-0.5,0.5]), axis('y', [-0.5,0.5]), axis('z', num_bins=64)])",
    expr_name="avg_density_profile")
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/ascent_expression_eval.cpp:1534
#5  0x00007fb75d927c35 in ascent::runtime::filters::BasicQuery::execute (this=0xa08c460)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/flow_filters/ascent_runtime_query_filters.cpp:127
#6  0x00007fb75d0b9ba9 in flow::Workspace::execute (this=0x7df7460)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/flow/flow_workspace.cpp:303
#7  0x00007fb75d6fba4f in ascent::AscentRuntime::Execute (this=0x7df6ec0, actions=...)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/runtimes/ascent_main_runtime.cpp:1831
#8  0x00007fb75d6e1915 in ascent::Ascent::execute (this=0x7ffd1f6318d0, actions=...)
    at /projects/cvz/bwibking/ascent_debug/ascent/src/libs/ascent/ascent.cpp:410
#9  0x000000000085ba78 in parthenon::AscentOutput::WriteOutputFile(parthenon::Mesh*, parthenon::ParameterInput*, parthenon::SimTime*, parthenon::SignalHandler::OutputSignal) ()
#10 0x0000000000777e7e in parthenon::Outputs::MakeOutputs(parthenon::Mesh*, parthenon::ParameterInput*, parthenon::SimTime*, parthenon::SignalHandler::OutputSignal) ()
#11 0x00000000006cfcb1 in parthenon::EvolutionDriver::Execute() ()
#12 0x00000000004419df in main ()
(cuda-gdb) list
1644	//#endif
1645	        for(int i = 0; i < homes_size; ++i)
1646	        {
1647	          if(homes[i] != -1)
1648	          {
1649	            update_bin(bins, homes[i], values[i], reduction_op);
1650	          }
1651	        }
1652	      }
1653	    }

@BenWibking
Copy link
Author

BenWibking commented Mar 31, 2023

Here's info args:

(cuda-gdb) info args
dataset = @0xc2bec10: {m_parent = 0x0, m_schema = 0xc2bfb20, m_owns_schema = true,
  m_children = std::vector of length 44, capacity 64 = {0xc309760, 0xc2bd670, 0xc2bd710, 0xc30a490, 0xc30a270, 0xc2bf2b0, 0xc2bf430,
    0xc2c1530, 0xc30feb0, 0xc2c4f60, 0xc2c5f20, 0xc2c2130, 0xc2c5ca0, 0xc2c91c0, 0xc2c9d10, 0xc2c8f90, 0xc2c25e0, 0xc2cbba0, 0xc2cbd10,
    0xc2c4bb0, 0xc2ccc90, 0xc2c2c10, 0xc2c7590, 0xc2cabe0, 0xc2c7380, 0xc2d51b0, 0xc2d4000, 0xc2d7020, 0xc355040, 0xc2d8e00, 0xa10ea60,
    0xc2dab80, 0xc2d4e00, 0xc2da950, 0xb841190, 0xc2db6e0, 0xc2de550, 0xc2d8930, 0xc2d5ea0, 0xc2e17a0, 0xc2e25d0, 0xc3d98b0, 0xc2e42f0,
    0xc2e5290}, m_data = 0x0, m_data_size = 0, m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
bin_axes = @0x7ffd1f62ea80: {m_parent = 0x0, m_schema = 0xc413880, m_owns_schema = true,
  m_children = std::vector of length 3, capacity 4 = {0xc40fb70, 0xc40f570, 0xc40fa70}, m_data = 0x0, m_data_size = 0,
  m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
reduction_var = "Density"
reduction_op = "avg"
empty_bin_val = 0
component = ""

And info locals:

(cuda-gdb) info locals
i = 26
values = {m_data = 0x7fb6ee6f9280, m_dtype = {m_id = 12, m_num_ele = 1728, m_offset = 0, m_stride = 8, m_ele_bytes = 8,
    m_endianness = 0}}
comp_path = ""
values_path = "fields/Density/values"
dom = @0xc309760: {m_parent = 0xc2bec10, m_schema = 0xc3096f0, m_owns_schema = false,
  m_children = std::vector of length 4, capacity 4 = {0xc309910, 0xc30a720, 0xc30ada0, 0xc30c280}, m_data = 0x0, m_data_size = 0,
  m_alloced = false, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
n_homes = {m_parent = 0x0, m_schema = 0xc413bc0, m_owns_schema = true, m_children = std::vector of length 0, capacity 0,
  m_data = 0xc4141f0, m_data_size = 6912, m_alloced = true, m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
homes = 0xc4141f0
homes_size = 1728
dom_index = 0
var_names = std::vector of length 4, capacity 6 = {"x", "y", "z", "Density"}
topo_and_assoc = @0x7ffd1f62d800: {m_parent = 0x0, m_schema = 0xc411040, m_owns_schema = true,
  m_children = std::vector of length 2, capacity 2 = {0xc2cc2e0, 0xc410de0}, m_data = 0x0, m_data_size = 0, m_alloced = false,
  m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
topo_name = "topo"
assoc_str = "element"
bounds = @0x7ffd1f62d760: {m_parent = 0x0, m_schema = 0xc4112a0, m_owns_schema = true,
  m_children = std::vector of length 2, capacity 2 = {0xc410fe0, 0xc413820}, m_data = 0x0, m_data_size = 0, m_alloced = false,
  m_mmaped = false, m_mmap = 0x0, m_allocator_id = 0}
min_coords = 0xc2c4370
max_coords = 0xc2c0b20
axes = {{"x", "i", "dx"}, {"y", "j", "dy"}, {"z", "k", "dz"}}
num_axes = 3
num_bins = 64
num_bin_vars = 2
bins_size = 128
bins = 0x6e99a90
mpi_comm = 0x7ffd1f62e240
global_bins = 0x10000000c2bec10
res = {m_parent = 0x0, m_schema = 0xc413880, m_owns_schema = false,
  m_children = std::vector of length -17553172076876, capacity -17553146376764 = {0x458b48c389481aeb, 0xf528ede8c78948e8,
    0xe8c78948d88948ff, 0xf85d8b48fff59272, 0x4853e5894855c3c9, 0x48e87d894818ec83, 0x48e8458b48e07589, 0x48fff4b0ffe8c789,
    0x53e8c78948e8458b, 0x48e0558b48fff4a6, 0x8948d68948e8458b, 0x1aebfff58e20e8c7, 0x48e8458b48c38948, 0x48fff5288fe8c789,
    0x9214e8c78948d889, 0xc3c9f85d8b48fff5, 0xec834853e5894855, 0x758948e87d894818, 0xc78948e8458b48e0, 0x458b48fff4b0a1e8,
    0xf4a5f5e8c78948e8, 0x458b48e0558b48ff, 0xe8c78948d68948e8, 0x89481aebfff49e32, 0xc78948e8458b48c3, 0xd88948fff52831e8,
    0xfff591b6e8c78948, 0x4855c3c9f85d8b48, 0x4848ec834853e589, 0x48b0758948b87d89, 0x43e8c78948b8458b, 0x48b8458b48fff4b0,
    0x48fff4a597e8c789, 0x1be8c78948ef458d, 0x48ef558d48fff592, 0x48c0458d48b04d8b, 0x4d94e8c78948ce89, 0x8b48c0558d48fff5,
    0xc78948d68948b845, 0x458d48fff49db1e8, 0xf4d5f5e8c78948c0, 0xc78948ef458d48ff, 0x483cebfff51dd9e8, 0x8948c0458d48c389,
    0x3ebfff4d5d8e8c7, 0x48ef458d48c38948, 0xebfff51db7e8c789, 0xb8458b48c3894803, 0xfff52776e8c78948, 0xfbe8c78948d88948,
    0xc9f85d8b48fff590, 0x8348e589485590c3, 0x8b48f87d894810ec, 0x5a9ce8c78948f845, 0x8948f8458b48fff5, 0xc990fff52740e8c7,
    0x8348e589485590c3, 0x8b48f87d894810ec, 0x5a74e8c78948f845, 0x485590c3c990fff5, 0xec8348535441e589, 0x758948b87d894840,
    0xc78948b8458b48b0, 0xef45c6fff483d1e8, 0xc78948b0458b4800, 0x458948fff4fe81e8, 0x5e7501d87d8348d8, 0xe8c78948b8458b48, 0x1ef45c6fff4ab0a, 0xe8c78948b0458b48, 0x48c38948fff4fb4a, 0xdbe8c78948b8458b, 0x8948de8948fff4bf, 0x8b48fff561a0e8c7, 0x94e4e8c78948b045, 0x458b48c38948fff5, 0xf48ae5e8c78948b8, 0xe8c78948de8948ff, 0x83482cebfff5196a, 0x458b48127502d87d, 0xf4a845e8c78948b8, 0x4813eb01ef45c6ff, 0x48b8458b48b0558b, 0x68bce8c78948d689, 0x840f00ef7d80fff5, 0xb8458b48000000af, 0xfff4d756e8c78948, 0xb0458b48d0458948, 0xfff4f0a6e8c78948, 0xe045c748c8458948, 0x8b4856eb00000000, 0x8948c8458b48e055, 0xf4cf75e8c78948d6, 0x40bf208b4cff, 0x8948fff51148e800, 0xe8df8948e6894cc3, 0xc05d8948fff5186a, 0xb8558b48c0458b48, 0xc0558d4838508948, 0x48d68948d0458b48, 0x48fff556b7e8c789, 0xc8458b4801e04583, 0xfff571e6e8c78948, 0x84c0920fe0453948, 0xc4894916eb9375c0, 0xfff50a0ee8df8948, 0x33e8c78948e0894c, 0x40c4834890fff58f, 0x485590c35d5c415b, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x824ce8c78948f845, 0x8948f8458b48fff4, 0x8b48fff48530e8c7, 0x8948f0558b48f845, 0xf5518de8c78948d6, 0xe5894855c3c990ff, 0xf87d894810ec8348, 0xf8458b48f0758948, 0xfff4820ee8c78948, 0xe8c78948f0458b48, 0x1f88348fff4fcc2, 0x480e74c084c0940f, 0x4be8c78948f8458b, 0x458b4823ebfff4a9, 0xf4fc9de8c78948f0, 0xc0940f02f88348ff, 0xf8458b480c74c084, 0xfff4a6c6e8c78948, 0xf0558b48f8458b48, 0x43e8c78948d68948, 0x4855c3c990fff567, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x8194e8c78948f845, 0x8b48f0558b48fff4, 0xc78948d68948f845, 0xc3c990fff4fe71e8, 0x10ec8348e5894855, 0xf0758948f87d8948, 0xf0453b48f8458b48, 0x8b48f0558b481374, 0xc78948d68948f845, 0x458b48fff4d771e8, 0xe589485590c3c9f8, 0xf87d894810ec8348, 0xf0558b48f0758948, 0x48d68948f8458b48, 0x48fff589d7e8c789, 0x485590c3c9f8458b, 0x894810ec8348e589, 0x8b48f0758948f87d, 0x8948f8458b48f055, 0xf49a1de8c78948d6, 0x90c3c9f8458b48ff, 0x40ec8348e5894855, 0xf845c748c87d8948, 0xc8458b4800000000, 0xfff4fb96e8c78948, 0xf07d8348f0458948, 0x2f07d8348077401, 0x8948c8458b487275, 0x8948fff4ee58e8c7, 0x8948e8458b48e845, 0x8948fff486d8e8c7, 0xd8458d4827ebd845, 0xfff54356e8c78948, 0x95e8c78948008b48, 0x48f8450148ffffff, 0x2be8c78948d8458d, 0x48e8458b48fff519, 0x48fff5754fe8c789, 0x48e0558d48e04589, 0x8948d68948d8458d, 0xc084fff55598e8c7, 0xf07d834817ebb275, 0x48c8458b48107400, 0x48fff573dfe8c789, 0xc9f8458b48f84589, 0x8348e589485590c3, 0xc748c87d894840ec, 0x8b4800000000f845, 0xfad4e8c78948c845, 0x8348f0458948fff4, 0x7d8348077401f07d, 0xc8458b48727502f0, 0xfff4ed96e8c78948, 0xe8458b48e8458948, 0xfff48616e8c78948, 0x8d4827ebd8458948, 0x4294e8c78948d845, 0xc78948008b48fff5, 0x450148ffffff95e8, 0xc78948d8458d48f8, 0x458b48fff51869e8, 0xf5748de8c78948e8, 0x558d48e0458948ff, 0xd68948d8458d48e0, 0xfff554d6e8c78948, 0x834817ebb275c084...}, m_data = 0x7ffd1f630130, m_data_size = 205600896, m_alloced = 48, m_mmaped = 234, m_mmap = 0x7fb754b039c7 <conduit::Node::init_defaults()+93>, m_allocator_id = 140725130029616}
res_bins = 0x26bbb40 <ompi_mpi_comm_world>

@BenWibking
Copy link
Author

I've uploaded the core files here: https://cloudstor.aarnet.edu.au/plus/s/hTgYZQWYDYTPZn9

@nicolemarsaglia
Copy link
Contributor

Thanks for the info! I'm a tad sick so I'm taking the rest of the day (sorry!), but I can get back to this on Monday.

@cyrush cyrush added the bug label Apr 8, 2023
@BenWibking
Copy link
Author

This is a very strange bug that I cannot reproduce on either Frontier or Summit. Somehow it appears to only happen on A100s.

@BenWibking BenWibking changed the title Data binning segmentation fault when run on GPU Data binning segfaults when data is in device memory on non-unified memory systems Apr 27, 2023
@BenWibking
Copy link
Author

Ok, I've traced the issue to the fact that the binning operation runs on the CPU and it attempts to dereference a device pointer, since our code sends the device-resident data to Ascent via zero-copy. This works on systems with unified memory, such as Summit and Frontier, but fails on systems without it.

@cyrush
Copy link
Member

cyrush commented May 5, 2023

Thanks for confirming this behavior, we will work to resolve these limits for binning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants