Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing assertion while reading/appending dwarf #5553

Conversation

thaystg
Copy link

@thaystg thaystg commented Jan 9, 2023

While loading dwarf from this wasm file attached here in dotnet.zip, I got this error:

thread 'main' panicked at 'assertion failed: range_start < range_end', crates\cranelift\src\debug\transform\expression.rs:689:13

I tried to remove the assert and check if without it the debugging experience on lldb works correctly and as far as I saw in my tests, everything is working fine after removing the assert.

dotnet.zip

If you think this is not the correct fix, please let me know, I can close the PR and open an issue.

@thaystg thaystg changed the title Removing assert while reading/appending dwarf Removing assertion while reading/appending dwarf Jan 9, 2023
@jameysharp
Copy link
Contributor

This bug has previously been reported as #3999 as well as several duplicates, but we haven't had any suggestions for how to fix it before. Thanks for this PR!

I don't know whether removing this assert is correct. It might be! But I think the effect is that the for i in (i..j).rev() loop below doesn't run in that case because j will also be less than i. I'm suspicious that removing the assert may just be hiding the bug.

As noted in other issues about this bug:

We currently don't have anyone on the project who understands our DWARF-handling code well and has time to work on it; so while this is definitely a bug, it's not likely to have a fast resolution. We do have an intent to have someone eventually focus on this, as priorities allow, so we should keep this issue open. Just wanted to give some context on the current situation...

@thaystg
Copy link
Author

thaystg commented Jan 10, 2023

Minimal repro:

#include <stdio.h>

void
sample_func (int type, int parm)
{
    printf("sample_func\n");
    switch (type){
    case 1:
        if (type + parm!= 50)
            break;
    case 2:
        if (type + parm == 10 || type + parm == 15)
            return;
        break;
    default:
        break;
    }
}

int main() {
    int i = 2;
    sample_func (i, i);
    return 0;
}

Steps to Reproduce

wasi-sdk-14.0/bin/clang -g -O0 test.c -o test.wasm
lldb -- wasmtime run -g test.wasm   

@thaystg
Copy link
Author

thaystg commented Jan 10, 2023

This bug has previously been reported as #3999 as well as several duplicates, but we haven't had any suggestions for how to fix it before. Thanks for this PR!

I don't know whether removing this assert is correct. It might be! But I think the effect is that the for i in (i..j).rev() loop below doesn't run in that case because j will also be less than i. I'm suspicious that removing the assert may just be hiding the bug.

As noted in other issues about this bug:

We currently don't have anyone on the project who understands our DWARF-handling code well and has time to work on it; so while this is definitely a bug, it's not likely to have a fast resolution. We do have an intent to have someone eventually focus on this, as priorities allow, so we should keep this issue open. Just wanted to give some context on the current situation...

Can't we just print a warning, remove the assertion and exit the function?
It would be better then assertion probably, because we are still able to debug if we skip this "wrong" label.

@jameysharp
Copy link
Contributor

Can't we just print a warning, remove the assertion and exit the function? It would be better then assertion probably, because we are still able to debug if we skip this "wrong" label.

@alexcrichton and @fitzgen, do either of you have opinions on this suggestion?

If you can spare some more time to investigate this bug, I'd be happier if we better understood its cause.

  • Looking at process_label, I see that it's FunctionFrameInfo::value_ranges which has an invalid range for some label.
  • As far as I can tell, the values in that map are only set by get_function_frame_info.
  • That comes from CompiledFunction::value_labels_ranges in the wasmtime-cranelift crate.
  • That comes from CompiledCodeBase::value_labels_ranges in the cranelift-codegen crate.
  • That comes from target-specific code implementing TargetIsa::compile_function.
  • Every backend implements that by calling VCode::emit and getting the EmitResult::value_labels_ranges field.
  • That's computed by VCode::compute_value_labels_ranges.

At that point there's a somewhat complex interaction between the inst_offsets array constructed in cranelift-codegen and the debug_locations array constructed by regalloc2.

I think a good first troubleshooting step would be to assert that from_offset < to_offset before pushing a new ValueLocRange in VCode::compute_value_labels_ranges. Hopefully that assert fails instead of the assert in process_label; otherwise we'd need to investigate more to understand where these ranges are coming from.

Then we could check why these ranges are out of order. I think the only possibilities are either that regalloc.debug_locations has pairs that are out of order, or inst_offsets is out of order. If it's the former then we can switch to investigating regalloc2; otherwise we can look at VCode::emit.

Either way we'd know more than we know now and be closer to fixing this particular bug.

@fitzgen
Copy link
Member

fitzgen commented Jan 10, 2023

Can't we just print a warning, remove the assertion and exit the function? It would be better then assertion probably, because we are still able to debug if we skip this "wrong" label.

@alexcrichton and @fitzgen, do either of you have opinions on this suggestion?

I am generally very suspicious of sweeping-the-bugs-under-the-rug kind of thing. If it was a bug that was purely within the DWARF transform, then I would be more amenable to it since that code as-is has no future and needs a rewrite, but since it looks like this is bottoming out in Cranelift, I think it would be better to debug the Cranelift issue like you laid out.

@alexcrichton
Copy link
Member

I personally know so little about the DWARF transformations Wasmtime does that I don't think I'd have much to offer over what @jameysharp has already mentioned.

@pavelsavara
Copy link

The dwarf parser in chrome doesn't fail on it.

Speculations (I know little about all this yet):
Maybe if we add assert earlier in the wasmtime pipeline, to validate input dwarf, we could learn that the flipped start/end is in the input by LLVM ?
And maybe it's how it should be, for break, which is a jmp to label, but in wasm it's some nested block for some reason ?

@jameysharp
Copy link
Contributor

The dwarf parser in chrome doesn't fail on it.

Speculations (I know little about all this yet): Maybe if we add assert earlier in the wasmtime pipeline, to validate input dwarf, we could learn that the flipped start/end is in the input by LLVM ? And maybe it's how it should be, for break, which is a jmp to label, but in wasm it's some nested block for some reason ?

That sounds possible to me but, like you, I know little about all of this.

I think it's unlikely, if I'm reading the Wasmtime/Cranelift source correctly. I think the point of the process_label function is to combine Cranelift's data with the input DWARF from LLVM, but I think this assert is only looking at data that was generated by Cranelift and regalloc2, not from the input DWARF. So I think it's most likely that this is detecting a bug in Cranelift or regalloc2.

@brendandburns
Copy link
Contributor

fwiw, when I kicked on this a little bit, llvm-dwarfdump is perfectly happy to read out the dwarf data, so at some level, the data in the wasm file is "legal" as far as llvm is concerned, but I will also mention that when I tried exactly this fix, it just caused a panic in a different place (for my WASM file at least)

@squillace
Copy link

I've asked @thaystg to have a look at this again if she has time, @jameysharp, per our conversation the other day about debugging with the crew.

@william-stacken
Copy link

william-stacken commented Mar 23, 2023

fwiw, when I kicked on this a little bit, llvm-dwarfdump is perfectly happy to read out the dwarf data, so at some level, the data in the wasm file is "legal" as far as llvm is concerned, but I will also mention that when I tried exactly this fix, it just caused a panic in a different place (for my WASM file at least)

Same here, removing the assertion did not fix the problem, but changing the if statement above to continue if range_start is greater than range_end (not just when it is equal) did fix it and I am able to finally debug my Wasm module. Given that this bug is severe and that developers have complained about it for a long time, and that no one knows the cause making this bug difficult to fix for a long time, I think that it should be "swept-under-the-rug" ASAP. Then the root cause for the bug can be fixed later. At least this is better than debugging Wasm modules not being possible.

@bjorn3
Copy link
Contributor

bjorn3 commented Mar 23, 2023

but changing the if statement above to continue if range_start is greater than range_end (not just when it is equal) did fix it and I am able to finally debug my Wasm module.

I'm pretty sure that can make it impossible to single step at several places and will cause incorrect assignments from machine instructions to source lines.

@william-stacken
Copy link

but changing the if statement above to continue if range_start is greater than range_end (not just when it is equal) did fix it and I am able to finally debug my Wasm module.

I'm pretty sure that can make it impossible to single step at several places and will cause incorrect assignments from machine instructions to source lines.

I was able to both single step and set breakpoints, and I didn't observe any strange instruction to source line assignments. Now I know absolutely nothing about dwarf, and I'm sure you're right that there will be bugs. But from my perspective an unstable debugger is better than no debugger at all. At least for the time being.

@alexcrichton
Copy link
Member

In the time since this PR was opened I think #6931 addresed the underlying issue. I think this is no longer necessary so I'm going to close this, but please let me know if I'm wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants