Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption with Thread on musl #8738

Open
straight-shoota opened this issue Feb 4, 2020 · 6 comments
Open

Memory corruption with Thread on musl #8738

straight-shoota opened this issue Feb 4, 2020 · 6 comments

Comments

@straight-shoota
Copy link
Member

The test_alpine job fails in some workflows on master (Example: https://circleci.com/gh/crystal-lang/crystal/37573) but not every time (Example: https://circleci.com/gh/crystal-lang/crystal/37643).
The error happens while running stdlib specs with a newly built compiler. The error message is Failed to raise an exception: END_OF_STACK, so it's not clear what's the issue. I'd assume that would be improved by #8735 but on that branch the error message is identical: https://circleci.com/gh/crystal-lang/crystal/37656

@Blacksmoke16
Copy link
Member

Blacksmoke16 commented Feb 4, 2020

Maybe #8743?

EDIT: NVM.

@bcardiff
Copy link
Member

bcardiff commented Feb 7, 2020

It fails often. Either we find a fix soon or we should turn off the CI for alpine. It's getting noisy.

@straight-shoota
Copy link
Member Author

straight-shoota commented Feb 7, 2020

Running against master with #8743 the error message is now Error while trying to determine if a stack overflow has occurred. Probable memory corrpution.

The error seems to occur in different thread specs, I've found these locations:

  • spec/std/thread/condition_variable_spec.cr:17
  • spec/std/thread/mutex_spec.cr:5
  • spec/std/thread/thread_spec.cr:30
  • spec/std/thread/thread_spec.cr:4
  • spec/std/thread/thread_spec.cr:11

It doesn't seem to be consistent. And it's not consistent whether the error occurs at all. When running only thread specs (--example Thread) it's about 50% failure rate. I didn't tally when running the entire spec suit, but it feels like it fails more often.

When running with strace (and --example Thread), the error occurs every single time. But it's always in one of the first specs of thread_spec.cr (most frequently the first one) which never failed when running the entire spec suite without strace.

Those are the traces:

write(4, "allows passing an argumentless f"..., 45allows passing an argumentless fun to execute) = 45
mmap(NULL, 143360, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f91579cf000
mprotect(0x7f91579d1000, 135168, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [], 8) = 0
clone(child_stack=0x7f91579f1a88, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000, parent_tid=[13], tls=0x7f91579f1b20, child_tidptr=0x7f915ce7c31c) = 13
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
Error while trying to determine if a stack overflow has occurred. Probable memory corrpution
futex(0x7f91579f1b60, FUTEX_WAIT_PRIVATE, 1, NULLInvalid memory access (signal 11) at address 0x40
[0x55735e770d06] *CallStack::print_backtrace:Int32 +118
[0x55735e19c9dc] __crystal_sigfault_handler +348
[0x55735feb7db4] sigfault_handler +40
[0x7f915ce2d0d4] ???
) = ?
write(4, "raises inside thread and gets it"..., 40raises inside thread and gets it on join) = 40
mmap(NULL, 143360, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fca6bc6a000
mprotect(0x7fca6bc6c000, 135168, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [], 8) = 0
clone(child_stack=0x7fca6bc8ca88, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000, parent_tid=[14], tls=0x7fca6bc8cb20, child_tidptr=0x7fca7111731c) = 14
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7fca6bc8cb60, FUTEX_WAIT_PRIVATE, 1, NULLError while trying to determine if a stack overflow has occurred. Probable memory corrpution
Invalid memory access (signal 11) at address 0x40
[0x55b6cf4c7d06] *CallStack::print_backtrace:Int32 +118
[0x55b6ceef39dc] __crystal_sigfault_handler +348
[0x55b6d0c0edb4] sigfault_handler +40
[0x7fca710c80d4] ???
) = ?

@straight-shoota
Copy link
Member Author

To reduce noise we could temporarily disable the specs in question for musl/alpine.

@RX14
Copy link
Contributor

RX14 commented Feb 9, 2020

Can we at least disable the emails? Please?

@straight-shoota
Copy link
Member Author

The specs have been disabled to remove noise on CI jobs (#8801). But the original issue still needs to be fixed.

@straight-shoota straight-shoota changed the title [CI] test_alpine occasionally fails on master Memory corruption with Thread on musl Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants