New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AppSignal NIF is raising a segfault and killing the VM #113

Closed
costaraphael opened this Issue Feb 4, 2017 · 8 comments

Comments

Projects
None yet
4 participants
@costaraphael
Contributor

costaraphael commented Feb 4, 2017

For a few days ago, our production application was silently shutting down without any error logging or reporting.

Looking at the server syslog, I found these entries very close to the times where the application had shutdown (it happened twice yesterday):

Feb  3 13:12:56 kernel: [2447327.434316] 2_scheduler[5145]: segfault at 7fb0bf600000 ip 00007fb0c615f8dc sp 00007fb0d55fdb60 error 4 in appsignal_extension.so[7fb0c60e8000+108000]
Feb  3 13:12:56 run_erl[5062]: Erlang closed the connection.
Feb  3 17:56:44 kernel: [2464355.762299] 1_scheduler[28152]: segfault at 7f381c900000 ip 00007f3824c058d6 sp 00007f382f7feb60 error 4 in appsignal_extension.so[7f3824b8e000+108000]
Feb  3 17:56:44 run_erl[28071]: Erlang closed the connection.

My conclusion is that the extension is segfaulting at some unknown point (I don't have stacktraces due to how NIFs operate in the BEAM) and shutting down the whole application.

Wouldn't it be safer if the interaction with the extension was made using a Port? Ports fit nicely in a supervision tree and in a case like this, it will not bring the entire VM down. Given the focus of Erlang/Elixir into make reliable applications that can keep running for ages with almost no downtime, I think this is a pretty big deal.

@thijsc

This comment has been minimized.

Member

thijsc commented Feb 4, 2017

Very sorry to hear this. We'll investigate this thoroughly. Which version of the package where you running on which OS? Did this start after upgrading to a newer version? I'm wondering if our recent changes for Alpine Linux support might have something to do with this.

Could you e-mail the full /tmp/appsignal.log file to support@appsignal.com?

We'd love to run this as a port, but the way ports work is not suitable for the way our monitoring works. The monitoring relies on multiple C calls during a transaction which gets some information while not influencing the host VM at all. Ports require serialisation and would prohibit using this approach. You can read some more about this here: http://docs.appsignal.com/elixir/why-nif.html

@costaraphael

This comment has been minimized.

Contributor

costaraphael commented Feb 6, 2017

Which version of the package where you running on which OS?

We are using Ubuntu 14.04.3 LTS and AppSignal 0.11.2

Did this start after upgrading to a newer version?

I'm not sure for how long the shutdowns have been happening, but it's probable that it started after this update.

@thijsc

This comment has been minimized.

Member

thijsc commented Feb 6, 2017

Thanks for the info. We're currently doing a full review of our C integration code, will get back to you asap.

@thijsc

This comment has been minimized.

Member

thijsc commented Feb 6, 2017

By the way, did you see this?

Could you e-mail the full /tmp/appsignal.log file to support@appsignal.com?

@costaraphael

This comment has been minimized.

Contributor

costaraphael commented Feb 6, 2017

Sorry, emailing it now 😅

@thijsc thijsc added this to the Elixir 1.0 milestone Feb 13, 2017

@thijsc

This comment has been minimized.

Member

thijsc commented Feb 15, 2017

It was a long road, but we've finally been able to verify that this is fixed. Version 0.11.6 and up don't have this issue. We will post a post-mortem here tomorrow.

@jeffkreeftmeijer

This comment has been minimized.

Member

jeffkreeftmeijer commented Feb 17, 2017

What caused this?

There was a bug in the C source code for the integration which was introduced in the 0.10.0 release (released on January 17th). Besides not sending all available data to AppSignal, this bug caused a segfault in an application and was reported twice, after which we were able to reproduce it under the following circumstances.

What happened was that instead of sending a pointer to data to the agent, the C code sent a struct. The agent interpreted that as a None in most cases, causing the setting of data to fail silently. This specific segfault occurred when the struct, interpreted as a pointer, accidentally pointed to an existing object in memory, which produced a segfault.

What did we do to fix it?

We found and fixed the bug on February 7th, and immediately released 0.11.6. To make sure this fix would solve the issue, we ran a benchmark with debug symbols on 0.11.2 to try to reproduce the issue, leading us to verify that this bug was actually causing the segfault.

What will we do to prevent this in the future?

  • We’ve added testing hooks to the agent code to be able to test the integration functions more thoroughly
  • We’ll notify all users of versions 0.10.0, 0.11.0, 0.11.1, 0.11.2, 0.11.3, 0.11.4, and 0.11.5 via e-mail to upgrade to 0.11.6 or higher
  • We will yank these versions from hex.pm after giving affected users the chance to update to a more recent version

Last, but not least; we'd like to thank @costaraphael and @PragTob for working with us to resolve this issue. ❤️

@PragTob

This comment has been minimized.

PragTob commented Feb 17, 2017

👋 Thanks for fixing the problem, the quick responses and for being so open about it! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment