Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elixir 1.2: Logger crash when using `inspect` in production with protocols consolidated #4518

Closed
SimonWoolf opened this issue Apr 13, 2016 · 11 comments

Comments

@SimonWoolf
Copy link
Contributor

commented Apr 13, 2016

Wasn't sure whether to report this as not certain it's an elixir/logger bug rather than something I'm doing wrong (tried slack and SO first), but on I figure logger crashing seems likely to be a bug that I should report even if I am doing something odd. If it isn't a bug, many apologies.

So: the first inspect call that happens in production causes a logger crash, claiming that eg Elixir.Inspect.List.inspect is undefined:

=INFO REPORT==== 13-Apr-2016::09:26:01 ===
    application: logger
    exited: shutdown
    type: permanent

=ERROR REPORT==== 13-Apr-2016::09:26:01 ===
Error in process <0.438.0> with exit value:
{[{reason,undef},
  {mfa,{'Elixir.AppName.Web.CatchAll',handle,2}},
  {stacktrace,
      [{'Elixir.Inspect.List',inspect,
           [[],
            #{'__struct__' => 'Elixir.Inspect.Opts',
              base => decimal,
              binaries => infer,
              char_lists => infer,
              limit => 50,
              pretty => false,
              safe => true,
              structs => true,
              width => 80}],
           []},
       {'Elixir.Kernel',inspect,2,[{file,"lib/kernel.ex"},{line,1566}]},
       {'Elixir.AppName.Web.CatchAll',handle,2,
           [{file,"lib/web/catch_all.ex"},{line,11}]},
       {cowboy_handler,handler_handle,4,
           [{file,"src/cowboy_handler.erl"},{line,111}]},
       {cowboy_protocol,execute,4,
           [{file,"src/cowboy_protocol.erl"},{line,442}]}]},
  {req,
      [{socket,#Port<0.8619>},
       {transport,ranch_tcp},
       {connection,keepalive},
       {pid,<0.438.0>},
       {method,<<"GET">>},
       {version,'HTTP/1.1'},
       {peer,{{172,17,0,1},40454}},
       {host,<<"172.17.0.3">>},
       {host_info,undefined},
       {port,6100},
       {path,<<"/badpagr">>},
       {path_info,[<<"badpagr">>]},
       {qs,<<>>},
       {qs_vals,undefined},
       {bindings,[]},
       {headers,
           [{<<"user-agent">>,<<"curl/7.35.0">>},
            {<<"host">>,<<"172.17.0.3:6100">>},
            {<<"accept">>,<<"*/*">>}]},
       {p_headers,[]},
       {cookies,undefined},
       {meta,[]},
       {body_state,waiting},
       {buffer,<<>>},
       {multipart,undefined},
       {resp_compress,false},
       {resp_state,waiting},
       {resp_headers,[]},
       {resp_body,<<>>},
       {onresponse,undefined}]},
  {state,nil}],
 [{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]}

The code that triggers it is just

Logger.warn "Unknown path: #{:cowboy_req.path(request) |> elem(0)},
  params: #{:cowboy_req.qs_vals(request) |> elem(0) |> inspect}"

Here, :cowboy_req.qs_vals(request) |> elem(0) is just [], so the line that crashes is equivalent to inspect([]).

Elixir version: 1.2.2
Erlang version: 18.2.1
Presumably this is related to Elixir 1.2 consolidating the inspect protocol by default? I'm using build_embedded: Mix.env == :prod, so protocols are being consolidated. (and eg _build/prod/consolidated/Elixir.Inspect.beam is present in the slug).

We're using the heroku elixir buildpack to make slugs, which are then run using docker.

The weird thing is, this crash only happens in production. The same app running locally -- even when using MIX_ENV=prod, running from the same slug that runs in production -- does not have any problems.

Let me know what other details I can provide to help.

Thanks in advance :)

@josevalim

This comment has been minimized.

Copy link
Member

commented Apr 13, 2016

It is failing because it thinks the function Elixir.Inspect.List.inspect/2 is not available. I would look for the following things:

  1. See if Elixir.Inspect.List is present inside Elixir in the slug (or check if it is available from the console)
  2. See if you are not accidentally redefining it
  3. Maybe it is a failed build issue. Try pushing to heroku with the always_rebuild configuration set to true. You can find it here: https://github.com/HashNuke/heroku-buildpack-elixir
@SimonWoolf

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2016

Hi Jose! Thanks for the quick reply :)

See if Elixir.Inspect.List is present inside Elixir in the slug (or check if it is available from the console)

I'm a little confused as to how it can not be, but you're right, it isn't. IO.puts :erlang.function_exported(Elixir.Inspect.List, :inspect, 2) from the app prints true when running the app locally, but false in production.

Not sure how else to check if it's 'present inside Elixir'? Grepping the _build directory for Elixir.Inspect.List produces "Binary file Elixir.Inspect.beam matches", if that helps.

See if you are not accidentally redefining it

I'm not doing anything that redefines any Elixir builtins, so unless I'm missing something, I'm afraid I don't see how I could be. (And as I said, inspect works fine when running the app locally, even with compiling and running with MIX_ENV=prod)

Maybe it is a failed build issue.

I've added always_rebuild=true to the elixir_buildpack.config, but the issue remains. (to be clear I'm not actually using heroku itself, just using the buildpack to build a slug, which is then run with mix run --no-halt by docker).

@josevalim

This comment has been minimized.

Copy link
Member

commented Apr 14, 2016

I'm a little confused as to how it can not be, but you're right, it isn't. IO.puts :erlang.function_exported(Elixir.Inspect.List, :inspect, 2) from the app prints true when running the app locally, but false in production.

This seems to be the root cause. What does Code.ensure_loaded?(Elixir.Inspect.List) return?

Not sure how else to check if it's 'present inside Elixir'?

Elixir.Inspect.List is provided as part of Elixir. I am assuming somewhere in your slug, the Elixir project was downloaded and compiled. I am assuming that Elixir.Inspect.List.beam is either missing from the elixir project (not _build) in your slug or it somehow has a bogus version.

@SimonWoolf

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2016

This seems to be the root cause. What does Code.ensure_loaded?(Elixir.Inspect.List) return?

After some experimentation -- it seems to be random :(

  • Sometimes, Code.ensure_loaded?(Elixir.Inspect.List) returns true. In this case:
    • Subsequent function_exported(Elixir.Inspect.List, :inspect, 2) calls return true
    • all subsequent inspect calls work correctly (whether List, Map, whatever). (After a successful inspect of a type, the relevant function_exported calls start returning true)
  • Sometimes, Code.ensure_loaded?(Elixir.Inspect.List) returns false. In this case:
    • all function_exported(Elixir.Inspect.List, :inspect, 2) calls continue to return false
    • all subsequent inspect calls fail (or if done by Logger, crash logger)

Trying the non-boolean version: I obviously can't IO.inspect the error tuple returned by Code.ensure_loaded if it fails, but since all elements are atoms I can extract and IO.puts them, and: the error returned is {:error, :nofile}.

To be clear, this isn't just nondeterminism between builds, it's between deploys of a single build. I.e. I can deploy the exact same slug (same SHA) several times, and sometimes ensure_loaded will fail with :nofile, and sometimes it'll work.

@josevalim

This comment has been minimized.

Copy link
Member

commented Apr 18, 2016

That's quite surprising.

I can deploy the exact same slug (same SHA) several times

When you mention "deploy" the same slug, does it compile the code every time you deploy or you are referring to the exact same compiled artifacts? Would you be able to retrieve the .beam file for cases where it works and where it fails and upload somewhere so we can analyze them?

Thank you for following up.

@SimonWoolf

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2016

When you mention "deploy" the same slug, does it compile the code every time you deploy or you are referring to the exact same compiled artifacts? Would you be able to retrieve the .beam file for cases where it works and where it fails and upload somewhere so we can analyze them?

Sorry for ambiguity - I mean literally the exact same compiled artifacts. heroku-elixir-buildpack spits out a tarball with everything compiled, that's what I'm referring to by the 'slug'. 'deploying the slug' is just starting a docker container with the slug extracted to /app, and running this script (passing in mix run --no-halt --no-deps-check).

I've put the beam file from a slug up here in case it helps. Since it does sometimes work, I'm guessing everything is in there that should be else it couldn't ever work. :/

Thank you for following up.

No, thank you for helping! I realise this must be frustrating to debug remotely.

@SimonWoolf

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2016

After some more experimentation, turns out this bug only ever happens if the very first call to inspect for some datatype is done from a cowboy handler process.

In other words, the issue completely disappears if I do something like Logger.debug "Some data structures: #{inspect {%{}, [], MapSet.new, 0.1, 1..2, ~r//, :a, <<1>>}}" from one of my apps' application callback functions (on start).

(I've also made sure it wasn't an issue with one of my other deps - I can reproduce the issue with a dummy app with no external deps except cowboy (1.0.0), and no other code except an app callback that starts cowboy and one http handler that does an IO.inspect).

The workaround solves the pressing issue for us, but of course I'm happy to keep trying any experiments you want me to to track the underlying bug down.

@josevalim

This comment has been minimized.

Copy link
Member

commented Apr 19, 2016

This is really helpful. It seems to be related to Erlang's code loading then. When building releases, we should not have such problems, as all modules are preloaded upfront. Great job and thank you!

@josevalim

This comment has been minimized.

Copy link
Member

commented Apr 19, 2016

I will leave this open for now if I get a chance to investigate it later, specially to find out why it happens only with inspect.

@SimonWoolf

This comment has been minimized.

Copy link
Contributor Author

commented May 4, 2016

We think we know why this was happening now, and it was our fault. Not a bug in elixir or erlang. Sorry for the invalid bug report :(

If you're interested:

  • Our deploy process mounts the compiled slug using overlayfs, with the upperdir set to a directory in /tmp that's unique to each docker container, before running that with docker.
  • We apparently have a maintenance script that runs after each deploy that, among other things, removes any files in /tmp with an mtime older than 2 days.

Since the things in /tmp are overlayfs upper directories, this cleanup script didn't affect the underlying slug, which was still correct. But it did mark files as 'deleted' in overlayfs, so they were hidden from the docker container.

Result: random sadness, depending on a combination of when each .beam file was last modified, and which files had already been loaded by the erlang vm by the time the cleanup script ran. (So I guess my workaround only worked because it meant those files were loaded early on - and presumably most other protocols were either exercised early on or else not at all, so the problem only showed up with inspect).

We've now moved the overlayfs upperdirs out of /tmp, and everything is working fine. (Why this hasn't shown up as an issue for us before, who knows..)

Thank you for your help earlier, and apologies again!

@SimonWoolf SimonWoolf closed this May 4, 2016

@josevalim

This comment has been minimized.

Copy link
Member

commented May 4, 2016

I am glad this was fixed! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.