Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition over current working directory in parallel compiler #7699

Closed
potatosalad opened this issue May 20, 2018 · 9 comments
Closed

Comments

@potatosalad
Copy link
Contributor

Environment

  • Linux (CircleCI and Travis)
    • Erlang/OTP 20.3.1 + Elixir 1.6.4 (elixir:1.6.4-alpine)
    • Erlang/OTP 20.3.1 + Elixir 1.6.5
  • macOS 10.13.4 (Darwin 17.5.0)
    • Erlang/OTP 20.2 + Elixir 1.6.2
    • Erlang/OTP 20.3 + Elixir 1.6.5
    • Erlang/OTP 20.3 + Elixir master-6fd0cd7
  • SmartOS joyent_20180510T153535Z
    • Erlang/OTP 20.3 + Elixir 1.6.5
    • Erlang/OTP 20.3 + Elixir master-6fd0cd7

Current behavior

A mix project with sentry as a dependency will occasionally fail to compile with the following exception (roughly ~5% of the time on Elixir master-6fd0cd7):

== Compilation error in file lib/sentry/http_client.ex ==
** (Code.LoadError) could not load /root/sentry_bug_232/deps/metrics/lib/sentry/http_client.ex
    (elixir) lib/code.ex:1136: Code.find_file/2
    (elixir) lib/code.ex:901: Code.compile_file/2
    (elixir) lib/kernel/parallel_compiler.ex:220: anonymous fn/5 in Kernel.ParallelCompiler.spawn_workers/6
could not compile dependency :sentry, "mix compile" failed. You can recompile this dependency with "mix deps.compile sentry", update it with "mix deps.update sentry" or clean it with "mix deps.clean sentry"

TL;DR There's a race condition over the current working directory competing with the expected working directory as used by Kernel.ParallelCompiler. See "Examples" below.

The compilation race condition while having getsentry/sentry-elixir as a dependency was found a while ago as reported in this issue: getsentry/sentry-elixir#232 (see also this forum post).

As partially pointed out by @michaelstalker in this comment, the race condition occurs between two separate processes fighting for control of the current working directory.

Sentry's problem is this call to Mix.Dep.loaded/1 which causes File.cd!/2 to be called for its rebar and rebar3 dependencies (see lib/mix/lib/mix/dep/loader.ex#L300). Occasionally at the same time, Code.find_file/2 attempts to lookup a file based on the current working directory which may be wrong.

I originally put together a demonstration mix project, but the race condition only occurred ~5% of the time and can be difficult to see what the actual problem is.

So, I put together potatosalad/elixir_uncompilable which has a much higher chance of exposing the race condition in Kernel.ParallelCompiler.

More details are provided below in the examples.

Examples

Simplified non-concurrent example
:ok = File.mkdir_p!("a")
:ok = File.mkdir_p!("b")
:ok = File.write!("a/foo", "bar")
File.cd!("a", fn -> File.cd!("../b", fn -> File.read!(Path.expand("foo")) end) end)
Simplified concurrent example
:ok = File.mkdir_p!("a")
:ok = File.mkdir_p!("b")
:ok = File.write!("a/foo", "bar")

spawn(fn ->
  parent = self()
  tag = make_ref()
  dir = File.cwd!()
  :ok = File.cd!("a")

  {pid1, mon1} =
    spawn_monitor(fn ->
      File.cd!("../b", fn ->
        send(parent, tag)

        receive do
          ^tag -> :ok
        end

        Process.sleep(1000)
      end)
    end)

  send(pid1, tag)

  receive do
    ^tag -> :ok
  end

  {pid2, mon2} =
    spawn_monitor(fn ->
      file = Path.expand("./foo")

      if File.regular?(file) do
        IO.puts("found #{file}")
      else
        IO.puts("not found #{file}")
      end
    end)

  receive do
    {:DOWN, ^mon1, :process, ^pid1, _reason} ->
      receive do
        {:DOWN, ^mon2, :process, ^pid2, _reason} ->
          IO.puts("done")
          File.cd!(dir)
          exit(:normal)
      end
  end
end)
Simplified mix project that is designed to force the race condition to occur

See potatosalad/elixir_uncompilable (example of failing build on travis).

The resulting exception differs slightly by version of Elixir.

For example, with Elixir 1.6.5:

$ MIX_DEBUG=1 elixir --erl '+S 2:2' -S mix compile
== Compilation error in file lib/elixir_uncompilable/b.ex ==
** (MatchError) no match of right hand side value: {:error, :enoent}
    (elixir) lib/kernel/parallel_compiler.ex:198: anonymous fn/4 in Kernel.ParallelCompiler.spawn_workers/6
** (MatchError) no match of right hand side value: {:error, :bad_directory}
    (mix) lib/mix/tasks/compile.all.ex:19: anonymous fn/1 in Mix.Tasks.Compile.All.run/1
    (mix) lib/mix/tasks/compile.all.ex:34: Mix.Tasks.Compile.All.with_logger_app/1
    (mix) lib/mix/task.ex:314: Mix.Task.run_task/3
    (mix) lib/mix/tasks/compile.ex:90: Mix.Tasks.Compile.run/1
    (mix) lib/mix/task.ex:314: Mix.Task.run_task/3
    (mix) lib/mix/cli.ex:80: Mix.CLI.run_task/2

Compared with master-ish (elixir-lang/elixir@6fd0cd7):

$ MIX_DEBUG=1 elixir --erl '+S 2:2' -S mix compile
== Compilation error in file lib/elixir_uncompilable/b.ex ==
** (Code.LoadError) could not load /tmp/lib/elixir_uncompilable/b.ex
    (elixir) lib/code.ex:1136: Code.find_file/2
    (elixir) lib/code.ex:901: Code.compile_file/2
    (elixir) lib/kernel/parallel_compiler.ex:221: anonymous fn/5 in Kernel.ParallelCompiler.spawn_workers/6
Demonstration mix project that first uncovered the race condition

See potatosalad/sentry_bug_232 (occurs roughly 5% of the time when schedulers are restricted to 2:2).

See lib/sentry/event.ex#L42-L46 for the call-that-caused-it-all.

Expected behavior

$ mix compile 2>&1 > /dev/null
$ echo $?
0

Possible solutions

First off, let's get the obvious one out of the way: change the sentry library to no longer call Mix.Dep.loaded(env: Mix.env()) during compilation.

However, I think this still exposes an interesting race condition that can and probably does occur with other libraries that either directly or indirectly cause the current working directory to change while in the middle of compilation.

The solution is a tricky one as some of the operations of mix and compilation are heavily reliant on cwd in general (which cannot be made thread-safe as far as I understand; see chdir(2)).

For example, OS-level processes could be improved by adding {:cd, "/path/to/dir"} as one of the options passed to Port.open/2.

Other solutions might involve performing a pre-compile path resolution check (which could also happen in parallel) before calling Kernel.ParallelCompiler.compile/2 with the non-relative paths.

There may be other solutions here that I'm not initially seeing, too. I primarily wanted to raise awareness of the existence of the race condition and supply documentation/examples on how to reproduce it.

@josevalim
Copy link
Member

josevalim commented May 20, 2018

Thanks @potatosalad for the fantastic report!

Mix.Dep.loaded is private API and it should not be called under any circumstance. Mix.Project.deps_paths will return the answer they want and it should work. If that still has the same directory issues, then it will be a bug.

I would also like to point out that many projects actually rely on the cd operation to work. For example: https://github.com/plataformatec/nimble_parsec/blob/master/lib/nimble_parsec.ex#L2

@potatosalad
Copy link
Contributor Author

potatosalad commented May 21, 2018

Mix.Project.deps_paths will return the answer they want and it should work. If that still has the same directory issues, then it will be a bug.

I ran the same test against potatosalad/sentry_bug_232 using Mix.Project.deps_paths/0 and the same race condition happened as a result.

Mix.Project.deps_paths/0calls Mix.Dep.cached/0 which has 2 branches that may call Mix.Dep.loaded/1, which eventually leads to Mix.Dep.Loader.rebar_dep/3 that contains a change to the working directory.

I would also like to point out that many projects actually rely on the cd operation to work.

Yeah, the more I've thought about it, I'm not sure if there is a "better" solution there at all. Short of changing the way erts/emulator/nifs/common/prim_file_nif.c functions to not use chdir(2) and instead use "at" versions of the system calls (like openat(2), fstatat(2), fchmodat(2), linkat(2), symlinkat(2), and anything else in erts/emulator/nifs/unix/unix_prim_file.c...not to even mention the windows version), I'm not sure what else could be done there to improve thread-safety.

Maybe resolving the paths prior to spawning the compilation workers would help prevent the exception from being raised from within the parallel compiler itself? Currently, if plataformatec/nimble_parsec had a module called lib/a.ex that made a compilation-level call to Mix.Project.deps_paths() there is a chance that the example line you linked will still raise an exception during compilation.

@josevalim
Copy link
Member

Awesome @potatosalad. I will make sure that Mix.Project.deps_paths won't be an issue. My understand is that the paths would already have been cached at this point but I am guessing that's not the case when dependencies are being compiled? I will investigate.

@josevalim
Copy link
Member

Closing in favor of #7833.

@mitchellhenke
Copy link
Contributor

@josevalim is it safe to use Mix.Dep.cached() at compile-time instead of Mix.Dep.loaded() to get application dependencies?

@josevalim
Copy link
Member

They are both private APIs, so none of them is recommended. :) The proper answer would be Mix.Project.deps_paths. :)

@mitchellhenke
Copy link
Contributor

Is there a good way to get the version of the dep via deps_path?

@josevalim
Copy link
Member

deps_paths is going to give you the deps names and you should be able to use Application.spec(app_name, :vsn) (or :version?) to get the app version.

@mitchellhenke
Copy link
Contributor

@josevalim cool, thank you!

pjk25 added a commit to rabbitmq/rabbitmq-cli that referenced this issue May 19, 2020
Perform a test run, skipping all tests, before actually running the tests, in order to force compilation of all the *_test.exs files, in an attempt to workaround the parallel compiler issue in elixir-lang/elixir#7699

The parallel compiler has changed since Elixir 1.8, such that this workaround may no longer be necessary when using later versions.
pjk25 added a commit to rabbitmq/rabbitmq-cli that referenced this issue Sep 1, 2020
Perform a test run, skipping all tests, before actually running the tests, in order to force compilation of all the *_test.exs files, in an attempt to workaround the parallel compiler issue in elixir-lang/elixir#7699

The parallel compiler has changed since Elixir 1.8, such that this workaround may no longer be necessary when using later versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants