Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent "invalid association" warnings at compile time #4293

Open
bkowalk opened this issue Oct 4, 2023 · 34 comments
Open

Intermittent "invalid association" warnings at compile time #4293

bkowalk opened this issue Oct 4, 2023 · 34 comments

Comments

@bkowalk
Copy link

bkowalk commented Oct 4, 2023

Elixir version

1.15.6

Database and Version

PostgreSQL 15.4

Ecto Versions

3.10.3

Database Adapter and Versions (postgrex, myxql, etc)

Postgrex

Current behavior

For a few months now, we've been seeing the following warning show up in our logs intermittently when compiling or running mix test:

warning: invalid association `family` in schema Core.UserDevices.UserDevice: associated module Core.Families.Family is not an Ecto schema lib/core/user_devices/user_device.ex:1: Core.UserDevices.UserDevice (module)

Sometimes there are multiple warnings for a few schemas, sometimes none. The subject of their complaint is often this Family module, though it sometimes varies and can apply to any one of our schemas. I've traced the source of the message to the after_compile_validation functions in association.ex in Ecto, where it seems that for some reason, Elixir doesn't believe the Family module has __schema__ exposed. It's definitely working as expected, though, and the app is functioning fine. So this warning seems to be a false alarm.

I assume that something strange is going on here with compilation order, but am unsure. Since it's intermittent, I don't have a simple code example that can reproduce the issue, but I saw this past issue that seems very similar, as well as this recent change one call upstream from our issue and am crossing my fingers that others may have reported something similar recently, resulting in this change? Here's a simplified version of our Family schema that is showing the issue (though as I mentioned, any of our Ecto schemas at varying times will sometimes show up, seemingly depending on compile order).

  @moduledoc """
  Representation of an entire family account.
  """

  # Pulls in the LetMe Ecto authorization library and some boilerplate for it.
  use Core.SecuredEctoSchema
  use Ecto.Schema

  import Ecto.Changeset
  import Ecto.Query

  alias __MODULE__
  alias Core.Accounts.Account
  alias Core.Addresses.Address
  alias Core.Cards.Card
  alias Core.Transfers.Transfer
  alias Core.UserDevices.UserDevice
  alias Core.Users.User

  @statuses [
    :activated,
    :deactivated,
    :failed
  ]

  schema "families" do
    field :uuid, Ecto.UUID, autogenerate: true

    field :name, :string
    field :status, Ecto.Enum, values: @statuses, default: :activated

    has_many :accounts, Account
    belongs_to :address, Address
    has_many :cards, Card
    has_many :devices, UserDevice
    has_many :transfers, Transfer
    has_many :users, User

    timestamps type: :utc_datetime_usec
  end

  @doc """
  Family changeset
  """
  def changeset(%Family{} = family, params \\ %{}) do
    cast(family, params, Family.__schema__(:fields))
  end

  @impl LetMe.Schema
  def scope(q, %User{family_id: family_id}, _opts \\ []) do
    where(q, id: ^family_id)
  end
end

The other schemas referencing (in the error above, UserDevice) simply have an alias to Core.Families.Family and then reference it with belongs_to :family, Family. Only things of note are that our schemas often use the LetMe library for authorization, and TypedEctoSchema to generate types for specs. Neither seem related since removing them from the schemas in the warning doesn’t change the warning, but worth mentioning just in case since I've never tried entirely pulling them from our codebase!

If this isn't enough information to confirm a bug since I'm unable to reproduce, feel free to close--I definitely recognize this isn't a support forum and I've come with some vague details! But if this is ringing any bells we'd love to help track down what's making Ecto mad! (And to get our logs happy again)!

Expected behavior

Not seeing a warning for a thing that's a functioning Ecto schema!

@josevalim
Copy link
Member

Thank you for the report. We can leave it open for a while to see if anyone has run into something similar.

@frerich
Copy link

frerich commented Oct 10, 2023

I noticed the exact same thing! 👋 Asked about it on Slack a few weeks ago, but the discussion unfortunately didn't lead to any conclusion other than "wat": https://elixir-lang.slack.com/archives/C087B66BY/p1695369972353739

I'm on Elixir 1.15.4 using Ecto 3.10.3. It started a few months ago, I would say - hard to pinpoint since it doesn't always happen. I noticed that if the warning occurs, it's always in the file for the schema with the has_many field. And it's always on line 1.

At one point, I was able to reproduce the issue at will - according to my notes, simply running mix compile would trigger the warning. Looking at https://github.com/elixir-ecto/ecto/blob/master/lib/ecto/association.ex#L784 I noticed that there is an 'after_compile_validation' hook which tries to make sure that any referenced associations are compiled and expose a __schema__/2 function .

When firing up IEx, I noticed:

iex(1)> function_exported?(MedHub.Medics.MedicActivity, :__schema__, 2)
false
iex(2)> MedHub.Medics.MedicActivity.__schema__(:fields) |> length()
15
iex(3)> function_exported?(MedHub.Medics.MedicActivity, :__schema__, 2)
true

I.e. for some obscure reason, the function didn't seem to be exported - yet, looking at the schema, it exposes plenty of fields and indeed, checking for whether it's exported now suddenly returns true.

@josevalim
Copy link
Member

The function_exported? behaviour is expected and documented: https://hexdocs.pm/elixir/Kernel.html#function_exported?/3

I don't think it is the root cause here, because Ecto.Association.ensure_compiled makes sure the module is loaded upfront.

@frerich
Copy link

frerich commented Oct 10, 2023

Ah, pity -- is there anything else I could try? I can reproduce it fairly easily by compiling repeatedly until I get a warning, via

#!/bin/sh
while true; do
    echo "[`date`] Compiling..."
    mix compile --force 2>&1 | grep warning && exit 1
done

...which usually takes about 10 runs or so. Once the warning is triggered, I can just run mix compile to trigger it.

@josevalim
Copy link
Member

Before returning {:error, ...}, try printing queryable.__schema__(:fields) but I am afraid doing that will "fix" the problem. :(

@frerich
Copy link

frerich commented Oct 10, 2023

I tried using https://github.com/hrzndhrn/beam_file to compare the decompiled code in both the good and the bad (i.e. when a warning is printed) states; alas, the decompiled Elixir code is mostly the same except for some fingerprint being different (but this fingerprint also differs between two successful builds, maybe the timestamp is part of it or such?).

One other bit which maybe helps: once I get to the faulty state by running mix compile --force repeatedly, any invocation of mix compile triggers the warning. I can get out of the faulty state by running mix compile --force again.

@frerich
Copy link

frerich commented Oct 10, 2023

I debugged this a bit further and noticed that I can reproduce this with Elixir 1.15 only, i.e. Elixir 1.15.0 occasionally triggers the warning but Elixir 1.14.5 never does.

@josevalim
Copy link
Member

does it happen on v1.15.6?

@frerich
Copy link

frerich commented Oct 10, 2023

Yes, v1.15.6 triggers it as well (I tried it locally, this is also what the OP used).

I just bisected https://github.com/elixir-lang/elixir (very pleased to see how easy it is to hack the compiler!), it appears elixir-lang/elixir@58b45e9 is the commit which triggers this.

@josevalim
Copy link
Member

Are those modules coming from a dependency or something? Are you using an umbrella?

@frerich
Copy link

frerich commented Oct 10, 2023

The modules are part of the main application, not coming from a dependency. I'm not using an umbrella.

@josevalim
Copy link
Member

Can you please run this inside iex -S mix:

Interactive Elixir (1.16.0-dev) - press Ctrl+C to exit (type h() ENTER for help)
iex(livebook_nvhlc7fi@Mac-Studio)1> :sys.get_status :code_server
{:status, #PID<0.50.0>, {:module, :code_server},
 [
   [],
   :running,
   #PID<0.49.0>,
   [],
   {:state, #PID<0.49.0>, ~c"/Users/jose/.kerl/26.0.2",
    [
      {~c"/Users/jose/OSS/livebook/_build/dev/lib/livebook/consolidated",
       :nocache},
      {~c"/Users/jose/OSS/livebook/_build/dev/lib/livebook/ebin", :nocache},
      {~c"/Users/jose/.kerl/26.0.2/lib/kernel-9.0.2/ebin",

You can see a list if paths. In my case, it is Livebook, so you see "livebook/_build/dev/lib/livebook/ebin". But you see :nocache in front of it. Is it nocache for your application's ebin?

@josevalim
Copy link
Member

And can you debug the Ecto code and add IO.inspect :sys.get_status :code_server in there when we return {:error, ...}? Is the path there cached or not?

@frerich
Copy link

frerich commented Oct 10, 2023

Yes, my application's ebin appears to have :nocache (med_hub is the application):

Interactive Elixir (1.15.5) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :sys.get_status :code_server
{:status, #PID<0.50.0>, {:module, :code_server},
 [
   [],
   :running,
   #PID<0.49.0>,
   [],
   {:state, #PID<0.49.0>, ~c"/Users/frerich/.asdf/installs/erlang/26.0.2",
    [
      {~c"/Users/frerich/src/bd/med_hub/_build/dev/lib/med_hub/consolidated",
       :nocache},
      {~c"/Users/frerich/src/bd/med_hub/_build/dev/lib/med_hub/ebin", :nocache},
      {~c"/Users/frerich/.asdf/installs/erlang/26.0.2/lib/kernel-9.0.2/ebin",
      ...

In case the warning is printed, the output is slightly different: the consolidated entry appears to be missing:

{:status, #PID<0.50.0>, {:module, :code_server},
 [
   [],
   :running,
   #PID<0.49.0>,
   [],
   {:state, #PID<0.49.0>, ~c"/Users/frerich/.asdf/installs/erlang/26.0.2",
    [
      {~c"/Users/frerich/src/bd/med_hub/_build/dev/lib/med_hub/ebin", :nocache},
      {~c"/Users/frerich/.asdf/installs/erlang/26.0.2/lib/kernel-9.0.2/ebin",
      ...

@josevalim
Copy link
Member

Yeah, so we should be finding all modules defined within your application. :S

@josevalim
Copy link
Member

Any chance you can isolate it in a regular app or our example app in this repo?

@frerich
Copy link

frerich commented Oct 10, 2023

I think I might have trimmed down the real application far enough to exhibit the issue. In fact, now it always triggers the problem for me -- but just with Elixir 1.15. Here's a small tarball - it's the result of mix new plus Ecto, plus some failure-inducing source code. I tried simplifying it further, but oddly enough all the things I tried made the error go away.

See @bkowalk 's reduced example in #4293 (comment) for a test case to reproduce the issue.

Just mix deps.get && ./stress.sh is enough for me to trigger the problem:

frerich@Frerichs-Mini ecto_4293 % mix deps.get && ./stress.sh
Resolving Hex dependencies...
Resolution completed in 0.012s
Unchanged:
  decimal 2.1.1
  ecto 3.10.3
  telemetry 1.2.1
* Getting ecto (Hex package)
* Getting decimal (Hex package)
* Getting telemetry (Hex package)
[Tue Oct 10 20:32:29 CEST 2023] Compiling (attempt 1)...
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
warning: Logger.warn/1 is deprecated. Use Logger.warning/2 instead
warning: invalid association `edits` in schema MedHub.Medics.MedicActivity: associated module MedHub.Medics.MedicActivityEdit is not an Ecto schema

(The stress.sh script recompiles in a loop, looking for warnings)

@frerich
Copy link

frerich commented Oct 10, 2023

What puzzles me a little bit is that the warning now occurs always with Elixir 1.15, it's no longer intermittent. I hope I didn't reduce the real application too much, introducing a stupid mistake along the way... at least it's still the case that Elixir 1.14.5 compiles this without warnings.

@bkowalk
Copy link
Author

bkowalk commented Oct 10, 2023

Wow, just catching up - thanks for running with this, @frerich! Confirmed that I'm able to see the issue on that tarball as well--amazing work reproducing! Compiling on an M2 macbook pro, if that makes a difference.

Interestingly, I see the warning immediately every time if I kill the _build/dev/lib/ecto dir, but if I just run ./stress.sh after I've seen the warning once, it runs for a while. 39 times on this last try. So 100% warn rate if I see Ecto compile before it attempts app compilation, but pretty low rate if it's using a cached Ecto dep.

@bkowalk
Copy link
Author

bkowalk commented Oct 10, 2023

I think I've trimmed down the necessary schemas for the error in your example, @frerich. Looks like we're failing just with this attached set. It doesn't seem like any more can be removed and still reliably create the error.

Seems like there may just be a complicated compilation chain. There are circular references in the schemas, but that's sort of expected in a has_many vs belongs_to association, so I assume the compiler shouldn't trip over that... The edit schemas having module attributes that reference the canonical schemas, though, that may elevate that dependency and cause the problem, though, I'm guessing. The problem definitely seems to go away if you remove @editable_fields from either _edit schema, or if you remove the has_many reference to the edit schema from the canonical schema.

However, it also weirdly disappears if you remove the link from Medic to Workplace, or (for who knows what reason) if you remove the gba_workplace_mapping from workplace. Super strange.

ecto_4293.zip

@frerich
Copy link

frerich commented Oct 10, 2023

Awesome job there @bkowalk - I can confirm that your last simplified test case triggers the issue for me, too:

frerich@Frerichs-Mini ecto_4293 % ./stress.sh
[Tue Oct 10 23:30:01 CEST 2023] Compiling (attempt 1)...
...
[Tue Oct 10 23:30:20 CEST 2023] Compiling (attempt 12)...
warning: invalid association `edits` in schema MedHub.Medics.MedicActivity: associated module MedHub.Medics.MedicActivityEdit is not an Ecto schema

@bkowalk
Copy link
Author

bkowalk commented Oct 10, 2023

Noticed one more interesting thing - if I go into association.ex in Ecto and add IO.inspect(queryable.__info__) in our problem location (line 785 in my current clone from master), I get back the below stack trace saying we're in a deadlock rather than the warning. 4/5 times, it'll work fine. But that other 1/5, I get the stack trace instead of the warning now.

not function_exported?(queryable, :__schema__, 2) ->
        IO.inspect(queryable.__info__)
        {:error, "associated module #{inspect(queryable)} is not an Ecto schema"}
== Compilation error in file lib/modules/medic_activity_edit.ex ==
** (CompileError) deadlocked waiting on module MedHub.Medics.MedicActivity
    lib/modules/medic_activity_edit.ex:4: (module)

== Compilation error in file lib/modules/medic_activity.ex ==
** (CompileError) deadlocked waiting on module MedHub.Medics.MedicActivityEdit
    (ecto 3.11.0-dev) lib/ecto/association.ex:785: Ecto.Association.Has.after_compile_validation/2
    (ecto 3.11.0-dev) lib/ecto/schema.ex:2251: anonymous fn/3 in Ecto.Schema.__after_compile__/2
    (elixir 1.15.0) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (elixir 1.15.0) lib/enum.ex:1693: Enum."-map/2-lists^map/1-1-"/2
    (ecto 3.11.0-dev) lib/ecto/schema.ex:2248: Ecto.Schema.__after_compile__/2

Compilation failed because of a deadlock between files.
The following files depended on the following modules:

  lib/modules/medic_activity_edit.ex => MedHub.Medics.MedicActivity
       lib/modules/medic_activity.ex => MedHub.Medics.MedicActivityEdit

Ensure there are no compile-time dependencies between those files and that the modules they reference exist and are correctly named

@bkowalk
Copy link
Author

bkowalk commented Oct 10, 2023

In our codebase, btw, I'll add that our most common error is card saying its associated family isn't an Ecto Schema. Between those two, though, we don't see any module attributes that require a call to another module - all just plain atoms. I suspect if we dug down, though, we'd find some other form of compilation deadlock that can occasionally occur. Makes me wonder if the ensure_compiled call on 775 in assocation.ex might not always be getting us to the :skip cond that it seems we'd expect in a deadlock.

I'm assuming here, btw, that these aren't real deadlocks - seems like they work themselves out if you ignore the warning and just go on your way. But rather just a temporary deadlocked state where the compiler is still working out its course of action or something.

@josevalim
Copy link
Member

Awesome job @bkowalk in isolating it.

Keep in mind that, in your example, you do have a warning that is accurate, but another one that indeed is a false positive. I have good news and bad news: I have fixed this in Elixir main but, given it changes the compiler, I am not yet comfortable with backporting it to v1.15.

While I have confirmed it indeed no longer reproduces the issue for your example, I am not sure if that's generally true. If you could try elixir main, it would be fantastic. Thank you! ❤️

@frerich
Copy link

frerich commented Oct 11, 2023

Using elixir main, I can no longer reproduce the issue on my real application. Excellent! 🥳

@bkowalk
Copy link
Author

bkowalk commented Oct 11, 2023

Same here. I see the issue after 5-10 recompiles on 1.15.4, but haven't seen it after 50 on 1.16.0-dev. Thanks for the fix, @josevalim! And thanks for finding a way to reproduce, @frerich!

@TheMoonDawg
Copy link

TheMoonDawg commented Nov 3, 2023

Not sure if this will help anyone else, but we noticed this issue when we had multiple Ecto schemas defined in the same file that were all assocs on another schema.

defmodule MD.Scheduling.AppointmentStatusHistory do
  use Ecto.Schema

  schema "appointment_status_history" do
    field(:status, :string)
    field(:changed_at, :utc_datetime_usec)
    belongs_to(:changed_by, MD.Accounts.User)
    belongs_to(:appointment, MD.Scheduling.Appointment)
  end
end

defmodule MD.Scheduling.AppointmentTimeHistory do
  use Ecto.Schema

  schema "appointment_time_history" do
    field(:start, :utc_datetime)
    field(:end, :utc_datetime)
    field(:changed_at, :utc_datetime_usec)
    belongs_to(:changed_by, MD.Accounts.User)
    belongs_to(:appointment, MD.Scheduling.Appointment)
  end
end

defmodule MD.Scheduling.AppointmentVisitTypeHistory do
  use Ecto.Schema

  schema "appointment_visit_type_history" do
    field(:visit_type_id, :string)
    field(:changed_at, :utc_datetime_usec)
    belongs_to(:changed_by, MD.Accounts.User)
    belongs_to(:appointment, MD.Scheduling.Appointment)
  end
end

You can see the warning in the log below.

image

Once I moved these out into their own individual files, the warnings were all resolved! Hoping this helps someone else still on 1.15!

@Billzabob
Copy link

We're still seeing this on Elixir 1.16.0 :( Did the fix not end up making it in @josevalim ?

@josevalim
Copy link
Member

I don't believe it was reverted. You can try running the reproducer above and see if it errors or not on Elixir v1.16.0. If it does not error, you have a separate case and we need a way to reproduce it.

@Billzabob
Copy link

Billzabob commented Jan 11, 2024

Sorry, been slow getting back to this. Yes it does error on Elixir 1.16.0. Interestingly, I can't get it to fail normally, but it consistently fails in Docker and our pipelines, even with a --no-cache option. What can I do to help track this down?

@paulo-silva
Copy link

I'm also experiencing this issue in elixir 1.16.0, cc @josevalim @Billzabob

@dantswain
Copy link
Contributor

I am also still seeing this error somewhat sporadically with Elixir 1.16.0 and Ecto 3.11.2. It seems to come and go depending on the state of the code - once it's happening, it is consistent (e.g., repeated mix clean / mix compile gives the same warning), but then some code will change (code not directly related to the modules causing the warning) and it will go away / come back. I can't give a reproduction easily as it seems to be so random and it's happening in a very large proprietary codebase.

The most recent case I've been able to isolate is that I'm introducing an enum where the type values of the enum are pulled from a different module, e.g.,

defmodule A do
  schema "as" do
    field :trinket_type, values: [:curly, :slimy, :crunchy]
  end

  def trinket_types, do: Ecto.Enum.values(__MODULE__, :trinket_type)
end

defmodule B do
  schema "bs" do
   # this is the line that changed
   field :trinket_type, values: A.trinket_types()
 end
end

defmodule C do
  schema "cs" do
    # only indirectly related to A and B through relations, does not have `trinket_type` as a field
  end
end

defmodule D do
  schema "ds" do
    # also not directly related to A or B
    belongs_to :c, C
  end
end

Adding the indicated line is giving me "invalid association c in D". Removing the line makes it compile without warning, and manually re-defining the static values for trinket_type makes it compile without warning. If I leave it in and go about my business and just ignore the warning, past experience says there's a high probability that the warning will go away with some other update of code that doesn't seem to be related.

I could define trinket_types elsewhere and I may do that (it seemed to work in #4371), but then I am creating an indirection that I'd prefer to avoid. Plus the whole thing just seems so random that it has to be a sign of something not being quite right.

@josevalim
Copy link
Member

There is definitely a bug, we just need a mechanism to reproduce it. Btw, you can try elixir --erl "+T 9" -S mix compile --force and see if it happens more often. But even if you isolate the associations above into something that reproduces it some of the time, it would help a lot. :)

@stuartc
Copy link

stuartc commented Apr 25, 2024

I've been experiencing this issue repeatedly - but only on CircleCI. We're on 1.16.2 and OTP 26.

I just fixed it, going off Jose's +T 9 request - lead me to CircleCI's docs where it mentions that languages that read /proc for CPU information may need to be manually overridden or else they end up thinking there are way more cores than available.

So by limiting the number of schedulers (ERL_FLAGS="+S 4:4" in our case), this fixed it.

I tried locally to see if I could reproduce the inverse locally, by telling Erlang I have like 48 cores (elixir --erl "+T 9 +SDcpu 48:48 +S 48:32" -S mix compile --force). Kinda just throwing stuff at the wall. Unfortunately other than taking significantly longer everything worked on my Mac.

For reference on our specific situation: OpenFn/lightning#2028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants