Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random test failure - UndefinedFunctionError #28

Closed
nsweeting opened this issue Oct 16, 2018 · 24 comments
Closed

Random test failure - UndefinedFunctionError #28

nsweeting opened this issue Oct 16, 2018 · 24 comments

Comments

@nsweeting
Copy link

nsweeting commented Oct 16, 2018

Environment

Elixir version (elixir -v): Elixir 1.7.3 (compiled with Erlang/OTP 21)
Nebulex version (mix deps): 1.0.0-rc.3
Operating system: Ubuntu 18.04

Expected behavior

Fetching a key from the cache should return nil or an already cached value.

DistCache.get({__MODULE__, type, cache_key})

Actual behavior

The other day, I was running some tests locally, and had this error pop up.

     ** (UndefinedFunctionError) function :function_clause.exception/1 is undefined (module :function_clause is not available)
     code: assert "http://foo.admin/baz" = ResourceURL.get_admin_url(product)
     stacktrace:
       :function_clause.exception([])
       (nebulex) lib/nebulex/adapters/dist.ex:265: Nebulex.Adapters.Dist.rpc_call/4
       (app) lib/app/support/caches/dist_cache.ex:2: App.DistCache.execute/2
       (app) lib/app/ecommerce/resource_urls/resource_url_cache.ex:17: App.ResourceURLCache.get/2
       (app) lib/app/ecommerce/resource_urls/resource_url_cache.ex:24: App.ResourceURLCache.get_or_set/2
       (app) lib/app/ecommerce/resource_urls/implements/product.ex:10: App.ResourceURL.App.Product.get_admin_url/1
       test/ecommerce/resource_urls/resource_url_test.exs:52: (test)

It actually ended up occurring twice - in different locations - in different test runs.

Havent been able to easily identify the problem.

Thanks

@cabol
Copy link
Owner

cabol commented Oct 16, 2018

@nsweeting can you copy some code, for example ResourceURL.get_admin_url, I need to check out that code, thanks!

BTW, are you running the distributed cache on one node or multiple nodes? If you are running this on multiple nodes, are you sure the cache is started in all nodes? just to check, but again, as much code as you can share would be the best

@nsweeting
Copy link
Author

@cabol

We have a resource url protocol:

defprotocol App.ResourceURL do
  def get_admin_url(resource)
end

With an implementation done for Product structs.

defimpl App.ResourceURL, for: App.Product do
  def get_admin_url(product) do
    case App.ResourceURLCache.get_or_set(:admin_url, product.account_id) do
      {:ok, base_url} -> "#{base_url}#{product.admin_url_path}"
      _ -> ""
    end
  end
end

Which calls our cache containing the errored get/2 function.

  def get(type, cache_key) do
    case DistCache.get({__MODULE__, type, cache_key}) do
      nil -> {:error, :not_found}
      url -> {:ok, url}
    end
  end

  def get_or_set(type, cache_key) do
    with {:error, :not_found} <- get(type, cache_key) do
      set(type, cache_key)
    end
  end

And finally, the nebulex cache module:

defmodule App.DistCache do
  use Nebulex.Cache, otp_app: :apple: 
end

@cabol
Copy link
Owner

cabol commented Oct 17, 2018

@nsweeting about the second question:

BTW, are you running the distributed cache on one node or multiple nodes? If you are running this on multiple nodes, are you sure the cache is started in all nodes? just to check

@nsweeting
Copy link
Author

During the test runs, it was running on a single node.

@cabol
Copy link
Owner

cabol commented Oct 18, 2018

Ok, but yes, that error looks very weird, let me try to reproduce it, meanwhile, can you try but with the master branch? There have been several improvements, and perhaps we can have better info about the error. Pls let me know your results, stay tuned!

@cabol
Copy link
Owner

cabol commented Oct 22, 2018

@nsweeting any news? On other hand, help me with this, the Nebulex cache App.DistCache you copied, is it using Nebulex.Adapters.Dist? If so, you need also to define another cache with a local backend, where is that Nebulex cache using Nebulex.Adapters.Local? Or is App.ResourceURLCache your local Nebulex cache? If so, I have some comments about it, but let's confirm that first. BTW, may you copy also the Nebulex config section in your config file ?

@nsweeting
Copy link
Author

nsweeting commented Oct 22, 2018

@cabol - here is the config:

config :app, App.LocalCache,
  adapter: Nebulex.Adapters.Local,
  gc_interval: 86_400

config :app, App.DistCache,
  adapter: Nebulex.Adapters.Dist,
  local: App.LocalCache,
  node_picker: Nebulex.Adapters.Dist

No news right now - ALTHOUGH - it did happen once again randomly. I meant to copy the error output, but forgot to :( It was a different error though - related to :badarg or something. Same spot in nebulex though.

So far - this has ONLY happened in tests though. I have yet to see it in our staging or production servers.

I'm hoping it would crop up again, but I have yet to find a way to reproduce it...

@nsweeting
Copy link
Author

nsweeting commented Oct 24, 2018

Got hit by the error I described above running the tests just a moment ago.

** (UndefinedFunctionError) function :badarg.exception/1 is undefined (module :badarg is not available)
     code: assert "http://foo.category/baz" = ResourceURL.get_url(category)
     stacktrace:
       :badarg.exception([])
       (nebulex) lib/nebulex/adapters/dist.ex:265: Nebulex.Adapters.Dist.rpc_call/4
       (app) lib/app/support/caches/dist_cache.ex:2: App.DistCache.execute/2
       (app) lib/app/ecommerce/resource_urls/resource_url_cache.ex:17: App.ResourceURLCache.get/2
       (app) lib/app/ecommerce/resource_urls/resource_url_cache.ex:24: App.ResourceURLCache.get_or_set/2
       (app) lib/app/ecommerce/resource_urls/implements/category.ex:3: App.ResourceURL.App.Category.get_url/1
       test/ecommerce/resource_urls/resource_url_test.exs:60: (test)

Different - but the same.

@cabol
Copy link
Owner

cabol commented Oct 24, 2018

@nsweeting ok, I have some suspicious, one of the possible reason this might be happening is that maybe the local cache for the distributed one is being stopped before the function is completed, maybe a race condition in the test, but I don't know how you are starting and stopping the cache(s) in your tests. But I think we can test it out easily, in your test case, add a try rescue wrap in App.ResourceURL.App.Category.get_url/1 call, and within the rescue section, call for instance App.LocalCache.get("something"), let's check is the function crashes with the same error. Also, check if there is something in your tests, maybe another test case that is stopping, or crashing the app or something similar, fi the tests run asynchronously, that could affect other test cases.

On other hand, change nebulex dep in your mix.exs file and point to master branch or the latest commit; it would be good to see what happens with latests version.

@cabol
Copy link
Owner

cabol commented Oct 30, 2018

@nsweeting any news?

@cabol
Copy link
Owner

cabol commented Nov 1, 2018

Heads up! version 1.0.0 was released already, you should try it out!

@nsweeting
Copy link
Author

Hey @cabol - thanks for the heads up. Will try. I've had it happen randomly once since the last comment. Still have yet to find reproducible steps.

I dont think its the caches being started or not - I have both caches underneath a "Cache Supervisor" that I use to start some other non-nebulex caches. The cache supervisor is started on application start. Nothing in my code stops them at any point either.

In the meantime, feel free to close this if wanted...

@nsweeting
Copy link
Author

@cabol I upgraded to the latest 1.0.0 release. Had a random test failure. Slightly different than the others though.

     ** (ArgumentError) argument error
     code: assert "http://foo.checkout/baz" = ResourceURL.get_url(checkout)
     stacktrace:
       (nebulex) lib/nebulex/adapters/dist.ex:325: Nebulex.Adapters.Dist.rpc_call/6
       (nebulex) lib/nebulex/cache/object.ex:12: Nebulex.Cache.Object.get/3
       (app) lib/app/support/caches/dist_cache.ex:2: App.DistCache.with_hooks/3
       (app) lib/app/ecommerce/resource_urls/resource_url_cache.ex:17: App.ResourceURLCache.get/2
       (app) lib/app/ecommerce/resource_urls/resource_url_cache.ex:24: App.ResourceURLCache.get_or_set/2
       (app) lib/app/ecommerce/resource_urls/implements/checkout.ex:3: App.ResourceURL.App.Checkout.get_url/1
       test/ecommerce/resource_urls/resource_url_test.exs:12: (test)

@cabol
Copy link
Owner

cabol commented Nov 6, 2018

@nsweeting I think I have an idea what might be happening. Perhaps is related to the generations manager, maybe at the same time a new generation is pushed a new command is being processed, when a new generation is created, there is a moment (very small) where the ETS tables are removed, hence, if for some reason there is a command being processed, it will fail with ArgumentError. It is very weird and unusual scenario to happen, that's why it's happening randomically for you, after some time. This is just a suspicion, still very weird since in your config you have the gc_interval set to 86_400 , and I suppose your tests finish before that, isn't? BTW, the error happens when you run the unit tests right (if so, this is not the case)? or is it a different kind of test you are running?

@nsweeting
Copy link
Author

Sorry for the delay. Yes - its configured with the gc_interval: 86_400, and the tests do finish before then. And yeah - it seems to only occur while running tests. I have yet to see the error pop up in staging or production.

@cabol
Copy link
Owner

cabol commented Mar 17, 2019

@nsweeting any news? in the meantime, I'll close this issue, but feel free to re-open it again!

@nsweeting
Copy link
Author

No news. Thanks for checking in. I dont recall it happening in a while - but yeah - I'll just reopen if I run across it and have time to dive in a bit. 👍

@nsweeting
Copy link
Author

@cabol - So I just had it happen on another test run. I think I might have tracked down what the issue is. It has changed a little bit - most likely from recent PR's. So, at the moment, it has transformed into a general argument error:

     ** (ArgumentError) argument error
       ...
       (nebulex) lib/nebulex/adapters/dist.ex:325: Nebulex.Adapters.Dist.rpc_call/6
       (nebulex) lib/nebulex/cache/object.ex:12: Nebulex.Cache.Object.get/3
       ...

Which I believe is as a result of:

def call(node, mod, fun, args, task_sup, timeout) do
{task_sup, node}
|> Task.Supervisor.async(
__MODULE__,
:call,
[node, mod, fun, args, task_sup, timeout]
)
|> Task.await(timeout)
end

I'm assuming Task.Supervisor is raising as a result of the supervisor hitting the max children. Which, by default, would be the number of schedulers. Could easily happen - especially in a concurrent test scenario. Unbound child creation is probably not ideal - but I'm wondering if a better error might be in order?

@nsweeting nsweeting reopened this Mar 19, 2019
@cabol
Copy link
Owner

cabol commented Mar 19, 2019

Ok, I'll take a look at it and try to figure out what's going on. Thanks for the info :) !!

@cabol
Copy link
Owner

cabol commented Mar 19, 2019

I was checking out the Task.Supervisor options, and apparently the :max_children is :infinity by default – https://hexdocs.pm/elixir/DynamicSupervisor.html#init/1-options

Anyways, I will fix the adapter at least to allow to pass the supervisor available options.

@nsweeting
Copy link
Author

Sorry - you are correct! Not quite sure where I got that assumption from...

@cabol
Copy link
Owner

cabol commented Mar 20, 2019

NP, actually it did make sense. Do you think is possible to give me more details about the tests? I'm not sure if maybe there might be a race condition or something, in which the cache is stopped while other test is still running, I don't know. So if you can give me more details or even paste some code related to the tests you are running, that would be very helpful. Thanks!

@cabol
Copy link
Owner

cabol commented Apr 24, 2019

@nsweeting any news? Did you check the previous suggestion ^^^? Perhaps we should close this issue in the meantime, while we have more info and a clear way to reproduce the error.

@nsweeting
Copy link
Author

Sorry - no news. The application in question has stabilized, so less day-to-day work is currently occurring. Have not noticed the error in a while though. Will close for now. Sorry - this issue snowballed into a giant thread!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants