Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why we need to handle {:error, {:already_started, pid}} ? #259

Closed
zookzook opened this issue Feb 20, 2023 · 4 comments
Closed

Why we need to handle {:error, {:already_started, pid}} ? #259

zookzook opened this issue Feb 20, 2023 · 4 comments

Comments

@zookzook
Copy link

First: thank you for this library and for your efforts to provide this library!!! I have a question about the :ignore value when call the start_link function:

What is the reason for converting a {:error, {:already_started, pid}} into an :ignore?

def start_link(arg) do
  case GenServer.start_link(...) do
    {:ok, pid} ->
      {:ok, pid}
    {:error, {:already_started, pid}} ->
      :ignore
  end
end

There is a little issue when starting a new process on another node. Assuming we have two nodes n1 and n2. I use the registry to look for pids. If I start a process on node n1 and the process is spawned on node n2, then it takes at least 200ms until the pid is inserted in the registry of node n1.

My code looks like this. I have a UUID and this is used for the process id:

  def find_pid(uuid, counter \\ 3) do
    case Horde.Registry.lookup(Registry, process_name(uuid)) do
      [{pid, _} | _] when is_pid(pid) ->
        {:ok, pid}

      [] ->
        case start_child(uuid) do
          {:ok, pid} ->
            {:ok, pid}

          :ignore ->
            case counter do
              0 ->
                :not_found

              _ ->
                ## the delta CRDT will attempt to sync its local changes with its neighbors at this interval (specified in milliseconds). The default is 200.
                Process.sleep(200)
                find_pid(uuid, counter - 1)
            end
        end
    end
  end

If the function find_pid is called, it first does a lookup and if the process is not found, then it calls a start_child function, which starts the process. If the process is started it returns the pid. If the function find_pid is called again and the process was started on node n2, then it takes about 200ms to be inserted in the registry of node n1. Meanwhile, the start_child function is called and from the implementation above it returns an :ignore. It would be better to return {:error, {:already_started, pid}} and the caller can use the PID. Actually, the code just waits 200ms and runs find_pid again. The number of calls is limited to avoid an endless loop.

If I would not call Process.sleep(200) then the caller would start the process a few times and the start_link function of the node n2 would return :ignoreinstead of {:error, {:already_started, pid}}. The function gives up and returns a :not_found which is not correct because the process exists on node n2. My naive solution now waits for 200ms because it is the duration for the next sync call of the DeltaCRDT package. After 200ms it is inserted in the registry of node n1 and then the find_pid function will return the PID.

Is there a way to handle those use cases better? Maybe insert the PID right after the start in the registry of node n1? Using a cache? Or just returning {:error, {:already_started, pid}} instead of :ignore??

@derekkraan
Copy link
Owner

The reason you must :ignore instead of {:error, {:already_started, pid}} is I believe to avoid double supervision.

You generally want to avoid registries or supervisors on node 2 from linking or supervising processes on node 1.

@zookzook
Copy link
Author

Thanks for the response. What would happen if the start_link function returns {:error, {:already_started, pid}}? Then the pid would be double supervised?

@derekkraan
Copy link
Owner

If you did something like this, then yes:

def start_link(arg) do
  case GenServer.start_link(...) do
    {:ok, pid} ->
      {:ok, pid}
    {:error, {:already_started, pid}} ->
      {:ok, pid}
  end
end

I don't recall exactly why we need to handle this error tuple and why we don't just pass the error through.

@zookzook
Copy link
Author

Ok! Thanks for your honest answer :-)! I will look at the code, maybe I can provide a PR, that improves this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants