Why we need to handle {:error, {:already_started, pid}} ? #259

zookzook · 2023-02-20T08:14:13Z

First: thank you for this library and for your efforts to provide this library!!! I have a question about the :ignore value when call the start_link function:

What is the reason for converting a {:error, {:already_started, pid}} into an :ignore?

def start_link(arg) do
  case GenServer.start_link(...) do
    {:ok, pid} ->
      {:ok, pid}
    {:error, {:already_started, pid}} ->
      :ignore
  end
end

There is a little issue when starting a new process on another node. Assuming we have two nodes n1 and n2. I use the registry to look for pids. If I start a process on node n1 and the process is spawned on node n2, then it takes at least 200ms until the pid is inserted in the registry of node n1.

My code looks like this. I have a UUID and this is used for the process id:

  def find_pid(uuid, counter \\ 3) do
    case Horde.Registry.lookup(Registry, process_name(uuid)) do
      [{pid, _} | _] when is_pid(pid) ->
        {:ok, pid}

      [] ->
        case start_child(uuid) do
          {:ok, pid} ->
            {:ok, pid}

          :ignore ->
            case counter do
              0 ->
                :not_found

              _ ->
                ## the delta CRDT will attempt to sync its local changes with its neighbors at this interval (specified in milliseconds). The default is 200.
                Process.sleep(200)
                find_pid(uuid, counter - 1)
            end
        end
    end
  end

If the function find_pid is called, it first does a lookup and if the process is not found, then it calls a start_child function, which starts the process. If the process is started it returns the pid. If the function find_pid is called again and the process was started on node n2, then it takes about 200ms to be inserted in the registry of node n1. Meanwhile, the start_child function is called and from the implementation above it returns an :ignore. It would be better to return {:error, {:already_started, pid}} and the caller can use the PID. Actually, the code just waits 200ms and runs find_pid again. The number of calls is limited to avoid an endless loop.

If I would not call Process.sleep(200) then the caller would start the process a few times and the start_link function of the node n2 would return :ignoreinstead of {:error, {:already_started, pid}}. The function gives up and returns a :not_found which is not correct because the process exists on node n2. My naive solution now waits for 200ms because it is the duration for the next sync call of the DeltaCRDT package. After 200ms it is inserted in the registry of node n1 and then the find_pid function will return the PID.

Is there a way to handle those use cases better? Maybe insert the PID right after the start in the registry of node n1? Using a cache? Or just returning {:error, {:already_started, pid}} instead of :ignore??

The text was updated successfully, but these errors were encountered:

derekkraan · 2023-02-20T12:05:29Z

The reason you must :ignore instead of {:error, {:already_started, pid}} is I believe to avoid double supervision.

You generally want to avoid registries or supervisors on node 2 from linking or supervising processes on node 1.

zookzook · 2023-02-20T12:12:18Z

Thanks for the response. What would happen if the start_link function returns {:error, {:already_started, pid}}? Then the pid would be double supervised?

derekkraan · 2023-02-20T12:21:22Z

If you did something like this, then yes:

def start_link(arg) do
  case GenServer.start_link(...) do
    {:ok, pid} ->
      {:ok, pid}
    {:error, {:already_started, pid}} ->
      {:ok, pid}
  end
end

I don't recall exactly why we need to handle this error tuple and why we don't just pass the error through.

zookzook · 2023-02-20T12:23:34Z

Ok! Thanks for your honest answer :-)! I will look at the code, maybe I can provide a PR, that improves this.

derekkraan closed this as completed Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why we need to handle {:error, {:already_started, pid}} ? #259

Why we need to handle {:error, {:already_started, pid}} ? #259

zookzook commented Feb 20, 2023

derekkraan commented Feb 20, 2023

zookzook commented Feb 20, 2023

derekkraan commented Feb 20, 2023

zookzook commented Feb 20, 2023

Why we need to handle {:error, {:already_started, pid}} ? #259

Why we need to handle {:error, {:already_started, pid}} ? #259

Comments

zookzook commented Feb 20, 2023

derekkraan commented Feb 20, 2023

zookzook commented Feb 20, 2023

derekkraan commented Feb 20, 2023

zookzook commented Feb 20, 2023