Skip to content

Intermittent Pool GenServer crash on hackney_conn:set_owner/2 failing to match {error, invalid_state} if the connection is closed #850

@ashutoshrishi

Description

@ashutoshrishi

Note: This issue was debugged with the assistance of Claude. I
have verified the analysis to the best of my knowledge.

Version: 4.0.0

What happens

The hackney pool GenServer crashes with a MatchError when a checkout races a
connection closing. hackney_sup restarts the pool immediately (one_for_one,
permanent) and checkout/4's try/catch converts the exit into {error, ...}
for the caller, so the system recovers. The practical impact is a single failed
checkout and log noise.

Noticed this when using
:broadway_sqs, which polls SQS
regularly. AWS closes idle connections on their end, triggering the race.

Have had similar experiences with other HTTP clients where AWS closing
connections needs handling in application or client library level.

** (MatchError) no match of right hand side value: {:error, :invalid_state}
    (hackney 4.0.0) hackney_pool.erl:450: :hackney_pool.handle_call/3
    (stdlib 7.3) gen_server.erl:2470: :gen_server.try_handle_call/4
    (stdlib 7.3) gen_server.erl:2499: :gen_server.handle_msg/3
    (stdlib 7.3) proc_lib.erl:333: :proc_lib.init_p_do_apply/3
Last message: {:checkout, {"sqs.eu-west-2.amazonaws.com", 443, :hackney_tcp}, #PID<...>, [recv_timeout: 30000]}

Suspected root cause

hackney_pool.erl:450 hard-matches the return of hackney_conn:set_owner/2:

ok = hackney_conn:set_owner(Pid, Requester),

find_available does call hackney_conn:is_ready/1 and only proceeds when it
gets {ok, connected}. It skips a connection in closed state.

Maybe: is_ready and set_owner are two separate synchronous
gen_statem calls, which may leave a window where a tcp_closed info message
arrives in the connection process mailbox between them:

  1. Pool calls is_ready. gen_statem responds {ok, connected}, mailbox empty
  2. TCP stack delivers tcp_closed to the gen_statem mailbox because AWS closed the connection?
  3. Pool receives {ok, connected} and dispatches set_owner
  4. gen_statem processes tcp_closed first (queued earlier). Transitions to closed
  5. gen_statem processes set_owner, hits the catch-all. Returns {error, invalid_state}

The set_owner call would then hit the handle_common catch-all in hackney_conn.erl:1587:

handle_common({call, From}, _, _State, _Data) ->
    {keep_state_and_data, [{reply, From, {error, invalid_state}}]};

The connection process stays alive for CLOSED_GRACE_MS (50 ms) before
terminating, so maybe the pool would not yet have received the 'DOWN' message to
remove the Pid from available?

The -spec for set_owner/2 only declares -> ok, and not {error, invalid_state}.

Possible fix

Maybe handle the {error, invalid_state}, regardless of how it happens?
Since it is part of the return path. Handle the error in hackney_pool.erl and
fall through to a fresh connection rather than crashing:

{ok, Pid, Available2} ->
    case hackney_conn:set_owner(Pid, Requester) of
        ok ->
            InUse2 = maps:put(Pid, Key, InUse),
            {reply, {ok, Pid}, State#state{available=Available2, in_use=InUse2}};
        {error, _} ->
            %% Connection closed between find_available and set_owner - start a fresh one
            case start_connection(Key, Requester, Opts, State#state{available=Available2}) of
                {ok, Pid2, State2} ->
                    InUse2 = maps:put(Pid2, Key, State2#state.in_use),
                    {reply, {ok, Pid2}, State2#state{in_use=InUse2}};
                {error, Reason} ->
                    {reply, {error, Reason}, State#state{available=Available2}}
            end
    end;

The spec for set_owner/2 updated:

-spec set_owner(pid(), pid()) -> ok | {error, invalid_state}.

I can create a PR if it is a issue and not expected behaviour.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions