Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better handling of crashes on socket closure by client #270

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

saleyn
Copy link

@saleyn saleyn commented Jun 28, 2021

Add functionality to terminate gen_server gracefully in case of SSL handshake errors or premature closing of socket by an SMTP client.

@seriyps
Copy link
Collaborator

seriyps commented Sep 20, 2021

I wonder if it is really necessary to use funs in the scenarios where you use send_and_run? Maye it's possible to have just linear "happy path" and then signal errors with throw? Or I'm missing smth?

It just becomes a bit difficult to reason about the order in which code lines are executed looking at the functions using send_and_run

@saleyn
Copy link
Author

saleyn commented Sep 20, 2021

The main issue is that when errors are thrown, the error log has a ton of details pertaining to the crashed gen_server, whereas in the majority of cases, especially when the server closes the socket after ELHO or similar scenarios, we don't really need to put the crash details in the log. Generally, I think the current implementation needs a major revision to clean up and simplify that logic, but this is merely a patch to make the existing code more usable.

@seriyps
Copy link
Collaborator

seriyps commented Sep 20, 2021

@saleyn the thing is that throw from gen_server's callback is kind of equivalent to just returning from the gen_server:

https://github.com/erlang/otp/blob/a83541527625fae5d7f849718a052cf6f2833809/lib/stdlib/src/gen_server.erl#L697-L698

so

handle_cast(..., State) ->
   ...
   {noreply, NewState}.

is equivalent of

handle_cast(..., State) ->
   ...
   throw({noreply, NewState}),
   ....

(same works for handle_call/handle_info and I think even init).

Also, if you throw({stop, normal, State}) it would stop the process, but normal exit reason (so, no error logs, no signal propagation).

When I implemented send/2 function making it throw {stop, {send_error, ...}, St} I might actually just change it to throw {stop, normal, St}, but decided to better let user-defined handle_error decide (handle_error callback could return {stop, normal, St} instead of returning ok or {stop, Error, St}), but I made it more explicit (by generating error log) by-default.

User can define

-behaviour(gen_smtp_server_session).

handle_error(send_error, Err, St) ->
  logger:debug("socket send error ~p", [Err]),
  {stop, normal, St};
handle_error(_, _, St) ->
  {ok, St}.

to suppress error_logger reports when gen_smtp_server_session:send/2 fails. It's probably should be better documented though...

@saleyn
Copy link
Author

saleyn commented Sep 20, 2021

If my memory serves me, a while back when I was implementing an SMTP server based on gen_smtp, the issue was that there were many places were gen_tcp/ssl's send/2 function would return an error, and this would result either in a failed match ok = gen_tcp:send(...) or erlang:error/1 someplace, which obviously is not same thing as an erlang:throw/1 and would leak from the gen_server's exception handler, crashing the process.

I recall seeing this behavior on ELHO and QUIT messages, e.g. when the server was trying to send QUIT, by which time a client already closed the socket. Hence, I had to implement some more refined handing of those edge cases to eliminate useless log reports of process crashes.

@seriyps
Copy link
Collaborator

seriyps commented Sep 20, 2021

@saleyn

a while back when I was implementing an SMTP server based on gen_smtp

depends on how long ago, because I made 2 PRs where error handling have been noticeably improved relatively recently: #211 adds handle_error user-defined callback and #226 is exactly about handling send errors; both PRs were merged in mid-2020

@saleyn
Copy link
Author

saleyn commented Sep 20, 2021

depends on how long ago, because I made 2 PRs where error handling have been noticeably improved relatively recently: #211 adds handle_error user-defined callback and #226 is exactly about handling send errors; both PRs were merged in mid-2020

I believe it was more or less at the end of 2020 using the HEAD of master branch.

@seriyps
Copy link
Collaborator

seriyps commented Sep 20, 2021

Oh, that's strange. Looking at the code, it seems the only place where we are not using gen_smtp_server_session:send/2 to send server's replies is init callback (becaue return tuple format there is slightly different I think). And all the send/2 errors whould be possible to suppress by returning {stop, normal, State} from handle_error callback of gen_smtp_server_session.

I would rally appreciate if you could provide some small example where we could reproduce it!

@saleyn
Copy link
Author

saleyn commented Sep 21, 2021

I've had this patched code running in production for the good part of this year, and haven't seen these issues. So it would be hard to reproduce without rolling back to the unpatched version, which is not something I could do any time soon. So you can table this PR for now, I guess. If I have more updates on this, I'll let you know.

@mworrell mworrell added the on hold On hold. Could be delayed or kept as inspiration. label Sep 21, 2021
@seriyps
Copy link
Collaborator

seriyps commented Sep 21, 2021

@saleyn I also have gen_smtp master server running in production with the following customized handle_error/3 callback:

handle_error(Kind, [], State) when Kind == tcp_closed; Kind == ssl_closed ->
    %% It's normal; no need to log/metric
    {ok, State};
handle_error(out_of_order, <<"DATA">>, #state{orig_to = To, inbox = undefined} = State) when
    is_binary(To)
->
    %% RCPT TO and DATA are sent in pipeline, but rcpt failed due to recipient address not found
    {ok, State};
handle_error(Kind, Details, #state{peer = Ip} = State) ->
    tm_metrics:count_inc([smtp, error, total], 1, #{labels => [Kind]}),
    ?LOG_WARNING(
        #{
            tag => "Session error",
            kind => Kind,
            details => Details,
            state => State
        }
    ),
    case Kind of
        ssl_handshake_error ->
            tm_smtp_tls_blacklist:report(Ip),
            %% Force stop
            {stop, normal, State};
        ssl_error ->
            tm_smtp_tls_blacklist:report(Ip),
            % will stop anyway
            {ok, State};
        send_error when Details =:= closed ->
            %% seems lots of servers close the connection prematurely (without sending QUIT)
            {stop, normal, State};
        _ ->
            {ok, State}
    end.

and I indeed suppress send_error by returning normal as stop reason. But I did it after I made some analysis of the traffic I receive; I'm not sure we should suppress them by-default, because it may make debugging more difficult. Maybe we should add more tips to the smtp_server_example docstring suggesting which errors are a good candidate for being suppressed after initial analysis?

@saleyn
Copy link
Author

saleyn commented Sep 21, 2021

It certainly would be helpful to include this in the example server, and mention your comments from this thread there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on hold On hold. Could be delayed or kept as inspiration.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants