Crash During Sending #4

Closed
chakhedik opened this Issue Apr 6, 2011 · 9 comments

Comments

Projects
None yet
2 participants

Hi,

Today I try to send 5k of sms but then it crash at 3k++ with the following error :

[error] 2011-04-06 15:42:41.698
** Generic server <0.120.0> terminating
** Last message in was {<0.121.0>,
{pdu,26,2147483652,0,71038063,
{submit_sm_resp,"600931801"}}}
** When Server state == {st_rx,<0.117.0>,#Ref<0.0.0.4221>,<0.118.0>,
#Ref<0.0.0.4222>,<0.121.0>,#Ref<0.0.0.4228>,
<0.122.0>,#Ref<0.0.0.4247>,<0.115.0>}
** Reason for termination ==
** {normal,{gen_server,call,
[<0.117.0>,
{deliver,<0.120.0>,
{pdu,26,2147483652,0,71038063,
{submit_sm_resp,"600931801"}}}]}}

** Reason for termination ==
** {{normal,
{gen_server,call,
[<0.117.0>,
{deliver,<0.120.0>,
{pdu,26,2147483652,0,71038063,
{submit_sm_resp,"600931801"}}}]}},
{gen_server,call,
[<0.120.0>,
{<0.121.0>,
{pdu,26,2147483652,0,71038063,{submit_sm_resp,"600931801"}}}]}}

[error] 2011-04-06 15:42:41.708
Error in process <0.128.0> on node 'bulk@192.168.1.110' with exit value: {badarg,[{gen_esme34,transmit_pdu,5},{sms,chilledpush,10}]}

Any idea?

Nevermind, this happen because of normal blocking call by gen_server call. I just put a que in gen_server cast in front of it.

@chakhedik chakhedik closed this Apr 6, 2011

Owner

essiene commented Apr 6, 2011

On Wed, Apr 6, 2011 at 8:53 AM, chakhedik
reply@reply.github.com
wrote:

Hi,

Today I try to send 5k of sms but then it crash at 3k++ with the following error :

[error] 2011-04-06 15:42:41.698
** Generic server <0.120.0> terminating
** Last message in was {<0.121.0>,
                       {pdu,26,2147483652,0,71038063,
                            {submit_sm_resp,"600931801"}}}
** When Server state == {st_rx,<0.117.0>,#Ref<0.0.0.4221>,<0.118.0>,
                              #Ref<0.0.0.4222>,<0.121.0>,#Ref<0.0.0.4228>,
                              <0.122.0>,#Ref<0.0.0.4247>,<0.115.0>}
** Reason for termination ==
** {normal,{gen_server,call,
                      [<0.117.0>,
                       {deliver,<0.120.0>,
                                {pdu,26,2147483652,0,71038063,
                                     {submit_sm_resp,"600931801"}}}]}}

looks like a submit_sm had just suceeded here...

** Reason for termination ==
** {{normal,
       {gen_server,call,
           [<0.117.0>,
            {deliver,<0.120.0>,
                {pdu,26,2147483652,0,71038063,
                    {submit_sm_resp,"600931801"}}}]}},
   {gen_server,call,
       [<0.120.0>,
        {<0.121.0>,
         {pdu,26,2147483652,0,71038063,{submit_sm_resp,"600931801"}}}]}}

[error] 2011-04-06 15:42:41.708
Error in process <0.128.0> on node 'bulk@192.168.1.110' with exit value: {badarg,[{gen_esme34,transmit_pdu,5},{sms,chilledpush,10}]}

Looks like there's a badarg somewhere in there... so you have a small
snippet of what you're trying to do?

Any idea?

Reply to this email directly or view it on GitHub:
#4

I just put echo_esme:sendsms/3 in the loop for a 5k sms. Something like

do([H|T], Src, Msg) ->
echo_esme:sendsms(Src, H, Msg),
do(T, Src, msg);

do([], _Src, _Msg) ->
ok.

Then modify handle_rx to differentiate the response.

handle_rx(P, St) ->
PDU = tuple_to_list(P),
case lists:nth(6, PDU) of
{submit_sm_resp, MessageId} ->
SequenceNumber = lists:nth(5, PDU),
smppque:update_outbox(integer_to_list(SequenceNumber), MessageId);
_ ->
log4erl:log(info, "RX --> ~p", [P])
end,
{noreply, St}.

By the way, there's another error during that time in echo_esme.log :

gen_esme34: Terminating with reason: {timeout,
{gen_server,call,
[<0.117.0>,
{tx,0,74541127,
{submit_sm,[],0,0,

and

gen_esme34: Terminating with reason: {function_clause,
[{bulk_esme,handle_tx,
[{'EXIT',
{timeout,
{gen_server,call,
[<0.118.0>,
{send,0,45693033,
{submit_sm,[],0,0,

I think that badarg appeared in gen_esme34:transmit_pdu/5 (the new added in dev)...

Or maybe I need to check whether that P is_tuple(P) before doing tuple_to_list(P)...

Ok, now I think the main problem is the timeout call. That badarg appeared only when there's another request after gen_esme34 had been terminated. Why don't you use gen_server cast instead of gen_server call?at least there's no timeout issue when there's too many requests. It was meant for to be asynchronous rite?Just a thought...

How about submit_multi? :-)

Owner

essiene commented Apr 7, 2011

On Thu, Apr 7, 2011 at 7:02 AM, chakhedik
reply@reply.github.com
wrote:

Ok, now I think the main problem is the timeout call. That badarg appeared only when there's another request after gen_esme34 had been terminated. Why don't you use gen_server cast instead of gen_server call?at least there's no timeout issue when there's too many requests. It was meant for to be asynchronous rite?Just a thought...

Ahh I see.

Actually, I initially made it a gen_server cast, then during load
testing, I noticed that gen_esme34's mailbox could easily get filled
up if the throughput to the SMSC wasn't high enough and then
gen_esme34 would slow to a crawl and crash. Basically, gen_esme34 was
receiving messages faster than it was pushing out to the network.

The alternate design is supposed to apply some kind of flow control so
that transmission throughput is limited by how fast an actual transmit
happens on the network, and gen_esme34 does not blindly fill up its
mailbox and then die unceremoniously.

What I need to actually do is to handle that condition as an actual
system limit. Basically, I'm thinking I should allow the timeout for
transmit_pdu to be configurable, and when an actual timeout occurs,
the system returns '{error, etoobusy}'. That way, the caller knows to
backoff a bit and then try again when things have calmed down.

btw, are you trying to build a full fledged SMPP gateway? Have a look
at http://github.com/essiene/mmyn

Reply to this email directly or view it on GitHub:
#4 (comment)

-I'm thinking I should allow the timeout for transmit_pdu to be configurable, and when an actual timeout occurs, the system returns '{error, etoobusy}'

That's sounds good. Can'y wait :-)

p/s: i'm use smpp34 from dev branch because I need that transmit_pdu/5

-btw, are you trying to build a full fledged SMPP gateway? Have a look at http://github.com/essiene/mmyn

Interesting. A little how to could help to understand better :-)

Owner

essiene commented Apr 7, 2011

On Thu, Apr 7, 2011 at 11:27 AM, chakhedik
reply@reply.github.com
wrote:

-I'm thinking I should allow the timeout for transmit_pdu to be configurable, and when an actual timeout occurs, the system returns '{error, etoobusy}'

That's sounds good. Can'y wait :-)

-btw, are you trying to build a full fledged SMPP gateway? Have a look at http://github.com/essiene/mmyn

Interesting. A little how to could help to understand better :-)

;)

Will do.. will do... will do... Now that someone else is actually
trying to use it apart from me deploying it, this is not top priority.
I'll put up some preliminary docs to help get started and then expand
it from there.

Reply to this email directly or view it on GitHub:
#4 (comment)

Owner

essiene commented Apr 12, 2011

On Thu, Apr 7, 2011 at 12:06 PM, Essien Essien essiene@gmail.com wrote:

On Thu, Apr 7, 2011 at 11:27 AM, chakhedik
reply@reply.github.com
wrote:

-I'm thinking I should allow the timeout for transmit_pdu to be configurable, and when an actual timeout occurs, the system returns '{error, etoobusy}'

That's sounds good. Can'y wait :-)

Been busy coding and load-testing ;)

What I've done is introduce different synchronous (transmit_pdu/2,3)
and asynchronous (async_transmit_pdu/2,3) apis for sending PDUs. I've
pushed them to dev branch now, along with some other changes even to
the underlying smpp34pdu parsing library.

  1. The synchronous api is still limited in throuhgput by the actual
    transmission on the wire.
  2. I noticed that if you turn off the error_logger, the entire system
    stays up for longer periods when pushing crazy traffic into it. This
    of course is because of how error_logger's mailbox grows faster than
    it can write out to file. Resources hogging up the system becomes a
    problem, I have a plan to deal with that using the os_mon memsup and
    cpusup applications or something similar.
  3. The cool new api is the asynchronous api. It introduces a new
    gen_smpp34 option 'max_async_transmits' (the default value is infinity
    for backward compatibility). When using the asynchronous api,
    gen_smpp34 tracks how many actual PDUs have been successfully sent off
    to the network. If the client is very fast and the unsent PDUs build
    up to the value of max_async_transmits, the all other new transmits
    are not attempted, but rather sent as warnings to handle_tx/3

My only problem is all through the time the system is overloaded, it
sends a warning for "each" attempted transmission. I'm thinking
instead that it should just send "one" warning and then when it
finally falls back to a reasonable level it sends an "ok" message, but
then handle_tx/3 may not be the proper place to be sending these
messages, and i'm wary about introducing another callback yet to
gen_smpp34.

If you have the time, can you play around with this api and tell me
your gut feeling? I'm leaning more towards introducing a
handle_overload/2 and handle_overload_recover/2 which will get
messages like:

handle_overload(cpu, St) ->
{noreply, St};
handle_overload(memory, St) ->
{noreply, St};
handle_overload(transmit_overflow, St) ->
{noreply, St};

etc.

-btw, are you trying to build a full fledged SMPP gateway? Have a look at http://github.com/essiene/mmyn

Interesting. A little how to could help to understand better :-)

;)

Will do.. will do... will do... Now that someone else is actually
trying to use it apart from me deploying it, this is not top priority.
I'll put up some preliminary docs to help get started and then expand
it from there.

Start work on this, should have something palatable by Wednesday :)

Reply to this email directly or view it on GitHub:
#4 (comment)

Owner

essiene commented Apr 12, 2011

On Wed, Apr 6, 2011 at 11:21 AM, Essien Essien essiene@gmail.com wrote:

On Wed, Apr 6, 2011 at 8:53 AM, chakhedik
reply@reply.github.com
wrote:

Hi,

Today I try to send 5k of sms but then it crash at 3k++ with the following error :

[error] 2011-04-06 15:42:41.698
** Generic server <0.120.0> terminating
** Last message in was {<0.121.0>,
                       {pdu,26,2147483652,0,71038063,
                            {submit_sm_resp,"600931801"}}}
** When Server state == {st_rx,<0.117.0>,#Ref<0.0.0.4221>,<0.118.0>,
                              #Ref<0.0.0.4222>,<0.121.0>,#Ref<0.0.0.4228>,
                              <0.122.0>,#Ref<0.0.0.4247>,<0.115.0>}
** Reason for termination ==
** {normal,{gen_server,call,
                      [<0.117.0>,
                       {deliver,<0.120.0>,
                                {pdu,26,2147483652,0,71038063,
                                     {submit_sm_resp,"600931801"}}}]}}

I've consistently gotten smpp34_rx crashing when memory runs out. This
happens when another rogue system is consuming memory and not
releasing it fast enough like the error_logger in the examples. For
this case, I have pushed a new branch where I'm testing a new idea:

  1. Catching all timeouts and returning an {error, timeout} tuple back
    all the way to the tcprx module
  2. When timeouts occur, I suspend network receive and backoff (for now
    a hard time of 5 seconds is set, this will be configurable before I
    merge this into dev and eventually into master
  3. After backoff, the receiver keeps trying untill it stops getting
    the timeout error, then it resumes full network receive.

This work is going on on the 'throttled_network_rx' branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment