Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore read and write errors if vio has been cleared #1522

Merged
merged 1 commit into from Mar 8, 2017

Conversation

bryancall
Copy link
Contributor

No description provided.

@bryancall
Copy link
Contributor Author

Here is a workaround for issue #1401. I ran into issues with the read also coring.

@bryancall bryancall self-assigned this Mar 1, 2017
@bryancall bryancall added this to the 7.1.0 milestone Mar 1, 2017
@bryancall bryancall requested review from zwoop and shinrich March 1, 2017 17:08
@atsci
Copy link

atsci commented Mar 1, 2017

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/freebsd-github/1660/ for details.

@atsci
Copy link

atsci commented Mar 1, 2017

Linux build successful! See https://ci.trafficserver.apache.org/job/linux-github/1556/ for details.

@atsci
Copy link

atsci commented Mar 1, 2017

Intel CC build successful! See https://ci.trafficserver.apache.org/job/icc-github/92/ for details.

@atsci
Copy link

atsci commented Mar 1, 2017

clang-analyzer build successful! See https://ci.trafficserver.apache.org/job/clang-analyzer-github/224/ for details.

@shinrich
Copy link
Member

shinrich commented Mar 1, 2017

Looks reasonable to me.

@zwoop
Copy link
Contributor

zwoop commented Mar 1, 2017

Testing this on docs, it's similar to #1444 which did not solve the problems for us there (completely at least), hopefully the additions to the read case helps.

@zwoop
Copy link
Contributor

zwoop commented Mar 2, 2017

Still looking good on Docs, but would like to give it at least another 24 hours before we declare victory.

@oknet
Copy link
Member

oknet commented Mar 3, 2017

to review #947 again, my suggestion is

diff --git a/iocore/net/UnixNet.cc b/iocore/net/UnixNet.cc
index 85d7527..5ef65cb 100644
--- a/iocore/net/UnixNet.cc
+++ b/iocore/net/UnixNet.cc
@@ -505,7 +505,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     set_cont_flags(vc->control_flags);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if ((vc->read.enabled || vc->read.error) && vc->read.triggered)
+    else if ((vc->read.enabled || vc->read.error) && vc->read.triggered && vc->read.vio._cont != nullptr)
       vc->net_read_io(this, trigger_event->ethread);
     else if (!vc->read.enabled) {
       read_ready_list.remove(vc);
@@ -522,7 +522,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     set_cont_flags(vc->control_flags);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if ((vc->write.enabled || vc->write.error) && vc->write.triggered)
+    else if ((vc->write.enabled || vc->write.error) && vc->write.triggered && vc->write.vio._cont != nullptr)
       write_to_net(this, vc, trigger_event->ethread);
     else if (!vc->write.enabled) {
       write_ready_list.remove(vc);
@@ -540,7 +540,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     diags->set_override(vc->control.debug_override);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if ((vc->read.enabled || vc->read.error) && vc->read.triggered)
+    else if ((vc->read.enabled || vc->read.error) && vc->read.triggered && vc->read.vio._cont != nullptr)
       vc->net_read_io(this, trigger_event->ethread);
     else if (!vc->read.enabled)
       vc->ep.modify(-EVENTIO_READ);
@@ -549,7 +549,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     diags->set_override(vc->control.debug_override);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if ((vc->write.enabled || vc->write.error) && vc->write.triggered)
+    else if ((vc->write.enabled || vc->write.error) && vc->write.triggered && vc->write.vio._cont != nullptr)
       write_to_net(this, vc, trigger_event->ethread);
     else if (!vc->write.enabled)
       vc->ep.modify(-EVENTIO_WRITE);

Before PR#947:

  • IOCore Net Sub-System only callback to SM if vc->read.enabled is set to 1 and vc->read.triggered is set to 1.
  • vc->read.enabled must not 0 if vc->read.vio._cont is not nullptr.
  • there are 2 ways to set the read.enabled to 0:
    • do_io_read(NULL, 0, NULL), both set vio._cont to nullptr
    • do_io_read(this, 0, NULL), keep vio._cont not nullptr
  • that means read.enabled==1 is indicate vio._cont is not nullptr.
  • but read.enabled==0 is not means vio._cont is nullptr.

The target of PR#947 is making iocore to notify EPOLLERR to SM even read.enabled is set to 0.
But the PR#947 forgot to finger out vio._cont is nullptr state.

the vc->net_read_io is designed to callback SM and Net sub-system must assure the vio can be callback.

@oknet
Copy link
Member

oknet commented Mar 3, 2017

or we can use:

if (vc->write.enabled && vc->write.triggered || vc->write.error && vc->write.vio._cont != nullptr)

keep old condition then add new condition.

@oknet oknet changed the title Ingore read and write errors if vio has been cleared Ignore read and write errors if vio has been cleared Mar 3, 2017
@scw00
Copy link
Member

scw00 commented Mar 6, 2017

I think, it #never calls handler with read.enabled = 0(or write.enabled). It means the handler don not want this type of event .It may never handle this event and just assert!
#1531

@bryancall
Copy link
Contributor Author

@oknet Thank for the suggestion I am running in 7.1.0 in production with the change you mention above instead of this PR

diff --git a/iocore/net/UnixNet.cc b/iocore/net/UnixNet.cc
index 85d7527ec..636b9a255 100644
--- a/iocore/net/UnixNet.cc
+++ b/iocore/net/UnixNet.cc
@@ -505,7 +505,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     set_cont_flags(vc->control_flags);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if ((vc->read.enabled || vc->read.error) && vc->read.triggered)
+    else if (vc->read.enabled && vc->read.triggered || vc->read.error && vc->read.vio._cont != nullptr)
       vc->net_read_io(this, trigger_event->ethread);
     else if (!vc->read.enabled) {
       read_ready_list.remove(vc);
@@ -522,7 +522,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     set_cont_flags(vc->control_flags);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if ((vc->write.enabled || vc->write.error) && vc->write.triggered)
+    else if (vc->write.enabled && vc->write.triggered || vc->write.error && vc->write.vio._cont != nullptr)
       write_to_net(this, vc, trigger_event->ethread);
     else if (!vc->write.enabled) {
       write_ready_list.remove(vc);
@@ -540,7 +540,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     diags->set_override(vc->control.debug_override);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if ((vc->read.enabled || vc->read.error) && vc->read.triggered)
+    else if (vc->read.enabled && vc->read.triggered || vc->read.error && vc->read.vio._cont != nullptr)
       vc->net_read_io(this, trigger_event->ethread);
     else if (!vc->read.enabled)
       vc->ep.modify(-EVENTIO_READ);
@@ -549,7 +549,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     diags->set_override(vc->control.debug_override);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if ((vc->write.enabled || vc->write.error) && vc->write.triggered)
+    else if (vc->write.enabled && vc->write.triggered || vc->write.error && vc->write.vio._cont != nullptr)
       write_to_net(this, vc, trigger_event->ethread);
     else if (!vc->write.enabled)
       vc->ep.modify(-EVENTIO_WRITE);

@atsci
Copy link

atsci commented Mar 6, 2017

FreeBSD build failed! See https://ci.trafficserver.apache.org/job/freebsd-github/1680/ for details.

@bryancall
Copy link
Contributor Author

bryancall commented Mar 6, 2017

I updated the PR to use oknet's recommendation. I am going to test a another version of this fix to be:

diff --git a/iocore/net/UnixNet.cc b/iocore/net/UnixNet.cc
index 636b9a2..c7f1798 100644
--- a/iocore/net/UnixNet.cc
+++ b/iocore/net/UnixNet.cc
@@ -505,7 +505,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     set_cont_flags(vc->control_flags);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if (vc->read.enabled && vc->read.triggered || vc->read.error && vc->read.vio._cont != nullptr)
+    else if (vc->read.triggered && (vc->read.enabled || (vc->read.error && vc->read.vio._cont != nullptr)))
       vc->net_read_io(this, trigger_event->ethread);
     else if (!vc->read.enabled) {
       read_ready_list.remove(vc);
@@ -522,7 +522,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     set_cont_flags(vc->control_flags);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if (vc->write.enabled && vc->write.triggered || vc->write.error && vc->write.vio._cont != nullptr)
+    else if (vc->write.triggered && (vc->write.enabled || (vc->write.error && vc->write.vio._cont != nullptr)))
       write_to_net(this, vc, trigger_event->ethread);
     else if (!vc->write.enabled) {
       write_ready_list.remove(vc);
@@ -540,7 +540,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     diags->set_override(vc->control.debug_override);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if (vc->read.enabled && vc->read.triggered || vc->read.error && vc->read.vio._cont != nullptr)
+    else if (vc->read.triggered && (vc->read.enabled || (vc->read.error && vc->read.vio._cont != nullptr)))
       vc->net_read_io(this, trigger_event->ethread);
     else if (!vc->read.enabled)
       vc->ep.modify(-EVENTIO_READ);
@@ -549,7 +549,7 @@ NetHandler::mainNetEvent(int event, Event *e)
     diags->set_override(vc->control.debug_override);
     if (vc->closed)
       close_UnixNetVConnection(vc, trigger_event->ethread);
-    else if (vc->write.enabled && vc->write.triggered || vc->write.error && vc->write.vio._cont != nullptr)
+    else if (vc->write.triggered && (vc->write.enabled || (vc->write.error && vc->write.vio._cont != nullptr)))
       write_to_net(this, vc, trigger_event->ethread);
     else if (!vc->write.enabled)
       vc->ep.modify(-EVENTIO_WRITE);

@atsci
Copy link

atsci commented Mar 6, 2017

Linux build failed! See https://ci.trafficserver.apache.org/job/linux-github/1576/ for details.

@postwait
Copy link
Contributor

postwait commented Mar 6, 2017

Given the mixing of || and &&, I'd highly suggest superfluous parentheses for clarity.

nm... looks like this was done.

@atsci
Copy link

atsci commented Mar 6, 2017

Intel CC build successful! See https://ci.trafficserver.apache.org/job/icc-github/112/ for details.

@atsci
Copy link

atsci commented Mar 6, 2017

Linux build successful! See https://ci.trafficserver.apache.org/job/linux-github/1577/ for details.

@atsci
Copy link

atsci commented Mar 6, 2017

Intel CC build successful! See https://ci.trafficserver.apache.org/job/icc-github/113/ for details.

@atsci
Copy link

atsci commented Mar 6, 2017

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/freebsd-github/1681/ for details.

@scw00
Copy link
Member

scw00 commented Mar 7, 2017

Actually, there is still a problem in my test after this patch according to jtest.

#0  0x00007ffff2be9c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff2bed028 in __GI_abort () at abort.c:89
#2  0x00007ffff4bd8e55 in ink_abort (message_format=0x7ffff4c1da40 "%s:%d: failed assertion `%s`") at ink_error.cc:99
#3  0x00007ffff4bd34bb in _ink_assert (expression=0xb13620 "server_entry->read_vio == (VIO *)data", file=0xb11420 "HttpSM.cc", line=1827)
    at ink_assert.cc:37
#4  0x000000000068e24e in HttpSM::state_read_server_response_header (this=0x60700016e100, event=3, data=0x603c0039d3a0) at HttpSM.cc:1827
#5  0x0000000000695861 in HttpSM::main_handler (this=0x60700016e100, event=3, data=0x603c0039d3a0) at HttpSM.cc:2679
#6  0x0000000000539f32 in Continuation::handleEvent (this=0x60700016e100, event=3, data=0x603c0039d3a0)
    at /root/ATS-7.0/trafficserver/iocore/eventsystem/I_Continuation.h:153
#7  0x0000000000a2a2e0 in write_signal_and_update (event=3, vc=0x603c0039d200) at UnixNetVConnection.cc:176
#8  0x0000000000a2a6b2 in write_signal_done (event=3, nh=0x7fffef217650, vc=0x603c0039d200) at UnixNetVConnection.cc:218
#9  0x0000000000a2a7b8 in write_signal_error (nh=0x7fffef217650, vc=0x603c0039d200, lerrno=32) at UnixNetVConnection.cc:237
#10 0x0000000000a2bf13 in write_to_net_io (nh=0x7fffef217650, vc=0x603c0039d200, thread=0x7fffef213800) at UnixNetVConnection.cc:453
#11 0x0000000000a2bbe9 in write_to_net (nh=0x7fffef217650, vc=0x603c0039d200, thread=0x7fffef213800) at UnixNetVConnection.cc:430
#12 0x0000000000a19be5 in NetHandler::mainNetEvent (this=0x7fffef217650, event=5, e=0x601200005ab0) at UnixNet.cc:526
#13 0x0000000000539f32 in Continuation::handleEvent (this=0x7fffef217650, event=5, data=0x601200005ab0)
    at /root/ATS-7.0/trafficserver/iocore/eventsystem/I_Continuation.h:153
#14 0x0000000000a764af in EThread::process_event (this=0x7fffef213800, e=0x601200005ab0, calling_code=5) at UnixEThread.cc:143
#15 0x0000000000a7713f in EThread::execute (this=0x7fffef213800) at UnixEThread.cc:270
#16 0x0000000000a74dfc in spawn_thread_internal (a=0x600800015a50) at Thread.cc:84
#17 0x00007ffff4e63b98 in ?? () from /usr/lib/x86_64-linux-gnu/libasan.so.0
#18 0x00007ffff39a4184 in start_thread (arg=0x7fffe3d5a700) at pthread_create.c:312
#19 0x00007ffff2cad37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

We got the error event and callback to SM, but SM do not want to handle this write event , because write.enabled is 0, then SM assert ! The backtrace is the same as #1531 .
I think if enabled is 0 , we should not callback to any SM. After all, 6.0.x logic may be better!

Here is 6.x.x

  while ((vc = write_ready_list.dequeue())) {
    set_cont_flags(vc->control_flags);
    if (vc->closed)
      close_UnixNetVConnection(vc, trigger_event->ethread);
    else if (vc->write.enabled && vc->write.triggered)
      write_to_net(this, vc, trigger_event->ethread);
    else if (!vc->write.enabled) {
      write_ready_list.remove(vc);

@bryancall
Copy link
Contributor Author

@scw00 What are server_entry->read_vio and data set to when it asserts?

@atsci
Copy link

atsci commented Mar 7, 2017

FreeBSD build successful! See https://ci.trafficserver.apache.org/job/freebsd-github/1684/ for details.

@scw00
Copy link
Member

scw00 commented Mar 7, 2017

(gdb) p *server_entry->read_vio
$3 = {_cont = 0x607000476100, nbytes = 9223372036854775807, ndone = 0, op = 1, buffer = {name = 0x0, mbuf = 0x6024000cebc0, entry = 0x0}, 
  vc_server = 0x603c00270900, mutex = {m_ptr = 0x601800129ac0}}
(gdb) p (VIO *)data
$4 = (VIO *) 0x603c00270aa0
(gdb) p *(VIO *)data
$5 = {_cont = 0x607000476100, nbytes = 172, ndone = 172, op = 2, buffer = {name = 0x0, mbuf = 0x6024000ce6c0, entry = 0x6024000ce6d8}, 
  vc_server = 0x603c00270900, mutex = {m_ptr = 0x601800129ac0}}
(gdb) 

@atsci
Copy link

atsci commented Mar 7, 2017

Linux build successful! See https://ci.trafficserver.apache.org/job/linux-github/1580/ for details.

@atsci
Copy link

atsci commented Mar 7, 2017

Intel CC build successful! See https://ci.trafficserver.apache.org/job/icc-github/116/ for details.

@oknet
Copy link
Member

oknet commented Mar 7, 2017

@bryancall @scw00 after #947 , We must change the SM to handle EVENT_ERROR event just like EVENT_TIMEOUT.

before #947 , an ERROR indicate a VIO error. after #947 , an ERROR indicate a VC error.

And we should combine vc->read.error & vc->write.error to one event if the vc->read._cont and vc->write._cont point to same SM. It is just like the EVENT_TIMEOUT callback that we do it in UnixNetVC::mainEvent().

@atsci
Copy link

atsci commented Mar 7, 2017

clang-analyzer build successful! See https://ci.trafficserver.apache.org/job/clang-analyzer-github/248/ for details.

@shinrich
Copy link
Member

shinrich commented Mar 7, 2017

I think the issue raised by @scw00 is different. It looks like what @zwoop identified in issue #1531. In that case the write vio errored, so the write vio is being sent up as data to a handler expecting only a read vio. In the error case it shouldn't matter whether it is a read or a write vio and the error clean up should occur regardless.

@shinrich
Copy link
Member

shinrich commented Mar 7, 2017

I ran my production box overnight with the latest work around. I think this change is a step in the right direction and we should merge it and bring it back to 7.1.

@atsci
Copy link

atsci commented Mar 7, 2017

Linux build successful! See https://ci.trafficserver.apache.org/job/RAT-github/2/ for details.

@scw00
Copy link
Member

scw00 commented Mar 7, 2017

Agree!!

@bryancall bryancall merged commit a128d56 into apache:master Mar 8, 2017
oknet added a commit to oknet/trafficserver that referenced this pull request Mar 9, 2017
After apache#947 (c1ac5f) and apache#1522 (a128d5) , the EVENT_ERROR leads by
EPOLLERR will be sent to read.vio._cont first and then write.vio._cont.

The reader SM could close or shutdown(WRITE) the VC, but we do not check
these operations before we callback write.vio._cont. The SM would
received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
oknet added a commit to oknet/trafficserver that referenced this pull request Mar 9, 2017
After apache#947 (c1ac5f) and apache#1522 (a128d5) , the EVENT_ERROR leads by
EPOLLERR will be sent to read.vio._cont first and then write.vio._cont.

The reader SM could close or shutdown(WRITE) the VC, but we do not check
these operations before we callback write.vio._cont. The SM would
received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
oknet added a commit to oknet/trafficserver that referenced this pull request Mar 9, 2017
After apache#947 (c1ac5f) and apache#1522 (a128d5) , the EVENT_ERROR leads by
EPOLLERR will be sent to read.vio._cont first and then write.vio._cont.

The reader SM could close or shutdown(WRITE) the VC, but we do not check
these operations before we callback write.vio._cont. The SM would
received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
oknet added a commit to oknet/trafficserver that referenced this pull request Mar 9, 2017
After apache#947 (c1ac5f) and apache#1522 (a128d5) , the EVENT_ERROR which caused by
EPOLLERR will be sent to read.vio._cont first and then write.vio._cont.

The reader SM could close or shutdown(WRITE) the VC, but we do not check
these operations before we callback write.vio._cont. The SM would
received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
zwoop pushed a commit that referenced this pull request Mar 9, 2017
After #947 (c1ac5f) and #1522 (a128d5) , the EVENT_ERROR which caused by
EPOLLERR will be sent to read.vio._cont first and then write.vio._cont.

The reader SM could close or shutdown(WRITE) the VC, but we do not check
these operations before we callback write.vio._cont. The SM would
received EVENT_ERROR twice if write.vio._cont == read.vio._cont.
zwoop pushed a commit that referenced this pull request Mar 9, 2017
After #947 (c1ac5f) and #1522 (a128d5) , the EVENT_ERROR which caused by
EPOLLERR will be sent to read.vio._cont first and then write.vio._cont.

The reader SM could close or shutdown(WRITE) the VC, but we do not check
these operations before we callback write.vio._cont. The SM would
received EVENT_ERROR twice if write.vio._cont == read.vio._cont.

(cherry picked from commit aee3f3b)
vmamidi added a commit to vmamidi/trafficserver that referenced this pull request Jun 8, 2017
@vmamidi vmamidi mentioned this pull request Jun 8, 2017
vmamidi added a commit that referenced this pull request Jun 8, 2017
…s to the VConnection"

this reverts PRs #1559, #1522 and #947

This reverts commit c1ac5f8.
vmamidi added a commit to vmamidi/trafficserver that referenced this pull request Jun 8, 2017
This reverts PRs apache#1559, apache#1522 and apache#947

PR apache#947 made the HTTP state machine unstable and lead to crashes in production like apache#1930 apache#1559 apache#1522 apache#1531 apache#1629

This reverts commit c1ac5f8.
zwoop pushed a commit that referenced this pull request Jun 8, 2017
This reverts PRs #1559, #1522 and #947

PR #947 made the HTTP state machine unstable and lead to crashes in production like #1930 #1559 #1522 #1531 #1629

This reverts commit c1ac5f8.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants