Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion in state_read_server_response_header (v7.1.0) #1531

Closed
zwoop opened this issue Mar 3, 2017 · 10 comments
Closed

Assertion in state_read_server_response_header (v7.1.0) #1531

zwoop opened this issue Mar 3, 2017 · 10 comments
Labels
Backport Marked for backport for an LTS patch release Crash Network
Milestone

Comments

@zwoop
Copy link
Contributor

zwoop commented Mar 3, 2017

Seeing this on docs (infrequently):

(gdb) bt full
#0  0x00007ffff57b21d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        resultvar = 0
        pid = 5856
        selftid = 5862
#1  0x00007ffff57b38c8 in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x300a00440}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007ffff7bb28fe in ink_abort (message_format=0x7ffff7bc9160 "%s:%d: failed assertion `%s`") at ink_error.cc:99
        ap = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0x2aaaaaffe960, reg_save_area = 0x2aaaaaffe8a0}}
#3  0x00007ffff7baffbf in _ink_assert (expression=0x902628 "server_entry->read_vio == (VIO *)data", file=0x9018ec "HttpSM.cc", line=1811) at ink_assert.cc:37
No locals.
#4  0x000000000064dbac in HttpSM::state_read_server_response_header (this=0x2aaab2ecc300, event=3, data=0x2aaadfe10fe0) at HttpSM.cc:1811
        __FUNCTION__ = "state_read_server_response_header"
        bytes_used = -1426065664
        vio = 0x0
        state = 10922
#5  0x0000000000650eca in HttpSM::main_handler (this=0x2aaab2ecc300, event=3, data=0x2aaadfe10fe0) at HttpSM.cc:2663
        jump_point = (int (HttpSM::*)(HttpSM * const, int, void *)) 0x64da58 <HttpSM::state_read_server_response_header(int, void*)>
        __FUNCTION__ = "main_handler"
        vc_entry = 0x2aaab2ecdb08
#6  0x000000000057472a in Continuation::handleEvent (this=0x2aaab2ecc300, event=3, data=0x2aaadfe10fe0) at /usr/local/src/trafficserver/iocore/eventsystem/I_Continuation.h:153
No locals.
#7  0x0000000000806eed in write_signal_and_update (event=3, vc=0x2aaadfe10e40) at UnixNetVConnection.cc:176
        __FUNCTION__ = "write_signal_and_update"
#8  0x0000000000807124 in write_signal_done (event=3, nh=0x7ffff3009e50, vc=0x2aaadfe10e40) at UnixNetVConnection.cc:218
No locals.
#9  0x00000000008071b4 in write_signal_error (nh=0x7ffff3009e50, vc=0x2aaadfe10e40, lerrno=32) at UnixNetVConnection.cc:237
No locals.
#10 0x00000000008080d0 in write_to_net_io (nh=0x7ffff3009e50, vc=0x2aaadfe10e40, thread=0x7ffff3006000) at UnixNetVConnection.cc:463
        err = 32
        errlen = 4
        s = 0x2aaadfe10fd8
        mutex = 0x7ffff3512db0
        lock = {m = {m_ptr = 0x2aaab0402ca0}, lock_acquired = true}
        __FUNCTION__ = "write_to_net_io"
        ntodo = 910
        buf = <error reading variable>
        towrite = 46913388875328
        signalled = 0
        needs = 32
        total_written = 4
        r = 140737294027776
#11 0x0000000000807ead in write_to_net (nh=0x7ffff3009e50, vc=0x2aaadfe10e40, thread=0x7ffff3006000) at UnixNetVConnection.cc:435
        mutex = 0x7ffff3512db0
#12 0x00000000007ff2d8 in NetHandler::mainNetEvent (this=0x7ffff3009e50, event=5, e=0x2aaaabc05ca0) at UnixNet.cc:526
        epd = 0x2aaadfe10da8
        poll_timeout = 0
        pd = 0x7ffff2f1d000
        vc = 0x2aaadfe10e40
        __FUNCTION__ = "mainNetEvent"
#13 0x000000000057472a in Continuation::handleEvent (this=0x7ffff3009e50, event=5, data=0x2aaaabc05ca0) at /usr/local/src/trafficserver/iocore/eventsystem/I_Continuation.h:153
No locals.
#14 0x000000000082b522 in EThread::process_event (this=0x7ffff3006000, e=0x2aaaabc05ca0, calling_code=5) at UnixEThread.cc:143
        c_temp = 0x7ffff3009e50
        lock = {m = {m_ptr = 0x7ffff3512480}, lock_acquired = true}
        __FUNCTION__ = "process_event"
#15 0x000000000082bb01 in EThread::execute (this=0x7ffff3006000) at UnixEThread.cc:270
        done_one = false
        e = 0x2aaaabc05ca0
        NegativeQueue = {<DLL<Event, Event::Link_link>> = {head = 0x2aaaabc04f80}, tail = 0x2aaaabc041a0}
        next_time = 1488388269045403248
        __FUNCTION__ = "execute"
#16 0x000000000082ab3c in spawn_thread_internal (a=0x7ffff4539fa0) at Thread.cc:84
        p = 0x7ffff4539fa0
#17 0x00007ffff604bdc5 in start_thread (arg=0x2aaaaafff700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x2aaaaafff700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {46912501708544, 8358616470746199031, 0, 140737488348672, 46912501708544, 0, 2786198315644908535, -8358635357929704457}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#18 0x00007ffff587473d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.
@zwoop zwoop added Backport Marked for backport for an LTS patch release Crash Network labels Mar 3, 2017
@zwoop zwoop added this to the 7.1.0 milestone Mar 3, 2017
@shinrich
Copy link
Member

shinrich commented Mar 3, 2017

The server_entry->read_vio is NULL. From what I can tell this value only gets set due to do_io_read() calls in attach_server_session() and setup_server_read_response_header(). Presumably one of these do_io_read operations failed.

Looking at the stack, we had a failure which is getting trigger via the write path (errno 32 broken pipe).

I'm guessing that we should have started handling the failure where the do_io_read failed presumably in setup_server_read_response_header().

@zwoop
Copy link
Contributor Author

zwoop commented Mar 3, 2017

Hmmm, I see

(gdb) frame 4
#4  0x000000000064dbac in HttpSM::state_read_server_response_header (this=0x2aaab2ecc300, event=3, data=0x2aaadfe10fe0) at HttpSM.cc:1811
1811	  ink_assert(server_entry->read_vio == (VIO *)data);
(gdb) p *server_entry
$5 = {vc = 0x2aaae0604c60, read_buffer = 0x0, write_buffer = 0x0, read_vio = 0x2aaadfe10f70, write_vio = 0x2aaadfe10fe0, vc_handler = (int (HttpSM::*)(HttpSM * const, int, void *)) 0x64da58 <HttpSM::state_read_server_response_header(int, void*)>, vc_type = HTTP_SERVER_VC, eos = true, in_tunnel = false}
(gdb) p (VIO*)data
$6 = (VIO *) 0x2aaadfe10fe0
(gdb) print server_entry->read_vio
$7 = (VIO *) 0x2aaadfe10f70

So, as far as I can tell, "data" is set to the write_vio, but the assert checks against the read_vio.

@shinrich
Copy link
Member

shinrich commented Mar 3, 2017

Oh, nevermind. I saw vio = NULL in the stack, but that is the local variable.

That makes sense that the event would be passing up the write vio. It is a write event we are processing. But it looks like the handler is only expecting a read event. I guess a broken pipe shows up as a write event.

@scw00
Copy link
Member

scw00 commented Mar 6, 2017

I think, it #never calls handler with read.enabled = 0(or write.enabled). It means the handler don not want this type of event .It may never handle this event and just assert!
#1522

@oknet
Copy link
Member

oknet commented Mar 6, 2017

before #947, the EPOLLERR event only set write.triggered to 1, iocore callback SM only if write.enabled==1 and write.triggered==1.
after #947 , the EPOLLERR set write.error to 1, iocore could callback SM if write.error==1 and write.triggered==1 even write.enable==0. (the assert you met in this isssue).

@zwoop , please try to fix this issue by merge #1522 .

@oknet
Copy link
Member

oknet commented Mar 7, 2017

@zwoop @shinrich It could run into state_read_server_response_header twice if the data is point to write.vio and the handler function doest not do_io_close on vc while the 1st run into it.

after #947 , iocore set read.error and write.error to 1 if EPOLLERR that means for that VC iocore will callback read._cont with EVENT_ERROR first and then callback write._cont with EVENT_ERROR if the VC is not disabled or closed in the first callback.

I think the SM can not receive EVENT_ERROR twice just like it can not receive EVENT_TIMEOUT twice (UnixNetVConnection::mainEvent will ignore the write VIO callback if its _cont is the same as read VIO).

@oknet
Copy link
Member

oknet commented Mar 9, 2017

I created a PR #1559 to resolve this issue.

@zwoop zwoop closed this as completed Mar 9, 2017
@zwoop
Copy link
Contributor Author

zwoop commented Apr 2, 2017

I got this again on Docs.

(gdb) bt full
#0  0x00002b5b8bd881d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        resultvar = 0
        pid = 24448
        selftid = 24454
#1  0x00002b5b8bd89a08 in __GI_abort () at abort.c:119
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {18446744073709551615 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00002b5b89af48fe in ink_abort (message_format=0x2b5b89b0b160 "%s:%d: failed assertion `%s`") at ink_error.cc:99
        ap = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0x2aaaab5fe960, reg_save_area = 0x2aaaab5fe8a0}}
#3  0x00002b5b89af1fbf in _ink_assert (expression=0x902748 "server_entry->read_vio == (VIO *)data", file=0x901a0c "HttpSM.cc", line=1811) at ink_assert.cc:37
No locals.
#4  0x000000000064dcc8 in HttpSM::state_read_server_response_header (this=0x2aaab14ee380, event=3, data=0x2aaaeea0a910) at HttpSM.cc:1811
        __FUNCTION__ = "state_read_server_response_header"
        bytes_used = -1419774208
        vio = 0x0
        state = 10922
#5  0x0000000000650fe6 in HttpSM::main_handler (this=0x2aaab14ee380, event=3, data=0x2aaaeea0a910) at HttpSM.cc:2663
        jump_point = (int (HttpSM::*)(HttpSM * const, int, void *)) 0x64db74 <HttpSM::state_read_server_response_header(int, void*)>
        __FUNCTION__ = "main_handler"
        vc_entry = 0x2aaab14efb88
#6  0x0000000000574752 in Continuation::handleEvent (this=0x2aaab14ee380, event=3, data=0x2aaaeea0a910) at /usr/local/src/trafficserver/iocore/eventsystem/I_Continuation.h:153
No locals.
#7  0x0000000000807027 in write_signal_and_update (event=3, vc=0x2aaaeea0a770) at UnixNetVConnection.cc:176
        __FUNCTION__ = "write_signal_and_update"
#8  0x000000000080725e in write_signal_done (event=3, nh=0x2b5b8e409e50, vc=0x2aaaeea0a770) at UnixNetVConnection.cc:218
No locals.
#9  0x00000000008072ee in write_signal_error (nh=0x2b5b8e409e50, vc=0x2aaaeea0a770, lerrno=32) at UnixNetVConnection.cc:237
No locals.
#10 0x00000000008081de in write_to_net_io (nh=0x2b5b8e409e50, vc=0x2aaaeea0a770, thread=0x2b5b8e406000) at UnixNetVConnection.cc:468
        err = 32
        errlen = 4
        s = 0x2aaaeea0a908
        mutex = 0x2b5b8d914c60
        lock = {m = {m_ptr = 0x2aaac4e02ed0}, lock_acquired = true}
        __FUNCTION__ = "write_to_net_io"
        ntodo = 959
        buf = <error reading variable>
        towrite = 46913636312944
        signalled = 0
        needs = 32
        total_written = 4
        r = 47672214753344
#11 0x0000000000808023 in write_to_net (nh=0x2b5b8e409e50, vc=0x2aaaeea0a770, thread=0x2b5b8e406000) at UnixNetVConnection.cc:443
        mutex = 0x2b5b8d914c60
#12 0x00000000007ff412 in NetHandler::mainNetEvent (this=0x2b5b8e409e50, event=5, e=0x2aaaabc05a60) at UnixNet.cc:526
        epd = 0x2aaaeea0ca18
        poll_timeout = 0
        pd = 0x2b5b8ea99000
        vc = 0x2aaaeea0a770
        __FUNCTION__ = "mainNetEvent"
#13 0x0000000000574752 in Continuation::handleEvent (this=0x2b5b8e409e50, event=5, data=0x2aaaabc05a60) at /usr/local/src/trafficserver/iocore/eventsystem/I_Continuation.h:153
No locals.
#14 0x000000000082b640 in EThread::process_event (this=0x2b5b8e406000, e=0x2aaaabc05a60, calling_code=5) at UnixEThread.cc:143
        c_temp = 0x2b5b8e409e50
        lock = {m = {m_ptr = 0x2b5b8d9141e0}, lock_acquired = true}
        __FUNCTION__ = "process_event"
#15 0x000000000082bc1f in EThread::execute (this=0x2b5b8e406000) at UnixEThread.cc:270
        done_one = false
        e = 0x2aaaabc05a60
        NegativeQueue = {<DLL<Event, Event::Link_link>> = {head = 0x2aaaabc04c80}, tail = 0x2aaaabc04080}
        next_time = 1490586447220437269
        __FUNCTION__ = "execute"
#16 0x000000000082ac5a in spawn_thread_internal (a=0x2b5b8d6aea90) at Thread.cc:84
        p = 0x2b5b8d6aea90
#17 0x00002b5b8b638dc5 in start_thread (arg=0x2aaaab5ff700) at pthread_create.c:308
        __res = <optimized out>
        pd = 0x2aaaab5ff700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {46912508000000, -4231554947831992312, 0, 140736391318576, 46912508000000, 0, -8065056817632415736, -7786325772008460280}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
#18 0x00002b5b8be4a73d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
No locals.

@zwoop zwoop reopened this Apr 2, 2017
@scw00
Copy link
Member

scw00 commented Apr 3, 2017

There are 2 reasons for this crash:
1、ATS try to connect the origin server, but receive RST package. So ATS try again until retry_times == 0. It will clean the server_entry (mark the VC as closed, remove this VC from write_ready_list and read_ready_list, then close it) in every times except the last one(It is easy to re-product), so it will call both read sm and write sm in the last times, and the write sm causes crash.
2、Epoll trigger EPOLLERR without EVENTIO_READ(hard to re-product).

In my opinion, the EPOLLERR always with read.enabled or write.enabled(otherwise, there is no sm could handle this event, we just ignore). So when EPOLLERR happens, we could push the vc into read_ready_list if the read.enabled == 1, otherwise push it into write_ready_list.

By the way, there is my program to re-product this crash:

package main

import (
	"net"
	"os"
	"fmt"
)

func main()  {

	ls, err := net.Listen("tcp", os.Args[1])
	if err != nil {
		fmt.Println(err)
		os.Exit(1)
	}

	for {
		conn, err := ls.Accept()
		if err != nil {
			fmt.Println(err)
			os.Exit(1)
		}
		go handleConnection(conn)
	}
}

func handleConnection(con net.Conn)  {
	con.Close()
}

Here is my patch #1629 @oknet

@zwoop zwoop modified the milestones: 7.1.0, 7.1.1 May 4, 2017
@vmamidi vmamidi mentioned this issue Jun 8, 2017
vmamidi added a commit to vmamidi/trafficserver that referenced this issue Jun 8, 2017
This reverts PRs apache#1559, apache#1522 and apache#947

PR apache#947 made the HTTP state machine unstable and lead to crashes in production like apache#1930 apache#1559 apache#1522 apache#1531 apache#1629

This reverts commit c1ac5f8.
zwoop pushed a commit that referenced this issue Jun 8, 2017
This reverts PRs #1559, #1522 and #947

PR #947 made the HTTP state machine unstable and lead to crashes in production like #1930 #1559 #1522 #1531 #1629

This reverts commit c1ac5f8.
@zwoop
Copy link
Contributor Author

zwoop commented Jun 21, 2017

This has been fixed / reverted.

@zwoop zwoop closed this as completed Jun 21, 2017
@zwoop zwoop modified the milestones: 7.1.1, 7.1.0 Jul 25, 2017
@zwoop zwoop removed this from the 7.1.1 milestone Jul 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backport Marked for backport for an LTS patch release Crash Network
Projects
None yet
Development

No branches or pull requests

4 participants