-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WRITE_COMPLETE event in http server open causes traffic server to abort #1930
Comments
Print history in HttpSM |
i dont have any more information |
:(.....sad!! |
@vmamidi What version is this for? |
we are seeing this on 7.0.0 |
@vmamidi Can you post a better backtrace with line numbers in the code? bt full would be great! |
@scw00 I was able to get the history. Here is the history for ATS on 7.1.x branch. |
great! |
looks like HttpSM is receiving VC_EVENT_ERROR and also WRITE_COMPLETE event after EOS. |
ATS do a hostdb lookup to retry the next ip by round robin.
The hostdb callback at [37]
I don't know the reason why [32] shows in the history, it should blocked by pending_action. Also, I noticed :
The event is changed from 500 to 103. What is that means ? |
@scw00 I think we should cleanup server_entry at HttpSM.cc:7359, just do the same thing at HttpSM.cc:7386. |
@vmamidi Did you seen "Unknown event: 500" ? |
@oknet yes, it is different from the original stack trace that I originally posted but it still falls under the same category of receiving the wrong event. and the Stack trace is: abort |
@vmamidi I think I found the reason, a patch will be pushed later. Can you reproduce it and have a try ? |
what do you think is the problem? |
@vmamidi The HttpSM retry to connect to other server IPs by DNS rr. (The [31] shows) From the HttpSM::history, the [32] changes the state of HttpSM and reset the default handler of HttpSM. |
@oknet HttpSM::handle_server_setup_error at [30] should clear all of that |
@vmamidi No, the [30] does not cleanup server entry. |
Can we do clean in HttpSM::handle_server_setup_error ? |
@scw00 I don't know. :-( |
@oknet @scw00 look at this stack trace. abort and corresponding log message. |
Also, i am not pretty sure but i dont think this was an issue in earlier versions of ATS. |
@vmamidi The NetHandler callback ERROR to HttpSM [32] and then run into [33] and [34].
In the [37], the HostDB callback to HttpSM and the default handler should be HttpSM::state_hostdb_lookup but it is changed to HttpSM::state_http_server_open. |
Hmm! In my opinion, we should clean up the server entry when we decide to try again, regardless of whether the server crashes. |
@oknet yes, i agree with related to history that i have pasted here but i dont think that is the only one of the scenarios. |
@scw00 yes, i agreee |
@oknet here is another history.
Stack trace: abort and corresponding log message: traffic_server[33674]: {0x2aaab470c700} ERROR: [HttpSM::state_http_server_open] Unknown event: 103 |
@vmamidi What version is the last HttpSM::history for? |
[25] WRITE_COMPLETE Is this a POST request ? This is a retry after transform finished if it is a POST request with post transform enabled. The PR #2034 is only for the first HttpSM::history. |
The post retry has broken currently because tunnel has consumed all post data when os down. So we can see [27] TIMEOUT. |
The transformed data is dropped by HttpSM if server failed. The HttpSM can not rebuild the tunnel chain if retry to connect server. The server connection retry mechanism has some bug, my suggestion is set the retry to 0 to disable the mechanism.
If congestion.config is used then:
|
@oknet it is 7.0.0. Yes, ATS connection retries are broken and could cause a crash. This is the reason why I was saying having a fix while doing DNS which only solve one scenario. |
In the second history, we can see that client post timeout ([27]) and we send the 408 to client. Then server side will receive timeout and retry [28]. The problem is we did not close the vc directly. That was fixed in 8.0.x |
@oknet @scw00 @zwoop I am okay with having patches for each of the histories that we have on this issue but I am worried that state machine can receive events for past states that too because of something we did in #947 as pointed out by @oknet. As of now, I don't think state machine is robust enough to handle events for past states but if that is expected behavior we should make sure that HttpSM is robust enough to handle that. |
@oknet @scw00 @zwoop @bryancall @SolidWallOfCode I understand the scenario mentioned in #947 but, state machine now can receive unexpected events. Now, we have to either fix the state machine with the new expectation or revert #947 and come up with a different solution which does not break the expectation and fixes the mentioned scenario. I personally do not think the current approach of leaving the state machine in not so robust state is a good idea. |
This might be worthwhile to take to the mailing list (dev@trafficserver). I'm ok either way, but I think it'd be better discussed / decided on the mailing list (I'm not sure who, if any, actually read these Issue comments :-). |
Everyone: I think we're to the point now of considering backing out #947. Is this agreeable by everyone? |
Agree! |
+1 |
This reverts PRs apache#1559, apache#1522 and apache#947 PR apache#947 made the HTTP state machine unstable and lead to crashes in production like apache#1930 apache#1559 apache#1522 apache#1531 apache#1629 This reverts commit c1ac5f8.
I'm removing the 7.1.0 Milestone from this Issue, but keeping this open for now as there's valuable discussions here. |
If this is still an issue please reopen. |
Stack trace :
ink_abort(char const*, ...)
HttpSM::state_http_server_open(int, void*)
HttpSM::main_handler(int, void*)
write_to_net_io(NetHandler*, UnixNetVConnection*, EThread*)
NetHandler::mainNetEvent(int, Event*)
EThread::execute()
and corresponding error messages:
traffic_server[33593]: {0x2aaab6d0c700} ERROR: [HttpSM::state_http_server_open] Unknown event: 103
traffic_server[33593]: FATAL: HttpSM.cc:1798: failed assertion
0
We are seeing this in 7.0.0
The text was updated successfully, but these errors were encountered: