-
-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flush FPM logs in case of timeouts #1133
Conversation
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever solution. I don't know that we can test it for you, because only our production instances see timeouts.
Testing in production? 😛 I'm also considering releasing that behind a feature flag (env variable for example), but would that change anything for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simple and smart enough 👍 (having FPM in Lambda is more or less a hack already 😅)
Just have one comment
@@ -1071,7 +1071,7 @@ public function test FPM timeouts are recovered from() | |||
'httpMethod' => 'GET', | |||
], $this->fakeContext); | |||
$this->fail('No exception was thrown'); | |||
} catch (FastCgiCommunicationFailed $e) { | |||
} catch (Timeout $e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FastCgiCommunicationFailed
is still thrown in FpmHandler.php
Shouldn't it be 👇 ?
} catch (Timeout $e) { | |
} catch (FastCgiCommunicationFailed|Timeout $e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test I specifically want to test for a timeout, so if I get any other exception it's that the test is broken? (and so the FastCgiCommunicationFailed
exception should bubble up and fail my test)
haha funny we're facing this issue right now , trying your branch right now :) |
I tried some minutes ago your branch , and unfortunately for us we still only in clouwatch 2022-01-05T17:43:58.568+01:00 | START RequestId: f5bb07ad-7fb3-49cf-88a7-4e13a19735f2 Version: $LATEST |
@allan-simon thanks for trying it! It seems in your case the new feature did not kick in at all 🤔 If you have a 15 seconds Lambda timeout, then FPM should stop at 14 seconds and throw an explicit exception (the "Task timed out" message shouldn't even appear, because the 15 seconds shouldn't be reached). At the 14th second, the exception should be thrown no matter what, so that's confusing. Sorry to insist but could you double-check that the pull request was really deployed in your case? |
@mnapoli no problem, I will double check that, you're totally right that I may have been too excited and forgot something |
@mnapoli indeed, I only did "composer install" , and forgot it does not upgrade the composer.lock, I confirm your branch works like a charm and is a huge life saver for us ^^ (so great luck from us that you did the PR at that very same time ) |
@allan-simon AWESOME, thank you for testing! |
Woohoo! 🎊 |
if that allows this PR to be more quickly merged in mainstream , yes :) |
I don't think that would change anything for us, but it would probably be better for release planning. |
Let's get this out of the door! Thanks for testing on your end too @allan-simon! |
@@ -3,6 +3,7 @@ | |||
require __DIR__ . '/../vendor/autoload.php'; | |||
|
|||
if (isset($_GET['sleep'])) { | |||
error_log('This is a log'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mnapoli was this merged into 1.5.0? Seems like extraneous debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is demo code, nothing to worry about IMHO
It just allows to test this feature works but will never be executed in you app 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this directory isn't even a demo really, just a sample app for me to play/develop. It's not shipped to users.
Follow-up of #1133 This is because sending SIGUSR2 to FPM (the previously implemented solution) did not really stop with 100% certainty the PHP script that timed out. Indeed, it merely interrupted the currently blocked call (e.g. a sleep, a DB call, etc.), flushed the logs and carried on. My guess is that this could have caused the PHP script to continue to run in some cases, possibly running into yet another timeout on a next line (e.g. another DB call). This PR fixes the timeout test that wasn't really working (🤦) and restarts FPM completely in case of timeout. That is confirmed to completely stop the execution of the timed out script + flush the logs to stderr.
For reference, follow-up: #1144 |
Flush FPM logs in case of timeouts
Follow-up of #1133 This is because sending SIGUSR2 to FPM (the previously implemented solution) did not really stop with 100% certainty the PHP script that timed out. Indeed, it merely interrupted the currently blocked call (e.g. a sleep, a DB call, etc.), flushed the logs and carried on. My guess is that this could have caused the PHP script to continue to run in some cases, possibly running into yet another timeout on a next line (e.g. another DB call). This PR fixes the timeout test that wasn't really working (🤦) and restarts FPM completely in case of timeout. That is confirmed to completely stop the execution of the timed out script + flush the logs to stderr.
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means logs never reach CloudWatch, which makes timeouts really hard to debug.
With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).
If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:
Follow up of #770, #772, #895, #1106
May address some of #862
Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
Here is an example of a timeout with the PHP logs correctly written to stderr:
I tried to make the error message as explicit as possible.
I would love for some of you to try this PR and confirm that it helps (that PHP logs end up in CloudWatch) and that there are no side effects.
You can do that via
composer.json
: