Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush FPM logs in case of timeouts #1133

Merged
merged 1 commit into from
Jan 8, 2022
Merged

Flush FPM logs in case of timeouts #1133

merged 1 commit into from
Jan 8, 2022

Conversation

mnapoli
Copy link
Member

@mnapoli mnapoli commented Jan 4, 2022

When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means logs never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

  • flushes the logs (logs end up in CloudWatch, which is great)
  • restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895, #1106

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.

Here is an example of a timeout with the PHP logs correctly written to stderr:

Screen 2022-01-04 17-24

I tried to make the error message as explicit as possible.


I would love for some of you to try this PR and confirm that it helps (that PHP logs end up in CloudWatch) and that there are no side effects.

You can do that via composer.json:

-    "bref/bref": "^1.0",
+    "bref/bref": "dev-timeouts-fpm",

When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
Copy link
Contributor

@shadowhand shadowhand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever solution. I don't know that we can test it for you, because only our production instances see timeouts.

@mnapoli
Copy link
Member Author

mnapoli commented Jan 4, 2022

Testing in production? 😛

I'm also considering releasing that behind a feature flag (env variable for example), but would that change anything for you?

Copy link
Member

@t-richard t-richard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simple and smart enough 👍 (having FPM in Lambda is more or less a hack already 😅)

Just have one comment

@@ -1071,7 +1071,7 @@ public function test FPM timeouts are recovered from()
'httpMethod' => 'GET',
], $this->fakeContext);
$this->fail('No exception was thrown');
} catch (FastCgiCommunicationFailed $e) {
} catch (Timeout $e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FastCgiCommunicationFailed is still thrown in FpmHandler.php

Shouldn't it be 👇 ?

Suggested change
} catch (Timeout $e) {
} catch (FastCgiCommunicationFailed|Timeout $e) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test I specifically want to test for a timeout, so if I get any other exception it's that the test is broken? (and so the FastCgiCommunicationFailed exception should bubble up and fail my test)

@allan-simon
Copy link
Contributor

haha funny we're facing this issue right now , trying your branch right now :)

@allan-simon
Copy link
Contributor

I tried some minutes ago your branch , and unfortunately for us we still only in clouwatch

2022-01-05T17:43:58.568+01:00 | START RequestId: f5bb07ad-7fb3-49cf-88a7-4e13a19735f2 Version: $LATEST
2022-01-05T17:44:13.586+01:00 | END RequestId: f5bb07ad-7fb3-49cf-88a7-4e13a19735f2
2022-01-05T17:44:13.586+01:00 | REPORT RequestId: f5bb07ad-7fb3-49cf-88a7-4e13a19735f2 Duration: 15015.50 ms Billed Duration: 15000 ms Memory Size: 1769 MB Max
2022-01-05T16:44:13.586Z f5bb07ad-7fb3-49cf-88a7-4e13a19735f2 Task timed out after 15.02 seconds

@mnapoli
Copy link
Member Author

mnapoli commented Jan 5, 2022

@allan-simon thanks for trying it! It seems in your case the new feature did not kick in at all 🤔

If you have a 15 seconds Lambda timeout, then FPM should stop at 14 seconds and throw an explicit exception (the "Task timed out" message shouldn't even appear, because the 15 seconds shouldn't be reached).

At the 14th second, the exception should be thrown no matter what, so that's confusing. Sorry to insist but could you double-check that the pull request was really deployed in your case?

@allan-simon
Copy link
Contributor

@mnapoli no problem, I will double check that, you're totally right that I may have been too excited and forgot something

@allan-simon
Copy link
Contributor

@mnapoli indeed, I only did "composer install" , and forgot it does not upgrade the composer.lock, I confirm your branch works like a charm and is a huge life saver for us ^^ (so great luck from us that you did the PR at that very same time )

@mnapoli
Copy link
Member Author

mnapoli commented Jan 6, 2022

@allan-simon AWESOME, thank you for testing!

@shadowhand
Copy link
Contributor

Woohoo! 🎊

@allan-simon
Copy link
Contributor

I'm also considering releasing that behind a feature flag (env variable for example), but would that change anything for you?

if that allows this PR to be more quickly merged in mainstream , yes :)

@shadowhand
Copy link
Contributor

I'm also considering releasing that behind a feature flag (env variable for example), but would that change anything for you?

I don't think that would change anything for us, but it would probably be better for release planning.

@mnapoli
Copy link
Member Author

mnapoli commented Jan 8, 2022

Let's get this out of the door!

Thanks for testing on your end too @allan-simon!

@mnapoli mnapoli merged commit 25c2415 into master Jan 8, 2022
@mnapoli mnapoli deleted the timeouts-fpm branch January 8, 2022 17:03
@@ -3,6 +3,7 @@
require __DIR__ . '/../vendor/autoload.php';

if (isset($_GET['sleep'])) {
error_log('This is a log');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mnapoli was this merged into 1.5.0? Seems like extraneous debugging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is demo code, nothing to worry about IMHO
It just allows to test this feature works but will never be executed in you app 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this directory isn't even a demo really, just a sample app for me to play/develop. It's not shipped to users.

mnapoli added a commit that referenced this pull request Jan 27, 2022
Follow-up of #1133

This is because sending SIGUSR2 to FPM (the previously implemented solution) did not really stop with 100% certainty the PHP script that timed out.

Indeed, it merely interrupted the currently blocked call (e.g. a sleep, a DB call, etc.), flushed the logs and carried on. My guess is that this could have caused the PHP script to continue to run in some cases, possibly running into yet another timeout on a next line (e.g. another DB call).

This PR fixes the timeout test that wasn't really working (🤦) and restarts FPM completely in case of timeout. That is confirmed to completely stop the execution of the timed out script + flush the logs to stderr.
@mnapoli
Copy link
Member Author

mnapoli commented Jan 27, 2022

For reference, follow-up: #1144

mnapoli added a commit that referenced this pull request Feb 14, 2023
Flush FPM logs in case of timeouts
mnapoli added a commit that referenced this pull request Feb 14, 2023
Follow-up of #1133

This is because sending SIGUSR2 to FPM (the previously implemented solution) did not really stop with 100% certainty the PHP script that timed out.

Indeed, it merely interrupted the currently blocked call (e.g. a sleep, a DB call, etc.), flushed the logs and carried on. My guess is that this could have caused the PHP script to continue to run in some cases, possibly running into yet another timeout on a next line (e.g. another DB call).

This PR fixes the timeout test that wasn't really working (🤦) and restarts FPM completely in case of timeout. That is confirmed to completely stop the execution of the timed out script + flush the logs to stderr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants