Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dan/bump pq #2858

Merged
merged 11 commits into from Feb 22, 2020
Merged

Dan/bump pq #2858

merged 11 commits into from Feb 22, 2020

Conversation

danielsdeleo
Copy link
Contributor

🔩 Description: What code changed, and why?

Giving it another go to see what happens. Probably won't actively work on it a lot if it's not easy so feel free to take over if you happen upon this PR and need it done.

⛓️ Related Resources

👍 Definition of Done

👟 How to Build and Test the Change

✅ Checklist

📷 Screenshots, if applicable

@danielsdeleo
Copy link
Contributor Author

To explain the approach here:

It seems the automatic reconnect and retry logic in lib/pq is kinda dangerous as-is: lib/pq#939 and attempts to bring back the previous retrying behavior petered out because of safety concerns: lib/pq#871

So it seems the best path forward is to assume the retrying behavior is not coming back in lib/pq, or at least not soon. I also think it's fine for Automate to return one 500 after a database disconnect/reconnnect. If you agree with that then the sensible thing to do is to make our tests a little more forgiving of the occasional 500 that we would expect to see from the database resets.

@danielsdeleo danielsdeleo force-pushed the dan/bump-pq branch 5 times, most recently from 7922677 to 0e7ecc8 Compare February 14, 2020 01:08
# Inspec tests are less tolerant of transient 500s; we expect a few of
# those as bad postgres connections get purged from the various services.
# Restart everything before running the tests
docker exec -t "$_frontend1_container_name" "$cli_bin" restart-services
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure i understand whats happening here. We've just done a deploy. Why does restarting everything help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frontends come up with broken Postgres connections, so the Inspec tests get exactly 2 500s, see for example https://buildkite.com/chef/chef-automate-master-verify-private/builds/9454#d349a780-69be-4a0f-ac17-10f6ae8f2475. Retrying 500s in the Inspec would be the better solution but it's a lot harder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone who might stumble on this later, the code comment above is the correct answer--an HAProxy somewhere has the idle timeout set to 5 minutes, which kills the connections on frontend1 during the more-than-five-minutes when frontend2 is being deployed.

@danielsdeleo danielsdeleo force-pushed the dan/bump-pq branch 4 times, most recently from d8bf7e7 to 0d35487 Compare February 19, 2020 18:39
@@ -5,3 +5,5 @@ status_port = 10146

[timeouts]
connect = 5
# 12 hours
idle = 43200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we wire this up to the config so its easy to change

@danielsdeleo danielsdeleo force-pushed the dan/bump-pq branch 8 times, most recently from 2f1161c to c3eeb38 Compare February 20, 2020 22:29
stevendanna and others added 9 commits February 20, 2020 17:39
Full set of changes:

lib/pq@83612a5...3427c32

Highlights include:

- A fix for a potential deadlock when related to commit and rollback
- QuoteLiteral function that can replace our custom implementation

Signed-off-by: Steven Danna <steve@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
Signed-off-by: Daniel DeLeo <dan@chef.io>
@danielsdeleo danielsdeleo force-pushed the dan/bump-pq branch 3 times, most recently from ed041a1 to df2e8ff Compare February 21, 2020 21:08
Signed-off-by: Daniel DeLeo <dan@chef.io>
@danielsdeleo danielsdeleo requested a review from a team as a code owner February 21, 2020 23:09
@danielsdeleo danielsdeleo merged commit 5650c42 into master Feb 22, 2020
@danielsdeleo danielsdeleo deleted the dan/bump-pq branch February 22, 2020 01:34
@danielsdeleo danielsdeleo mentioned this pull request Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants