Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor celery-worker memory usage in Prod #3840

Closed
rjayasekera opened this issue Jun 26, 2019 · 10 comments
Closed

Monitor celery-worker memory usage in Prod #3840

rjayasekera opened this issue Jun 26, 2019 · 10 comments
Assignees
Milestone

Comments

@rjayasekera
Copy link
Contributor

rjayasekera commented Jun 26, 2019

Not received Slack #bot message about successful reload all AOs messages from production environments for the past two days. (6/24 & 6/25/19). But we are receiving reload of AOs completion messages from DEV and STAGE environments.

image

By looking and Kibana logs it appears to be AO reload jobs are starting in DEV/STAGE/PROD environments but for some reason AO reload job is not completing in PROD.

Links to Kibana logs:
https://logs.fr.cloud.gov/goto/a0518e7dcc8367cc11cb23efcc255989

image

https://logs.fr.cloud.gov/goto/db1e2660c6fafdda0c0547b59bb35a99

@lbeaufort lbeaufort changed the title Daily reload of all AOs now working in Prod Daily reload of all AOs not working in Prod Jun 26, 2019
@lbeaufort lbeaufort added this to the Sprint 9.3 milestone Jun 26, 2019
@lbeaufort lbeaufort self-assigned this Jun 26, 2019
@lbeaufort
Copy link
Member

lbeaufort commented Jun 26, 2019

Looks like the celery worker tried to load AO 2012-20 and then exited before the reload completed: https://logs.fr.cloud.gov/goto/4228f8f66a575f1bd3cc582360d11fb2

[2019-06-26 01:08:33,060: INFO/ForkPoolWorker-3] Loading AO: 2012-20
[2019-06-26 01:08:33,956: ERROR/MainProcess] Process 'ForkPoolWorker-3' pid:59 exited with 'signal 9 (SIGKILL)'

In the short term, I'm going to try to manually reload AO's in prod with a CF task. I'll start by just reloading starting with AO 2012-20, where it errored before. I'll used the zero-downtime reload task.
cf run-task api "python manage.py refresh_current_legal_docs_zero_downtime" -m 4G --name reload-legal
This might take a while but just reloading AOs will cause intermittent downtime.

@lbeaufort
Copy link
Member

It looks like started getting this error (mostly in prod) around the beginning of the year.

billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL)

https://logs.fr.cloud.gov/goto/52250b0d73a83b390c7ff04edde4ed4c

This seems to indicate the reload task was terminated by another process.

@lbeaufort
Copy link
Member

There are some recent ‘billiard’ bug fixes that might address this issue. I pushed a manual deploy of #3842 to dev to see how the reloads and downloads work.

This might be hard to test because the issue isn’t easily replicated. We might need to update versions and check once the deploy goes live.

@rjayasekera
Copy link
Contributor Author

Daily reload of AOs failed at the same records on Wednesday night as well

   2019-06-26T21:08:33.02-0400 [APP/PROC/WEB/4] ERR [2019-06-27 01:08:33,025: ERROR/MainProcess] Process 'ForkPoolWorker-4' pid:62 exited with 'signal 9 (SIGKILL)'
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR [2019-06-27 01:08:33,036: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR Traceback (most recent call last):
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR   File "/home/vcap/deps/0/python/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR     human_status(exitcode)),
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).```
Above error occurred trying to load 2012-20.  May be we are running out of celery-workers in `PROD` Or something wrong with AO 2012-20, since this is the exact record trying to process during your manual AO load and also failed. 

@rjayasekera
Copy link
Contributor Author

rjayasekera commented Jun 27, 2019

CPU and memory usage in celery-worker (we should increase to 2G and re-run AO refresh)

state since cpu memory disk details
#0 running 2019-06-19T03:30:42Z 0.3% 638.7M of 1G 420.3M of 2G
#1 running 2019-06-19T02:36:00Z 0.5% 860.3M of 1G 420.3M of 2G
#2 running 2019-06-19T03:22:54Z 0.3% 681.1M of 1G 420.3M of 2G
#3 running 2019-06-19T02:50:28Z 0.3% 745.2M of 1G 420.3M of 2G
#4 running 2019-06-19T03:48:13Z 0.2% 620.1M of 1G 420.3M of 2G
#5 running 2019-06-19T02:59:13Z 0.2% 751.7M of 1G 420.3M of 2G

@lbeaufort lbeaufort modified the milestones: Sprint 9.3, Sprint 9.4 Jun 28, 2019
@lbeaufort lbeaufort changed the title Daily reload of all AOs not working in Prod Monitor celery-worker memory usage in Prod Jun 28, 2019
@lbeaufort
Copy link
Member

Can monitor memory usage by targeting prod and running cf app celery-worker. Waiting for deploy on 9/7

@lbeaufort
Copy link
Member

name: celery-worker
requested state: started
routes:
last uploaded: Tue 09 Jul 11:15:50 EDT 2019
stack: cflinuxfs3
buildpacks: python

type: web
instances: 6/6
memory usage: 1024M

state since cpu memory disk details
#0 running 2019-07-09T15:16:06Z 0.2% 577.6M of 1G 421.2M of 2G
#1 running 2019-07-09T15:16:06Z 1.3% 704.4M of 1G 421.2M of 2G
#2 running 2019-07-09T15:16:05Z 0.2% 683.8M of 1G 421.2M of 2G
#3 running 2019-07-09T15:16:06Z 0.2% 663.8M of 1G 421.2M of 2G
#4 running 2019-07-09T15:16:07Z 0.2% 615.2M of 1G 421.2M of 2G
#5 running 2019-07-09T15:16:06Z 1.4% 647.2M of 1G 421.2M of 2G

@dorothyyeager dorothyyeager modified the milestones: Sprint 9.4, Sprint 9.5 Jul 12, 2019
@lbeaufort
Copy link
Member

name: celery-worker
requested state: started
instances: 6/6
usage: 1G x 6 instances
routes:
last uploaded: Tue 09 Jul 11:13:50 EDT 2019
stack: cflinuxfs3
buildpack: python_buildpack

state since cpu memory disk details
#0 running 2019-07-11T20:30:53Z 0.2% 611.6M of 1G 421.2M of 2G
#1 running 2019-07-11T19:48:00Z 0.2% 647.2M of 1G 421.2M of 2G
#2 running 2019-07-11T20:25:09Z 0.2% 622.3M of 1G 421.2M of 2G
#3 running 2019-07-11T19:39:06Z 0.2% 635.3M of 1G 421.2M of 2G
#4 running 2019-07-11T20:19:11Z 0.2% 672.6M of 1G 421.2M of 2G
#5 running 2019-07-11T19:57:27Z 0.5% 698.2M of 1G 421.2M of 2G

@lbeaufort
Copy link
Member

lbeaufort commented Jul 15, 2019

Here's a Kibana query that tracks celery-worker memory usage over time. We want to look at memory usage before and after 7/9/19.

https://logs.fr.cloud.gov/goto/6429fcbc24388c72b6c4094ff5b2377d

@lbeaufort
Copy link
Member

We don't have any other symptoms of celery-worker running out of memory, so I'm closing this issue and we can investigate if the refreshes fail again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants