Monitor celery-worker memory usage in Prod #3840

rjayasekera · 2019-06-26T16:03:36Z

Not received Slack #bot message about successful reload all AOs messages from production environments for the past two days. (6/24 & 6/25/19). But we are receiving reload of AOs completion messages from DEV and STAGE environments.

By looking and Kibana logs it appears to be AO reload jobs are starting in DEV/STAGE/PROD environments but for some reason AO reload job is not completing in PROD.

Links to Kibana logs:
https://logs.fr.cloud.gov/goto/a0518e7dcc8367cc11cb23efcc255989

https://logs.fr.cloud.gov/goto/db1e2660c6fafdda0c0547b59bb35a99

lbeaufort · 2019-06-26T17:14:14Z

Looks like the celery worker tried to load AO 2012-20 and then exited before the reload completed: https://logs.fr.cloud.gov/goto/4228f8f66a575f1bd3cc582360d11fb2

[2019-06-26 01:08:33,060: INFO/ForkPoolWorker-3] Loading AO: 2012-20
[2019-06-26 01:08:33,956: ERROR/MainProcess] Process 'ForkPoolWorker-3' pid:59 exited with 'signal 9 (SIGKILL)'

In the short term, I'm going to try to manually reload AO's in prod with a CF task. ~~I'll start by just reloading starting with AO 2012-20, where it errored before.~~ I'll used the zero-downtime reload task.
cf run-task api "python manage.py refresh_current_legal_docs_zero_downtime" -m 4G --name reload-legal
This might take a while but just reloading AOs will cause intermittent downtime.

lbeaufort · 2019-06-26T18:48:29Z

It looks like started getting this error (mostly in prod) around the beginning of the year.

billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL)

https://logs.fr.cloud.gov/goto/52250b0d73a83b390c7ff04edde4ed4c

This seems to indicate the reload task was terminated by another process.

lbeaufort · 2019-06-26T20:40:03Z

There are some recent ‘billiard’ bug fixes that might address this issue. I pushed a manual deploy of #3842 to dev to see how the reloads and downloads work.

This might be hard to test because the issue isn’t easily replicated. We might need to update versions and check once the deploy goes live.

rjayasekera · 2019-06-27T01:32:36Z

Daily reload of AOs failed at the same records on Wednesday night as well

   2019-06-26T21:08:33.02-0400 [APP/PROC/WEB/4] ERR [2019-06-27 01:08:33,025: ERROR/MainProcess] Process 'ForkPoolWorker-4' pid:62 exited with 'signal 9 (SIGKILL)'
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR [2019-06-27 01:08:33,036: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR Traceback (most recent call last):
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR   File "/home/vcap/deps/0/python/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR     human_status(exitcode)),
   2019-06-26T21:08:33.03-0400 [APP/PROC/WEB/4] ERR billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).```
Above error occurred trying to load 2012-20.  May be we are running out of celery-workers in `PROD` Or something wrong with AO 2012-20, since this is the exact record trying to process during your manual AO load and also failed.

rjayasekera · 2019-06-27T02:12:31Z

CPU and memory usage in celery-worker (we should increase to 2G and re-run AO refresh)

state	since	cpu	memory	disk
#0 running	2019-06-19T03:30:42Z	0.3%	638.7M of 1G	420.3M of 2G
#1 running	2019-06-19T02:36:00Z	0.5%	860.3M of 1G	420.3M of 2G
#2 running	2019-06-19T03:22:54Z	0.3%	681.1M of 1G	420.3M of 2G
#3 running	2019-06-19T02:50:28Z	0.3%	745.2M of 1G	420.3M of 2G
#4 running	2019-06-19T03:48:13Z	0.2%	620.1M of 1G	420.3M of 2G
#5 running	2019-06-19T02:59:13Z	0.2%	751.7M of 1G	420.3M of 2G

lbeaufort · 2019-07-03T20:12:11Z

Can monitor memory usage by targeting prod and running cf app celery-worker. Waiting for deploy on 9/7

lbeaufort · 2019-07-10T17:27:37Z

name: celery-worker
requested state: started
routes:
last uploaded: Tue 09 Jul 11:15:50 EDT 2019
stack: cflinuxfs3
buildpacks: python

type: web
instances: 6/6
memory usage: 1024M

state	since	cpu	memory	disk	details
#0	running	2019-07-09T15:16:06Z	0.2%	577.6M of 1G	421.2M of 2G
#1	running	2019-07-09T15:16:06Z	1.3%	704.4M of 1G	421.2M of 2G
#2	running	2019-07-09T15:16:05Z	0.2%	683.8M of 1G	421.2M of 2G
#3	running	2019-07-09T15:16:06Z	0.2%	663.8M of 1G	421.2M of 2G
#4	running	2019-07-09T15:16:07Z	0.2%	615.2M of 1G	421.2M of 2G
#5	running	2019-07-09T15:16:06Z	1.4%	647.2M of 1G	421.2M of 2G

lbeaufort · 2019-07-15T14:25:56Z

name: celery-worker
requested state: started
instances: 6/6
usage: 1G x 6 instances
routes:
last uploaded: Tue 09 Jul 11:13:50 EDT 2019
stack: cflinuxfs3
buildpack: python_buildpack

state	since	cpu	memory	disk	details
#0	running	2019-07-11T20:30:53Z	0.2%	611.6M of 1G	421.2M of 2G
#1	running	2019-07-11T19:48:00Z	0.2%	647.2M of 1G	421.2M of 2G
#2	running	2019-07-11T20:25:09Z	0.2%	622.3M of 1G	421.2M of 2G
#3	running	2019-07-11T19:39:06Z	0.2%	635.3M of 1G	421.2M of 2G
#4	running	2019-07-11T20:19:11Z	0.2%	672.6M of 1G	421.2M of 2G
#5	running	2019-07-11T19:57:27Z	0.5%	698.2M of 1G	421.2M of 2G

lbeaufort · 2019-07-15T18:05:54Z

Here's a Kibana query that tracks celery-worker memory usage over time. We want to look at memory usage before and after 7/9/19.

https://logs.fr.cloud.gov/goto/6429fcbc24388c72b6c4094ff5b2377d

lbeaufort · 2019-07-15T18:06:49Z

We don't have any other symptoms of celery-worker running out of memory, so I'm closing this issue and we can investigate if the refreshes fail again.

rjayasekera added the Needs refinement label Jun 26, 2019

lbeaufort changed the title ~~Daily reload of all AOs now working in Prod~~ Daily reload of all AOs not working in Prod Jun 26, 2019

lbeaufort added this to the Sprint 9.3 milestone Jun 26, 2019

lbeaufort self-assigned this Jun 26, 2019

lbeaufort removed the Needs refinement label Jun 26, 2019

lbeaufort mentioned this issue Jun 26, 2019

Update celery and related packages #3842

Merged

lbeaufort modified the milestones: Sprint 9.3, Sprint 9.4 Jun 28, 2019

lbeaufort changed the title ~~Daily reload of all AOs not working in Prod~~ Monitor celery-worker memory usage in Prod Jun 28, 2019

dorothyyeager modified the milestones: Sprint 9.4, Sprint 9.5 Jul 12, 2019

lbeaufort closed this as completed Jul 15, 2019

patphongs mentioned this issue Feb 29, 2024

Epic: Stability and reliability fecgov/fec-epics#137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor celery-worker memory usage in Prod #3840

Monitor celery-worker memory usage in Prod #3840

rjayasekera commented Jun 26, 2019 •

edited by lbeaufort

lbeaufort commented Jun 26, 2019 •

edited

lbeaufort commented Jun 26, 2019

lbeaufort commented Jun 26, 2019

rjayasekera commented Jun 27, 2019

rjayasekera commented Jun 27, 2019 •

edited

lbeaufort commented Jul 3, 2019

lbeaufort commented Jul 10, 2019

lbeaufort commented Jul 15, 2019

lbeaufort commented Jul 15, 2019 •

edited

lbeaufort commented Jul 15, 2019

Monitor celery-worker memory usage in Prod #3840

Monitor celery-worker memory usage in Prod #3840

Comments

rjayasekera commented Jun 26, 2019 • edited by lbeaufort

lbeaufort commented Jun 26, 2019 • edited

lbeaufort commented Jun 26, 2019

lbeaufort commented Jun 26, 2019

rjayasekera commented Jun 27, 2019

rjayasekera commented Jun 27, 2019 • edited

lbeaufort commented Jul 3, 2019

lbeaufort commented Jul 10, 2019

lbeaufort commented Jul 15, 2019

lbeaufort commented Jul 15, 2019 • edited

lbeaufort commented Jul 15, 2019

rjayasekera commented Jun 26, 2019 •

edited by lbeaufort

lbeaufort commented Jun 26, 2019 •

edited

rjayasekera commented Jun 27, 2019 •

edited

lbeaufort commented Jul 15, 2019 •

edited