Supervisor removal #790

dev-zero · 2017-10-17T20:49:23Z

No description provided.

giovannipizzi · 2017-10-17T21:24:58Z

For the test that fails: please run aiida/docs/update_req_for_rtd.py locally and commit the changes it makes. It's possible that the requirements for RTD are not needed to be maintained anymore, but for now we're updating them "by hand" with the script. Thanks!

giovannipizzi

Hi @dev-zero ,
a few small requests:

could you rebase/merge/solve the conflicts?
at the end of the tests (in .travis-data/test_daemon.py or something like this) at the end the daemon log file is printed. Now this is not printed anymore because the filename changed. Can you adapt the test script?
in the travis output, I see there are three processes. This means that there are three independent processes or workers for the daemon that act in parallel? Unfortunately, if this is the case at the moment we must limit to only 1 worker... There are some parts of the code (I think mainly the old workflows at this stage) that would be run multiple times if multiple workers start working with them...
Apart for this, I understand correctly that this is ready to be merged?

dev-zero · 2017-11-03T15:27:13Z

rebase is done
I have to wait for the output of Travis now to see whether everything works, I started the test here but had to interrupt it at some point
well, what was previously done is that verdi daemon start started supervisord which in turn started one celery worker which starts one worker per CPU. Previously verdi daemon status only showed the processes started directly by supervisor which is/was only the celery worker parent process. Since I am querying the PID now directly, I thought it would be important to show how many workers are actually running and additionally query also for child processes of the still single celery worker process. So, unless I have missed a configuration somewhere limiting the number of celery workers to one, we are not doing anything different from before.
yes, the code should be ready for wider testing (especially with the last commit, polishing some parts of it)

giovannipizzi · 2017-11-03T15:34:37Z

aiida/common/setup.py

-command=celery worker -A tasks --loglevel=INFO --beat --schedule={daemon_dir}/celerybeat-schedule
-directory={aiida_code_home}/daemon/
-user={local_user}
-numprocs=1


I think this is the line that was limiting the # of procs to 1 (and so probably indirectly the # of celery workers).

No, it does not. It only tells supervisor to start only one celery worker process instead of multiple ones.

Here is the the analysis with the current develop branch and using supervisor:

(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ verdi daemon start 11/04/2017 02:45:19 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl. 11/04/2017 02:45:19 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL. Clearing all locks ... Starting AiiDA Daemon ... Daemon started (venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ verdi daemon status 11/04/2017 02:48:26 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl. 11/04/2017 02:48:26 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL. # Most recent daemon timestamp:0h:00m:02s ago ## Found 1 process running: * aiida-daemon[aiida-daemon] RUNNING pid 17019, uptime 0:03:05 (venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ pstree -p 17019 celery(17019)─┬─celery(17033) ├─celery(17034) ├─celery(17035) ├─celery(17036) ├─celery(17037) ├─celery(17038) ├─celery(17039) ├─celery(17040) ├─celery(17044) ├─{celery}(17041) ├─{celery}(17042) └─{celery}(17043)

and here is the output without supervisor:

(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ verdi daemon start 11/04/2017 02:51:59 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl. 11/04/2017 02:51:59 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL. Clearing all locks ... Starting AiiDA Daemon (log file: /users/tiziano/.aiida/daemon/log/celery.log)... Daemon started (venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ verdi daemon status 11/04/2017 02:52:13 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl. 11/04/2017 02:52:13 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL. # Most recent daemon timestamp:0h:00m:01s ago Daemon is running as pid 17868 since 2017-11-04 14:51:59.590000, child processes: * celery[17899] sleeping, started at 2017-11-04 14:52:01 * celery[17900] sleeping, started at 2017-11-04 14:52:01 * celery[17901] sleeping, started at 2017-11-04 14:52:01 * celery[17902] sleeping, started at 2017-11-04 14:52:01 * celery[17903] sleeping, started at 2017-11-04 14:52:01 * celery[17904] sleeping, started at 2017-11-04 14:52:01 * celery[17905] sleeping, started at 2017-11-04 14:52:01 * celery[17906] sleeping, started at 2017-11-04 14:52:01 * celery[17910] sleeping, started at 2017-11-04 14:52:01 (venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ pstree -p 17868 celery(17868)─┬─celery(17899) ├─celery(17900) ├─celery(17901) ├─celery(17902) ├─celery(17903) ├─celery(17904) ├─celery(17905) ├─celery(17906) ├─celery(17910) ├─{celery}(17907) ├─{celery}(17908) └─{celery}(17909)

I used verdi setup ... using the SQLA backend to setup my profile, so I don't think there is something wrong with it.

Can you please check on one of your setups (preferably on one with more than 1 CPU).

giovannipizzi · 2017-11-03T15:34:59Z

I just put a comment to the line that I think was limiting to 1 worker only

dev-zero · 2017-11-11T13:31:22Z

@giovannipizzi I looked into the worker-limitation: I can't confirm that with my setup. The number of celery workers does not seem to be limited at any point.

giovannipizzi · 2017-11-13T08:20:58Z

Ok, interesting, I was convinced that there was only 1 worker... Maybe I was wrong, or maybe this changed over time. Let's merge this then. Do you know in the new setup how to limit the number of processes?
Just for a better understanding, are the multiple workers now multiple processes, or threads in the same process? Because this might also help debugging some nasty behaviors that we have in special cases with SQLAlchemy (and @szoupanos might be interested)

szoupanos · 2017-11-13T15:47:02Z

This is a very interesting topic. It's a pity that I didn't notice it before.
I think that the argument that @giovannipizzi refers to is the --concurrency
http://docs.celeryproject.org/en/latest/userguide/workers.html#concurrency

So the command should change from
command=celery worker -A tasks --loglevel=INFO --beat --schedule=/home/aiida_test/.aiida/daemon/celerybeat-schedule
to
command=celery worker -A tasks --loglevel=INFO --beat --concurrency=1 --schedule=/home/aiida_test/.aiida/daemon/celerybeat-schedule

The result of the pstree for the first command is

supervisord(30013)───celery(30015)─┬─celery(30039)
                                   ├─celery(30040)
                                   ├─celery(30041)
                                   ├─celery(30042)
                                   ├─celery(30043)
                                   ├─celery(30044)
                                   ├─celery(30045)
                                   ├─celery(30046)
                                   ├─celery(30047)
                                   ├─celery(30048)
                                   ├─celery(30049)
                                   ├─celery(30050)
                                   ├─celery(30051)
                                   ├─celery(30052)
                                   ├─celery(30053)
                                   ├─celery(30054)
                                   ├─celery(30055)
                                   ├─{celery}(30056)
                                   ├─{celery}(30057)
                                   └─{celery}(30058)

And for the second

supervisord(29798)───celery(29800)─┬─celery(29823)
                                   ├─celery(29827)
                                   ├─{celery}(29824)
                                   ├─{celery}(29825)
                                   └─{celery}(29826)

I don't understand why there are 2 subprocesses with the --concurrency=1 argument but I suppose that the second should be a helping sub-process, since according to the doc:

-c, --concurrency
Number of child processes processing the queue. The default is the number of CPUs available on your system.

http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker-c

I believe that all these are very related to the workflow testing and debugging that we currently perform and it is a good opportunity to have a small chat on:

why the supervisor was removed and why it was there in the first place
why we need only one thread for celery (and how this affects the workflow execution)

I don't know if @lekah is also interested into this discussion.

dev-zero · 2017-11-20T14:55:29Z

I guess the reason for the supervisor being there was that it was expected that there would be more background processes in the future which needed to be managed, plus that the supervisor should always be restarted in case it crashes. The second feature we don't have anymore now that supervisor is gone, but Celery does a good job to restart its workers (which are spawn using fork and therefore run in a completely separate environment), so a full crash should never happen and if it does (for example if the filesystem is full), we want the user to investigate.

About the 2 processes with --concurrency 1: I guess one of them is the Celery Beat - periodically injecting new tasks into the work queue - while the other is the requested worker to process the actual tasks.

Now, for the concurrency issues: all the tasks in aiida/daemon/tasks.py are started periodically and may run in parallel if they fail to finish until the next period is kicked off. The way I handled this in my projects is using PostgreSQL's "select for update" feature:

    calc = (Calculation.query
            .with_for_update(of=Calculation)
            .get(calc_id))

which should write-lock only the selected row until the end of the transaction. If you can't restrict the lock to a small and otherwise mostly independent resource/row, this kind of locking may introduce other problems like stalling workers and you therefore should rather use with_for_update(nowait=True, ... instead, especially since the "retry" will happen by the periodic task injection.

dev-zero force-pushed the supervisor-removal branch 3 times, most recently from 55dbcc8 to 295c15c Compare October 18, 2017 12:43

This was referenced Oct 18, 2017

REST API: add endpoint to check if server is up and running #793

Closed

TRACKER: Python 3 compatibility #804

Closed

Celeryd replacement or leveraging #820

Closed

giovannipizzi requested changes Oct 31, 2017

View reviewed changes

dev-zero added 4 commits November 3, 2017 14:58

run celery directly instead of via supervisord

e22aa51

setup: drop supervisor config creation

9806307

remove leftovers from supervisord

63e7c59

simple log file rotation for daemon on start

3c6690d

dev-zero force-pushed the supervisor-removal branch from 295c15c to 3c6690d Compare November 3, 2017 14:01

dev-zero added 2 commits November 3, 2017 16:11

travis-data: adapt daemon test to logfile name change

08a2c8c

cmds/daemon: factor out initialization code to __init__ function

9a09762

giovannipizzi reviewed Nov 3, 2017

View reviewed changes

Merge branch 'develop' into supervisor-removal

825f890

giovannipizzi approved these changes Nov 13, 2017

View reviewed changes

giovannipizzi merged commit 02295ef into aiidateam:develop Nov 13, 2017

sphuber mentioned this pull request Jan 15, 2018

daemon error messages & stored path to aiida module #1031

Closed

dev-zero deleted the supervisor-removal branch August 31, 2018 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervisor removal #790

Supervisor removal #790

dev-zero commented Oct 17, 2017

giovannipizzi commented Oct 17, 2017

giovannipizzi left a comment

dev-zero commented Nov 3, 2017

giovannipizzi Nov 3, 2017

dev-zero Nov 4, 2017

giovannipizzi commented Nov 3, 2017

dev-zero commented Nov 11, 2017

giovannipizzi commented Nov 13, 2017

szoupanos commented Nov 13, 2017

dev-zero commented Nov 20, 2017

Supervisor removal #790

Supervisor removal #790

Conversation

dev-zero commented Oct 17, 2017

giovannipizzi commented Oct 17, 2017

giovannipizzi left a comment

Choose a reason for hiding this comment

dev-zero commented Nov 3, 2017

giovannipizzi Nov 3, 2017

Choose a reason for hiding this comment

dev-zero Nov 4, 2017

Choose a reason for hiding this comment

giovannipizzi commented Nov 3, 2017

dev-zero commented Nov 11, 2017

giovannipizzi commented Nov 13, 2017

szoupanos commented Nov 13, 2017

dev-zero commented Nov 20, 2017