Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervisor removal #790

Merged
merged 7 commits into from
Nov 13, 2017
Merged

Conversation

dev-zero
Copy link
Contributor

No description provided.

@giovannipizzi
Copy link
Member

For the test that fails: please run aiida/docs/update_req_for_rtd.py locally and commit the changes it makes. It's possible that the requirements for RTD are not needed to be maintained anymore, but for now we're updating them "by hand" with the script. Thanks!

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dev-zero ,
a few small requests:

  • could you rebase/merge/solve the conflicts?
  • at the end of the tests (in .travis-data/test_daemon.py or something like this) at the end the daemon log file is printed. Now this is not printed anymore because the filename changed. Can you adapt the test script?
  • in the travis output, I see there are three processes. This means that there are three independent processes or workers for the daemon that act in parallel? Unfortunately, if this is the case at the moment we must limit to only 1 worker... There are some parts of the code (I think mainly the old workflows at this stage) that would be run multiple times if multiple workers start working with them...
  • Apart for this, I understand correctly that this is ready to be merged?

@dev-zero
Copy link
Contributor Author

dev-zero commented Nov 3, 2017

  • rebase is done
  • I have to wait for the output of Travis now to see whether everything works, I started the test here but had to interrupt it at some point
  • well, what was previously done is that verdi daemon start started supervisord which in turn started one celery worker which starts one worker per CPU. Previously verdi daemon status only showed the processes started directly by supervisor which is/was only the celery worker parent process. Since I am querying the PID now directly, I thought it would be important to show how many workers are actually running and additionally query also for child processes of the still single celery worker process. So, unless I have missed a configuration somewhere limiting the number of celery workers to one, we are not doing anything different from before.
  • yes, the code should be ready for wider testing (especially with the last commit, polishing some parts of it)

command=celery worker -A tasks --loglevel=INFO --beat --schedule={daemon_dir}/celerybeat-schedule
directory={aiida_code_home}/daemon/
user={local_user}
numprocs=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the line that was limiting the # of procs to 1 (and so probably indirectly the # of celery workers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does not. It only tells supervisor to start only one celery worker process instead of multiple ones.

Here is the the analysis with the current develop branch and using supervisor:

(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ verdi daemon start
11/04/2017 02:45:19 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl.
11/04/2017 02:45:19 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL.
Clearing all locks ...
Starting AiiDA Daemon ...
Daemon started
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ verdi daemon status
11/04/2017 02:48:26 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl.
11/04/2017 02:48:26 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL.
# Most recent daemon timestamp:0h:00m:02s ago
## Found 1 process running:
   * aiida-daemon[aiida-daemon] RUNNING    pid 17019, uptime 0:03:05
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (develop *=) $ pstree -p 17019
celery(17019)─┬─celery(17033)
              ├─celery(17034)
              ├─celery(17035)
              ├─celery(17036)
              ├─celery(17037)
              ├─celery(17038)
              ├─celery(17039)
              ├─celery(17040)
              ├─celery(17044)
              ├─{celery}(17041)
              ├─{celery}(17042)
              └─{celery}(17043)

and here is the output without supervisor:

(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ verdi daemon start
11/04/2017 02:51:59 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl.
11/04/2017 02:51:59 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL.
Clearing all locks ...
Starting AiiDA Daemon (log file: /users/tiziano/.aiida/daemon/log/celery.log)...
Daemon started
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ verdi daemon status
11/04/2017 02:52:13 PM, alembic.runtime.migration: [INFO] Context impl PostgresqlImpl.
11/04/2017 02:52:13 PM, alembic.runtime.migration: [INFO] Will assume transactional DDL.
# Most recent daemon timestamp:0h:00m:01s ago
Daemon is running as pid 17868 since 2017-11-04 14:51:59.590000, child processes:
   * celery[17899]   sleeping, started at 2017-11-04 14:52:01
   * celery[17900]   sleeping, started at 2017-11-04 14:52:01
   * celery[17901]   sleeping, started at 2017-11-04 14:52:01
   * celery[17902]   sleeping, started at 2017-11-04 14:52:01
   * celery[17903]   sleeping, started at 2017-11-04 14:52:01
   * celery[17904]   sleeping, started at 2017-11-04 14:52:01
   * celery[17905]   sleeping, started at 2017-11-04 14:52:01
   * celery[17906]   sleeping, started at 2017-11-04 14:52:01
   * celery[17910]   sleeping, started at 2017-11-04 14:52:01
(venv) tiziano@tcpc18 ~/work/aiida/aiida_core (supervisor-removal *>) $ pstree -p 17868
celery(17868)─┬─celery(17899)
              ├─celery(17900)
              ├─celery(17901)
              ├─celery(17902)
              ├─celery(17903)
              ├─celery(17904)
              ├─celery(17905)
              ├─celery(17906)
              ├─celery(17910)
              ├─{celery}(17907)
              ├─{celery}(17908)
              └─{celery}(17909)

I used verdi setup ... using the SQLA backend to setup my profile, so I don't think there is something wrong with it.

Can you please check on one of your setups (preferably on one with more than 1 CPU).

@giovannipizzi
Copy link
Member

I just put a comment to the line that I think was limiting to 1 worker only

@dev-zero
Copy link
Contributor Author

@giovannipizzi I looked into the worker-limitation: I can't confirm that with my setup. The number of celery workers does not seem to be limited at any point.

@giovannipizzi
Copy link
Member

Ok, interesting, I was convinced that there was only 1 worker... Maybe I was wrong, or maybe this changed over time. Let's merge this then. Do you know in the new setup how to limit the number of processes?
Just for a better understanding, are the multiple workers now multiple processes, or threads in the same process? Because this might also help debugging some nasty behaviors that we have in special cases with SQLAlchemy (and @szoupanos might be interested)

@giovannipizzi giovannipizzi merged commit 02295ef into aiidateam:develop Nov 13, 2017
@szoupanos
Copy link
Contributor

This is a very interesting topic. It's a pity that I didn't notice it before.
I think that the argument that @giovannipizzi refers to is the --concurrency
http://docs.celeryproject.org/en/latest/userguide/workers.html#concurrency

So the command should change from
command=celery worker -A tasks --loglevel=INFO --beat --schedule=/home/aiida_test/.aiida/daemon/celerybeat-schedule
to
command=celery worker -A tasks --loglevel=INFO --beat --concurrency=1 --schedule=/home/aiida_test/.aiida/daemon/celerybeat-schedule

The result of the pstree for the first command is

supervisord(30013)───celery(30015)─┬─celery(30039)
                                   ├─celery(30040)
                                   ├─celery(30041)
                                   ├─celery(30042)
                                   ├─celery(30043)
                                   ├─celery(30044)
                                   ├─celery(30045)
                                   ├─celery(30046)
                                   ├─celery(30047)
                                   ├─celery(30048)
                                   ├─celery(30049)
                                   ├─celery(30050)
                                   ├─celery(30051)
                                   ├─celery(30052)
                                   ├─celery(30053)
                                   ├─celery(30054)
                                   ├─celery(30055)
                                   ├─{celery}(30056)
                                   ├─{celery}(30057)
                                   └─{celery}(30058)

And for the second

supervisord(29798)───celery(29800)─┬─celery(29823)
                                   ├─celery(29827)
                                   ├─{celery}(29824)
                                   ├─{celery}(29825)
                                   └─{celery}(29826)

I don't understand why there are 2 subprocesses with the --concurrency=1 argument but I suppose that the second should be a helping sub-process, since according to the doc:

-c, --concurrency
Number of child processes processing the queue. The default is the number of CPUs available on your system.

http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker-c

I believe that all these are very related to the workflow testing and debugging that we currently perform and it is a good opportunity to have a small chat on:

  • why the supervisor was removed and why it was there in the first place
  • why we need only one thread for celery (and how this affects the workflow execution)

I don't know if @lekah is also interested into this discussion.

@dev-zero
Copy link
Contributor Author

I guess the reason for the supervisor being there was that it was expected that there would be more background processes in the future which needed to be managed, plus that the supervisor should always be restarted in case it crashes. The second feature we don't have anymore now that supervisor is gone, but Celery does a good job to restart its workers (which are spawn using fork and therefore run in a completely separate environment), so a full crash should never happen and if it does (for example if the filesystem is full), we want the user to investigate.

About the 2 processes with --concurrency 1: I guess one of them is the Celery Beat - periodically injecting new tasks into the work queue - while the other is the requested worker to process the actual tasks.

Now, for the concurrency issues: all the tasks in aiida/daemon/tasks.py are started periodically and may run in parallel if they fail to finish until the next period is kicked off. The way I handled this in my projects is using PostgreSQL's "select for update" feature:

    calc = (Calculation.query
            .with_for_update(of=Calculation)
            .get(calc_id))

which should write-lock only the selected row until the end of the transaction. If you can't restrict the lock to a small and otherwise mostly independent resource/row, this kind of locking may introduce other problems like stalling workers and you therefore should rather use with_for_update(nowait=True, ... instead, especially since the "retry" will happen by the periodic task injection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants