Limit number of celery task executions per second per user #16232

claudiofr · 2023-06-11T15:07:31Z

Introduce a celery_user_rate_limit config parameter which specifies how many tasks per second a user can execute. Create a custom celery Task class called GalaxyTask with a before_start hook that implements the rate limiting logic. Add a new table, celery_user_rate_limit that tracks the last scheduled execution time
by user id. Add new integration test, test_celery_user_rate_limit.py.

A task function must include the keyword parameter task_user_id for user rate limiting to work.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

claudiofr · 2023-06-11T15:10:25Z

I have implemented the core logic. For this to work in practice a task_user_id keyword parameter needs to be added to the celery task functions.

jdavcs · 2023-06-13T00:49:25Z

This looks great! Just a few comments -

Configuration changes: once you've made a change to config_schema.yml, just run make config-rebuild from galaxy's root: that will regenerate all the necessary *.sample and *.rst files (i.e., we don't edit those manually).

Can you, please, move the database migration into a separate commit? We try to keep those separate (see #15663 (comment)). In general, while there's nothing wrong with one large commit (and we sometimes have those), it's often helpful when a large modification is split into several smaller/narrowly focused commits: that way it's easier to debug stuff; it also makes it easier to read the history and understand the evolution of the code base.

lib/galaxy/app.py

lib/galaxy/celery/base_task.py

lib/galaxy/model/__init__.py

...xy/model/migrations/alembic/versions_gxy/987ce9839ecb_create_celery_user_rate_limit_table.py

lib/galaxy/model/__init__.py

claudiofr · 2023-06-19T18:42:13Z

I made the changes. I agree that behavioral logic does not belong in model classes. Model classes should be plain old objects. I only did so because there was no ideal place to put this type of database specific access logic. I do not think it belongs in the business logic layer either. In past projects I would create a data access logic layer to contain database specific access logic. This layer sits between business layer and database calls. In the case of Galaxy, sqlalchemy orm serves the purpose of a data access logic layer. However, in order to take advantage of database specific efficiencies available in postgres, the sqlalchemy layer exposes low level details to the calling business logic layer. Ideally, we should keep database specific logic in dedicated modules separate from the rest of the business logic. Another option is to write a custom sqlalchemy extension to abstract away this kind of detail from the business logic layer.
For now I moved the logic from the model class to a business logic class.

claudiofr · 2023-06-20T18:49:09Z

I have listed all of the celery tasks below and grouped them according to how they are invoked. Marius mentioned in the meeting that we should user rate limit all tasks other than the celery beat ones. For tasks that are only invoked individually this makes sense. However, for tasks that are only invoked as part of a chain this could introduce significant overhead. The rate limiting logic implemented in the before_task hook involves making a database update call for each task invocation (This is true for postgres. For sqlite it is a select followed by an update). If we rate limit each individual task in a chain that means that we would incur a separate database update call for each task in the chain. For the second chain listed below which includes 4 tasks that means that each invocation of the chain would result in 4 database update calls. If we only rate limited one of the tasks in the chain there would be only one database update call per chain invocation. If, in practice, there could be thousands of invocations of this chain at a time this could result in significantly fewer database updates statements.

Celery beat tasks:
prune_history_audit_table, cleanup_short_term_storage, clean_object_store_caches

Invoked individually only asynchronously:
Recalculate_user_disk_usage
Purge_hda
Purge_datasets
prepare_dataset_collection_download
prepare_pdf_download
prepare_history_download
prepare_history_content_download
prepare_invocation_download
write_invocation_to
write_history_to
write_history_content_to
import_model_store
compute_dataset_hash
import_data_bundle

Invoked as part of a chain only:
Touch
Setup_fetch_data
Finish_job
Fetch_data

Invoked as part of a chain and individually
Set_job_metadata – When invoked individually it is run synchronously.
Change_datatype

Invoked individually both synchronously and asynchronously
Export_history

Invoked synchronously only:
Materialize

Can’t find any calls:
Set_metadata

2 Chains:
Change_datatype | touch

    setup_fetch_data.s(job_id, raw_tool_source=raw_tool_source)
            | fetch_data.s(job_id=job_id)
            | set_job_metadata.s(
                extended_metadata_collection="extended" in tool.app.config.metadata_strategy,
                job_id=job_id,
            ).set(link_error=finish_job.si(job_id=job_id, raw_tool_source=raw_tool_source))
            | finish_job.si(job_id=job_id, raw_tool_source=raw_tool_source)
        )()

claudiofr · 2023-06-20T18:51:54Z

Right now there is only one task, Recalculate_user_disk_usage, that accepts a user_id as a parameter. It gets the user id from the transaction (ProvidesUserContext) parameter. It was mentioned that the currently logged in user isn’t necessarily the right user for rate limiting. How do I determine where to get the user id from for the other tasks? Do I get it from the transaction as in recalculate_user_disk_usage? If not, from where?

claudiofr · 2023-06-20T18:56:24Z

Are there celery task execution statistics available for our production site? This could provide guidance on which tasks would benefit most from rate limiting.

jdavcs · 2023-06-21T04:41:36Z

Thank you for addressing the comments!

Could you, please, do a rebase instead of a merge? That gives us a clean commit history that's easier to read and debug. (the merge commits will disappear once you do the git rebase dev; there are no conflicts)

The tests are broken because of a migration error, which is easily fixable. There have been new migrations added to the dev branch. As a result, Alembic cannot construct a revision script sequence because now the b855b714e8b8 parent revision has two children: your migration and 2d749563e1fe. You can see exactly what's happening by running ./scripts/db_dev.sh history:

b855b714e8b8 -> 987ce9839ecb (gxy) (head), create celery_user_rate_limit_table
2d749563e1fe -> e0561d5fc8c7 (gxy) (head), add archived columns to history
b855b714e8b8 -> 2d749563e1fe (gxy), Add Notification System tables
...

To fix this, you need to change the parent of your revision to the current head: down_revision = "e0561d5fc8c7" (line 17 in the revision file). This takes care of the error.

Once the error is fixed, could you, please, extract the migration (i.e., the revision script module) into its own commit? (You could squash 76664fe, then split 41cd07c into a commit with the migration and a commit with everything else.)

claudiofr · 2023-06-21T18:10:48Z

These are the steps I took when I started this work:

git remote add upstream git@github.com:galaxyproject/galaxy
git remote add origin git@github.com:claudiofr/galaxy

I had an existing fork in github, claudiofr/galaxy but it was out of date
I also had a local repository with origin pointing to it.

So in order to bring it up to date I did:

git checkout dev
git fetch upstream dev
git merge upstream/dev
git push origin dev
I created a new branch and did development on it:

git branch celery_url

"I made my changes, tested them and committed them to
celery_url. Then I brought my branch up to date with
upstream"

git fetch upstream dev
git merge upstream/dev

git push origin celery_url

Given the above is it too late to rebase? If not
what exact steps do I need to take to do this?
I would prefer not to have to make all my changes again
from scratch.

Also, what steps should I have taken to do this the right way?

jdavcs · 2023-06-21T19:55:24Z

@claudiofr All your steps look fine; so just a few comments. (You certainly won't have to make any edits from scratch!)

I think you get your origin remote for free when you run git clone
The dev branch in my fork is usually out of date, so I almost never bother to git push origin dev (that way I remember to always use upstream/dev). I follow the same steps as you do to keep my local dev branch up to date (checkout > fetch > merge).
The one thing I suggest changing is doing a rebase instead of a merge into the celery branch (while a merge is not incorrect, the commit history is much nicer after a rebase).

To do a rebase (it is not too late, and there will be no conflicts - I've just tried it):

# first, bring your local dev branch up to date (just like you do)
$ git checkout dev
$ git fetch upstream 
$ git merge upstream/dev

# then switch to the celery_url branch
$ git checkout celery_url
# if you check the commit history of your branch, you'll see that your commits are interspersed among other commits:
$ git log --oneline -4
b77e1f1eb9 (HEAD -> claudiofr-celery_url, claudiofr/celery_url) Move behavior from celery_user_rate_limit model class to base_task module.
a24a01eda3 Merge remote-tracking branch 'upstream/dev' into celery_url
76664fe776 Replace calls to alembic create_table and drop_table with functions from galaxy.model.migrations.util
4788277f0b Merge pull request #16196 from guerler/activity_drag

# so, rebase:
$ git rebase dev
# now check the commit history: you immediately see your commits which have been rebase (aka re-applied) on top of the current dev branch's HEAD (you see the tips of other branches, including dev):
$ git log --oneline -4
7147fbb7b5 (HEAD -> claudiofr-celery_url) Move behavior from celery_user_rate_limit model class to base_task module.
4bdcf26d53 Replace calls to alembic create_table and drop_table with functions from galaxy.model.migrations.util
e8ee8a60ef Limit number of celery tasks that a user can run per second.
a68a453608 (upstream/dev, dev) Merge pull request #15085 from bernt-matthias/topic/decompress-assertions
# then you can push to origin (as you did)
$ git push origin celery_url -f
# note the `-f` flag: that stands for "force": anytime you rebase, or change your commit history in any other way, your remote will initially reject the push. So you use `-f`.

If a rebase cannot proceed automatically, there'll be a message about conflicts you'll need to resolve. But in this case, there should be no conflicts. (If there are - let me know and I'll help.)

That's all for the rebase!

In order to extract the migration into its separate commit, here's what you can do. First, I suggest combining the 2 commits into 1: your first commit where you added the migration script and the commit where you made the change to it; then you can easily extract the migration in the next step. You can do it via squashing:
git rebase HEAD~3 -i
An editor will open, and you'll see something like this: (note the reverse chronological order)

  1 pick e8ee8a60ef Limit number of celery tasks that a user can run per second.
  2 pick 4bdcf26d53 Replace calls to alembic create_table and drop_table with functions from galaxy.model.migrations.util
  3 pick 7147fbb7b5 Move behavior from celery_user_rate_limit model class to base_task module.

Replace "pick" with "squash" for the "Replace calls to alembic..." commit:

  1 pick e8ee8a60ef Limit number of celery tasks that a user can run per second.
  2 squash 4bdcf26d53 Replace calls to alembic create_table and drop_table with functions from galaxy.model.migrations.util
  3 pick 7147fbb7b5 Move behavior from celery_user_rate_limit model class to base_task module.

Then save and exit. You'll be asked to confirm (or change) the commit message. After that you're done. Your commit history looks like this:

$ git log --oneline -4
09ef62b780 (HEAD -> claudiofr-celery_url) Move behavior from celery_user_rate_limit model class to base_task module.
ed448cd803 Limit number of celery tasks that a user can run per second.
a68a453608 (upstream/dev, dev) Merge pull request #15085 from bernt-matthias/topic/decompress-assertions
8c4598f008 Merge pull request #16224 from jmchilton/edit_attributes

Now you can split your first commit into the migration and everything else: git rebase HEAD~2 -i
In the editor, mark the "Limit number..." commit with "edit"

  1 edit ed448cd803 Limit number of celery tasks that a user can run per second.
  2 pick 09ef62b780 Move behavior from celery_user_rate_limit model class to base_task module.

Save and exit. You are currently at the commit you want to change. You need to reset your changes, and then rearrange them as you see fit:

$git reset HEAD^
# Use git status to see the undone changes:
$ git status
interactive rebase in progress; onto a68a453608
Last command done (1 command done):
   edit ed448cd803 Limit number of celery tasks that a user can run per second.
Next command to do (1 remaining command):
   pick 09ef62b780 Move behavior from celery_user_rate_limit model class to base_task module.
  (use "git rebase --edit-todo" to view and edit)
You are currently splitting a commit while rebasing branch 'claudiofr-celery_url' on 'a68a453608'.
  (Once your working directory is clean, run "git rebase --continue")

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   doc/source/admin/galaxy_options.rst
	modified:   lib/galaxy/app.py
	modified:   lib/galaxy/celery/__init__.py
	modified:   lib/galaxy/config/sample/galaxy.yml.sample
	modified:   lib/galaxy/config/schemas/config_schema.yml
	modified:   lib/galaxy/model/__init__.py
	modified:   lib/galaxy_test/driver/driver_util.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	lib/galaxy/celery/base_task.py
	lib/galaxy/model/migrations/alembic/versions_gxy/987ce9839ecb_create_celery_user_rate_limit_table.py
	test/integration/test_celery_user_rate_limit.py

no changes added to commit (use "git add" and/or "git commit -a")

Now you can simply add the files and commit as you like. For example:
git add --all adds everything;
git reset lib/galaxy/model/migrations/alembic/versions_gxy/987ce9839ecb_create_celery_user_rate_limit_table.py unstages the migration file.
Now you can commit the staged files (git commit)
And then simply add the migration file and commit again. Now you have 2 commits instead of 1!
Now, the last step: git rebase --continue. And you're done: (I used short commit messages for simplicity):

$ git log --oneline -4
27bcaf560b (HEAD -> claudiofr-celery_url) Move behavior from celery_user_rate_limit model class to base_task module.
40ae498c41 my second commit
bfe0b0f78c my first commit
a68a453608 (upstream/dev, dev) Merge pull request #15085 from bernt-matthias/topic/decompress-assertion

One note of caution: a rebase cannot be (easily) undone. So when I have potentially challenging rebases, I create a new branch and do a trial run to be on the safe side. Also, if things go wrong during the rebase, you can always do git rebase --abort - and it will rewind into pre-rebase state.

Also, just in case, here's my go-to for all things git: https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History (this link is specific to what we're doing here).

I hope this helps! Please ping me if any of this doesn't make sense.

jdavcs · 2023-06-22T20:36:28Z

A rebase will fix this. See my previous comment (first part). When you see the output of the rebase command, run git status - it'll show you that app.py has a conflict. You'll need to open that file in your editor, find the conflict (you can search for =====), and rearrange the edits manually (I think it's just one line that has been added last week to the same code block that your commit is trying to update). Then save, stage the file (with git add), and git rebase --continue.

claudiofr · 2023-06-23T15:27:13Z

I made changes and prior problem appears to be fixed. Do you know when you might have a chance to address some of the issues I posted above about passing user id to tasks and which tasks to rate limit?

jdavcs · 2023-06-23T15:53:25Z

I made changes and prior problem appears to be fixed. Do you know when you might have a chance to address some of the issues I posted above about passing user id to tasks and which tasks to rate limit?

Great! All tests are green! And yes, I'll address the other issues today.

davelopez · 2023-06-26T07:36:53Z

Definitely fetch_data was the task that was impacting usegalaxy.eu the most If I recall correctly.

lib/galaxy/app.py

jdavcs · 2023-06-27T07:07:01Z

lib/galaxy/config/schemas/config_schema.yml

+        required: false
+        desc: |
+          Applies if celery_user_rate_limit is non-zero. Used for testing against
+          a postgres db. Forces use of standard sql code rather than


Is this config option used for tests only? If so, can we keep it in the testing code and not in the config schema? (we try to limit our user-facing configuration to options relevant for Galaxy admins). I don't think an admin would need to select what SQL dialect is being used.

Originally I did not put it in the config schema for the reason you mention. That is why I referenced in the code using the self.config.get("celery_user_rate_limit_standard_before_start", False) syntax which you commented on earlier. However, it occurred to me that there is a chance, albeit small, that a developer may in the future introduce a config parameter with the same name in the config schema. So I decided to put it in the config schema. Without a central registry of all parameters including those used only for testing there is always a chance of a naming collision. However, I'm ok with removing it from the config schema and using the self.config.get syntax to get its value.

Ah, I see! I didn't realize at the time that it was only there for the purpose of setting up the right configuration for a test. Given that this option is only relevant in the context of a test, it (ideally) shouldn't be part of the main code base. (I say "ideally" because we still have a couple of settings like that left over from ancient times, but we try to remove those as time permits). However, you don't really need it: Galaxy can determine what path to take based on the value of database_connection: with a non-postgres connection a test will automatically take the standard sql path.

I thought it would be useful to test a the standard sql path even with postgres. That is why I introduced this testing parameter. Given that it would have to be in the main code base.

It'll never take that path: any such alternative sql implementations in galaxy are strictly postgres OR anything else (with sqlite being the default and mysql no longer formally supported, but know to work). So as long as the conn string is identified as postgres, the postgres-specific implementation should be selected.

test/integration/test_celery_user_rate_limit.py

lib/galaxy/celery/base_task.py

test/integration/test_celery_user_rate_limit.py

jdavcs · 2023-06-28T22:39:12Z

test/integration/test_celery_user_rate_limit.py

+            found_user_ids = conn.scalars(
+                text("select id from galaxy_user where id between 1 and :high"), {"high": num_users}
+            ).all()
+            if len(expected_user_ids) > len(found_user_ids):


With the default value of num_users (~~which is what the tests are currently using~~ EDIT: tests are using 3), len(expected_user_ids) will be 1. Is that correct? So, when num_users == 2, and the database has only one user, this code won't create any additional users, and execute tasks for one user only. I would expect that the number of users tested would equal the value of the num_users parameter?

EDIT: I think that's indeed a bug: I've just ran into an integrity error under postgres: it couldn't find user.id=2. Changing to num_users + 2 in setup_users fixed it. However, I haven't verified this - it could be an error in my code too.

It is a bug but it never caused a problem because at the time this code runs there apparently are no users in the galaxy_user table so len(found_user_ids) is always 0 and the condition len(expected_user_ids) > len(found_user_ids) is always true. Apparently the integration testing framework creates a brand new empty database everytime. The test assumes that there will always be a user id = 1, but apparently it is not present at the time this function runs and it somehow gets populated at some point after this code runs. I will change the code to say:
select id from galaxy_user where id between 2 and :high

sometimes a new db is created, sometimes it is not - it depends on how the tests are run. I think, run_tests.sh will try to reuse the postgres database, whereas a direct execution via pytest will not, but I may be mistaken.
The assumption about user.id=1 holds when the test uses the database that has been created by the testing framework. However, when a separate db is created for a test (like when we need to execute under sqlite regardless of the db env var passed to the test in the CI, the default user is not created. So, my suggestion would be to not make any assumptions about existing records in the db.

test/integration/test_celery_user_rate_limit.py

lib/galaxy/celery/__init__.py

jdavcs · 2023-06-28T23:50:46Z

lib/galaxy/app.py

+        """
+        if self.config.celery_user_rate_limit:
+            task_before_start: GalaxyTaskBeforeStart
+            if (


We have a utility method is_postgres that does exactly that, but is located in the wrong place. My suggestion is to move that method into galaxy.model.database_utils (there's only one other place where it's used - it would need to be adjusted), and use it here. Given that you wouldn't have the celery_user_rate_limit_standard_before_start config option, I think this code would become even simpler:

if is_postgres(self.config.database_connection): task_before_start = GalaxyTaskBeforeStartUserRateLimitPostgres( self.config.celery_user_rate_limit, self.model.session ) else: task_before_start = GalaxyTaskBeforeStartUserRateLimitStandard( self.config.celery_user_rate_limit, self.model.session )

I will make changes to use is_postgres. I would like to keep the celery_user_rate_limit_standard_before_start option. However, I guess I could never really see a need for it if we are using postgres as a database.

lib/galaxy/celery/base_task.py

test/integration/test_celery_user_rate_limit.py

jdavcs · 2023-06-29T00:46:06Z

OK, almost done here!
To make the tests run both under Postgres and SQLite (the default), you need to make a few edits. But, first, I think you don't need as many test classes. I think all we need is to test the new functionality under these conditions:

Rate limit set, SQLite db
Rate limit set, Postgres db
Rate limit NOT set, db does not matter (because no rate limiting code is executed!)

So, 3 tests. And a few edits:

You can use @skip_unless_postgres to decorate test 2.

Add config["check_migrate_databases"] = False to the config setup of test 1. Galaxy's testing infrastructure can be quirky in places, and that's one of them. If you look at the database_conf method, line 366, you'll see that for postgres, we set this to False, but for sqlite we don't. As a result, galaxy will check the state of the database on startup, but won't find the alembic version table, and will raise an error (alembic is not initialized for integration tests). This setting takes care of it.

Finally, move the call to setup_users to the setUp() method; otherwise the sqlite database won't be initialized when you try to query it.

So, you might have something like this:

class TestRateLimit(TestCeleryUserRateLimitIntegration):
    def setUp(self):
        super().setUp()
        dburl = self._app.config.database_connection  # that's how you get to galaxy app instance
        setup_users(dburl)

    @classmethod
    def handle_galaxy_config_kwds(cls, config):        
        super().handle_galaxy_config_kwds(config)
        config["check_migrate_databases"] = False  # this goes ONLY in this one
        config["database_connection"] = sqlite_url()
        # ... all your sqlite-specific config settings + rate limiting go here

    def test_mock_pass_user_id_task(self):
        self._test_mock_pass_user_id_task([1, 2], 3, 0.1)

@skip_unless_postgres()
class TestRateLimitPostgres(TestCeleryUserRateLimitIntegration):
    # setUp: same as previous class
    # handle_galaxy_config_kwds: same, minus sqlite-related stuff
    # test_mock_pass_user_id_task: same

class TestNoRateLimit(TestCeleryUserRateLimitIntegration):
   # same as above; also no handle_galaxy_config_kwds

Overall, this is great work, really! We're almost done - I'm very much looking forward to merging this. Thank you for all your work on this, and thank you for bearing with me!

claudiofr · 2023-07-05T18:57:47Z

I'm adding task_user_id to all of the tasks. Calling functions contain either a user object of type, model.User, or a trans object. Can I assume that the model.User object and trans objects are always not None?
If so, can I assume that the trans.user object is not None?

This would mean I can pass user.id or trans.user.id without first having to verify that either user or trans.user is not None.

jdavcs · 2023-07-08T02:16:50Z

I'm adding task_user_id to all of the tasks. Calling functions contain either a user object of type, model.User, or a trans object. Can I assume that the model.User object and trans objects are always not None? If so, can I assume that the trans.user object is not None?

This would mean I can pass user.id or trans.user.id without first having to verify that either user or trans.user is not None.

I suppose it is safe to assume that if the caller's scope contains an object of type model.User, its value is not None. Same for trans. trans.user, on the other hand could be None. If there are cases where these assumptions do not hold, it'll fail fast, and we can then decide how to handle it.

Introduce a celery_user_rate_limit config parameter which specifies how many tasks per second a user can execute. Create a custom celery Task class called GalaxyTask with a before_start hook that implements the rate limiting logic. Add a new table, celery_user_rate_limit that tracks the last scheduled execution time by user id. Add new integration test, test_celery_user_rate_limit.py.

Added migration script to create the new table which tracks the last scheduled execution time for a user which is used to schedule the next execution time in order to limit the task execution rate.

…dule. Moved behavioral logic from CeleryUserRateLimit model class to base_task module because model classes should be plain old classes. Ran make config-schema to update galaxy_options.rst and galaxy.yml.sample based on changes made to config_schema.yml.

…nality. Remove unnecessary integration tests. Change calls to celery tasks to pass values to new task_user_id parameter which is used for user rate limiting.

…nality. Fix signature to overriden PurgableManagerMixin.purge method. Rather than having an explicit user parameter assume it is in the **kwargs param.

…nality. Added # type: ignore[arg-type] for dburl parameter to is_posgress function.

claudiofr · 2023-07-12T16:30:58Z

I made the suggested code changes. I also added task_user_id to most of the celery task functions and changed the calling code to pass the user id.

jdavcs

Thank you, @claudiofr, for your contribution and for addressing all the comments! This is great work!

github-actions bot added area/admin area/database Galaxy's database or data access layer area/documentation area/testing area/testing/integration labels Jun 11, 2023

github-actions bot added this to the 23.1 milestone Jun 11, 2023

martenson added the kind/feature label Jun 12, 2023