-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix export and export cleanup job hangs in scheduled state #8198
Conversation
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe update enhances the retry logic for job exports in CVAT's dataset manager by introducing a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant View as View (views.py)
participant Scheduler as RQ Scheduler
participant Job as Export Job
User->>View: Trigger export
View->>Job: Start job
Job-->>View: LockNotAvailableError
View->>View: Call _retry_current_rq_job
View->>Scheduler: Enqueue retry job with delay
Scheduler->>Job: Retry job after delay
Job-->>View: Job completed
View-->>User: Export result
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (1)
cvat/apps/dataset_manager/tests/test_rest_api_formats.py (1)
33-33
: Remove unnecessary import.The
django.utils.timezone
import is not used within the file and should be removed to keep the code clean.
- File:
cvat/apps/dataset_manager/tests/test_rest_api_formats.py
- Line: 33
Analysis chain
Verify the necessity of the added import.
The
django.utils.timezone
import is added but not used within the provided code. Ensure it's necessary.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify if `django.utils.timezone` is used within the file. # Test: Search for `timezone` usage. Expect: At least one occurrence. rg --type python 'timezone'Length of output: 58
Script:
#!/bin/bash # Description: Verify if `django.utils.timezone` is used within the file `cvat/apps/dataset_manager/tests/test_rest_api_formats.py`. # Test: Search for `timezone` usage. Expect: At least one occurrence. rg 'timezone' cvat/apps/dataset_manager/tests/test_rest_api_formats.pyLength of output: 104
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- changelog.d/20240718_202907_mzhiltso_fix_export_job_retries.md (1 hunks)
- cvat/apps/dataset_manager/tests/test_rest_api_formats.py (4 hunks)
- cvat/apps/dataset_manager/views.py (7 hunks)
Additional comments not posted (9)
changelog.d/20240718_202907_mzhiltso_fix_export_job_retries.md (1)
1-4
: Changelog entry looks good.The changelog entry is clear and accurately reflects the changes made in the PR.
cvat/apps/dataset_manager/views.py (6)
16-16
: New import for Scheduler looks good.The import
from rq_scheduler import Scheduler
is necessary for the new retry mechanism.
22-22
: New import for get_rq_lock_by_user looks good.The import
from cvat.apps.engine.utils import get_rq_lock_by_user
is necessary for acquiring locks by user.
63-98
: Ensure robustness of the retry mechanism.The function
_retry_current_rq_job
is well-implemented to handle retries using the RQ scheduler. However, the TODO comment indicates potential future improvements for better reliability and error reporting. Additionally, the implementation might lead to two jobs with the same name running in parallel if the retry is enqueued immediately.Consider addressing the following:
- Ensure that the retry mechanism does not lead to multiple jobs running in parallel.
- Plan for future improvements to use the built-in RQ scheduler for better reliability.
150-152
: Scheduler initialization looks good.The initialization of the
Scheduler
for the export function is correct.
176-182
: Retry mechanism integration looks good.The integration of the new retry mechanism in the
export
function is correct. It properly handlesLockNotAvailableError
and logs retry attempts.
Line range hint
227-246
:
Retry mechanism integration in clear_export_cache looks good.The integration of the new retry mechanism in the
clear_export_cache
function is correct. It properly handlesLockNotAvailableError
and logs retry attempts.cvat/apps/dataset_manager/tests/test_rest_api_formats.py (2)
27-27
: Verify the necessity of the added import.The
django_rq
import is added but not used within the provided code. Ensure it's necessary.Verification successful
Verified: The
django_rq
import is necessary.The
django_rq
import is used within the file for patching purposes, confirming its necessity.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify if `django_rq` is used within the file. # Test: Search for `django_rq` usage. Expect: At least one occurrence. rg --type python 'django_rq'Length of output: 59
Script:
#!/bin/bash # Description: Verify if `django_rq` is used within the file `cvat/apps/dataset_manager/tests/test_rest_api_formats.py`. # Test: Search for `django_rq` usage. Expect: At least one occurrence. rg 'django_rq' cvat/apps/dataset_manager/tests/test_rest_api_formats.pyLength of output: 1295
31-31
: Verify the necessity of the added import.The
django.conf.settings
import is added but not used within the provided code. Ensure it's necessary.Verification successful
The import statement
from django.conf import settings
is necessary.The
settings
object is used within the filecvat/apps/dataset_manager/tests/test_rest_api_formats.py
, confirming the necessity of the import.
cvat/apps/dataset_manager/tests/test_rest_api_formats.py
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify if `django.conf.settings` is used within the file. # Test: Search for `settings` usage. Expect: At least one occurrence. rg --type python 'settings'Length of output: 58
Script:
#!/bin/bash # Description: Verify if `django.conf.settings` is used within the file. # Test: Search for `settings` usage. Expect: At least one occurrence. rg --type py 'settings'Length of output: 62252
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8198 +/- ##
===========================================
+ Coverage 83.39% 83.40% +0.01%
===========================================
Files 388 388
Lines 41305 41314 +9
Branches 3854 3854
===========================================
+ Hits 34446 34459 +13
+ Misses 6859 6855 -4
|
**current_rq_job.kwargs, | ||
job_id=current_rq_job.id, | ||
meta=current_rq_job.meta, | ||
depends_on=current_rq_job.dependency_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhiltsov-max, Probably, there are several bugs:
dependency_ids
can be outdated at the moment when this job will be enqueued, so one user can clog a queue with his jobs (despiteONE_RUNNING_JOB_IN_QUEUE_PER_USER
==True)- let's imagine a situation when:
there were 3 rq jobs:J1
<-J2(allow_dependency_failures=True)
<-J3(allow_dependency_failures=True)
J1
was executed successfully ->J2
was started ->J1
was deleted by CVAT code after downloading a file ->J2
was scheduled with allow_dependency_failures = None (because it is not copied in the current implementation), ->J1
was recreated after one more request from a user and nowJ1
depends on theJ3
. What will be at the end? Will J3 be enqueued afterJ2
is placed to the scheduled job registry?
Motivation and context
There are 2 schedulers supported by django_rq and by python RQ:
rq_scheduler
and a newer, builtin queue scheduler in RQ. rq_scheduler seems to die slowly in favor of the builtin scheduler. The schedulers have compatible API, but not the implementation. The existing job retry implementation relies onretry()
calls, which, in turn, rely on the builtin RQ scheduler. CVAT uses rq_scheduler a for some tasks, so it its executed. The builtin RQ scheduler needs the--with-scheduler
startup parameter on the worker processes. Thus, the jobs were hanging in the scheduled state, as the builtin RQ scheduler was not running on the queues. As CVAT is currently using rq_scheduler, it's decided to continue using it to avoid disruption and use of 2 schedulers together. The implementation in this PR does best efforts to be correct, but it's has potential problems with multiple same jobs running in parallel.In future we need to migrate to the builtin RQ scheduler, as it is the only one maintained as of February 2023.
How has this been tested?
Checklist
develop
branch(cvat-canvas,
cvat-core,
cvat-data and
cvat-ui)
License
Feel free to contact the maintainers if that's a concern.
Summary by CodeRabbit
Bug Fixes
Chores