Skip to content

Commit bf8bbae

Browse files
author
Philipp Stanner
committed
drm/sched: Avoid memory leaks with cancel_job() callback
Since its inception, the GPU scheduler can leak memory if the driver calls drm_sched_fini() while there are still jobs in flight. The simplest way to solve this in a backwards compatible manner is by adding a new callback, drm_sched_backend_ops.cancel_job(), which instructs the driver to signal the hardware fence associated with the job. Afterwards, the scheduler can safely use the established free_job() callback for freeing the job. Implement the new backend_ops callback cancel_job(). Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Link: https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/ Reviewed-by: Maíra Canal <mcanal@igalia.com> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Signed-off-by: Philipp Stanner <phasta@kernel.org> Link: https://lore.kernel.org/r/20250710125412.128476-4-phasta@kernel.org
1 parent fe69a39 commit bf8bbae

File tree

2 files changed

+39
-13
lines changed

2 files changed

+39
-13
lines changed

drivers/gpu/drm/scheduler/sched_main.c

Lines changed: 21 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1352,26 +1352,30 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_
13521352
}
13531353
EXPORT_SYMBOL(drm_sched_init);
13541354

1355+
static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
1356+
{
1357+
struct drm_sched_job *job, *tmp;
1358+
1359+
/* All other accessors are stopped. No locking necessary. */
1360+
list_for_each_entry_safe_reverse(job, tmp, &sched->pending_list, list) {
1361+
sched->ops->cancel_job(job);
1362+
list_del(&job->list);
1363+
sched->ops->free_job(job);
1364+
}
1365+
}
1366+
13551367
/**
13561368
* drm_sched_fini - Destroy a gpu scheduler
13571369
*
13581370
* @sched: scheduler instance
13591371
*
13601372
* Tears down and cleans up the scheduler.
13611373
*
1362-
* This stops submission of new jobs to the hardware through
1363-
* drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job()
1364-
* will not be called for all jobs still in drm_gpu_scheduler.pending_list.
1365-
* There is no solution for this currently. Thus, it is up to the driver to make
1366-
* sure that:
1367-
*
1368-
* a) drm_sched_fini() is only called after for all submitted jobs
1369-
* drm_sched_backend_ops.free_job() has been called or that
1370-
* b) the jobs for which drm_sched_backend_ops.free_job() has not been called
1371-
* after drm_sched_fini() ran are freed manually.
1372-
*
1373-
* FIXME: Take care of the above problem and prevent this function from leaking
1374-
* the jobs in drm_gpu_scheduler.pending_list under any circumstances.
1374+
* This stops submission of new jobs to the hardware through &struct
1375+
* drm_sched_backend_ops.run_job. If &struct drm_sched_backend_ops.cancel_job
1376+
* is implemented, all jobs will be canceled through it and afterwards cleaned
1377+
* up through &struct drm_sched_backend_ops.free_job. If cancel_job is not
1378+
* implemented, memory could leak.
13751379
*/
13761380
void drm_sched_fini(struct drm_gpu_scheduler *sched)
13771381
{
@@ -1401,6 +1405,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
14011405
/* Confirm no work left behind accessing device structures */
14021406
cancel_delayed_work_sync(&sched->work_tdr);
14031407

1408+
/* Avoid memory leaks if supported by the driver. */
1409+
if (sched->ops->cancel_job)
1410+
drm_sched_cancel_remaining_jobs(sched);
1411+
14041412
if (sched->own_submit_wq)
14051413
destroy_workqueue(sched->submit_wq);
14061414
sched->ready = false;

include/drm/gpu_scheduler.h

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -512,6 +512,24 @@ struct drm_sched_backend_ops {
512512
* and it's time to clean it up.
513513
*/
514514
void (*free_job)(struct drm_sched_job *sched_job);
515+
516+
/**
517+
* @cancel_job: Used by the scheduler to guarantee remaining jobs' fences
518+
* get signaled in drm_sched_fini().
519+
*
520+
* Used by the scheduler to cancel all jobs that have not been executed
521+
* with &struct drm_sched_backend_ops.run_job by the time
522+
* drm_sched_fini() gets invoked.
523+
*
524+
* Drivers need to signal the passed job's hardware fence with an
525+
* appropriate error code (e.g., -ECANCELED) in this callback. They
526+
* must not free the job.
527+
*
528+
* The scheduler will only call this callback once it stopped calling
529+
* all other callbacks forever, with the exception of &struct
530+
* drm_sched_backend_ops.free_job.
531+
*/
532+
void (*cancel_job)(struct drm_sched_job *sched_job);
515533
};
516534

517535
/**

0 commit comments

Comments
 (0)