Fix rare startup race condition between OpenCL background job and dt_wb_presets_init#20765
Fix rare startup race condition between OpenCL background job and dt_wb_presets_init#20765da-phil wants to merge 1 commit intodarktable-org:masterfrom
dt_wb_presets_init#20765Conversation
Fix rare starup race condition between OpenCL background job
and `dt_wb_presets_init`.
In darktable.c:1878-1887, the GUI startup path queues OpenCL detection as a background job on the system worker pool and then immediately continues to call `dt_wb_presets_init(NULL)`.
`dt_opencl_init` (called by a worker thread) loads OpenCL vendor libraries (ICD loader, driver). These vendor libraries are known to call `setlocale()` internally.
`setlocale()` in glibc:
* Acquires `__libc_setlocale_lock`
* Frees and replaces internal locale data structures (including cached locale strings)
Meanwhile, the main thread is inside `dcigettext` → `guess_category_value` → `getenv("LANGUAGE")`. The glibc `dcigettext` code is not re-entrant with `setlocale()`: it caches locale state that can be freed mid-read when `setlocale()` runs concurrently, leading to a SIGSEGV.
This does not happen in the non-GUI/CLI path (`init_gui=0`) because there `dt_opencl_init` is called synchronously before `dt_wb_presets_init`.
Potential fixes:
* Move `dt_wb_presets_init` (and `dt_noiseprofile_init`) to before the OpenCL background job is enqueued, so the JSON/gettext code finishes before any worker thread starts running the OCL job. In darktable.c around line 1876. The fix is confirmed clean for the non-GUI path since it already calls `dt_opencl_init` synchronously — the proposed reorder makes both paths consistent.
* Alternatively, a more targeted fix is to call `setlocale(LC_ALL, "")` once and freeze it (or save/restore it around `dt_opencl_init`) before JSON parsing begins, but reordering the init sequence is cleaner and less invasive.
3cc3ee3 to
61c41a7
Compare
|
IIUC we can't reproduce this bug. So, how did you find it? And more importantly, how are we going to test a "fix" to a "bug" we can't reproduce or verify? |
Who is "we"? I encountered it 3 times over the past days & weeks. I'll attach a backtrace if I still find one. I also don't know how to reproduce it. EDIT: found one and attached it to the original post. |
All the people who raise issues when darktable crashes? There's no issue, no steps to reproduce,
In your description you said it was rare. You never said you encountered it. An issue would have been good, because others might have chimed in with they had seen it too. Also all the devs that have done a lot of work on the startup sequence would have a chance to weigh in.
Honestly when I read this I thought it was an AI generated code review and not necessarily a real problem.
That'll make testing fun. I guess we test to make sure nothing breaks and sooner or later decide it's fixed if it doesn't happen again. |
To be honest, I was just too lazy to first create an issue, as I already spent some time on understanding and fixing the issue. But I created a an issue now for tracking purpose: #20770
I used AI to help me with understand the gdb backtrace and the rootcause and it helped writing this summary, but it wasn't done all by AI ;)
Yeah, makes sense, will also keep testing with this change. |
Fix rare startup race condition between OpenCL background job and
dt_wb_presets_initby reordering startup sequence. See gdb backtrace: darktable_bt_WNW7M3.txtIn darktable.c:1878-1887, the GUI startup path queues OpenCL detection as a background job on the system worker pool and then immediately continues to call
dt_wb_presets_init(NULL).dt_opencl_init(called by a worker thread) loads OpenCL vendor libraries (ICD loader, driver). These vendor libraries are known to callsetlocale()internally.setlocale()in glibc:__libc_setlocale_lockMeanwhile, the main thread is inside
dcigettext→guess_category_value→getenv("LANGUAGE"). The glibcdcigettextcode is not re-entrant withsetlocale(): it caches locale state that can be freed mid-read whensetlocale()runs concurrently, leading to a SIGSEGV. This does not happen in the non-GUI/CLI path (init_gui=0) because theredt_opencl_initis called synchronously beforedt_wb_presets_init.Potential fixes:
dt_wb_presets_init(anddt_noiseprofile_init) to before the OpenCL background job is enqueued, so the JSON/gettext code finishes before any worker thread starts running the OCL job. In darktable.c around line 1876. The fix is confirmed clean for the non-GUI path since it already callsdt_opencl_initsynchronously — the proposed reorder makes both paths consistent.setlocale(LC_ALL, "")once and freeze it (or save/restore it arounddt_opencl_init) before JSON parsing begins, but reordering the init sequence is cleaner and less invasive.Fix 1) was chosen in this case.
Fixes #20770