Skip to content

Consistent null pointer exceptions from RLEScanPartialInternal when multiple databases are loaded #554

@jghoman

Description

@jghoman

What happens?

We’re seeing consistent null pointer exceptions (NPEs) when executing queries in a Rust-based multiple-DuckDb-database-loaded scenario. A full stack trace from a coredump is here, but the relevant call is:

(gdb) info threads
  Id   Target Id                      Frame 
* 1    Thread 0x7f264abfd700 (LWP 48) 0x0000562085dbda04 in void duckdb::RLEScanPartialInternal<signed char, true>(duckdb::ColumnSegment&, duckdb::ColumnScanState&, unsigned long, duckdb::Vector&, unsigned long) ()
  2    Thread 0x7f26497fb700 (LWP 50) 0x00007f2651cac388 in futex_abstimed_wait_cancelable (private=0, 
    abstime=0x7f26497f8c10, clockid=0, expected=0, futex_word=0x56208d515c00) at ../sysdeps/nptl/futex-internal.h:323
  3    Thread 0x7f2650d9c700 (LWP 45) 0x00007f2651cac388 in futex_abstimed_wait_cancelable (private=0, 

This stack trace is consistent across all crashes, and occurs either here as originating from our code, or from the TaskScheduler

These stack traces are identical across the NPEs.


We’ve got a Rust application, which is using the duckdb-rs library, via which we access multiple DuckDB databases. We’re using R2D2 to pool the connections for each database.

The potential twist that may impact this is: The individual databases are newer versions of the same underlying data, but the database files themselves have different names (if that matters; I don’t think it does?) The content of the tables within each are different, there are the same named tables, etc. within each file. For example, we have dataset-A-123.duckdb loaded, and then may also host dataset-A-345.duckdb. 


We consistently see the NPEs if we don’t do any queries for a while (order of 10s of minutes) and then do a routine query. Unfortunately, it’s not consistent enough to be able to reliably or deterministically reproduce in a unit test; the queries against the databases are very complex and dynamic.

I’ve checked that there’s been no recent changes to the RLEScanPartialInternal code wherein the crashes are happening. We’re running v1.3.2 though also saw this with v1.3.1.


Questions:

  • Each DuckDB file that’s opened should have its own memory space, correct? There’s no way that the loading of a new DuckDB file should impact the existing one, specifically taking into account that the two or more files could have schemas and tables with the same name.
  • Is there any way to specifically isolate the memory between multiple DuckDB files when they’re loaded via the same process (out-of-process is not an option here).
  • Is there something specific to the RLE encoding code that would trigger this? It’s the same call every time. The exact instruction that crashes is:


=> 0x0000562085dbda04 <+100>:   movzwl (%r12,%rax,2),%eax

Movzwl is used to move a 16-bit value to a 32-bit value, filling the upper 16 bits with zero. However, the address from which it’s trying to read (intended to be stored in the rax register) is zero when being called, hence the NPE. I’ve not yet got back to see what specifically is invoking this instruction.

To Reproduce

I'm still working to be able to reliably reproduce this.

OS:

x86_64-linux

DuckDB Version:

1.3.2

DuckDB Client:

Rust

Hardware:

Relatively arge memory instances (64GB) with Duck configured to use ~60GB.

Full Name:

Jakob Homan

Affiliation:

ASF

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions