Skip to content

CachingCatalog does not close FileIO on cache eviction, causing S3FileIO / SDK v2 thread leak in long-running applications #15898

@boaz-gold

Description

@boaz-gold

Apache Iceberg version

1.10.0

Query engine

Spark

Please describe the bug 🐞

Apache Iceberg version: 1.10.0

Component: org.apache.iceberg.CachingCatalog

Description

CachingCatalog uses a Caffeine cache to hold Table objects. When an entry is evicted (by TTL via cache.expiration-interval-ms or by size via cache.max-total-bytes), the RemovalListener
(MetadataTableInvalidatingRemovalListener) only invalidates related metadata table entries. It does not call table.io().close().

This means any resources held by the FileIO implementation are never released on eviction.

Impact

With io-impl = org.apache.iceberg.aws.s3.S3FileIO:

  • Each evicted Table leaves behind a live AWS SDK v2 S3Client
  • Each S3Client owns a ScheduledExecutorService (sdk-ScheduledExecutor-N) with background threads for credential refresh (IMDSv2)
  • These threads are GC roots — they can never be collected
  • In a long-running process (e.g. Spark Thrift Server), threads accumulate without bound until the JVM crashes with os::commit_memory failed; error='Not enough space' (errno=12)

Observed in production (Spark Thrift Server, ~24h uptime):
Total JVM threads: 27,877
sdk-ScheduledExecutor: 27,657
Distinct pool instances: 8,075+

Proof from bytecode

CachingCatalog$MetadataTableInvalidatingRemovalListener.onRemoval() decompiled from iceberg-spark-runtime-3.5_2.12-1.10.0:

// logs debug
// if EXPIRED and not a metadata table: cache.invalidateAll(metadataTableIdentifiers)
// return ← no close() call

There is no table.io().close() call anywhere in the eviction path.

Proposed fix

In CachingCatalog.java, MetadataTableInvalidatingRemovalListener.onRemoval():

if (value != null && value.io() instanceof Closeable) {
try {
((Closeable) value.io()).close();
} catch (IOException e) {
LOG.warn("Failed to close FileIO for evicted table {}", key, e);
}
}

Note: S3FileIO implements Closeable and its close() method calls S3Client.close(), which shuts down the ScheduledExecutorService and releases all threads. This fix is sufficient to resolve the leak.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions