CachingCatalog does not close FileIO on cache eviction, causing S3FileIO / SDK v2 thread leak in long-running applications

### Apache Iceberg version

1.10.0

### Query engine

Spark

### Please describe the bug 🐞

Apache Iceberg version: 1.10.0                                                                                                                                                                         
                                                                                                                                                                                                       
  Component: org.apache.iceberg.CachingCatalog                                                                                                                                                           
                                                                                                                                                                                                       
  Description

  CachingCatalog uses a Caffeine cache to hold Table objects. When an entry is evicted (by TTL via cache.expiration-interval-ms or by size via cache.max-total-bytes), the RemovalListener               
  (MetadataTableInvalidatingRemovalListener) only invalidates related metadata table entries. It does not call table.io().close().
                                                                                                                                                                                                         
  This means any resources held by the FileIO implementation are never released on eviction.                                                                                                             
   
  Impact                                                                                                                                                                                                 
                                                                                                                                                                                                       
  With io-impl = org.apache.iceberg.aws.s3.S3FileIO:                                                                                                                                                     
  - Each evicted Table leaves behind a live AWS SDK v2 S3Client
  - Each S3Client owns a ScheduledExecutorService (sdk-ScheduledExecutor-N) with background threads for credential refresh (IMDSv2)                                                                      
  - These threads are GC roots — they can never be collected                                                                       
  - In a long-running process (e.g. Spark Thrift Server), threads accumulate without bound until the JVM crashes with os::commit_memory failed; error='Not enough space' (errno=12)                      
                                                                                                                                                                                   
  Observed in production (Spark Thrift Server, ~24h uptime):                                                                                                                                             
  Total JVM threads:          27,877                                                                                                                                                                     
  sdk-ScheduledExecutor:      27,657                                                                                                                                                                     
  Distinct pool instances:    8,075+                                                                                                                                                                     
                                    
  Proof from bytecode                                                                                                                                                                                    
                                                                                                                                                                                                         
  CachingCatalog$MetadataTableInvalidatingRemovalListener.onRemoval() decompiled from iceberg-spark-runtime-3.5_2.12-1.10.0:                                                                             
                                                                                                                                                                                                         
  // logs debug                                                                                                                                                                                          
  // if EXPIRED and not a metadata table: cache.invalidateAll(metadataTableIdentifiers)
  // return   ← no close() call                                                                                                                                                                          
                  
  There is no table.io().close() call anywhere in the eviction path.                                                                                                                                     
                  
  Proposed fix                                                                                                                                                                                           
                  
  In CachingCatalog.java, MetadataTableInvalidatingRemovalListener.onRemoval():                                                                                                                          
   
  if (value != null && value.io() instanceof Closeable) {                                                                                                                                                
      try {       
          ((Closeable) value.io()).close();
      } catch (IOException e) {
          LOG.warn("Failed to close FileIO for evicted table {}", key, e);                                                                                                                               
      }
  }                                                                                                                                                                                                      
                  
  Note: S3FileIO implements Closeable and its close() method calls S3Client.close(), which shuts down the ScheduledExecutorService and releases all threads. This fix is sufficient to resolve the leak. 
   

### Willingness to contribute

- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CachingCatalog does not close FileIO on cache eviction, causing S3FileIO / SDK v2 thread leak in long-running applications #15898

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CachingCatalog does not close FileIO on cache eviction, causing S3FileIO / SDK v2 thread leak in long-running applications #15898

Description

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions