Apache Iceberg version
1.11.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
Hello
We are ingesting files to azure via iceberg+spark. We have disabled the hadoop filesystem cache via config fs.abfs.impl.disable.cache=true. During ingestion, the job was getting stuck with the below error
[Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] ERROR org.apache.hadoop.util.BlockingThreadPoolExecutorService - Could not submit task to executor java.util.concurrent.ThreadPoolExecutor@24ad434d[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
Looking at the debug logs, I see that finalize was called on AzureBlobFileSystem (indicating the fs was garbage collected) and parquet attempted to write using the finalized fs
18:18:20.127 [Finalizer] DEBUG org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem - finalize() called.
18:18:20.128 [Finalizer] DEBUG org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore - Gracefully shutting down tor service BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=48, available=48, waiting=0}, eCount=0}. Waiting max 30 SECONDS
18:18:20.209 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 4: write data pages
18:18:20.209 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 4: write data pages content
18:18:20.210 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: end column
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: write data pages
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: write data pages content
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: end column
18:18:20.219 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: end block
18:18:20.220 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: column indexes
18:18:20.231 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 148: offset indexes
18:18:20.234 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 171: bloom filters
18:18:20.234 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 171: end
18:18:20.418 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 641: footer length = 470
18:18:20.421 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] ERROR org.apache.hadoop.util.BlockingThreadPoolExecutorService - Could not submit task to executor java.util.concurrent.ThreadPoolExecutor@24ad434d[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
The minimal code example to reproduce (to reproduce the above error we had to trigger the GC often, so this example has to be run with jvm args -Xmx480m -XX:G1ReservePercent=50)
var spark = SparkSession.builder()
.master("local[*]")
.appName("test")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.spark_catalog.type", "hadoop")
.config("spark.sql.catalog.spark_catalog.warehouse", "abfs://<container>@<storage>.[dfs.core.windows.net](http://dfs.core.windows.net/)")
// setup rest of azure secrets for the storage account
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.log.level", "debug")
.config("fs.abfs.impl.disable.cache", "true")
.getOrCreate()
spark.sql("CREATE TABLE test_table_x (id LONG, data STRING) " +
"USING iceberg " +
"TBLPROPERTIES (" +
" 'write.format.default'='parquet'"+
")");
for (int i = 0; i < 20; i++) {
spark.sql("INSERT INTO test_table_x VALUES (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e'), (6, 'f'), (7, 'g'),(8, 'h'), (9, 'i'), (10, 'j')");
}
The dependencies used are as below
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.8</version>
</dependency>
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark-runtime-3.5_2.12</artifactId>
<version>1.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-api</artifactId>
<version>3.3.6</version>
</dependency>
</dependencies>
java version : 17.0.18
Willingness to contribute
Apache Iceberg version
1.11.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
Hello
We are ingesting files to azure via iceberg+spark. We have disabled the hadoop filesystem cache via config
fs.abfs.impl.disable.cache=true. During ingestion, the job was getting stuck with the below errorLooking at the debug logs, I see that finalize was called on AzureBlobFileSystem (indicating the fs was garbage collected) and parquet attempted to write using the finalized fs
The minimal code example to reproduce (to reproduce the above error we had to trigger the GC often, so this example has to be run with jvm args
-Xmx480m -XX:G1ReservePercent=50)The dependencies used are as below
java version : 17.0.18
Willingness to contribute