Skip to content

[Spark] parquet ingestion to azure gets stuck when hadoop fs cache is disabled #16640

@palladium-coder

Description

@palladium-coder

Apache Iceberg version

1.11.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

Hello

We are ingesting files to azure via iceberg+spark. We have disabled the hadoop filesystem cache via config fs.abfs.impl.disable.cache=true. During ingestion, the job was getting stuck with the below error

[Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] ERROR org.apache.hadoop.util.BlockingThreadPoolExecutorService - Could not submit task to executor java.util.concurrent.ThreadPoolExecutor@24ad434d[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

Looking at the debug logs, I see that finalize was called on AzureBlobFileSystem (indicating the fs was garbage collected) and parquet attempted to write using the finalized fs

18:18:20.127 [Finalizer] DEBUG org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem - finalize() called.
18:18:20.128 [Finalizer] DEBUG org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore - Gracefully shutting down tor service BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=48, available=48, waiting=0}, eCount=0}. Waiting max 30 SECONDS
18:18:20.209 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 4: write data pages
18:18:20.209 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 4: write data pages content
18:18:20.210 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: end column
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: write data pages
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: write data pages content
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: end column
18:18:20.219 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: end block
18:18:20.220 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: column indexes
18:18:20.231 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 148: offset indexes
18:18:20.234 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 171: bloom filters
18:18:20.234 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 171: end
18:18:20.418 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 641: footer length = 470
18:18:20.421 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] ERROR org.apache.hadoop.util.BlockingThreadPoolExecutorService - Could not submit task to executor java.util.concurrent.ThreadPoolExecutor@24ad434d[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

The minimal code example to reproduce (to reproduce the above error we had to trigger the GC often, so this example has to be run with jvm args -Xmx480m -XX:G1ReservePercent=50)

var spark = SparkSession.builder()
        .master("local[*]")
        .appName("test")
        .config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkCatalog")
        .config("spark.sql.catalog.spark_catalog.type", "hadoop")
        .config("spark.sql.catalog.spark_catalog.warehouse", "abfs://<container>@<storage>.[dfs.core.windows.net](http://dfs.core.windows.net/)")
        // setup rest of azure secrets for the storage account
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
        .config("spark.log.level", "debug")
        .config("fs.abfs.impl.disable.cache", "true")
        .getOrCreate()

spark.sql("CREATE TABLE test_table_x (id LONG, data STRING) " +
        "USING iceberg " +
        "TBLPROPERTIES (" +
        "  'write.format.default'='parquet'"+
        ")");
for (int i = 0; i < 20; i++) {
    spark.sql("INSERT INTO test_table_x VALUES (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e'), (6, 'f'), (7, 'g'),(8, 'h'), (9, 'i'), (10, 'j')");
}

The dependencies used are as below

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>3.5.8</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.12</artifactId>
        <version>3.5.8</version>
    </dependency>
    <dependency>
        <groupId>org.apache.iceberg</groupId>
        <artifactId>iceberg-spark-runtime-3.5_2.12</artifactId>
        <version>1.11.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-azure</artifactId>
        <version>3.3.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.6</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client-api</artifactId>
        <version>3.3.6</version>
    </dependency>
</dependencies>

java version : 17.0.18

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions