Skip to content

Flink - Data File completed and added to Manifest but data file NotFound in S3 location #5310

@abmo-x

Description

@abmo-x

We have seen multi occurrences of this issue in our Flink pipeline where Iceberg Manifest has datafile which was never uploaded to S3.

This could be related to #4168

logger_name:  org.apache.iceberg.aws.s3.S3OutputStream 
    message:  S3OutputStream initialized with S3_PATH_TO_FILE_THAT_DOESN'T_EXIST

2022-07-15T15:04:38.220+00:00
Iceberg.flink.sink.IcebergFilesCommitter
      message:  Start to flush snapshot state to state backend,

2022-07-15T15:04:08.447+00:00 
logger_name:  org.apache.iceberg.io.BaseTaskWriter 
Complete data file GenericDataFile{content=data, file_path=PATH_TO_FILE_THAT_DOESN'T_EXIST file_format=PARQUET, spec_id=0, partition=PartitionData{date=2022-07-15, hour=14}

We have seen this for 5-6 data files in various partitions. One common pattern for all these files I found is that snapshot was triggered which triggered the Data File to be completed. But looks like the file never gets uploaded as I don't see any S3 PutObject or multi part upload requests for these data files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions