We have seen multi occurrences of this issue in our Flink pipeline where Iceberg Manifest has datafile which was never uploaded to S3.
This could be related to #4168
logger_name: org.apache.iceberg.aws.s3.S3OutputStream
message: S3OutputStream initialized with S3_PATH_TO_FILE_THAT_DOESN'T_EXIST
2022-07-15T15:04:38.220+00:00
Iceberg.flink.sink.IcebergFilesCommitter
message: Start to flush snapshot state to state backend,
2022-07-15T15:04:08.447+00:00
logger_name: org.apache.iceberg.io.BaseTaskWriter
Complete data file GenericDataFile{content=data, file_path=PATH_TO_FILE_THAT_DOESN'T_EXIST file_format=PARQUET, spec_id=0, partition=PartitionData{date=2022-07-15, hour=14}
We have seen this for 5-6 data files in various partitions. One common pattern for all these files I found is that snapshot was triggered which triggered the Data File to be completed. But looks like the file never gets uploaded as I don't see any S3 PutObject or multi part upload requests for these data files.