File is overwritten with each sink operation #31

pedro-muniz · 2023-11-23T18:53:17Z

When using target-s3 with over 10k records, we've noticed that the only way to achieve the desired outcome is by using the "append_date_to_prefix_grain": "microsecond" option. This is due to the fact that each sink operation overwrites the generated file instead of appending the data. Consequently, if more than 10K data is sent by the tap to the target within a short time frame, some data may be lost during the file writing process. Could we consider modifying the command from "w" to "a"? Is there a particular reason for using the write operation? If so, it might be necessary to create a new file for each sink operation to prevent data loss.

The target-s3-parquet uses append mode to write data.
https://github.com/gupy-io/target-s3-parquet/blob/main/target_s3_parquet/sinks.py#L83C15-L83C25

crowemi · 2023-11-23T19:35:01Z

@pedro-muniz -- this sounds great, can you make this change (or make it configurable) and submit a PR?

pedro-muniz · 2023-11-24T14:25:21Z

smart_open plugin does not support append mode. I'll try to work on this.
https://github.com/piskvorky/smart_open/blob/2894d20048bd8ee56c0a89060413eb8041603c30/smart_open/s3.py#L260C22-L260C30

https://github.com/crowemi/target-s3/blob/main/target_s3/formats/format_base.py#L10

rstml · 2023-11-25T00:14:31Z

Bumped into the very same behaviour with Parquet files which has different implementation.

The workaround could be to keep writer open and keep adding, e.g.:

# this should be initialised somewhere outside
pqwriter = pq.ParquetWriter('sample.parquet', table.schema)

def _write(self, contents: str = None) -> None:
    pqwriter.write(records)

# cleanup
pqwriter.close()

However, each set of records may have different schema and I'm not sure how to overcome this.

rstml · 2023-11-25T23:33:45Z

I managed to resolve my issue described above by extending batch size and age. Please see #32 for details.

In my case, I use hourly batching and minute grain for filename. This combination solves the problem with overwriting for me.

pedro-muniz · 2023-11-27T12:52:19Z

I managed to resolve my issue described above by extending batch size and age. Please see #32 for details.

In my case, I use hourly batching and minute grain for filename. This combination solves the problem with overwriting for me.

I think they are good parameters to have control over, but they don't resolve this issue, for example, if we set the granularity to microseconds, we also have a workaround for most cases without these variables.

S3 objects don't support append operation, so the best solution is to create a new file for each sink operation, but IMHO this behavior cannot be part of a parameter combination.

What do you think?

ShahBinoy · 2023-11-30T17:33:09Z

Another thing to add here is that in situations of large number of files, the S3 I/O has shown very good performance for an object size of ~100 MB. This provides a good blend of I/O latencies and record count, so rolling the file to new file based on a size is also a good option to add

rstml · 2023-11-30T17:53:43Z

Indeed, for Parquet files AWS recommends ~250MB per file. However, I didn't see any built-in mechanism in Meltano to flush based on byte size. Moreover, with compression enabled, it will be as hard to estimate output size when using byte size limit as it is with row size limit.

rstml mentioned this issue Nov 25, 2023

Add max_batch_age and max_batch_size optional config params #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File is overwritten with each sink operation #31

File is overwritten with each sink operation #31

pedro-muniz commented Nov 23, 2023 •

edited

crowemi commented Nov 23, 2023

pedro-muniz commented Nov 24, 2023 •

edited

rstml commented Nov 25, 2023

rstml commented Nov 25, 2023

pedro-muniz commented Nov 27, 2023

ShahBinoy commented Nov 30, 2023

rstml commented Nov 30, 2023

File is overwritten with each sink operation #31

File is overwritten with each sink operation #31

Comments

pedro-muniz commented Nov 23, 2023 • edited

crowemi commented Nov 23, 2023

pedro-muniz commented Nov 24, 2023 • edited

rstml commented Nov 25, 2023

rstml commented Nov 25, 2023

pedro-muniz commented Nov 27, 2023

ShahBinoy commented Nov 30, 2023

rstml commented Nov 30, 2023

pedro-muniz commented Nov 23, 2023 •

edited

pedro-muniz commented Nov 24, 2023 •

edited