Skip to content

Duplicated rows in Parquet files #163

@tomtaylor

Description

@tomtaylor

What happens?

We run a nightly process to dump some Postgres tables into Parquet files. Sometimes we see a handful of rows duplicated in the output. In last night's example, we saw 33 rows out of 5,396,400 with a duplicate copy. This might be unavoidable with PER_THREAD_OUTPUT enabled against a moving data set, but it might be worth documenting this.

Does it also mean some rows might be missing?

To Reproduce

I think this might be very difficult to reproduce, but our script looks something like this:

FORCE INSTALL postgres_scanner FROM 'http://nightly-extensions.duckdb.org'; 
INSTALL httpfs; 
LOAD httpfs;
SET s3_endpoint='storage.googleapis.com'; 
SET s3_access_key_id='id';
SET s3_secret_access_key='key';
ATTACH 'dbname=foo' AS pg (TYPE postgres);
USE pg;
BEGIN;
COPY (SELECT * FROM pg.table1) TO 's3://bucket/table1.parquet' (FORMAT 'parquet', CODEC 'ZSTD', PER_THREAD_OUTPUT true);
COPY (SELECT * FROM pg.table2) TO 's3://bucket/table2.parquet' (FORMAT 'parquet', CODEC 'ZSTD', PER_THREAD_OUTPUT true);
COMMIT;

Then:

SELECT id, filename, ROW_NUMBER() OVER (PARTITION BY id) FROM read_parquet('data_*.parquet', fil
┌──────────┬────────────────┬─────────────────────────────────────┐
│    id    │    filename    │ row_number() OVER (PARTITION BY id) │
│  int64   │    varchar     │                int64                │
├──────────┼────────────────┼─────────────────────────────────────┤
│ 60449480 │ data_4.parquet │                                   2 │
│ 60725890 │ data_4.parquet │                                   2 │
│ 61009724 │ data_4.parquet │                                   2 │
│ 60844642 │ data_0.parquet │                                   2 │
│ 53617707 │ data_4.parquet │                                   2 │
│ 60574594 │ data_4.parquet │                                   2 │
│ 56486342 │ data_4.parquet │                                   2 │
│ 60034575 │ data_4.parquet │                                   2 │
│ 60574565 │ data_0.parquet │                                   2 │
│ 60698777 │ data_3.parquet │                                   2 │
│ 61080027 │ data_4.parquet │                                   2 │
│ 60261247 │ data_4.parquet │                                   2 │
│ 61079630 │ data_0.parquet │                                   2 │
│ 60386713 │ data_4.parquet │                                   2 │
│ 60008204 │ data_2.parquet │                                   2 │
│ 60261152 │ data_4.parquet │                                   2 │
│ 60983239 │ data_4.parquet │                                   2 │
│ 61092457 │ data_3.parquet │                                   2 │
│ 60856837 │ data_1.parquet │                                   2 │
│ 59246489 │ data_0.parquet │                                   2 │
│ 60224537 │ data_3.parquet │                                   2 │
│ 60569503 │ data_4.parquet │                                   2 │
│ 60905359 │ data_0.parquet │                                   2 │
│ 60859433 │ data_4.parquet │                                   2 │
│ 60255325 │ data_4.parquet │                                   2 │
│ 60341075 │ data_0.parquet │                                   2 │
│ 60968139 │ data_0.parquet │                                   2 │
│ 60574631 │ data_0.parquet │                                   2 │
│ 60560326 │ data_2.parquet │                                   2 │
│ 60927674 │ data_3.parquet │                                   2 │
│ 61092552 │ data_4.parquet │                                   2 │
│ 60574652 │ data_0.parquet │                                   2 │
│ 61051974 │ data_1.parquet │                                   2 │
├──────────┴────────────────┴─────────────────────────────────────┤
│ 33 rows                                               3 columns │
└─────────────────────────────────────────────────────────────────┘

OS:

Linux

PostgreSQL Version:

14.7

DuckDB Version:

0.9.2

DuckDB Client:

CLI

Full Name:

Tom Taylor

Affiliation:

Breakroom

Have you tried this on the latest main branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions