-
Notifications
You must be signed in to change notification settings - Fork 68
Closed
Description
What happens?
We run a nightly process to dump some Postgres tables into Parquet files. Sometimes we see a handful of rows duplicated in the output. In last night's example, we saw 33 rows out of 5,396,400 with a duplicate copy. This might be unavoidable with PER_THREAD_OUTPUT enabled against a moving data set, but it might be worth documenting this.
Does it also mean some rows might be missing?
To Reproduce
I think this might be very difficult to reproduce, but our script looks something like this:
FORCE INSTALL postgres_scanner FROM 'http://nightly-extensions.duckdb.org';
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='storage.googleapis.com';
SET s3_access_key_id='id';
SET s3_secret_access_key='key';
ATTACH 'dbname=foo' AS pg (TYPE postgres);
USE pg;
BEGIN;
COPY (SELECT * FROM pg.table1) TO 's3://bucket/table1.parquet' (FORMAT 'parquet', CODEC 'ZSTD', PER_THREAD_OUTPUT true);
COPY (SELECT * FROM pg.table2) TO 's3://bucket/table2.parquet' (FORMAT 'parquet', CODEC 'ZSTD', PER_THREAD_OUTPUT true);
COMMIT;Then:
SELECT id, filename, ROW_NUMBER() OVER (PARTITION BY id) FROM read_parquet('data_*.parquet', fil
┌──────────┬────────────────┬─────────────────────────────────────┐
│ id │ filename │ row_number() OVER (PARTITION BY id) │
│ int64 │ varchar │ int64 │
├──────────┼────────────────┼─────────────────────────────────────┤
│ 60449480 │ data_4.parquet │ 2 │
│ 60725890 │ data_4.parquet │ 2 │
│ 61009724 │ data_4.parquet │ 2 │
│ 60844642 │ data_0.parquet │ 2 │
│ 53617707 │ data_4.parquet │ 2 │
│ 60574594 │ data_4.parquet │ 2 │
│ 56486342 │ data_4.parquet │ 2 │
│ 60034575 │ data_4.parquet │ 2 │
│ 60574565 │ data_0.parquet │ 2 │
│ 60698777 │ data_3.parquet │ 2 │
│ 61080027 │ data_4.parquet │ 2 │
│ 60261247 │ data_4.parquet │ 2 │
│ 61079630 │ data_0.parquet │ 2 │
│ 60386713 │ data_4.parquet │ 2 │
│ 60008204 │ data_2.parquet │ 2 │
│ 60261152 │ data_4.parquet │ 2 │
│ 60983239 │ data_4.parquet │ 2 │
│ 61092457 │ data_3.parquet │ 2 │
│ 60856837 │ data_1.parquet │ 2 │
│ 59246489 │ data_0.parquet │ 2 │
│ 60224537 │ data_3.parquet │ 2 │
│ 60569503 │ data_4.parquet │ 2 │
│ 60905359 │ data_0.parquet │ 2 │
│ 60859433 │ data_4.parquet │ 2 │
│ 60255325 │ data_4.parquet │ 2 │
│ 60341075 │ data_0.parquet │ 2 │
│ 60968139 │ data_0.parquet │ 2 │
│ 60574631 │ data_0.parquet │ 2 │
│ 60560326 │ data_2.parquet │ 2 │
│ 60927674 │ data_3.parquet │ 2 │
│ 61092552 │ data_4.parquet │ 2 │
│ 60574652 │ data_0.parquet │ 2 │
│ 61051974 │ data_1.parquet │ 2 │
├──────────┴────────────────┴─────────────────────────────────────┤
│ 33 rows 3 columns │
└─────────────────────────────────────────────────────────────────┘
OS:
Linux
PostgreSQL Version:
14.7
DuckDB Version:
0.9.2
DuckDB Client:
CLI
Full Name:
Tom Taylor
Affiliation:
Breakroom
Have you tried this on the latest main branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
noppaz
Metadata
Metadata
Assignees
Labels
No labels