Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed GZIP generated from a read_csv with an order by not matching original CSV #3714

Closed
pdet opened this issue May 25, 2022 · 3 comments
Labels

Comments

@pdet
Copy link
Member

pdet commented May 25, 2022

statement ok
PRAGMA enable_verification

statement ok
CREATE TABLE csv_data AS (SELECT * FROM 'test/sql/copy/csv/data/real/voter.tsv' order by 1,37,64);

statement ok
COPY csv_data TO '__TEST_DIR__/voter.tsv.gz' (COMPRESSION GZIP);

statement ok
CREATE TABLE csv_data_gz AS (SELECT * FROM '__TEST_DIR__/voter.tsv.gz');

query I
SELECT COUNT(*) FROM (SELECT * FROM csv_data_gz EXCEPT SELECT * FROM csv_data)
----
0

statement ok
DROP TABLE csv_data;

statement ok
DROP TABLE csv_data_gz;
@pdet pdet added the bug label May 25, 2022
pdet added a commit to JimStam/duckdb that referenced this issue May 25, 2022
…king too much space due to the generated row groups on sorting
@Tishj
Copy link
Contributor

Tishj commented Jul 11, 2022

The amount of columns in the order by clause doesn't seem to have anything to do with it

the issue also triggers with:
13,64
37,64
the order of appearance does not matter either: 64,13 and 37,64 make no difference
The type of the columns doesn't seem to matter either
looking at PRAGMA information about the table

cid │ name │ type │ notnull │ dflt_value │ pk

13:
│ 12 │ name_sufx_cd │ VARCHAR │ false │ │ false │
37:
│ 36 │ area_cd │ INTEGER │ false │ │ false │

Ordering on any column in the range of 52 <-> 67 in combination with 13/43 breaks

Ordering on these column ids always breaks:
7,
37,
67

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Jul 30, 2023
@github-actions
Copy link

This issue was closed because it has been stale for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants