Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible data corruption on simple insert #5488

Closed
2 tasks done
jaens opened this issue Nov 25, 2022 · 1 comment · Fixed by #5519
Closed
2 tasks done

Reproducible data corruption on simple insert #5488

jaens opened this issue Nov 25, 2022 · 1 comment · Fixed by #5519
Assignees

Comments

@jaens
Copy link

jaens commented Nov 25, 2022

What happens?

Inserting a simple string-columns-only CSV into a table will corrupt the data deterministically (by replacing some values with empty strings) when DuckDB happens to choose dictionary compression for a segment.

(discovered by executing select * from pragma_storage_info('urls'); for the segment the corrupted data is in)

After executing eg. SET force_compression = 'FSST'; and rebuilding the database from the same source CSV with the same query, the corruption disappears.

To Reproduce

The CSV & SQL file (content warning: contains random URLs from the Internet):
https://gist.github.com/jaens/0be7a28adeec547e520ffcdc6dfc8d85

SELECT COUNT(*) c FROM urls WHERE hostname = '';
    c = 16768

OS:

Linux x64

DuckDB Version:

0.6.0 & master 9479be7

DuckDB Client:

CLI

Full Name:

Jaen

Affiliation:

none

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@hannes
Copy link
Member

hannes commented Nov 25, 2022

Maybe @samansmink can have a look?

@samansmink samansmink self-assigned this Nov 25, 2022
samansmink added a commit to samansmink/duckdb that referenced this issue Nov 28, 2022
@Mytherin Mytherin linked a pull request Nov 29, 2022 that will close this issue
Mytherin added a commit that referenced this issue Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants