Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: read_csv() uses quote of \0 if no quote char seen in sample_size rows #11838

Closed
2 tasks done
NickCrews opened this issue Apr 25, 2024 · 0 comments · Fixed by #11880
Closed
2 tasks done

regression: read_csv() uses quote of \0 if no quote char seen in sample_size rows #11838

NickCrews opened this issue Apr 25, 2024 · 0 comments · Fixed by #11880

Comments

@NickCrews
Copy link

NickCrews commented Apr 25, 2024

What happens?

read_csv() claims to use a default quote character of ". But starting in 0.10.1, if there is no
quote found within the first sample_size rows, then it actually defaults to \0 (ie no quoting). This is still present in the nightly build.

I found this when trying to read the csv from https://doi.org/10.7910/DVN/JXPREB: the first 482808 lines have no quote characters, but the 482809th one does. In 0.10.0, the quote char would correctly get sniffed as ". In 0.10.1+, the quote char incorrectly get sniffed as \0. I can workaround this by explicitly supplying the arg quote='"', but I shouldn't have to do this.

To Reproduce

import duckdb
import tempfile

with tempfile.NamedTemporaryFile(mode="w") as f:
    f.write("col1,col2\n")
    # for _ in range(20_478):  # detected as "
    for _ in range(20_479):  # detected as \0
        f.write("a,0\n")
    f.write('"x,y",1\n')
    f.flush()
    con = duckdb.connect(":memory:")
    display(con.sql(f"""FROM sniff_csv('{f.name}')"""))

OS:

macos M1

DuckDB Version:

nightly

DuckDB Client:

python

Full Name:

Nick Crews

Affiliation:

Ship Creek Group

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants