New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import from CSV not handling escaped quotes correctly #5977
Comments
Thanks for the report! The problem seems to be that the quotes are escaped but the value itself is unquoted. Adding quotes to the line makes DuckDB able to read it successfully:
D select * from 'data/csv/issue5977.csv';
┌──────────────────────────────────────┬──────────────────────┬─────────┬────────────┬────────────┬────────────┐
│ uuid │ company │ cid │ type │ date_start │ date_end │
│ varchar │ varchar │ varchar │ varchar │ date │ date │
├──────────────────────────────────────┼──────────────────────┼─────────┼────────────┼────────────┼────────────┤
│ 78d9d8bd-d957-58ec-9d3e-e4ef42c7e8ce │ "Tell Me" Short film │ n/a │ Production │ 2011-01-01 │ 2012-01-01 │
└──────────────────────────────────────┴──────────────────────┴─────────┴────────────┴────────────┴────────────┘
Interestingly, without the quotes, Postgres seems to remove the quotes altogether:
Pandas does not like the file at all: >>> df = pd.read_csv('data/csv/issue5977.csv', sep=',', quotechar='"', escapechar='"')
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 1 SQLite does not like it either:
I have not yet found a single CSV reader that reads the file correctly - all in all not an easy file to read. According to the CSV standard (RFC 4180) the file is also corrupt:
We could add support for escaping quotes outside of quotes but this creates a somewhat tough parsing situation and creates ambiguities, for example, what does this mean:
Is that a single escaped quote, or an empty string that is quoted?
Is the first quote entering a quote, or escaping a quote? Do we need to keep track of the number of quotes, whether it is even or uneven, to determine if we are inside a quoted string? We could make a best effort guess to try and fix this exact case - or do a retry if an error pops up - but again, this is not trivial, as is evidenced by all other CSV readers also failing to parse the file correctly. |
Thanks for the really in-depth analysis. I have to apologize I bungled the header on the test case CSV I gave, it only has 5 values...
I agree that the CSV is definitely malformed in this case since pandas can't handle it, though Postgres seems to handle it the nicest however they manage that. Feel free to close |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days. |
This issue was closed because it has been stale for 30 days with no activity. |
What happens?
Running something akin to
is producing the following error:
Line 2548 is:
The CSV parser seems to not be able to handle the escaped quotes in
""Tell Me""
, which is rather unfortunate behavior for real-world CSV parsing.To Reproduce
should be enough to reproduce this problem with the above DDL.
OS:
MacOS
DuckDB Version:
0.6.1
DuckDB Client:
Python
Full Name:
Darius Russell Kish
Affiliation:
Ocient
Have you tried this on the latest
master
branch?Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
The text was updated successfully, but these errors were encountered: