Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import from CSV not handling escaped quotes correctly #5977

Closed
2 tasks done
dkish-Ocient opened this issue Jan 23, 2023 · 4 comments
Closed
2 tasks done

Import from CSV not handling escaped quotes correctly #5977

dkish-Ocient opened this issue Jan 23, 2023 · 4 comments
Labels

Comments

@dkish-Ocient
Copy link

What happens?

Running something akin to

CREATE OR REPLACE TABLE people AS SELECT * FROM read_csv('data/people.csv', header=True, ignore_errors=True, columns={'ID': 'UUID', 'company_name': 'VARCHAR', 'company_linkedin_name': 'VARCHAR', 'last_title': 'VARCHAR', 'start_date': 'DATE', 'end_date': 'DATE'})

is producing the following error:

Invalid Input Error: Error in file "data/people.csv" on line 2548: quote should be followed by end of value, end of row or another quote. (DELIMITER=',' (default), QUOTE='"' (default), ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=20480, IGNORE_ERRORS=1, ALL_VARCHAR=0)

Line 2548 is:

78d9d8bd-d957-58ec-9d3e-e4ef42c7e8ce,""Tell Me"" Short film,n/a,Production,2011-01-01,2012-01-01

The CSV parser seems to not be able to handle the escaped quotes in ""Tell Me"", which is rather unfortunate behavior for real-world CSV parsing.

To Reproduce

uuid,company,cid,date_start,date_end
78d9d8bd-d957-58ec-9d3e-e4ef42c7e8ce,""Tell Me"" Short film,n/a,Production,2011-01-01,2012-01-01

should be enough to reproduce this problem with the above DDL.

OS:

MacOS

DuckDB Version:

0.6.1

DuckDB Client:

Python

Full Name:

Darius Russell Kish

Affiliation:

Ocient

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@Mytherin
Copy link
Collaborator

Thanks for the report!

The problem seems to be that the quotes are escaped but the value itself is unquoted. Adding quotes to the line makes DuckDB able to read it successfully:

uuid,company,cid,type,date_start,date_end
78d9d8bd-d957-58ec-9d3e-e4ef42c7e8ce,"""Tell Me"" Short film",n/a,Production,2011-01-01,2012-01-01
D select * from 'data/csv/issue5977.csv';
┌──────────────────────────────────────┬──────────────────────┬─────────┬────────────┬────────────┬────────────┐
│                 uuid                 │       company        │   cid   │    type    │ date_start │  date_end  │
│               varcharvarcharvarcharvarchardatedate    │
├──────────────────────────────────────┼──────────────────────┼─────────┼────────────┼────────────┼────────────┤
│ 78d9d8bd-d957-58ec-9d3e-e4ef42c7e8ce │ "Tell Me" Short film │ n/a     │ Production │ 2011-01-012012-01-01 │
└──────────────────────────────────────┴──────────────────────┴─────────┴────────────┴────────────┴────────────┘

Interestingly, without the quotes, Postgres seems to remove the quotes altogether:

myth=# COPY people FROM '/Users/myth/Programs/duckdb-bugfix/data/csv/issue5977.csv' (FORMAT CSV, HEADER, QUOTE '"', ESCAPE '"');
COPY 1
myth=# SELECT * FROM people;
                  id                  |    company_name    | company_linkedin_name | last_title | start_date |  end_date  
--------------------------------------+--------------------+-----------------------+------------+------------+------------
 78d9d8bd-d957-58ec-9d3e-e4ef42c7e8ce | Tell Me Short film | n/a                   | Production | 2011-01-01 | 2012-01-01
(1 row)

Pandas does not like the file at all:

>>> df = pd.read_csv('data/csv/issue5977.csv', sep=',', quotechar='"', escapechar='"')
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 1

SQLite does not like it either:

sqlite> .import data/csv/issue5977.csv people --csv
data/csv/issue5977.csv:2: unescaped " character

I have not yet found a single CSV reader that reads the file correctly - all in all not an easy file to read. According to the CSV standard (RFC 4180) the file is also corrupt:

Fields that contain a special character (comma, CR, LF, or double quote), must be "escaped" by enclosing them in double quotes (Hex 22).

We could add support for escaping quotes outside of quotes but this creates a somewhat tough parsing situation and creates ambiguities, for example, what does this mean:

hello,"",world

Is that a single escaped quote, or an empty string that is quoted?

hello,"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""",42""

Is the first quote entering a quote, or escaping a quote? Do we need to keep track of the number of quotes, whether it is even or uneven, to determine if we are inside a quoted string?

We could make a best effort guess to try and fix this exact case - or do a retry if an error pops up - but again, this is not trivial, as is evidenced by all other CSV readers also failing to parse the file correctly.

@dkish-Ocient
Copy link
Author

Thanks for the really in-depth analysis. I have to apologize I bungled the header on the test case CSV I gave, it only has 5 values...

uuid,company,cid,title,date_start,date_end
78d9d8bd-d957-58ec-9d3e-e4ef42c7e8ce,""Tell Me"" Short film,n/a,Production,2011-01-01,2012-01-01
pandas.read_csv('data/people_test.csv')['company'].iloc[0]
'Tell Me"" Short film'

I agree that the CSV is definitely malformed in this case since pandas can't handle it, though Postgres seems to handle it the nicest however they manage that. Feel free to close

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Jul 29, 2023
@github-actions
Copy link

This issue was closed because it has been stale for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants