Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: Inconsistent IMPORT CSV behavior related to quotation marks #39820

Closed
rolandcrosby opened this issue Aug 22, 2019 · 3 comments · Fixed by #40482

Comments

@rolandcrosby
Copy link
Collaborator

commented Aug 22, 2019

A forum user had trouble importing a TSV data set from IMDb - some lines in the input file containing quotation marks were being glued together. The user's specific issue can be reproduced in CockroachDB 19.2 as follows:

IMPORT TABLE title_basics (
	tconst STRING PRIMARY KEY,
	title_type STRING,
	primary_title STRING,
	original_title STRING,
	is_adult BOOL,
	start_year INT8,
	end_year INT8,
	runtime_minutes INT8,
	genres STRING
) CSV DATA ('https://datasets.imdbws.com/title.basics.tsv.gz')
WITH delimiter = e'\t', skip = '1', "nullif" = e'\\N';

select tconst, left(primary_title, 120) as left_of_primary_title from title_basics where tconst = 'tt0033122';

A more minimal reproduction shows that the issue arises when a column in a delimited file has a quoted string at the beginning of it:

col_a,col_b,col_c
1a,one b,one c
2a,"two b" is the string,two c
3a,three b,three c
4a,"four b",four c
5a,five "b",five c
6a,six b,six c
root@:26257/defaultdb> import table thing (a string primary key, b string, c string) csv data ('nodelocal:///thing.txt') with skip='1';
        job_id       |  status   | fraction_completed | rows | index_entries | system_records | bytes
+--------------------+-----------+--------------------+------+---------------+----------------+-------+
  479767674627817473 | succeeded |                  1 |    4 |             0 |              0 |   165
(1 row)

Time: 41.899ms

root@:26257/defaultdb> select * from thing;
  a  |             b              |   c
+----+----------------------------+--------+
  1a | one b                      | one c
  2a | two b" is the string,two c | four c
     | 3a,three b,three c         |
     | 4a,"four b                 |
  5a | five "b"                   | five c
  6a | six b                      | six c
(4 rows)

Time: 8.305ms

The input file here doesn't appear to treat double-quotes any differently than any other character. Our CSV parser has the following requirements, which this file doesn't meet:

If one of the following characters appears in a field, the field must be enclosed by double quotes:

  • delimiter (, by default)
  • double quote (")
  • newline (\n)
  • carriage return (\r)

If double quotes are used to enclose fields, then a double quote appearing inside a field must be escaped by preceding it with another double quote. For example: "aaa","b""bb","ccc"

Despite these requirements, we silently accept the input.

We could relax these requirements by adding a KV option to IMPORT ... CSV to allow the user to specify which character to use as the quote mark character (defaulting to "). Then, if the user specifies the empty string as the quote mark, we could suppress all logic around escaping. This would preclude the user from ever importing a column that contains the delimiter character, but for many import scenarios this is not a concern anyway.

@rolandcrosby rolandcrosby added this to Triage in Bulk I/O via automation Aug 22, 2019

@rolandcrosby

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 22, 2019

Actually, in reading about the problem some more, I don't know that there's any value in allowing custom quote characters. The Go encoding/csv implementation is pretty specifically tailored to RFC 4180; five years ago, in response to other complaints about quote parsing issues, they said "Using encoding/csv to parse TSV doesn't really make sense". Maybe we just need to add a "dumber" delimited mode that splits at the delimiter, doesn't do any escaping, and doesn't treat quotes as special?

@spaskob spaskob self-assigned this Aug 28, 2019

@rolandcrosby

This comment has been minimized.

Copy link
Collaborator Author

commented Aug 29, 2019

Per discussion with @spaskob, we're not going to offer custom quotation characters in this fix. Instead, we'll add a new WITH verbatim option that just splits on the specified delimiter character (defaulting to ,) and preserves quotation marks exactly as they appeared in the input.

spaskob added a commit to spaskob/cockroach that referenced this issue Sep 3, 2019
importccl: force csv import fail on bad usaga of quotes in fields
Before this change we will incorrectly import fields like `abcxyz`.
The reason was setting `LazyQuotes = true` when parsing csv rows.

Tested: checked that the test runs correctly locally. Note that
the test is currently disabled because of failure on TC.

Touches: cockroachdb#39820.

Release note: quotes are not accepted in an unquoted csv field.
spaskob added a commit to spaskob/cockroach that referenced this issue Sep 3, 2019
importccl: force csv import to fail on bad usage of quotes in csv fields
Before this change we will incorrectly import fields like `abc"xy"z`.
The reason was setting `LazyQuotes = true` when parsing csv rows.
Now the client will return an error on an attempt to import a csv file
containing such fields.

Touches: cockroachdb#39820.

Release note: quotes are no longer accepted inside an unquoted csv field.
spaskob added a commit to spaskob/cockroach that referenced this issue Sep 3, 2019
importccl: force csv import to fail on bad usage of quotes in csv fields
Before this change we will incorrectly import fields like `abc"xy"z`.
The reason was setting `LazyQuotes = true` when parsing csv rows.
Now the client will return an error on an attempt to import a csv file
containing such fields.

Touches: cockroachdb#39820.

Release note (backward-incompatible change): quotes no longer accepted inside unquoted csv field.
spaskob added a commit to spaskob/cockroach that referenced this issue Sep 4, 2019
importccl: force csv import to fail on bad usage of quotes in csv fields
Before this change we will incorrectly import fields like `abc"xy"z`.
The reason was setting `LazyQuotes = true` when parsing csv rows.
Now the client will return an error on an attempt to import a csv file
containing such fields.

Touches: cockroachdb#39820.

Release note (backward-incompatible change): quotes no longer accepted inside unquoted csv field.
spaskob added a commit to spaskob/cockroach that referenced this issue Sep 4, 2019
importccl: force csv import to fail on bad usage of quotes in csv fields
Before this change we will incorrectly import fields like `abc"xy"z`.
The reason was setting `LazyQuotes = true` when parsing csv rows.
Now the client will return an error on an attempt to import a csv file
containing such fields.

Touches: cockroachdb#39820.

Release note (backward-incompatible change): quotes no longer accepted inside unquoted csv field.
spaskob added a commit to spaskob/cockroach that referenced this issue Sep 4, 2019
importccl: force csv import to fail on bad usage of quotes in csv fields
Before this change we will incorrectly import fields like `abc"xy"z`.
The reason was setting `LazyQuotes = true` when parsing csv rows.
Now the client will return an error on an attempt to import a csv file
containing such fields.

Touches: cockroachdb#39820.

Release note (backward-incompatible change): quotes no longer accepted inside unquoted csv field.
spaskob added a commit to spaskob/cockroach that referenced this issue Sep 4, 2019
importccl: force csv import to fail on bad usage of quotes in csv fields
Before this change we will incorrectly import fields like `abc"xy"z`.
The reason was setting `LazyQuotes = true` when parsing csv rows.
Now the client will return an error on an attempt to import a csv file
containing such fields.

Touches: cockroachdb#39820.

Release note (backward-incompatible change): quotes no longer accepted inside unquoted csv field.

@spaskob spaskob moved this from Triage to Current Milestone (Sept 16) [Stabilize] in Bulk I/O Sep 4, 2019

spaskob added a commit to spaskob/cockroach that referenced this issue Sep 4, 2019
importccl: replace the undocumented data format MYSQLOUTFILE with DEL…
…IMITED

MYSQLOUTFILE was originally used to help a client import data that was
not in csv proper format. This turns out to be useful for other users so
we call the new format DELIMITED. It's very fast and simple format for importing
delimited data disregarding issues with quoting. For example the csv format forbids
`field1,fieldsth2` as such fields that contain  quotes have to be enclosed in
quotes themselves. The whole format is fully described here
https://dev.mysql.com/doc/refman/8.0/en/load-data.html.

Fixes cockroachdb#39820.

Release note (cli change): add a new IMPORT DATA format DELIMITED.
craig bot pushed a commit that referenced this issue Sep 4, 2019
Merge #40424
40424: importccl: force csv import to fail on bad usage of quotes in csv fields r=spaskob a=spaskob

Before this change we will incorrectly import fields like `abc"xy"z`.
The reason was setting `LazyQuotes = true` when parsing csv rows.
Now the client will return an error on an attempt to import a csv file
containing such fields.

Touches: #39820.

Release note: quotes are no longer accepted inside an unquoted csv field.

Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
spaskob added a commit to spaskob/cockroach that referenced this issue Sep 4, 2019
importccl: replace the undocumented data format MYSQLOUTFILE with DEL…
…IMITED

MYSQLOUTFILE was originally used to help a client import data that was
not in csv proper format. This turns out to be useful for other users so
we call the new format DELIMITED. It's very fast and simple format for importing
delimited data disregarding issues with quoting. For example the csv format forbids
`field1,fieldsth2` as such fields that contain  quotes have to be enclosed in
quotes themselves. The whole format is fully described here
https://dev.mysql.com/doc/refman/8.0/en/load-data.html.

Fixes cockroachdb#39820.

Release note (cli change): add a new IMPORT DATA format DELIMITED.
@spaskob

This comment has been minimized.

Copy link
Collaborator

commented Sep 4, 2019

Recap: the customer problem can be solved by using DELIMITED data import vs CSV:

IMPORT TABLE title_basics (
  tconst text primary key, title_type string, primary_title string, original_title string, is_adult int,
  start_year int, end_year int, runtime_minutes int, genres text
) DELIMITED DATA ('nodelocal:///title.basics.tsv') WITH fields_escaped_by ='\';

SELECT COUNT(*) FROM title_basics;
   count
+---------+
  6137393
craig bot pushed a commit that referenced this issue Sep 4, 2019
Merge #40469 #40471 #40482
40469: opt: fix panic when building indirection exprs r=justinj a=justinj

Previously, we would ask a built array datum to tell us its type when
building an IndirectionExpr. This was problematic when the datum was
NULL, since while the optimizer-level NULLs track their inferred type,
built DNulls do not, so we ended up not knowing the element type.

This is fixed by grabbing the type from the opt expression, rather than
the built datum.

Fixes #40404.
Fixes #37794.

Release note (bug fix): fixed an optimizer panic when building array
access expressions.

40471: exec: add projections of AndExpr r=jordanlewis a=jordanlewis

This adds support for TPCH q19, which sees a 3x speedup with vectorized.

Release note: None

40482: importccl: replace the undocumented data format MYSQLOUTFILE with DELIMITED r=spaskob a=spaskob

MYSQLOUTFILE was originally used to help a client import data that was
not in csv proper format. This turns out to be useful for other users so
we call the new format DELIMITED. It's very fast and simple format for importing
delimited data disregarding issues with quoting. For example the csv format forbids
`field1,fieldsth2` as such fields that contain  quotes have to be enclosed in
quotes themselves. The whole format is fully described here
https://dev.mysql.com/doc/refman/8.0/en/load-data.html.

Fixes #39820.

Release note (cli change): add a new IMPORT DATA format DELIMITED.

Co-authored-by: Justin Jaffray <justin@cockroachlabs.com>
Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>

@craig craig bot closed this in 3c3cece Sep 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.