-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix COPY bugs for binary/csv formats #69066
Conversation
0ef1e72
to
ec6c129
Compare
pkg/sql/copy.go
Outdated
@@ -387,6 +387,15 @@ func (c *copyMachine) readTextData(ctx context.Context, final bool) (brk bool, e | |||
} | |||
|
|||
func (c *copyMachine) readCSVData(ctx context.Context, final bool) (brk bool, err error) { | |||
// Don't try to parse the CSV until we read all the data. This is because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we look at improving this in future / make an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as @steven-sheehy says this behavior isn't that good, and it might even be better to leave the CSV parsing as-is instead of doing this workaround
i'll give a real fix another shot.
pkg/sql/copy.go
Outdated
@@ -424,6 +424,17 @@ func (c *copyMachine) readCSVTuple(ctx context.Context, record []string) error { | |||
exprs[i] = tree.DNull | |||
continue | |||
} | |||
switch t := c.resultColumns[i].Typ; t.Family() { | |||
case types.BytesFamily, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did you get this list from? how do we know it's complete?
(it's weird to me that e.g. Timestamp is ehre but Time is not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the list comes from from what COPY FROM ... TEXT
is already doing:
Lines 660 to 670 in 61fdb08
switch t := c.resultColumns[i].Typ; t.Family() { | |
case types.BytesFamily, | |
types.DateFamily, | |
types.IntervalFamily, | |
types.INetFamily, | |
types.StringFamily, | |
types.TimestampFamily, | |
types.TimestampTZFamily, | |
types.UuidFamily: | |
s = decodeCopy(s) | |
} |
it seems old, wouldn't be surprised if it has the wrong types in it
i can extract that into its own function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm actually this is wrong. I tested in PG, and COPY CSV does not do \xdd
or \ddd
escaping. Left a comment on #68804 explaining why that issue is happening
pkg/sql/pgwire/testdata/pgtest/copy
Outdated
{"Type":"ReadyForQuery","TxStatus":"I"} | ||
|
||
# Verify that byte-escape sequences are processed in CSV mode. | ||
# `MixceDU0` is base64 for `2,\x54`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this support octal encoding as well like PostgreSQL does? Something like 1,"\157\156\145"
. Would be good to add a test if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should. adding a test
pkg/sql/copy.go
Outdated
// we don't have a good way of knowing how many bytes to read for the next | ||
// record ahead of time (without reimplementing all the CSV parsing logic | ||
// here). This means the input buffer will keep growing -- the memory is | ||
// accounted in processCopyData. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what will happen if I want to upload a CSV file that's 10GB? It won't process a single row from it until all 10GBs are uploaded to memory? If so, that just trades one problem for another.
It seems like instead the csv reader should be updated to only return complete records and put back incomplete records into the buffer if there's no newline or EOF after that incomplete record in the buffer yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I was thinking about how to do this and I think we'll need to fork the csv reader implementation from go. (we'd also need to do that anyway in order to support some of the other COPY options)
one added challenge is that the COPY CSV format requires a newline to appear after each record except for the last one. so if we encounter an EOF on an individual CopyData input buffer, it's hard to tell if that's the end of all the CSV input or if we need to check for another input buffer
tomorrow I'll take a look at how postgres does this and explore some other CSV parsers in go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help if I create a separate ticket just for partial records with COPY CSV with the info in my comment? That way the more complicated CSV reader refactoring doesn't block the nice improvements in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure another ticket would help. if i don't get to this soon then we can go ahead with this PR without the CSV buffering changes
I did find that Postgres solves this by buffering line-by-line, and taking into account whether it is in a quoted field or not: https://github.com/postgres/postgres/blob/9626325da5e8e23ff90091bc96535495d350f06e/src/backend/commands/copyfromparse.c#L1090-L1096
This is ready for another look -- I updated the CSV buffering logic so it searches for a non-escaped newline. |
Fixes cockroachdb#63235 Release note (bug fix): When using COPY FROM .. BINARY, the correct format code will now be returned.
Fixes cockroachdb#68805 Release note (bug fix): Previously, COPY FROM ... BINARY would return an error if the input data was split across different messages. This is fixed now.
Release note (bug fix): Previously, COPY FROM ... CSV would require each CopyData message to be split at the boundary of a record. This is a bug since the COPY protocol allows messages to be split at arbitrary points. This is fixed now.
Fixes cockroachdb#68804 Release note (bug fix): COPY FROM ... CSV did not correctly handle octal byte escape sequences such as `\011` when using a BYTEA column`. This is now fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
These are worth adding since we now have special logic for the CSV format for finding newlines. Release note: None
tftr! bors r=otan |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. Backport to branch 21.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
This is failing after cockroachdb#69066 got merged. `TestConnCopyFromFailServerSideMidwayAbortsWithoutWaiting` is checking that the COPY command errors out quickly if it tries to insert a null value into a non-nullable column. Previously this would pass against CRDB, since it would actually error out due to CRDB not handling partial batches of data (cockroachdb#68805). Now that cockroachdb#68805 is fixed, the pgx test gets a bit further. It fails this assertion: ``` endTime := time.Now() copyTime := endTime.Sub(startTime) if copyTime > time.Second { t.Errorf("Failing CopyFrom shouldn't have taken so long: %v", copyTime) } ``` The COPY command does not fail immediately against CRDB because we have logic to delay inserting rows until we see 100 rows total: https://github.com/cockroachdb/cockroach/blob/8cae60f603ccc4d83137167b3b31cab09be9d41a/pkg/sql/copy.go#L370-L374 I don't think we should change our row batching behavior, so for now it makes sense to ignore this pgx test. Release justification: test only change Release note: None
69446: roachtest: ignore pgx COPY error test r=otan a=rafiss resolves #69291 This is failing after #69066 got merged. `TestConnCopyFromFailServerSideMidwayAbortsWithoutWaiting` is checking that the COPY command errors out quickly if it tries to insert a null value into a non-nullable column. Previously this would pass against CRDB, since it would actually error out due to CRDB not handling partial batches of data (#68805). Now that #68805 is fixed, the pgx test gets a bit further. It fails this assertion: ``` endTime := time.Now() copyTime := endTime.Sub(startTime) if copyTime > time.Second { t.Errorf("Failing CopyFrom shouldn't have taken so long: %v", copyTime) } ``` The COPY command does not fail immediately against CRDB because we have logic to delay inserting rows until we see 100 rows total: https://github.com/cockroachdb/cockroach/blob/8cae60f603ccc4d83137167b3b31cab09be9d41a/pkg/sql/copy.go#L370-L374 I don't think we should change our row batching behavior, so for now it makes sense to ignore this pgx test. Release justification: test only change Release note: None Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
69234: changefeedccl: signal changeAggregator shutdown from the kvfeed r=miretskiy a=stevendanna During acceptance testing, we observed that changefeeds did not correctly restart on primary key changes and did not correctly stop when schema_change_policy was set to 'stop' when the changeFrontier and changeAggregator were running on different nodes (most production changefeeds). The root cause of this was a bad assumption in the changeAggregator shutdown logic. Namely, we assumed that the changeAggregator (and kv feed) would see context cancellations as a result of the changeFrontier moving to draining. However, this is not guaranteed. When the changeFrontier moves to draining, all of its inputs will be drained. But, a DrainRequest message is only sent to the input lazily when the next message is received from that input. In this case of a schema change, the kv feed would stop sending messages to the changeAggregator and thus no further messages will be sent to the changeFrontier and the drain request is not triggered. With this change, we now shut down the changeAggregator when the kvfeed indicates that no more messages will be returned. Fixes #68791 Release note (enterprise change): Fixed a bug where CHANGEFEEDs would fail to correctly handle a primary key change. 69304: sql: no-op interleaved syntax for CREATE TABLE/INDEX r=fqazi a=fqazi Fixes: #68344 Previously, CREATE TABLE/INDEX operations were completely blocked with an error once support was removed. This was inadequate because for migrations may still use this syntax and we can't fully block it. To address this, this patch will make the interleaved syntax a no-op with a client warning. Release justification: low risk and saner behaviour for customer migrations on this release. Release note (sql change): Interleaved syntax for CREATE TABLE/INDEX is now a no-op, since support has been removed. 69439: storage/fs: remove RemoveDir function r=jbowens a=jbowens Remove the RemoveDir function from the storage/fs.FS interface. It's a vestige from the RocksDB Env and its Pebble implementation simply called the existing Remove function. Release justification: non-production code changes Release note: None 69446: roachtest: ignore pgx COPY error test r=otan a=rafiss resolves #69291 This is failing after #69066 got merged. `TestConnCopyFromFailServerSideMidwayAbortsWithoutWaiting` is checking that the COPY command errors out quickly if it tries to insert a null value into a non-nullable column. Previously this would pass against CRDB, since it would actually error out due to CRDB not handling partial batches of data (#68805). Now that #68805 is fixed, the pgx test gets a bit further. It fails this assertion: ``` endTime := time.Now() copyTime := endTime.Sub(startTime) if copyTime > time.Second { t.Errorf("Failing CopyFrom shouldn't have taken so long: %v", copyTime) } ``` The COPY command does not fail immediately against CRDB because we have logic to delay inserting rows until we see 100 rows total: https://github.com/cockroachdb/cockroach/blob/8cae60f603ccc4d83137167b3b31cab09be9d41a/pkg/sql/copy.go#L370-L374 I don't think we should change our row batching behavior, so for now it makes sense to ignore this pgx test. Release justification: test only change Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
This is failing after cockroachdb#69066 got merged. `TestConnCopyFromFailServerSideMidwayAbortsWithoutWaiting` is checking that the COPY command errors out quickly if it tries to insert a null value into a non-nullable column. Previously this would pass against CRDB, since it would actually error out due to CRDB not handling partial batches of data (cockroachdb#68805). Now that cockroachdb#68805 is fixed, the pgx test gets a bit further. It fails this assertion: ``` endTime := time.Now() copyTime := endTime.Sub(startTime) if copyTime > time.Second { t.Errorf("Failing CopyFrom shouldn't have taken so long: %v", copyTime) } ``` The COPY command does not fail immediately against CRDB because we have logic to delay inserting rows until we see 100 rows total: https://github.com/cockroachdb/cockroach/blob/8cae60f603ccc4d83137167b3b31cab09be9d41a/pkg/sql/copy.go#L370-L374 I don't think we should change our row batching behavior, so for now it makes sense to ignore this pgx test. Release justification: test only change Release note: None
See individual commits:
sql: return correct format code for COPY
fixes #63235
sql: COPY BINARY messages can be split at arbitrary boundaries
fixes #68805
sql: COPY CSV handles arbitrary message boundaries
refs #68805
sql: fix byte escaping in COPY CSV input
fixes #68804
Release justification: Fixes for high-priority bugs in existing functionality. ("Another rule of thumb to consider: if a customer hit this bug after releasing, would we patch the release to fix it? If we would patch the release later to fix the bug, then usually it's best to fix the bug now.")