Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: enabling saving of bad csv rows to a side file #41430

Merged
merged 1 commit into from
Oct 14, 2019

Conversation

spaskob
Copy link
Contributor

@spaskob spaskob commented Oct 8, 2019

importccl: enabling saving of bad csv rows to a side file

When importing from large csv files it's common to have a few offending rows.
Currently IMPORT will abort at the first error which makes for a tedious
task of manually fixing ech problem and re-running. Instead users can specify
new option WITH experimental_save_rejected which will not stop on bad rows
but save them in a side file called <original_csv_file>.rejected and
continue. The user then can re-run the import command using IMPORT INTO using
the rejected file after fixing the problems in it.

Touches #40374.

Release note (sql change): enable skipping of faulty rows in IMPORT.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spaskob spaskob requested a review from dt October 8, 2019 15:08
Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 11 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/importccl/import_stmt.go, line 292 at r1 (raw file):

			// is passed in all import formats not just DELIMITED..
			if _, ok := opts[importOptionSaveRejected]; ok {
				format.MysqlOut.SaveRejected = true

this doesn't feel like a per-format option to me? I guess the level at which we can reject depends on the format, and support will arrive in formats one-by-one, but the option feels more general to me and should be either at the job level or at least at the top level in format options.


pkg/ccl/importccl/import_stmt_test.go, line 240 at r1 (raw file):

		// MySQL OUTFILE
		// If err field is non-empty, the query filed specifies what expect
		// to get from the rows that are parsed correctly (see option experimental_save_rejected).

I think we should also have a field that is there content of the rejected row file?


pkg/ccl/importccl/read_import_base.go, line 133 at r1 (raw file):

				rejected = make(chan string)
			}
			grp := ctxgroup.WithContext(ctx)

tiny nit, but I might move the if rejected != nil out here and only spin up the group / use the the group to call fileFunc in the non-nil branch, keeping the direct call to fileFunc in normal, non-rejecting execution, just to avoid the extra goroutine(s) unless we need them.


pkg/ccl/importccl/read_import_base.go, line 140 at r1 (raw file):

				suffix := ".rejected"
				if es.Conf().Provider == roachpb.ExportStorageProvider_Http {
					suffix = "/rejected"

Why is this case different?


pkg/ccl/importccl/read_import_base.go, line 142 at r1 (raw file):

					suffix = "/rejected"
				}
				conf, err := storageccl.ExportStorageConfFromURI(dataFile + suffix)

I don't think you can blindly append to the string URI since almost all of them use query parameters to pass options, like keys or settings.

I think you need to do net/url.Parse(), append to the url.Path field, and then serialize the result to a string and pass that to makeStorage, so that trailing parameters are correctly preserved.


pkg/ccl/importccl/read_import_base.go, line 146 at r1 (raw file):

					return err
				}
				rejectedStorage, err := storageccl.MakeExportStorage(ctx, conf, settings)

I think we should wait to open this file until the first rejected row shows up in the loop below, so that we don't open (and create) it at all if all the rows process without rejection.


pkg/roachpb/io-formats.proto, line 77 at r1 (raw file):

  optional int32 escape = 6 [(gogoproto.nullable) = false];
  // If true, don't abort on failures but instead save the offending row and keep on.
  optional bool save_rejected = 7 [(gogoproto.nullable) = false];

ditto comment above.

Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt)


pkg/ccl/importccl/import_stmt.go, line 292 at r1 (raw file):

Previously, dt (David Taylor) wrote…

this doesn't feel like a per-format option to me? I guess the level at which we can reject depends on the format, and support will arrive in formats one-by-one, but the option feels more general to me and should be either at the job level or at least at the top level in format options.

agreed, notice the TODO just above that addresses this. Since this will be fairly big refactor are you ok with keeping it in DELIMITED at first and once we nail DELIMITED I will refactor it out and start implementing for other modes.


pkg/ccl/importccl/import_stmt_test.go, line 240 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I think we should also have a field that is there content of the rejected row file?

yes, this is tbd until i figure out the threading through the http test server


pkg/ccl/importccl/read_import_base.go, line 133 at r1 (raw file):

Previously, dt (David Taylor) wrote…

tiny nit, but I might move the if rejected != nil out here and only spin up the group / use the the group to call fileFunc in the non-nil branch, keeping the direct call to fileFunc in normal, non-rejecting execution, just to avoid the extra goroutine(s) unless we need them.

done


pkg/ccl/importccl/read_import_base.go, line 140 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Why is this case different?

http uri is 192.0.0.1:<port#> and you append .rejected to it becomes invalid http request
this is still WIP, hacking it like this for now to have unit tests working


pkg/ccl/importccl/read_import_base.go, line 142 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I don't think you can blindly append to the string URI since almost all of them use query parameters to pass options, like keys or settings.

I think you need to do net/url.Parse(), append to the url.Path field, and then serialize the result to a string and pass that to makeStorage, so that trailing parameters are correctly preserved.

thanks, TBD


pkg/ccl/importccl/read_import_base.go, line 146 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I think we should wait to open this file until the first rejected row shows up in the loop below, so that we don't open (and create) it at all if all the rows process without rejection.

good point, done


pkg/roachpb/io-formats.proto, line 77 at r1 (raw file):

Previously, dt (David Taylor) wrote…

ditto comment above.

ack

@spaskob spaskob force-pushed the csv-save-rejected branch 3 times, most recently from 828da27 to 75f30cd Compare October 12, 2019 02:26
Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt)


pkg/ccl/importccl/import_stmt_test.go, line 240 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

yes, this is tbd until i figure out the threading through the http test server

done


pkg/ccl/importccl/read_import_base.go, line 146 at r1 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

good point, done

Done.

@spaskob
Copy link
Contributor Author

spaskob commented Oct 12, 2019

PTAL

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 11 files at r2, 1 of 3 files at r3, 3 of 3 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)


pkg/ccl/importccl/read_import_base.go, line 115 at r4 (raw file):

						atFirstLine = false
					}
					if atFirstLine {

nit: I think we could just use len(buf) > 0

Copy link
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt)


pkg/ccl/importccl/read_import_base.go, line 115 at r4 (raw file):

Previously, dt (David Taylor) wrote…

nit: I think we could just use len(buf) > 0

you are right, I initially thought that buf can be empty but that there's still an error but in that case it still does not make sense to write the reject file anyway.

When importing from large csv files it's common to have a few offending rows.
Currently `IMPORT` will abort at the first error which makes for a tedious
task of manually fixing ech problem and re-running. Instead users can specify
new option `WITH experimental_save_rejected` which will not stop on bad rows
but save them in a side file called `<original_csv_file>.rejected` and
continue. The user then can re-run the import command using `IMPORT INTO` using
the rejected file after fixing the problems in it.

Release note (sql change): enable skipping of faulty rows in IMPORT.
@spaskob
Copy link
Contributor Author

spaskob commented Oct 14, 2019

bors r+

craig bot pushed a commit that referenced this pull request Oct 14, 2019
41430: importccl: enabling saving of bad csv rows to a side file r=spaskob a=spaskob

importccl: enabling saving of bad csv rows to a side file

When importing from large csv files it's common to have a few offending rows.
Currently `IMPORT` will abort at the first error which makes for a tedious
task of manually fixing ech problem and re-running. Instead users can specify
new option `WITH experimental_save_rejected` which will not stop on bad rows
but save them in a side file called `<original_csv_file>.rejected` and
continue. The user then can re-run the import command using `IMPORT INTO` using
the rejected file after fixing the problems in it.

Touches #40374.

Release note (sql change): enable skipping of faulty rows in IMPORT.

Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Oct 14, 2019

Build succeeded

@craig craig bot merged commit 97788de into cockroachdb:master Oct 14, 2019
@spaskob spaskob deleted the csv-save-rejected branch November 1, 2019 01:55
spaskob pushed a commit to spaskob/cockroach that referenced this pull request Nov 12, 2019
There was severe error introduced by cockroachdb#41430 that causes the original
IMPORT file to be overwritten by the rejected file.

Release note (bug fix): fix bug when using ‘experimental_save_rejected’
that would cause the rejected row file to overwrite the original input
file.
craig bot pushed a commit that referenced this pull request Nov 12, 2019
42398: importccl: fix rejected file overwriting the original file r=spaskob a=spaskob

There was severe error introduced by #41430 that causes the original
IMPORT file to be overwritten by the rejected file.

Release note (bug fix): Release note (bug fix): fix bug when using ‘experimental_save_rejected’ that would cause the rejected row file to overwrite the original input file.

Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants