Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: Added row limit option for importing DELIMITED, AVRO formats #56135

Merged
merged 1 commit into from Nov 4, 2020

Conversation

mokaixu
Copy link
Contributor

@mokaixu mokaixu commented Oct 29, 2020

A row limit option allows users to specify a fixed number of rows
that they want to import. This functionality is only available for
CSV formats.

With this code change, users can limit the number of rows they want
to import for DELIMITED and AVRO data forms. That is, they can
specify: `WITH row_limit="{$num}"` to only import $num rows.

Release note (sql change): Added WITH row_limit="{$num}" option for importing
DELIMITED/AVRO data to allow users to do a quick test run on an import of $num rows.
Ex. IMPORT TABLE test ... DELIMITED/AVRO DATA ... WITH row_limit="3";

DIFF BASE on PR: #56080

@mokaixu mokaixu requested review from a team and miretskiy and removed request for a team October 29, 2020 23:01
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mokaixu mokaixu removed the request for review from miretskiy October 29, 2020 23:01
@mokaixu mokaixu added the do-not-merge bors won't merge a PR with this label. label Oct 29, 2020
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to have 2 commits with the same title... Not sure if you need to squash them, or perhaps update one of the commits w/ correct description.

Nice job! I have mostly nits. Giving you a big :lgtm:
Feel free to ping me if you have any questions re my minor requests.

Reviewed 2 of 6 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mokaixu)


pkg/ccl/importccl/import_stmt.go, line 456 at r2 (raw file):

				rowLimit, err := strconv.Atoi(override)
				if err != nil {
					return pgerror.Wrapf(err, pgcode.Syntax, "invalid %s value", csvRowLimit)

perhaps invalid numeric value %s to make it clear we expect a number there?


pkg/ccl/importccl/import_stmt.go, line 522 at r2 (raw file):

				format.SaveRejected = true
			}
			// TODO(mokaixu): rename csvRowLimit to rowLimit

Maybe just rename it?
Or keep it as csvRowLimit -- that's fine, I think (and drop this todo). We already use csv named vars in other formats.
Or... if you feel strongly, move this todo on top of this file and make it generic to all csv* options that we use in other formats (e.g. csvNullIf, etc)


pkg/ccl/importccl/import_stmt.go, line 913 at r2 (raw file):

Quoted 10 lines of code…
	if override, ok := opts[csvRowLimit]; ok {
		rowLimit, err := strconv.Atoi(override)
		if err != nil {
			return pgerror.Wrapf(err, pgcode.Syntax, "invalid %s value", csvRowLimit)
		}
		if rowLimit <= 0 {
			return pgerror.Newf(pgcode.Syntax, "%s must be > 0", csvRowLimit)
		}
		format.Avro.RowLimit = uint32(rowLimit)
	}

Perhaps let's move this into a helper

  func setNumericOption(opt string, opts map[string]string, int32* val) error {
     ....
  }

pkg/ccl/importccl/import_stmt_test.go, line 344 at r2 (raw file):

			typ:    "CSV",
			data:   "a string, b string\nfoo,normal\nbar,baz\nchocolate,cake\n",
			err:    "pq: row_limit must be > 0",

would row limit of 0 be useful? Perhaps I just want to create the table and check it worked?


pkg/ccl/importccl/import_stmt_test.go, line 376 at r2 (raw file):

			typ:    "CSV",
			data:   "a string, b string\nfoo,normal\nbar,baz\nchocolate,cake\n",
			query:  map[string][][]string{`SELECT * from t`: {{"foo", "normal"}, {"bar", "baz"}, {"chocolate", "cake"}}},

do we want test case w/ skip > max rows?
skip < 0?


pkg/ccl/importccl/import_stmt_test.go, line 1443 at r2 (raw file):

	// Set up a directory for the data files.
	err := os.Mkdir(filepath.Join(baseDir, "test"), 0777)
	require.NoError(t, err)

you don't need this. Your baseDir already points to a temporary directory.


pkg/ccl/importccl/import_stmt_test.go, line 1450 at r2 (raw file):

			// Create temporary file for test data.
			err := ioutil.WriteFile(filepath.Join(baseDir, "test", "data"), []byte(test.data), 0666)

You probably should create a different file for each test. And you probably don't need to create subdirectoreis (filepath.Join) since basedir is a temp directory.


pkg/ccl/importccl/read_import_base.go, line 411 at r2 (raw file):

	skip     int64       // Number of records to skip
	rejected chan string // Channel for reporting corrupt "rows"
	rowLimit uint32      // Number of records to process.

Just noticed. Let's not use uint (here and throughout). Use unsigned types only when representing bit patters. Check for invalid values (like you already do when checking for negative offset and return errors). See below.


pkg/ccl/importccl/read_import_base.go, line 555 at r2 (raw file):

			// Stop when we have processed row limit number of rows.
			numRowsToProcess := count - numSkipped
			if fileCtx.rowLimit != 0 && numRowsToProcess > int64(fileCtx.rowLimit) {

This is why rowLimit should be a signed quantity. Not that your code is effected (for now), but it's because signed extensions and conversions are rather subtle, language dependent and a source of bugs. See https://play.golang.org/p/O8_ERDumpvu

Copy link
Contributor Author

@mokaixu mokaixu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @mokaixu)


pkg/ccl/importccl/import_stmt.go, line 456 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

perhaps invalid numeric value %s to make it clear we expect a number there?

done


pkg/ccl/importccl/import_stmt.go, line 522 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Maybe just rename it?
Or keep it as csvRowLimit -- that's fine, I think (and drop this todo). We already use csv named vars in other formats.
Or... if you feel strongly, move this todo on top of this file and make it generic to all csv* options that we use in other formats (e.g. csvNullIf, etc)

ack. i'll keep it as csvRowLimit for the CSV only PR and will address this change in the next PR that covers AVRO & DELIMITED


pkg/ccl/importccl/import_stmt.go, line 913 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
	if override, ok := opts[csvRowLimit]; ok {
		rowLimit, err := strconv.Atoi(override)
		if err != nil {
			return pgerror.Wrapf(err, pgcode.Syntax, "invalid %s value", csvRowLimit)
		}
		if rowLimit <= 0 {
			return pgerror.Newf(pgcode.Syntax, "%s must be > 0", csvRowLimit)
		}
		format.Avro.RowLimit = uint32(rowLimit)
	}

Perhaps let's move this into a helper

  func setNumericOption(opt string, opts map[string]string, int32* val) error {
     ....
  }

I refactored this for rowLimit, but will this apply to skip as well?


pkg/ccl/importccl/import_stmt_test.go, line 344 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

would row limit of 0 be useful? Perhaps I just want to create the table and check it worked?

Not sure how frequently this would come up, but right now 0 is used as a sentinel value, so I'm enforcing row_limit to be strictly greater than 0.


pkg/ccl/importccl/import_stmt_test.go, line 1443 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

you don't need this. Your baseDir already points to a temporary directory.

ack


pkg/ccl/importccl/import_stmt_test.go, line 1450 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

You probably should create a different file for each test. And you probably don't need to create subdirectoreis (filepath.Join) since basedir is a temp directory.

pinged you about this one!


pkg/ccl/importccl/read_import_base.go, line 411 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Just noticed. Let's not use uint (here and throughout). Use unsigned types only when representing bit patters. Check for invalid values (like you already do when checking for negative offset and return errors). See below.

done


pkg/ccl/importccl/read_import_base.go, line 555 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

This is why rowLimit should be a signed quantity. Not that your code is effected (for now), but it's because signed extensions and conversions are rather subtle, language dependent and a source of bugs. See https://play.golang.org/p/O8_ERDumpvu

makes sense, thanks for the resource :)

Copy link
Contributor Author

@mokaixu mokaixu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @mokaixu)


pkg/ccl/importccl/import_stmt.go, line 913 at r2 (raw file):

Previously, mokaixu (Monica Xu) wrote…

I refactored this for rowLimit, but will this apply to skip as well?

Following up, I reverted this code refactor because I feel like the error that could be thrown might be different for other numeric options. Wasn't sure how to cleanly account for this in a helper

Copy link
Contributor Author

@mokaixu mokaixu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/ccl/importccl/import_stmt_test.go, line 376 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

do we want test case w/ skip > max rows?
skip < 0?

this is already tested in TestImportData
name: "skip -1 lines"
name: "skip > all lines"

@mokaixu mokaixu force-pushed the project6 branch 2 times, most recently from 3c37508 to 86781d3 Compare November 2, 2020 21:41
@mokaixu mokaixu changed the title [WIP] importccl: Added row limit option for importing DELIMITED, AVRO formats importccl: Added row limit option for importing DELIMITED, AVRO formats Nov 2, 2020
@mokaixu mokaixu removed the do-not-merge bors won't merge a PR with this label. label Nov 2, 2020
A row limit option allows users to specify a fixed number of rows
that they want to import. This functionality is only available for
CSV formats.

With this code change, users can limit the number of rows they want
to import for DELIMITED and AVRO data forms. That is, they can
specify: `WITH row_limit="{$num}"` to only import $num rows.

Release note (sql change): Added WITH row_limit="{$num}" option for importing
DELIMITED/AVRO data to allow users to do a quick test run on an import of $num rows.
Ex. IMPORT TABLE test ... DELIMITED/AVRO DATA ... WITH row_limit="3";
@mokaixu
Copy link
Contributor Author

mokaixu commented Nov 4, 2020

bors r=adityamaru,miretskiy

@craig
Copy link
Contributor

craig bot commented Nov 4, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants