New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: Added row limit option for importing DELIMITED, AVRO formats #56135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to have 2 commits with the same title... Not sure if you need to squash them, or perhaps update one of the commits w/ correct description.
Nice job! I have mostly nits. Giving you a big
Feel free to ping me if you have any questions re my minor requests.
Reviewed 2 of 6 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @mokaixu)
pkg/ccl/importccl/import_stmt.go, line 456 at r2 (raw file):
rowLimit, err := strconv.Atoi(override) if err != nil { return pgerror.Wrapf(err, pgcode.Syntax, "invalid %s value", csvRowLimit)
perhaps invalid numeric value %s
to make it clear we expect a number there?
pkg/ccl/importccl/import_stmt.go, line 522 at r2 (raw file):
format.SaveRejected = true } // TODO(mokaixu): rename csvRowLimit to rowLimit
Maybe just rename it?
Or keep it as csvRowLimit -- that's fine, I think (and drop this todo). We already use csv named vars in other formats.
Or... if you feel strongly, move this todo on top of this file and make it generic to all csv* options that we use in other formats (e.g. csvNullIf, etc)
pkg/ccl/importccl/import_stmt.go, line 913 at r2 (raw file):
Quoted 10 lines of code…
if override, ok := opts[csvRowLimit]; ok { rowLimit, err := strconv.Atoi(override) if err != nil { return pgerror.Wrapf(err, pgcode.Syntax, "invalid %s value", csvRowLimit) } if rowLimit <= 0 { return pgerror.Newf(pgcode.Syntax, "%s must be > 0", csvRowLimit) } format.Avro.RowLimit = uint32(rowLimit) }
Perhaps let's move this into a helper
func setNumericOption(opt string, opts map[string]string, int32* val) error {
....
}
pkg/ccl/importccl/import_stmt_test.go, line 344 at r2 (raw file):
typ: "CSV", data: "a string, b string\nfoo,normal\nbar,baz\nchocolate,cake\n", err: "pq: row_limit must be > 0",
would row limit of 0 be useful? Perhaps I just want to create the table and check it worked?
pkg/ccl/importccl/import_stmt_test.go, line 376 at r2 (raw file):
typ: "CSV", data: "a string, b string\nfoo,normal\nbar,baz\nchocolate,cake\n", query: map[string][][]string{`SELECT * from t`: {{"foo", "normal"}, {"bar", "baz"}, {"chocolate", "cake"}}},
do we want test case w/ skip > max rows?
skip < 0?
pkg/ccl/importccl/import_stmt_test.go, line 1443 at r2 (raw file):
// Set up a directory for the data files. err := os.Mkdir(filepath.Join(baseDir, "test"), 0777) require.NoError(t, err)
you don't need this. Your baseDir already points to a temporary directory.
pkg/ccl/importccl/import_stmt_test.go, line 1450 at r2 (raw file):
// Create temporary file for test data. err := ioutil.WriteFile(filepath.Join(baseDir, "test", "data"), []byte(test.data), 0666)
You probably should create a different file for each test. And you probably don't need to create subdirectoreis (filepath.Join) since basedir is a temp directory.
pkg/ccl/importccl/read_import_base.go, line 411 at r2 (raw file):
skip int64 // Number of records to skip rejected chan string // Channel for reporting corrupt "rows" rowLimit uint32 // Number of records to process.
Just noticed. Let's not use uint (here and throughout). Use unsigned types only when representing bit patters. Check for invalid values (like you already do when checking for negative offset and return errors). See below.
pkg/ccl/importccl/read_import_base.go, line 555 at r2 (raw file):
// Stop when we have processed row limit number of rows. numRowsToProcess := count - numSkipped if fileCtx.rowLimit != 0 && numRowsToProcess > int64(fileCtx.rowLimit) {
This is why rowLimit should be a signed quantity. Not that your code is effected (for now), but it's because signed extensions and conversions are rather subtle, language dependent and a source of bugs. See https://play.golang.org/p/O8_ERDumpvu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @mokaixu)
pkg/ccl/importccl/import_stmt.go, line 456 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
perhaps
invalid numeric value %s
to make it clear we expect a number there?
done
pkg/ccl/importccl/import_stmt.go, line 522 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Maybe just rename it?
Or keep it as csvRowLimit -- that's fine, I think (and drop this todo). We already use csv named vars in other formats.
Or... if you feel strongly, move this todo on top of this file and make it generic to all csv* options that we use in other formats (e.g. csvNullIf, etc)
ack. i'll keep it as csvRowLimit for the CSV only PR and will address this change in the next PR that covers AVRO & DELIMITED
pkg/ccl/importccl/import_stmt.go, line 913 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
if override, ok := opts[csvRowLimit]; ok { rowLimit, err := strconv.Atoi(override) if err != nil { return pgerror.Wrapf(err, pgcode.Syntax, "invalid %s value", csvRowLimit) } if rowLimit <= 0 { return pgerror.Newf(pgcode.Syntax, "%s must be > 0", csvRowLimit) } format.Avro.RowLimit = uint32(rowLimit) }
Perhaps let's move this into a helper
func setNumericOption(opt string, opts map[string]string, int32* val) error { .... }
I refactored this for rowLimit, but will this apply to skip as well?
pkg/ccl/importccl/import_stmt_test.go, line 344 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
would row limit of 0 be useful? Perhaps I just want to create the table and check it worked?
Not sure how frequently this would come up, but right now 0 is used as a sentinel value, so I'm enforcing row_limit to be strictly greater than 0.
pkg/ccl/importccl/import_stmt_test.go, line 1443 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
you don't need this. Your baseDir already points to a temporary directory.
ack
pkg/ccl/importccl/import_stmt_test.go, line 1450 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
You probably should create a different file for each test. And you probably don't need to create subdirectoreis (filepath.Join) since basedir is a temp directory.
pinged you about this one!
pkg/ccl/importccl/read_import_base.go, line 411 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Just noticed. Let's not use uint (here and throughout). Use unsigned types only when representing bit patters. Check for invalid values (like you already do when checking for negative offset and return errors). See below.
done
pkg/ccl/importccl/read_import_base.go, line 555 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
This is why rowLimit should be a signed quantity. Not that your code is effected (for now), but it's because signed extensions and conversions are rather subtle, language dependent and a source of bugs. See https://play.golang.org/p/O8_ERDumpvu
makes sense, thanks for the resource :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @mokaixu)
pkg/ccl/importccl/import_stmt.go, line 913 at r2 (raw file):
Previously, mokaixu (Monica Xu) wrote…
I refactored this for rowLimit, but will this apply to skip as well?
Following up, I reverted this code refactor because I feel like the error that could be thrown might be different for other numeric options. Wasn't sure how to cleanly account for this in a helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/ccl/importccl/import_stmt_test.go, line 376 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
do we want test case w/ skip > max rows?
skip < 0?
this is already tested in TestImportData
name: "skip -1 lines"
name: "skip > all lines"
3c37508
to
86781d3
Compare
A row limit option allows users to specify a fixed number of rows that they want to import. This functionality is only available for CSV formats. With this code change, users can limit the number of rows they want to import for DELIMITED and AVRO data forms. That is, they can specify: `WITH row_limit="{$num}"` to only import $num rows. Release note (sql change): Added WITH row_limit="{$num}" option for importing DELIMITED/AVRO data to allow users to do a quick test run on an import of $num rows. Ex. IMPORT TABLE test ... DELIMITED/AVRO DATA ... WITH row_limit="3";
bors r=adityamaru,miretskiy |
Build succeeded: |
DIFF BASE on PR: #56080