-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: Allow wildcards in import URIs #40714
Conversation
73f1329
to
48d72c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have some other comments about some details later, but my main question is just about whether we're allowing general glob patterns beyond *
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @g3orgia and @lucy-zhang)
pkg/ccl/importccl/csv_testdata_helpers_test.go, line 113 at r3 (raw file):
testFiles.fileWithShadowKeys = append(testFiles.fileWithShadowKeys, fmt.Sprintf(`'nodelocal:///%s'`, fmt.Sprintf("shadow-data%s", suffix))) wildcardFileName := "data-[0-9]"
Do we want to support arbitrary glob patterns beyond just *
? This seems risky to me. I think users are likely to be able to deal with *
, but in the context of cloud storage, it's not as obvious that we'll support globbing in general, and people might have other special characters in their filenames that causes unexpected behavior.
Also, I think these tests should include more uses of *
, since that's expected to be the most common case.
pkg/ccl/importccl/import_stmt.go, line 177 at r3 (raw file):
} listedFiles, err := filesFn()
This name is kind of vague. I would call it filenamePatterns
or something.
pkg/ccl/importccl/import_stmt.go, line 191 at r3 (raw file):
if err != nil { files = append(files, file) continue
Is this assuming that an error from ListFiles()
indicates that one of the file patterns provided doesn't have a wildcard?
I would think that we should return the error and fail immediately for most of the errors that could occur here (incorrect credentials/permissions, a nonexistent location, etc.). Right now, it looks like patterns where ListFiles()
fails for any reason will be added to files
, even if they have wildcards. Is that right?
(About patterns that match no files: I think if someone provides multiple patterns/paths, we should only return an error if we end up matching no files to import across all the patterns/paths, but I don't have a strong opinion on that.)
pkg/ccl/storageccl/export_storage.go, line 255 at r3 (raw file):
WriteFile(ctx context.Context, basename string, content io.ReadSeeker) error // ListFiles should return a list of files that a directory contains
nit: Comments should be like sentences, with punctuation at the end (the style guide is https://github.com/cockroachdb/cockroach/blob/master/docs/style.md).
pkg/ccl/storageccl/export_storage.go, line 256 at r4 (raw file):
// ListFiles should return a list of files that a directory contains ListFiles(ctx context.Context, basename string) ([]string, error)
I think this could use a more detailed doc comment in general, explaining what paths are returned, what basename
is (should it even be an argument? I don't see it being used in any of the implementations), what the matching rules are, etc., and providing a few examples.
I find the behavior of this method very non-obvious in general. Unlike the other methods, which represent operations on a "directory"/store, this method seems to assume that the URI is actually not a store URI, but a pattern for file URIs. It kind of strains the ExportStorage
abstraction IMO, but I'd be more fine with it if the documentation were more comprehensive. (When we read files with ReadFile()
during restore/import, we set the ExportStorage
URI to be the entire file path and basename
to the empty string anyway, so this issue of conflating file paths and store/directory paths isn't new.)
pkg/ccl/storageccl/export_storage.go, line 709 at r4 (raw file):
} if matches { fileList = append(fileList, fmt.Sprintf("s3://%s/%s", *s.bucket, *fileObject.Key))
Can this use os.Join()
? I think that's preferable to string manipulation for paths. (The same goes for the other paths.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @lucy-zhang, and @rolandcrosby)
pkg/ccl/importccl/csv_testdata_helpers_test.go, line 113 at r3 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
Do we want to support arbitrary glob patterns beyond just
*
? This seems risky to me. I think users are likely to be able to deal with*
, but in the context of cloud storage, it's not as obvious that we'll support globbing in general, and people might have other special characters in their filenames that causes unexpected behavior.Also, I think these tests should include more uses of
*
, since that's expected to be the most common case.
Previously discussed with @dt and @rolandcrosby, it seemed like supporting glob pattern matching might be useful. For example, a user might want to schedule a monthly job that imports s3://mybucket/2019-09-[0-9][0-9]/*.csv
. That was the use case in mind when making this decision.
If that case is uncommon/easy to work around, it would be easier to only match files within a directory. As for other glob patterns than *
, I think this would be an easy fix but it could be very useful (I actually need it for testing, otherwise it would match unwanted files).
Will add more tests with *
and ?
patterns for now. We can change the implementation if needed.
pkg/ccl/importccl/import_stmt.go, line 177 at r3 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
This name is kind of vague. I would call it
filenamePatterns
or something.
Done.
pkg/ccl/importccl/import_stmt.go, line 191 at r3 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
Is this assuming that an error from
ListFiles()
indicates that one of the file patterns provided doesn't have a wildcard?I would think that we should return the error and fail immediately for most of the errors that could occur here (incorrect credentials/permissions, a nonexistent location, etc.). Right now, it looks like patterns where
ListFiles()
fails for any reason will be added tofiles
, even if they have wildcards. Is that right?(About patterns that match no files: I think if someone provides multiple patterns/paths, we should only return an error if we end up matching no files to import across all the patterns/paths, but I don't have a strong opinion on that.)
Yes, currently, if they have wildcards but error out, it will be added to the list of files and later errors out based if it can read that file. This was to catch the case if an actual filename contained *
, []
, and/or ?
.
I agree it would be much cleaner if we failed immediately. Not sure if the extra check would ever be useful. Lemme know if what you think :)
pkg/ccl/storageccl/export_storage.go, line 256 at r4 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
I think this could use a more detailed doc comment in general, explaining what paths are returned, what
basename
is (should it even be an argument? I don't see it being used in any of the implementations), what the matching rules are, etc., and providing a few examples.I find the behavior of this method very non-obvious in general. Unlike the other methods, which represent operations on a "directory"/store, this method seems to assume that the URI is actually not a store URI, but a pattern for file URIs. It kind of strains the
ExportStorage
abstraction IMO, but I'd be more fine with it if the documentation were more comprehensive. (When we read files withReadFile()
during restore/import, we set theExportStorage
URI to be the entire file path andbasename
to the empty string anyway, so this issue of conflating file paths and store/directory paths isn't new.)
Will make the function description more useful and concise. Wanted to keep the function similar to other functions in this interface, but basename
definitely is not useful in this case, will remove it.
Since this function does not use the URI the way the other functions do, I've thought about extracting it from this interface. But ExportStorage
already sets up clients to all the different cloud providers, so it seems like that would be too much work for a change like this.
pkg/ccl/storageccl/export_storage.go, line 709 at r4 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
Can this use
os.Join()
? I think that's preferable to string manipulation for paths. (The same goes for the other paths.)
Done.
15064c2
to
23f0ad7
Compare
5deadb9
to
e2ac7e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @g3orgia, @lucy-zhang, and @rolandcrosby)
pkg/ccl/storageccl/export_storage.go, line 232 at r5 (raw file):
// URINeedsExpansion checks if URI can be expanded by checking if it contains wildcard characters. // This should be used before passing a URI into ListFiles()
nit: Add a period
pkg/ccl/storageccl/export_storage.go, line 228 at r6 (raw file):
// URINeedsExpansion checks if URI can be expanded by checking if it contains wildcard characters. // This should be used before passing a URI into ListFiles() func URINeedsExpansion(uri string) bool {
I might call thisURINeedsGlobExpansion
to be more specific.
pkg/ccl/storageccl/export_storage.go, line 268 at r6 (raw file):
WriteFile(ctx context.Context, basename string, content io.ReadSeeker) error // ListFiles should return a list of files that matches glob pattern provided in URI.
Maybe ListFiles should treat the export storage URI as a glob pattern, and return a list of files that match the pattern.
?
pkg/ccl/storageccl/export_storage.go, line 298 at r6 (raw file):
cfg roachpb.ExportStorage_LocalFilePath // constains un-prefixed base -- DO NOT use for I/O ops. base string // the prefixed base, for I/O ops on this node. externalIODir string // base directory in which we can perform reads and writes.
Can this have a comment explaining what the difference between this and the other two fields, and why we need it?
(Also, should this have a different name? There's already a cluster setting that's accessed via settings.ExternalIODir
, which is very similar.)
pkg/ccl/storageccl/export_storage.go, line 731 at r6 (raw file):
} func getBucketBeforeWildcard(path string) string {
What happens if someone tries to use a glob pattern to match multiple buckets? Obviously this isn't possible, but what error do we return in that case?
pkg/ccl/storageccl/export_storage.go, line 1215 at r6 (raw file):
func (s *workloadStorage) ListFiles(_ context.Context) ([]string, error) { return nil, errors.New(`workload storage does not support listing files`)
nit: Can this (and the other error messages) use errors.Errorf()
for consistency?
pkg/ccl/storageccl/export_storage_test.go, line 217 at r6 (raw file):
} func testListFiles(t *testing.T, storeURI string) {
Should we also add tests for patterns like, e.g., s3://bucket/*/*
or s3://bucket/*/a/*
?
This adds support for using wildcard characters to specify the list of files to import from. Using cloud provider SDKs, we list files that could potentially match a pattern in the URI, then match files using specs as detailed [here](https://golang.org/pkg/path/filepath/#Match). This listing capability was added to the `ExportStorage` interface. Within `import_stmt`, we use the newly added interface to expand the listed files. Note that we will import the expanded list of files but the job description would show the originally listed files (with wildcards). This also adds an option flag that can specify turning off this feature `WITH disable_glob_matching`. Fixes: cockroachdb#40522 Release note (enterprise change): Within an import statement, users will be able to specify CSV filenames using wildcard characters. This behavior can be disabled with the following option: `WITH disabled_glob_matching`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @lucy-zhang, and @rolandcrosby)
pkg/ccl/storageccl/export_storage.go, line 228 at r6 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
I might call this
URINeedsGlobExpansion
to be more specific.
Done.
pkg/ccl/storageccl/export_storage.go, line 268 at r6 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
Maybe
ListFiles should treat the export storage URI as a glob pattern, and return a list of files that match the pattern.
?
Done.
pkg/ccl/storageccl/export_storage.go, line 298 at r6 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
Can this have a comment explaining what the difference between this and the other two fields, and why we need it?
(Also, should this have a different name? There's already a cluster setting that's accessed via
settings.ExternalIODir
, which is very similar.)
This variable is in fact the same as settings.ExternalIODir
but settings is no longer accessible later on. The base
is the path provided in URI prefixed with externalIODir
. The reason we need it, is because we later strip it after the file expansions are done. Added comment to explain difference.
pkg/ccl/storageccl/export_storage.go, line 731 at r6 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
What happens if someone tries to use a glob pattern to match multiple buckets? Obviously this isn't possible, but what error do we return in that case?
Hmm I'm not sure I understand the question. But if someone matches file/path/*/file/*/path
, we will list everything in file/path
and attempt to match those files to the pattern. I don't think that should be an error case.
pkg/ccl/storageccl/export_storage_test.go, line 217 at r6 (raw file):
Previously, lucy-zhang (Lucy Zhang) wrote…
Should we also add tests for patterns like, e.g.,
s3://bucket/*/*
ors3://bucket/*/a/*
?
Yes, added more comprehensive testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @g3orgia, @lucy-zhang, and @rolandcrosby)
pkg/ccl/storageccl/export_storage.go, line 731 at r6 (raw file):
Previously, g3orgia (Georgia Hong) wrote…
Hmm I'm not sure I understand the question. But if someone matches
file/path/*/file/*/path
, we will list everything infile/path
and attempt to match those files to the pattern. I don't think that should be an error case.
Resolved offline. (The question was about users trying to do something like s3://bucket*/
, but we return a reasonable error in that case.)
bors r+ |
40714: importccl: Allow wildcards in import URIs r=g3orgia a=g3orgia This adds support for using wildcard characters to specify the list of files to import from. Using cloud provider SDKs, we list files that could potentially match a pattern in the URI, then match files using specs as detailed [here](https://golang.org/pkg/path/filepath/#Match). This listing capability was added to the `ExportStorage` interface. Within `import_stmt`, we use the newly added interface to expand the listed files. Note that we will import the expanded list of files but the job description would show the originally listed files (with wildcards). Fixes: #40522 Release note: None Co-authored-by: Georgia Hong <georgiah@cockroachlabs.com>
Build succeeded |
This adds support for using wildcard characters to specify the list of files to import from. Using cloud provider SDKs, we list files that could potentially match a pattern in the URI, then match files using specs as detailed here. This listing capability was added to the
ExportStorage
interface.Within
import_stmt
, we use the newly added interface to expand the listed files. Note that we will import the expanded list of files but the job description would show the originally listed files (with wildcards).Fixes: #40522
Release note: None