Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: Allow wildcards in import URIs #40714

Merged
merged 1 commit into from
Oct 9, 2019

Conversation

g3orgia
Copy link
Contributor

@g3orgia g3orgia commented Sep 12, 2019

This adds support for using wildcard characters to specify the list of files to import from. Using cloud provider SDKs, we list files that could potentially match a pattern in the URI, then match files using specs as detailed here. This listing capability was added to the ExportStorage interface.

Within import_stmt, we use the newly added interface to expand the listed files. Note that we will import the expanded list of files but the job description would show the originally listed files (with wildcards).

Fixes: #40522

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have some other comments about some details later, but my main question is just about whether we're allowing general glob patterns beyond *.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @g3orgia and @lucy-zhang)


pkg/ccl/importccl/csv_testdata_helpers_test.go, line 113 at r3 (raw file):

	testFiles.fileWithShadowKeys = append(testFiles.fileWithShadowKeys, fmt.Sprintf(`'nodelocal:///%s'`, fmt.Sprintf("shadow-data%s", suffix)))

	wildcardFileName := "data-[0-9]"

Do we want to support arbitrary glob patterns beyond just *? This seems risky to me. I think users are likely to be able to deal with *, but in the context of cloud storage, it's not as obvious that we'll support globbing in general, and people might have other special characters in their filenames that causes unexpected behavior.

Also, I think these tests should include more uses of *, since that's expected to be the most common case.


pkg/ccl/importccl/import_stmt.go, line 177 at r3 (raw file):

		}

		listedFiles, err := filesFn()

This name is kind of vague. I would call it filenamePatterns or something.


pkg/ccl/importccl/import_stmt.go, line 191 at r3 (raw file):

			if err != nil {
				files = append(files, file)
				continue

Is this assuming that an error from ListFiles() indicates that one of the file patterns provided doesn't have a wildcard?

I would think that we should return the error and fail immediately for most of the errors that could occur here (incorrect credentials/permissions, a nonexistent location, etc.). Right now, it looks like patterns where ListFiles() fails for any reason will be added to files, even if they have wildcards. Is that right?

(About patterns that match no files: I think if someone provides multiple patterns/paths, we should only return an error if we end up matching no files to import across all the patterns/paths, but I don't have a strong opinion on that.)


pkg/ccl/storageccl/export_storage.go, line 255 at r3 (raw file):

	WriteFile(ctx context.Context, basename string, content io.ReadSeeker) error

	// ListFiles should return a list of files that a directory contains

nit: Comments should be like sentences, with punctuation at the end (the style guide is https://github.com/cockroachdb/cockroach/blob/master/docs/style.md).


pkg/ccl/storageccl/export_storage.go, line 256 at r4 (raw file):

	// ListFiles should return a list of files that a directory contains
	ListFiles(ctx context.Context, basename string) ([]string, error)

I think this could use a more detailed doc comment in general, explaining what paths are returned, what basename is (should it even be an argument? I don't see it being used in any of the implementations), what the matching rules are, etc., and providing a few examples.

I find the behavior of this method very non-obvious in general. Unlike the other methods, which represent operations on a "directory"/store, this method seems to assume that the URI is actually not a store URI, but a pattern for file URIs. It kind of strains the ExportStorage abstraction IMO, but I'd be more fine with it if the documentation were more comprehensive. (When we read files with ReadFile() during restore/import, we set the ExportStorage URI to be the entire file path and basename to the empty string anyway, so this issue of conflating file paths and store/directory paths isn't new.)


pkg/ccl/storageccl/export_storage.go, line 709 at r4 (raw file):

				}
				if matches {
					fileList = append(fileList, fmt.Sprintf("s3://%s/%s", *s.bucket, *fileObject.Key))

Can this use os.Join()? I think that's preferable to string manipulation for paths. (The same goes for the other paths.)

Copy link
Contributor Author

@g3orgia g3orgia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @lucy-zhang, and @rolandcrosby)


pkg/ccl/importccl/csv_testdata_helpers_test.go, line 113 at r3 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Do we want to support arbitrary glob patterns beyond just *? This seems risky to me. I think users are likely to be able to deal with *, but in the context of cloud storage, it's not as obvious that we'll support globbing in general, and people might have other special characters in their filenames that causes unexpected behavior.

Also, I think these tests should include more uses of *, since that's expected to be the most common case.

Previously discussed with @dt and @rolandcrosby, it seemed like supporting glob pattern matching might be useful. For example, a user might want to schedule a monthly job that imports s3://mybucket/2019-09-[0-9][0-9]/*.csv. That was the use case in mind when making this decision.

If that case is uncommon/easy to work around, it would be easier to only match files within a directory. As for other glob patterns than *, I think this would be an easy fix but it could be very useful (I actually need it for testing, otherwise it would match unwanted files).

Will add more tests with * and ? patterns for now. We can change the implementation if needed.


pkg/ccl/importccl/import_stmt.go, line 177 at r3 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

This name is kind of vague. I would call it filenamePatterns or something.

Done.


pkg/ccl/importccl/import_stmt.go, line 191 at r3 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Is this assuming that an error from ListFiles() indicates that one of the file patterns provided doesn't have a wildcard?

I would think that we should return the error and fail immediately for most of the errors that could occur here (incorrect credentials/permissions, a nonexistent location, etc.). Right now, it looks like patterns where ListFiles() fails for any reason will be added to files, even if they have wildcards. Is that right?

(About patterns that match no files: I think if someone provides multiple patterns/paths, we should only return an error if we end up matching no files to import across all the patterns/paths, but I don't have a strong opinion on that.)

Yes, currently, if they have wildcards but error out, it will be added to the list of files and later errors out based if it can read that file. This was to catch the case if an actual filename contained *, [], and/or ?.

I agree it would be much cleaner if we failed immediately. Not sure if the extra check would ever be useful. Lemme know if what you think :)


pkg/ccl/storageccl/export_storage.go, line 256 at r4 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

I think this could use a more detailed doc comment in general, explaining what paths are returned, what basename is (should it even be an argument? I don't see it being used in any of the implementations), what the matching rules are, etc., and providing a few examples.

I find the behavior of this method very non-obvious in general. Unlike the other methods, which represent operations on a "directory"/store, this method seems to assume that the URI is actually not a store URI, but a pattern for file URIs. It kind of strains the ExportStorage abstraction IMO, but I'd be more fine with it if the documentation were more comprehensive. (When we read files with ReadFile() during restore/import, we set the ExportStorage URI to be the entire file path and basename to the empty string anyway, so this issue of conflating file paths and store/directory paths isn't new.)

Will make the function description more useful and concise. Wanted to keep the function similar to other functions in this interface, but basename definitely is not useful in this case, will remove it.

Since this function does not use the URI the way the other functions do, I've thought about extracting it from this interface. But ExportStorage already sets up clients to all the different cloud providers, so it seems like that would be too much work for a change like this.


pkg/ccl/storageccl/export_storage.go, line 709 at r4 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Can this use os.Join()? I think that's preferable to string manipulation for paths. (The same goes for the other paths.)

Done.

@g3orgia g3orgia force-pushed the import-dir branch 4 times, most recently from 15064c2 to 23f0ad7 Compare September 18, 2019 19:49
@g3orgia g3orgia force-pushed the import-dir branch 2 times, most recently from 5deadb9 to e2ac7e9 Compare October 7, 2019 15:18
Copy link
Contributor

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @g3orgia, @lucy-zhang, and @rolandcrosby)


pkg/ccl/storageccl/export_storage.go, line 232 at r5 (raw file):

// URINeedsExpansion checks if URI can be expanded by checking if it contains wildcard characters.
// This should be used before passing a URI into ListFiles()

nit: Add a period


pkg/ccl/storageccl/export_storage.go, line 228 at r6 (raw file):

// URINeedsExpansion checks if URI can be expanded by checking if it contains wildcard characters.
// This should be used before passing a URI into ListFiles()
func URINeedsExpansion(uri string) bool {

I might call thisURINeedsGlobExpansion to be more specific.


pkg/ccl/storageccl/export_storage.go, line 268 at r6 (raw file):

	WriteFile(ctx context.Context, basename string, content io.ReadSeeker) error

	// ListFiles should return a list of files that matches glob pattern provided in URI.

Maybe ListFiles should treat the export storage URI as a glob pattern, and return a list of files that match the pattern.?


pkg/ccl/storageccl/export_storage.go, line 298 at r6 (raw file):

	cfg           roachpb.ExportStorage_LocalFilePath // constains un-prefixed base -- DO NOT use for I/O ops.
	base          string                              // the prefixed base, for I/O ops on this node.
	externalIODir string                              // base directory in which we can perform reads and writes.

Can this have a comment explaining what the difference between this and the other two fields, and why we need it?

(Also, should this have a different name? There's already a cluster setting that's accessed via settings.ExternalIODir, which is very similar.)


pkg/ccl/storageccl/export_storage.go, line 731 at r6 (raw file):

}

func getBucketBeforeWildcard(path string) string {

What happens if someone tries to use a glob pattern to match multiple buckets? Obviously this isn't possible, but what error do we return in that case?


pkg/ccl/storageccl/export_storage.go, line 1215 at r6 (raw file):

func (s *workloadStorage) ListFiles(_ context.Context) ([]string, error) {
	return nil, errors.New(`workload storage does not support listing files`)

nit: Can this (and the other error messages) use errors.Errorf() for consistency?


pkg/ccl/storageccl/export_storage_test.go, line 217 at r6 (raw file):

}

func testListFiles(t *testing.T, storeURI string) {

Should we also add tests for patterns like, e.g., s3://bucket/*/* or s3://bucket/*/a/*?

This adds support for using wildcard characters
to specify the list of files to import from. Using
cloud provider SDKs, we list files that could potentially
match a pattern in the URI, then match files using specs
as detailed [here](https://golang.org/pkg/path/filepath/#Match).
This listing capability was added to the `ExportStorage` interface.

Within `import_stmt`, we use the newly added interface
to expand the listed files. Note that we will import
the expanded list of files but the job description would
show the originally listed files (with wildcards).

This also adds an option flag that can specify turning off
this feature `WITH disable_glob_matching`.

Fixes: cockroachdb#40522

Release note (enterprise change):
Within an import statement, users will be able to specify
CSV filenames using wildcard characters. This behavior can
be disabled with the following option: `WITH disabled_glob_matching`.
Copy link
Contributor Author

@g3orgia g3orgia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @lucy-zhang, and @rolandcrosby)


pkg/ccl/storageccl/export_storage.go, line 228 at r6 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

I might call thisURINeedsGlobExpansion to be more specific.

Done.


pkg/ccl/storageccl/export_storage.go, line 268 at r6 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Maybe ListFiles should treat the export storage URI as a glob pattern, and return a list of files that match the pattern.?

Done.


pkg/ccl/storageccl/export_storage.go, line 298 at r6 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Can this have a comment explaining what the difference between this and the other two fields, and why we need it?

(Also, should this have a different name? There's already a cluster setting that's accessed via settings.ExternalIODir, which is very similar.)

This variable is in fact the same as settings.ExternalIODir but settings is no longer accessible later on. The base is the path provided in URI prefixed with externalIODir. The reason we need it, is because we later strip it after the file expansions are done. Added comment to explain difference.


pkg/ccl/storageccl/export_storage.go, line 731 at r6 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

What happens if someone tries to use a glob pattern to match multiple buckets? Obviously this isn't possible, but what error do we return in that case?

Hmm I'm not sure I understand the question. But if someone matches file/path/*/file/*/path, we will list everything in file/path and attempt to match those files to the pattern. I don't think that should be an error case.


pkg/ccl/storageccl/export_storage_test.go, line 217 at r6 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Should we also add tests for patterns like, e.g., s3://bucket/*/* or s3://bucket/*/a/*?

Yes, added more comprehensive testing.

Copy link
Contributor

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @g3orgia, @lucy-zhang, and @rolandcrosby)


pkg/ccl/storageccl/export_storage.go, line 731 at r6 (raw file):

Previously, g3orgia (Georgia Hong) wrote…

Hmm I'm not sure I understand the question. But if someone matches file/path/*/file/*/path, we will list everything in file/path and attempt to match those files to the pattern. I don't think that should be an error case.

Resolved offline. (The question was about users trying to do something like s3://bucket*/, but we return a reasonable error in that case.)

@g3orgia
Copy link
Contributor Author

g3orgia commented Oct 9, 2019

bors r+

craig bot pushed a commit that referenced this pull request Oct 9, 2019
40714: importccl: Allow wildcards in import URIs r=g3orgia a=g3orgia

This adds support for using wildcard characters to specify the list of files to import from. Using cloud provider SDKs, we list files that could potentially match a pattern in the URI, then match files using specs as detailed [here](https://golang.org/pkg/path/filepath/#Match). This listing capability was added to the `ExportStorage` interface.

Within `import_stmt`, we use the newly added interface to expand the listed files. Note that we will import the expanded list of files but the job description would show the originally listed files (with wildcards).

Fixes: #40522

Release note: None

Co-authored-by: Georgia Hong <georgiah@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Oct 9, 2019

Build succeeded

@craig craig bot merged commit 2c7315b into cockroachdb:master Oct 9, 2019
@g3orgia g3orgia deleted the import-dir branch October 9, 2019 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

importccl: Allow wildcards in cloud storage URLs
3 participants