Skip to content

Shared extractor: support file path globs #13969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 23, 2023
Merged

Conversation

hmac
Copy link
Contributor

@hmac hmac commented Aug 15, 2023

Update the tree-sitter extractor to support file globs. This replaces the existing file_extensions field with a file_globs field, which supports UNIX style glob patterns powered by the globset crate.

This allows files with no extension (e.g. Dockerfiles) to be extracted
by specifying a glob such as *Dockerfile.

One surprising aspect of this change is that the globs match against the
whole path, rather than just the file name. I'm not sure if this is an issue we should work around, or if it's OK.

@aibaars I'd be interested in your thoughts on this.

Fixes #13964

@hmac hmac marked this pull request as ready for review August 16, 2023 15:11
@hmac hmac requested a review from a team as a code owner August 16, 2023 15:11
@hmac hmac requested a review from aibaars August 18, 2023 10:26
Copy link
Contributor

@aibaars aibaars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general, let's ensure the wildcards don't match path separators. For example *JsonFile should not match test/SampleJsonFile.

Make sure to test path separator handling on Windows too.

// Construct a single globset containing all language globs,
// and a mapping from glob index to language index.
let (globset, glob_language_mapping) = {
let mut builder = GlobSetBuilder::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set the literal_separator feature, so path separators do not match the * and ? wildcards.

Suggested change
let mut builder = GlobSetBuilder::new();
let mut builder = GlobSetBuilder::new().literal_separator(true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember now why I didn't do this:

One surprising aspect of this change is that the globs match against the
whole path, rather than just the file name.

If we prevent * from matching file separators, then *.txt will match foo.txt but not bar/foo.txt.

Copy link
Contributor

@aibaars aibaars Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but if you don't want that then you should match the globs against the whole paths. Instead of globset.matches(&path); we could do something like globset.matches(&path.filename());

Alternatively, users could add **/ prefixes to their globs.

@hmac hmac force-pushed the shared-extractor-globs branch from 2409b71 to a9115c5 Compare August 23, 2023 13:11
@hmac hmac requested a review from a team as a code owner August 23, 2023 13:11
hmac added 4 commits August 23, 2023 14:11
Replace the `file_extensions` field with `file_globs`, which supports
UNIX style glob patterns powered by the `globset` crate.

This allows files with no extension (e.g. Dockerfiles) to be extracted,
by specifying a glob such as `*Dockerfile`.

One surprising aspect of this change is that the globs match against the
whole path, rather than just the file name.

This is a breaking change.
@github-actions github-actions bot added the Ruby label Aug 23, 2023
@hmac hmac force-pushed the shared-extractor-globs branch from a9115c5 to 60f2506 Compare August 23, 2023 13:11
@github-actions github-actions bot removed the Ruby label Aug 23, 2023
@hmac hmac removed the request for review from a team August 23, 2023 13:11
@hmac hmac force-pushed the shared-extractor-globs branch from 5645322 to 3680613 Compare August 23, 2023 15:13
@hmac hmac merged commit 96e9dfc into github:main Aug 23, 2023
@hmac hmac deleted the shared-extractor-globs branch August 23, 2023 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tree-Sitter Shared Extractor doesn't support extension-less files
2 participants