Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow any characters in filenames / labels #374

Open
damienmg opened this issue Aug 13, 2015 · 82 comments
Open

Allow any characters in filenames / labels #374

damienmg opened this issue Aug 13, 2015 · 82 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Loading-API BUILD file and macro processing: labels, package(), visibility, glob type: feature request
Milestone

Comments

@damienmg
Copy link
Contributor

Ultimately any character can be part of a filename. We should probably allow that.

Some mangling to generate the corresponding label should probably be done.

Original report on the mailing-list:
https://groups.google.com/d/msgid/bazel-discuss/CAN0GiO3__5jXo5rZqroSj0mFxpqCzUZZVkY%3DSNsJK1%2BZ1BdJLg%40mail.gmail.com

@abergmeier
Copy link
Contributor

  1. So are we talking Unicode?
  2. Where does any character stop?
  3. how do you treat characters that are not allowed on certain platforms?
  4. When using mangling, how do you handle collisions?

@kayasoze
Copy link

In POSIX, filenames are "bags of bytes"--there is no encoding; however, NUL and / are not allowed. Windows has a few more restrictions. Perhaps the BUILD file should be parsed in the encoding of the system locale, usually UTF-8, and filenames run though a ValidForCurrentPlatform() function which checks for disallowed characters. However, opting for strict platform neutrality in this way means that Bazel would have to represent filenames as a bag of bytes and not a Unicode string, as there is no guarantee that the filename will roundtrip through Unicode correctly. The problem can probably be simplified by restricting filenames to be UTF-8 or UTF-16, which should cover most people's needs even though that's not strict POSIX.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 22, 2015

Well, I think we can probably require valid UTF-8 file names and strongly recommend that people use UTF-8 for their file system. For labels / BUILD files, we probably need an escaping scheme, at least for the control characters. If there's a file that isn't valid UTF-8, we give an error message?

@btelles
Copy link

btelles commented Nov 18, 2015

Our company codes mainly in C++, but our frontend uses a lot of JS and nodejs modules which have all sorts of characters in the filenames--for example, -, #, @, (, and ).

Right now this is a major blocker for getting all our codebase under one build system since we can't reference files with semi-special characters. I don't think Bazel should decide what characters are acceptable in file names, as that reduces file names to those that fit both (1) supported languages and (2) supported platforms. This seems unnecessarily restrictive, and is becoming a major pain point for us.

@ulfjack
Copy link
Contributor

ulfjack commented Nov 19, 2015

Agreed. Unfortunately, it's a bit tricky to fix, as a lot of code assumes that the mapping from labels to file names (and vice versa) is trivial, and doesn't require escaping. Any suggestions on an escaping scheme?

@damienmg
Copy link
Contributor Author

URL based?

@abergmeier
Copy link
Contributor

You mean an own URI scheme? Sounds good.

@damienmg
Copy link
Contributor Author

I mean replacing special characters by %XX where XX is the UTF-8 code in hexa.

@ulfjack ulfjack assigned philwo and unassigned ulfjack Mar 2, 2016
@ulfjack
Copy link
Contributor

ulfjack commented Mar 2, 2016

Sorry, I won't be able to work on this. @philwo had an interest, maybe he can make some progress here. :-/

@kayasoze
Copy link

This blocks our Bazel deployment as well.

@RonnieAtOracle
Copy link

This is blocking us. We have a templating system where we need to build our template files. The filenames themselves contains template variables (e.g. ${ServiceName}.java ). Both $, {, and } are not supported by Bazel in file names.

@philwo
Copy link
Member

philwo commented May 25, 2016

I totally agree that this is important, should be done, I want this myself, however I don't have the time to work on it in the coming months, thus I have to unassign it.

@DemiMarie
Copy link

Here is my proposal:

  • Metacharacters (:, %, =, any others) must be %-encoded
  • All other characters are allowed. This includes Unicode characters.
  • Non-UTF8 names are not allowed, even if escaped. This is because there is no good way to handle them cross-platform, nor to display them to the user.

@mihnita
Copy link

mihnita commented Oct 6, 2016

Plain ASCII (and even that partial) makes this feels like we are in the early 90s.

There are reasonable ways to handle that.
For POSIX using the default locale would be good enough.

If my project is C/C++, and it is cross-platform, and I have problems handling Unicode, then I will not use Unicode in file names. And the fact that bazel "explodes" is not such a problem.
But if I use something like Java, then "it just works", and bazel would work too.

Even better would be to to allow for a character-set option in the project file.
This is what maven does. And what Java does with -Dfile.encoding=UTF-8
So if it is there, use it. If not, then take the system charset.

I did not move one project to bazel because test units check that Unicode file names work.
So the files are there, make it through git, work with maven and ant and gradle and java.
But bazel fails because there is an "@" in the file name... Which is supported on all OSes.

lazcamus added a commit to lazcamus/rules_pyenv that referenced this issue Mar 10, 2022
There are 2 incompatibilities with the python install that pyenv creates:
* empty directories - bazel's sandbox mode + rules_pkg doesn't work with
  them: bazelbuild/rules_pkg#135
* spaces in filenames - bazelbuild/bazel#374

So pass flags to glob() to exclude these sticking points, and you can now build
a container (or tar) with rules_pyenv built python.
jorisvandenbossche pushed a commit to apache/arrow that referenced this issue Mar 17, 2022
Bazel pep_deps rule can not handle files names with `=` in the name [1]. Unfortunately, there is no way to fix this issue in Bazel.

[1] bazelbuild/bazel#374

Closes #12643 from iampat/iampat/fix_bazel_pep_import

Authored-by: Ali Amiri <ali@amiri.dev>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
tbg added a commit to tbg/panicparse that referenced this issue May 11, 2022
This is similar to 53e9d3e.

Over at https://github.com/cockroachdb/cockroach, we use bazel and it
chokes on this file:

```
java.io.IOException: Error extracting
[...]/com_github_maruel_panicparse_v2-v2.2.1.zip to [...]:
[...]/com_github_maruel_panicparse_v2/cmd/panic/internal/utf8/?tf8.go
(Illegal byte sequence)
```

which is why we're on a fork so far. It would be nice to avoid this.

There is an (eternal, it seems) [upstream discussion], so there is
little home of bazel starting to support non-unicode filenames.

I wonder if the loss in test coverage (if any) is acceptable in return
for avoiding build problems like the ones we've been seeing.

[upstream discussion]: bazelbuild/bazel#374
tbg added a commit to tbg/panicparse that referenced this issue May 12, 2022
This is similar to 53e9d3e.

Over at https://github.com/cockroachdb/cockroach, we use bazel and it
chokes on this file:

```
java.io.IOException: Error extracting
[...]/com_github_maruel_panicparse_v2-v2.2.1.zip to [...]:
[...]/com_github_maruel_panicparse_v2/cmd/panic/internal/utf8/?tf8.go
(Illegal byte sequence)
```

which is why we're on a fork so far. It would be nice to avoid this.

There is an (eternal, it seems) [upstream discussion], so there is
little home of bazel starting to support non-unicode filenames.

I wonder if the loss in test coverage (if any) is acceptable in return
for avoiding build problems like the ones we've been seeing.

[upstream discussion]: bazelbuild/bazel#374
tbg added a commit to tbg/panicparse that referenced this issue May 12, 2022
This is similar to 53e9d3e.

Over at https://github.com/cockroachdb/cockroach, we use bazel and it
chokes on this file:

```
java.io.IOException: Error extracting
[...]/com_github_maruel_panicparse_v2-v2.2.1.zip to [...]:
[...]/com_github_maruel_panicparse_v2/cmd/panic/internal/utf8/?tf8.go
(Illegal byte sequence)
```

which is why we're on a fork so far. It would be nice to avoid this.

There is an (eternal, it seems) [upstream discussion], so there is
little home of bazel starting to support non-unicode filenames.

I wonder if the loss in test coverage (if any) is acceptable in return
for avoiding build problems like the ones we've been seeing.

[upstream discussion]: bazelbuild/bazel#374
tbg added a commit to tbg/panicparse that referenced this issue May 13, 2022
This is similar to 53e9d3e.

Over at https://github.com/cockroachdb/cockroach, we use bazel and it
chokes on this file:

```
java.io.IOException: Error extracting
[...]/com_github_maruel_panicparse_v2-v2.2.1.zip to [...]:
[...]/com_github_maruel_panicparse_v2/cmd/panic/internal/utf8/?tf8.go
(Illegal byte sequence)
```

which is why we're on a fork so far. It would be nice to avoid this.

There is an (eternal, it seems) [upstream discussion], so there is
little home of bazel starting to support non-unicode filenames.

I wonder if the loss in test coverage (if any) is acceptable in return
for avoiding build problems like the ones we've been seeing.

[upstream discussion]: bazelbuild/bazel#374
maruel pushed a commit to maruel/panicparse that referenced this issue May 13, 2022
This is similar to 53e9d3e.

Over at https://github.com/cockroachdb/cockroach, we use bazel and it
chokes on this file:

```
java.io.IOException: Error extracting
[...]/com_github_maruel_panicparse_v2-v2.2.1.zip to [...]:
[...]/com_github_maruel_panicparse_v2/cmd/panic/internal/utf8/?tf8.go
(Illegal byte sequence)
```

which is why we're on a fork so far. It would be nice to avoid this.

There is an (eternal, it seems) [upstream discussion], so there is
little home of bazel starting to support non-unicode filenames.

[upstream discussion]: bazelbuild/bazel#374
@brandjon brandjon added untriaged team-Loading-API BUILD file and macro processing: labels, package(), visibility, glob and removed team-Build-Language untriaged labels Nov 4, 2022
@adam-azarchs
Copy link
Contributor

I'm reconsidering what I said earlier about allowing :. Yes, it's in common usage in filenames in perl code, however given bazel is intended for cross-platform support, there's also an argument against supporting filenames which wouldn't be legal in Windows, specifically those containing any of <>:"|?*\. Most of those are also problematic in cases of careless rule authors writing shell scripts. Though there's a limit to how much bazel can protect them there1, at least if " isn't legal in target names then it's always safe to "-quote them.

Footnotes

  1. I'd really rather not to also prohibit ';&#()[]{} or whitespace, all of which are also potentially problematic for a carelessly written shell command. I think whitespace is probably the most important use case for loosening the current restrictions, as there are a lot of use cases where file names become user-visible and normal humans don't automatically translate _ into spaces.

@mihnita
Copy link

mihnita commented Mar 23, 2023

given bazel is intended for cross-platform support, there's also an argument against supporting filenames which wouldn't be legal in Windows,

TLDR: I think it is not the job of the build tool to force naming conventions on a project

A project might be Linux and Mac only, no Windows.
Or might be Linux only, but using bazel just because they like it.
Or because use bazel because they us a mixture of targets / programming languages.
Or just because they know it (used it in other projects). Or want to learn.

The file naming rules would be selected by the developers looking at the big picture:
what they need to test, tools they use, version control tools, etc.
If each tool I use tries to force on me its own opinions on what's "the right thing" I end up with [A-Z0-9_]

Should a tool disallow the existence of the same file with different case?
(because there are case insensitive file systems)
If yes, what version of the case mapping table? (it can be system dependent)

Should a tool force a certain Unicode normalization?
Because the MacOS file system normalizes file names to decompose format.
(and what version of the normalization table?)

What about a Mac OS accessing a disk shared by a Linux machine using Samba (a Windows protocol)?

Long story short: not a bazel responsibility to force any limitations on what I can do or not.
IF bazel wants to help devs do "the right thing" then it can be as a validation step that can be disabled.
Not as a design limitation frozen in code.

@adam-azarchs
Copy link
Contributor

I agree, mostly. My point was that there are reasons why we might want to/need to preserve some limitations on legal target names. Ideally any name would be supported, but if certain patterns are difficult to support, if they would anyway be illegal on some platforms that might be a sufficient excuse to not support them. I definitely feel like some of the proposed solutions start to get a bit convoluted.

@thomaslangston-mongodb
Copy link

Ideally any name would be supported, but if certain patterns are difficult to support, if they would anyway be illegal on some platforms that might be a sufficient excuse to not support them.

Illegal on some platforms is not a sufficient excuse to not support a filename pattern. Some filenames are system provided and cannot change to match the build system's preferences. Choosing some subset of valid filename patterns is equivalent to not having full support for any platform that is not encapsulated by the entire subset. We should be supporting the superset of all filename patterns instead.

@atobiszei
Copy link

Encountered this issue while using aws-sdk-cpp as a dependency. Luckily in that case it was only test file that we could safely remove with patch_cmds parameter in git_repository. Leaving it for anybody trying to use aws-sdk-cpp with bazel. It would be nice to see it resolved.

atobiszei added a commit to openvinotoolkit/model_server that referenced this issue Feb 27, 2024
Additionally bump aws-sdk-cpp version to 1.11.279

Encountered issues with bazel. Non ASCII filenames don't work inside bazel. Workaround is to remove test file from aws-sdk-cpp.
bazelbuild/bazel#374

Building with rules_foreign cmake doesn't work with parallel builds. This is something we could optimize later, for now hardcode some low value:
bazelbuild/rules_foreign_cc#329

On Redhat to properly build targets we now have to add:
--//:distro=redhat
By default distro is set to ubuntu. This is workaround for not bazel not being able to differentiate between the two. There is difference in aws-sdk-cpp static libraries location after build. Ideally we should find how to use select on aws-sdk-cpp repo BUILD file which would rely on main repo flag, but I didn't find way to support it there.

JIRA:CVS-130367
psakamoori pushed a commit to psakamoori/model_server that referenced this issue Apr 2, 2024
Additionally bump aws-sdk-cpp version to 1.11.279

Encountered issues with bazel. Non ASCII filenames don't work inside bazel. Workaround is to remove test file from aws-sdk-cpp.
bazelbuild/bazel#374

Building with rules_foreign cmake doesn't work with parallel builds. This is something we could optimize later, for now hardcode some low value:
bazelbuild/rules_foreign_cc#329

On Redhat to properly build targets we now have to add:
--//:distro=redhat
By default distro is set to ubuntu. This is workaround for not bazel not being able to differentiate between the two. There is difference in aws-sdk-cpp static libraries location after build. Ideally we should find how to use select on aws-sdk-cpp repo BUILD file which would rely on main repo flag, but I didn't find way to support it there.

JIRA:CVS-130367
@snakethatlovesstaticlibs
Copy link

snakethatlovesstaticlibs commented Apr 11, 2024

We're running into this in erlang 26 which bundles emoji filenames: https://github.com/erlang/otp/blob/master/lib/stdlib/test/edlin_expand_SUITE_data/.hidden%F0%9F%98%80_file

Reported upstream as not a bug: erlang/otp#8005

@fmeum
Copy link
Collaborator

fmeum commented Apr 11, 2024

@atobiszei @snakethatlovesstaticlibs Both your cases sound like you are in a slightly different situation: Files with complex file names are part of some external dependency you consume, but aren't relevant to its build (for example, because they are part of the test suite). The errors arise while extracting an archive. This use case should be fully supported in Bazel 6.4.0 and higher, could you give that a try?

@snakethatlovesstaticlibs
Copy link

snakethatlovesstaticlibs commented Apr 11, 2024

@fmeum I worked around this by adding the test directories to an exclude in a filegroup, and it seems to be working now (because as you mentioned, the files are part of tests, so I don't need them)

For reference, the error I saw was:

ERROR: .../bazel_build/external/erlang/BUILD.bazel:10:15: Foreign Cc - Configure: Building erlang failed: error reading file '@@erlang//:erts/test/erlc_SUITE_data/src/😀/erl_test_unicode.erl': .../bazel_build/external/erlang/erts/test/erlc_SUITE_data/src/😀/erl_test_unicode.erl (No such file or directory)

sorry if this is unrelated to the issue in the original report, it sounded relevant while I was debugging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Loading-API BUILD file and macro processing: labels, package(), visibility, glob type: feature request
Projects
None yet
Development

No branches or pull requests