Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect character encoding of YAML files and ignore files #630

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Jayman2000
Copy link
Contributor

This PR makes sure that yamllint never uses open()’s default encoding. Specifically, it uses the character encoding detection algorithm specified in chapter 5.2 of the YAML spec when reading both YAML files and files that are on the ignore-from-file list.

There are two other PRs that are similar to this one. Here’s how this PR compares to those two:

  • This PR doesn’t have any merge conflicts.
  • This PR has a cleaner commit history. You can run the tests and flake8 on each commit in this PR, and they’ll report no errors. I don’t think that you can do that with Detect encoding per yaml spec (fix #238) #240.
  • This PR has longer commit messages. I really tried to explain why I think that my changes make sense.
  • This PR detect the encoding of files being linted, config files, and files on the ignore-from-file list. Those two PRs only detects the encoding of files being linted.
  • Detect encoding per yaml spec (fix #238) #240 PR adds a dependency on chardet. This PR doesn’t add any dependencies.
  • This PR only supports UTF-8, UTF-16 and UTF-32. Both of those PRs support additional encodings.
  • Unicode yaml #581 adds support for running tests on Windows. This PR doesn’t.
  • The code that this PR adds to the yamllint package is simpler.
  • The code that this PR adds to the test package is much more complicated, but hopefully it tests things more thoroughly.

Fixes #218. Fixes #238. Fixes #347.

@coveralls
Copy link

coveralls commented Jan 3, 2024

Coverage Status

coverage: 99.835% (+0.01%) from 99.825%
when pulling d569de6 on Jayman2000:auto-detect-encoding
into 81e9f98 on adrienverge:master.

@Jayman2000 Jayman2000 force-pushed the auto-detect-encoding branch 4 times, most recently from 8cedbee to 3fa4c57 Compare January 10, 2024 12:40
@Jayman2000 Jayman2000 force-pushed the auto-detect-encoding branch 2 times, most recently from fd2c72d to bb8dc2b Compare February 8, 2024 16:01
@Jayman2000
Copy link
Contributor Author

I just noticed that one of the checks for this PR is failing. The coverage for yamllint/config.py went down, but that’s just because the total number relevant lines went down. There’s only two lines that aren’t covered, but those same two lines aren’t covered in the master branch. Is there anything that I need to do here?

@adrienverge
Copy link
Owner

Is there anything that I need to do here?

At the moment, no. I'm sorry, please excuse the delay, this is a big change with much impact, I need a large time slot to review this, which I couldn't find yet.

Before this change, build_temp_workspace() would always encode a path
using UTF-8 and the strict error handler [1]. Most of the time, this is
fine, but systems do not necessarily use UTF-8 and the strict error
handler for paths [2].

[1]: <https://docs.python.org/3.12/library/stdtypes.html#str.encode>
[2]: <https://docs.python.org/3.12/glossary.html#term-filesystem-encoding-and-error-handler>
Before this commit, test_run_default_format_output_in_tty() changed the
values of sys.stdout and sys.stderr, but it would never change them
back. This commit makes sure that they get changed back.

At the moment, this commit doesn’t make a user-visible difference. A
future commit will add a new test named test_ignored_from_file_with_multiple_encodings().
That new test requires stdout and stderr to be restored, or else it will
fail.
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2].

Most of the time, the locale encoding on Linux systems is UTF-8 [3][4],
but it doesn’t have to be [5]. Additionally, the locale encoding on
Windows systems is the system’s ANSI code page [6]. As a result, you
would have to either enable UTF-8 mode, give Python a custom manifest or
enable a beta feature in Windows settings in order to lint UTF-8 YAML
files on Windows [2][7].

Finally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begins with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
Before this change, yamllint would decode files on the ignore-from-file
list using open()’s default encoding [1][2]. This can cause decoding to
fail on some systems and succeed on other systems (see the previous
commit message for details).

This change makes yamllint automatically detect the encoding for files
on the ignore-from-file list. It uses the same algorithm that it uses
for detecting the encoding of YAML files, so the same limitations apply:
files must use UTF-8, UTF-16 or UTF-32 and they must begin with either a
byte order mark or an ASCII character.

[1]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.input>
[2]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.FileInput>
In general, using open()’s default encoding is a mistake [1]. This
change makes sure that every time open() is called, the encoding
parameter is specified. Specifically, it makes it so that all tests
succeed when run like this:

	python -X warn_default_encoding -W error::EncodingWarning -m unittest discover

[1]: <https://peps.python.org/pep-0597/#using-the-default-encoding-is-a-common-mistake>
The previous few commits have removed all calls to open() that use its
default encoding. That being said, it’s still possible that code added
in the future will contain that same mistake. This commit makes it so
that the CI test job will fail if that mistake is made again.

Unfortunately, it doesn’t look like coverage.py allows you to specify -X
options [1] or warning filters [2] when running your tests [3]. As a
result, the CI test job will also fail if coverage.py uses open()’s
default encoding. Hopefully, coverage.py won’t do that. If it does, then
we can always temporarily revert this commit.

[1]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-X>
[2]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-W>
[3]: <https://coverage.readthedocs.io/en/7.4.0/cmd.html#execution-coverage-run>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants