Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail when locale is not UTF-8 #136

Open
sarveshtamba opened this issue Apr 15, 2020 · 15 comments
Open

Tests fail when locale is not UTF-8 #136

sarveshtamba opened this issue Apr 15, 2020 · 15 comments

Comments

@sarveshtamba
Copy link

Trying to build plexus-archiver v3.7.0 and v4.2.2 on ppc64le platform, however facing the following test case error:-

[ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.102 s <<< FAILURE! - in org.codehaus.plexus.archiver.zip.ZipUnArchiverTest
[ERROR] testUnarchiveUtf8(org.codehaus.plexus.archiver.zip.ZipUnArchiverTest)  Time elapsed: 0.021 s  <<< FAILURE!
junit.framework.AssertionFailedError
        at org.codehaus.plexus.archiver.zip.ZipUnArchiverTest.testUnarchiveUtf8(ZipUnArchiverTest.java:86)
@michael-o
Copy link
Member

michael-o commented Apr 15, 2020

This is a locale error. It assumes you use xx_YY.UTF-8. What is your locale?

@sarveshtamba
Copy link
Author

sh-4.2# locale
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

@michael-o
Copy link
Member

This will not work. We need to test for file.encoding/Charset#default() and skip such tests.

Can you disable this test and see whether the rest works?

@michael-o
Copy link
Member

michael-o commented Apr 15, 2020

I can reproduce with LC_ALL=C mvn verify:

[ERROR] Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.104 s <<< FAILURE! - in org.codehaus.plexus.archiver.zip.ZipUnArchiverTest
[ERROR] testUnarchiveUtf8(org.codehaus.plexus.archiver.zip.ZipUnArchiverTest)  Time elapsed: 0.021 s  <<< FAILURE!
junit.framework.AssertionFailedError
        at org.codehaus.plexus.archiver.zip.ZipUnArchiverTest.testUnarchiveUtf8(ZipUnArchiverTest.java:86)

on

$ LC_ALL=C mvn -v
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T20:33:14+02:00)
Maven home: /usr/local/apache-maven-3.5.4
Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: /usr/local/openjdk8/jre
Default locale: en_US, platform encoding: US-ASCII
OS name: "freebsd", version: "12.1-stable", arch: "amd64", family: "unix"

@michael-o michael-o changed the title Error while building plexus-archiver on ppc64le platform Tests fail when locale is not UTF-8 Apr 15, 2020
@michael-o
Copy link
Member

These tests need to be viewed whether they work as intended:

$ grep -ri -E -e utf-?8 src/test/java/
src/test/java/org/codehaus/plexus/archiver/jar/DirectoryArchiverUnpackJarTest.java:        archiver.addArchivedFileSet( afs, Charset.forName( "UTF-8" ) );
src/test/java/org/codehaus/plexus/archiver/tar/TarArchiverTest.java:        File tmpDir = getTestFile( "src/test/resources/utf8" );
src/test/java/org/codehaus/plexus/archiver/zip/ConcurrentJarCreatorTest.java:        zos.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/PlexusIoZipFileResourceCollectionTest.java:                final String manifest = IOUtils.toString( contents1, "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver2.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchive.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:    public void testDefaultUTF8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        final ZipArchiver zipArchiver = getZipArchiver( new File( "target/output/utf8-default.zip" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:    public void testDefaultUTF8withUTF8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        final ZipArchiver zipArchiver = getZipArchiver( new File( "target/output/utf8-with_utf.zip" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver.setEncoding( "UTF-8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipArchiverTest.java:        zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java:    public void testUnarchiveUtf8()
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java:        File dest = new File( "target/output/unzip/utf8" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java:        final File zipFile = new File( "target/output/unzip/utf8-default.zip" );
src/test/java/org/codehaus/plexus/archiver/zip/ZipUnArchiverTest.java:        zipArchiver.addDirectory( new File( "src/test/resources/miscUtf8" ) );

@sarveshtamba
Copy link
Author

I tried to set the locale using export LC_ALL=en_US.UTF-8, post this the build was successful.

[INFO] --- maven-install-plugin:2.5.2:install (default-install) @ plexus-archiver ---
[INFO] Installing /root/plexus-archiver-master/target/plexus-archiver-4.2.3-SNAPSHOT.jar to /root/.m2/repository/org/codehaus/plexus/plexus-archiver/4.2.3-SNAPSHOT/plexus-archiver-4.2.3-SNAPSHOT.jar
[INFO] Installing /root/plexus-archiver-master/pom.xml to /root/.m2/repository/org/codehaus/plexus/plexus-archiver/4.2.3-SNAPSHOT/plexus-archiver-4.2.3-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  41.100 s
[INFO] Finished at: 2020-04-15T13:41:29Z
[INFO] ------------------------------------------------------------------------

@plamentotev
Copy link
Member

Hi @sarveshtamba, this is very interesting issue. Thanks for reporting it. It seems that there is a bug in Plexus Archiver when working with files containing unicode characters when the system locale is not UTF-8.

I've set my locale to POSIX to reproduce your environment. Here is what is the behavior on my system, would you please confirm that you see the same on yours.

When I clone Plexus Archiver the files are cloned:

$ ls src/test/resources/miscUtf8/
aFileWithA#.html
'aPi'$'\303\261''ata.txt'
'an'$'\303\274''mlaut.txt'
''$'\342\202\254''uro.txt'

Although ls shows the files with escape characters they are there and their names are the same as in the repository. When I ran the maven build the files with special characters are missing from the output directory:

$ mvn clean verify
$ ls target/output/unzip/utf8
aFileWithA#.html

The generated zip file also includes a single file:

$ unzip -l target/output/unzip/utf8-default.zip
Archive:  target/output/unzip/utf8-default.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       20  2020-04-17 10:06   aFileWithA#.html
---------                     -------
       20                     1 file

If I use zip and unzip everything works as expected - all files are compressed and with the correct names. So this is not some limitation of the system or ZIP itself. It is a defect in Plexus Archiver.

When I change the locale to en_US.UTF-8 Plexus Archiver also behaves as expected:

$ LC_ALL=en_US.UTF-8 mvn verify
$ ls target/output/unzip/utf8
aFileWithA#.html
'aPi'$'\303\261''ata.txt'
'an'$'\303\274''mlaut.txt'
''$'\342\202\254''uro.txt'
$ unzip -l target/output/unzip/utf8-default.zip
Archive:  target/output/unzip/utf8-default.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
       31  2020-04-17 10:06   €uro.txt
       20  2020-04-17 10:06   aFileWithA#.html
       39  2020-04-17 10:06   anümlaut.txt
       29  2020-04-17 10:06   aPiñata.txt
---------                     -------
      119                     4 files

p.s. I'm testing on Ubuntu, ext4 file system with locale set to POSIX, but I would expect that the behavior is the same on other POSIX/Unix like systems.

@michael-o
Copy link
Member

The problem is Java cannot properly map bytes to characters when encoding is wrong. Unix filesystems are not charset aware. They simply store bytes, not codepoints.

@sarveshtamba
Copy link
Author

Thanks for the inputs @michael-o
@plamentotev do you still want me to verify this?

@plamentotev
Copy link
Member

@michael-o thanks for the tip. It really looks like the character encoding is the problem. Still it looks like if Path and URI are used Java can work with such files as expected. The URI has the bytes properly escaped. As now Java 7 is required maybe we can look into those "new" APIs in order to better support use cases as the one reported here.

@sarveshtamba thanks. I think I understood where the issue is, so no need to verify it.

@sarveshtamba
Copy link
Author

@plamentotev @michael-o thanks for the inputs.

@jorsol
Copy link
Contributor

jorsol commented Sep 9, 2021

This is not an issue for plexus-archiver, it's how Java works, Java uses the locale from the operating system, if the OS is configured with a non-utf8 locale, then Java will use that, and not even the new Java 7 APIs will help here:

Accented or extended UTF-8 characters cause "Malformed input or input contains unmappable characters" error.

java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: target/piñata.txt
        at java.base/sun.nio.fs.UnixPath.encode(UnixPath.java:145)
        at java.base/sun.nio.fs.UnixPath.<init>(UnixPath.java:69)
        at java.base/sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:280)
        at java.base/java.io.File.toPath(File.java:2290)

Java 11 won't support setting sun.jnu.encoding to UTF-8 via the command line to use UTF-8 for encoding file paths. It will silently ignore it and will not have any effect.

So the only possible and real solution is to use the correct locale, at least C.UTF-8.
LANG=C.UTF-8 mvn verify

So, as this is a Java thing, even trying to clean the project with LANG=C mvn clean will fail if the target directory contains a UTF-8 encoded filename.

@michael-o
Copy link
Member

UnixPath operates on raw bytes, WindowsPath on wchar_t. I still think those tests need to be skipped with a warning.

@jorsol
Copy link
Contributor

jorsol commented Sep 9, 2021

UnixPath operates on raw bytes, WindowsPath on wchar_t. I still think those tests need to be skipped with a warning.

Well, just to show a warning message that a utf-8 locale is required, it works, but just hides the real issue.

@michael-o
Copy link
Member

UnixPath operates on raw bytes, WindowsPath on wchar_t. I still think those tests need to be skipped with a warning.

Well, just to show a warning message that a utf-8 locale is required, it works, but just hides the real issue.

Correct, but unfortunately I don't see a better portable way on POSIX-like systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants