New Check: Utf8EncodingCheck #265

Closed
rdiachenko opened this Issue Sep 2, 2014 · 4 comments

Comments

Projects
None yet
1 participant
@rdiachenko

This comment has been minimized.

Show comment
Hide comment
@rdiachenko

rdiachenko Sep 2, 2014

Contributor

@romani commented on Jun 7

Byte order mark is not requirement for files - http://en.wikipedia.org/wiki/Byte-order_mark#UTF-8

23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ sudo apt-get install moreutils
.....
23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ file -i pom.xml 
pom.xml: application/xml; charset=utf-8
23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ file -i import-control.xml 
import-control.xml: application/xml; charset=us-ascii
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ isutf8 pom.xml 
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ isutf8 import-control.xml 
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ xxd pom.xml | head -2 
0000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
0000010: 2e30 2220 656e 636f 6469 6e67 3d22 5554  .0" encoding="UT
23:13 ~/java/git-others/checkstyle/checkstyle [master|✔] $ xxd import-control.xml | head -2 
0000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
0000010: 2e30 223f 3e0a 3c21 444f 4354 5950 4520  .0"?>.<!DOCTYPE 

We might need to port "isutf8" application from C++ to Java, sources https://joeyh.name/code/moreutils/ , file "isutf8.c".

Attention: we cannot force to use only utf-8!!!, any ascii is more preferable and should be accepted, see my example above.

We might need to use - http://jchardet.sourceforge.net/ , that could give us full functional support for most of encoding detection (not only utf-8).

Contributor

rdiachenko commented Sep 2, 2014

@romani commented on Jun 7

Byte order mark is not requirement for files - http://en.wikipedia.org/wiki/Byte-order_mark#UTF-8

23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ sudo apt-get install moreutils
.....
23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ file -i pom.xml 
pom.xml: application/xml; charset=utf-8
23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ file -i import-control.xml 
import-control.xml: application/xml; charset=us-ascii
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ isutf8 pom.xml 
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ isutf8 import-control.xml 
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ xxd pom.xml | head -2 
0000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
0000010: 2e30 2220 656e 636f 6469 6e67 3d22 5554  .0" encoding="UT
23:13 ~/java/git-others/checkstyle/checkstyle [master|✔] $ xxd import-control.xml | head -2 
0000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
0000010: 2e30 223f 3e0a 3c21 444f 4354 5950 4520  .0"?>.<!DOCTYPE 

We might need to port "isutf8" application from C++ to Java, sources https://joeyh.name/code/moreutils/ , file "isutf8.c".

Attention: we cannot force to use only utf-8!!!, any ascii is more preferable and should be accepted, see my example above.

We might need to use - http://jchardet.sourceforge.net/ , that could give us full functional support for most of encoding detection (not only utf-8).

@rdiachenko

This comment has been minimized.

Show comment
Hide comment
@rdiachenko

rdiachenko Sep 2, 2014

Contributor

@maxvetrenko commented on Aug 31

I read that InputStream uses operation system encoding. All libs read bytes from InputStream, so all already bytes encoded in operation system encoding.
I ran into the same problem: http://stackoverflow.com/questions/8305635/javahow-can-i-get-the-encoding-from-inputstream

Contributor

rdiachenko commented Sep 2, 2014

@maxvetrenko commented on Aug 31

I read that InputStream uses operation system encoding. All libs read bytes from InputStream, so all already bytes encoded in operation system encoding.
I ran into the same problem: http://stackoverflow.com/questions/8305635/javahow-can-i-get-the-encoding-from-inputstream

@rdiachenko

This comment has been minimized.

Show comment
Hide comment
@rdiachenko

rdiachenko Sep 2, 2014

Contributor

Here's my investigation of encoding detection by:

  1. linux command "find -ib file"
  2. juniversalchardet (https://code.google.com/p/juniversalchardet/)
  3. jChardet (http://jchardet.sourceforge.net/)
Actual encoding $find -ib file juniversalchardet jChardet
Windows-1250 charset=unknown-8bit WINDOWS-1252 windows-1252
ISO8859-2 charset=iso-8859-1 ISO-8859-7 ISO-8859-7
CP866 charset=iso-8859-1 ISO-8859-5 windows-1252
KOI8-R charset=utf-8 UTF-8 UTF-8
GBK charset=iso-8859-1 IBM866 [UTF-16BE, Big5, GB18030, UTF-16LE]
SHIFT_JIS charset=utf-8 UTF-8 UTF-8
ISO2022-KR charset=utf-8 UTF-8 UTF-8
UTF-8 charset=us-ascii No encoding detected ASCII

I used files of different encoding types with the corresponding content as input on Linux OS (Fedora). The output may be different on Windows OS.

We can't say for sure what is the file's encoding. It is not the task for Checkstyle

Contributor

rdiachenko commented Sep 2, 2014

Here's my investigation of encoding detection by:

  1. linux command "find -ib file"
  2. juniversalchardet (https://code.google.com/p/juniversalchardet/)
  3. jChardet (http://jchardet.sourceforge.net/)
Actual encoding $find -ib file juniversalchardet jChardet
Windows-1250 charset=unknown-8bit WINDOWS-1252 windows-1252
ISO8859-2 charset=iso-8859-1 ISO-8859-7 ISO-8859-7
CP866 charset=iso-8859-1 ISO-8859-5 windows-1252
KOI8-R charset=utf-8 UTF-8 UTF-8
GBK charset=iso-8859-1 IBM866 [UTF-16BE, Big5, GB18030, UTF-16LE]
SHIFT_JIS charset=utf-8 UTF-8 UTF-8
ISO2022-KR charset=utf-8 UTF-8 UTF-8
UTF-8 charset=us-ascii No encoding detected ASCII

I used files of different encoding types with the corresponding content as input on Linux OS (Fedora). The output may be different on Windows OS.

We can't say for sure what is the file's encoding. It is not the task for Checkstyle

@rdiachenko

This comment has been minimized.

Show comment
Hide comment
@rdiachenko

rdiachenko Sep 2, 2014

Contributor

Won't fix

Contributor

rdiachenko commented Sep 2, 2014

Won't fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment