Ensure we use UTF-8 when outputting and comparing SAX output#696
Conversation
mbeckerle
left a comment
There was a problem hiding this comment.
+1 if checks all pass.
| class DaffodilParseOutputStreamContentHandler(out: OutputStream, pretty: Boolean = false) | ||
| extends ContentHandler with Indentable { | ||
| private val writer = new OutputStreamWriter(out) | ||
| private val writer = new OutputStreamWriter(out, Charset.forName("UTF-8")) |
There was a problem hiding this comment.
You know, it would be super nice if there was a way to automate checks for things like this, where one could setup a rule named "always specify encoding", with a bunch of checks of constructors like this, and methods, basically deprecating the flavors that don't specify an encoding.
Ever heard of such a tool?
There was a problem hiding this comment.
I can't find a tool like that, but it really is need. Java allowing encodings to be optional really seems like a mistake, and the fact that it's also allow to be a string makes it even more difficult to find all the locations. I did some greps, but getting a sane number of results to check is difficult. We really need compiler help to find them all.
There was a problem hiding this comment.
For future reference, you can get a sane number of results to check by using ripgrep with options like rg -g \*.scala -i --sort=path utf-8 which avoids false positives from all the XML-type files but finds all the occurrences of utf-8 in Scala files. Now only half of the results look like false positives and the remaining results look like places that should be checked although I'm not sure if some APIs accept Charset instead of string.
There was a problem hiding this comment.
SonarQube (or other code quality scanners) may have a feature for this.
What we need sounds pretty simple and broadly useful: deprecate a method by rule from outside. I.e., "don't use this method" from Java or Scala (or anything else that can call it)
There is a plug-in way to create custom checkers, but they're language-being-scanned specific, not sure they have a scala API for this.
| class DaffodilParseOutputStreamContentHandler(out: OutputStream, pretty: Boolean = false) | ||
| extends ContentHandler with Indentable { | ||
| private val writer = new OutputStreamWriter(out) | ||
| private val writer = new OutputStreamWriter(out, Charset.forName("UTF-8")) |
There was a problem hiding this comment.
StandardCharsets.UTF_8 Is a constant. I think stylistically preferable to calling forName with a string.
There was a problem hiding this comment.
Agreed, I've updated most places where we use Charset.forname, except for those where the charset comes from a schema or something dynamic.
tuxji
left a comment
There was a problem hiding this comment.
+1
Yes, it would be nice if we had a tool to automate checking IO code for missing Charset parameters. Agree about StandardCharsets.UTF_8 being stylistically preferable here, although we would need more changes (probably out of scope here) if we wanted Daffodil code to be uniform in how it makes IO calls and uses encodings, alas.
Whenever Daffodil outputs an infoset, it always does so with UTF-8 encoding. The one exception to this when converting SAX evensts to in the CLI and TDML runner. This means that depending on a users system encoding, tests may act differently and lead to failure. This kind of system-dependent behavior is not desierable. So instead, thiss modifies SAX related output to use UTF-8 so that a users environment does not affect the results of tests or results of a parser. Also use StandardCharsets instead of a encoding name string where possible. DAFFODIL-2600
beb5cd3 to
5ac458f
Compare
Whenever Daffodil outputs an infoset, it always does so with UTF-8
encoding. The one exception to this when converting SAX evensts to in
the CLI and TDML runner. This means that depending on a users system
encoding, tests may act differently and lead to failure. This kind of
system-dependent behavior is not desierable. So instead, thiss modifies
SAX related output to use UTF-8 so that a users environment does not
affect the results of tests or results of a parser.
DAFFODIL-2600