Daffodil 2663 pluggable charsets#768
Conversation
mbeckerle
left a comment
There was a problem hiding this comment.
Good work so far. I'll review the tests more on next round.
| * A charset transformer is created at runtime as part of a single parse/unparse call. | ||
| * Hence, they can be stateful without causing thread-safety issues. | ||
| */ | ||
| abstract class CharsetTransformer(charsetName: String){ |
There was a problem hiding this comment.
I think you can collapse CharsetTransformer and have a CharsetTransformer just be a BitsCharset.
Unless there is a need for CharsetTransformer, the added indirection is misleading, as people will do what I just did, which is search for the reason it is needed.
| .newInstance() | ||
| if (cs2 == null) { | ||
| tci.schemaDefinitionError( | ||
| "Unsupported encoding: %s. Supported encodings: %s", |
There was a problem hiding this comment.
I think you should just Assert.invariant(cs2 ne null) because the find() call above will SDE if the encoding cannot be found, and the other calls are never supposed to return null. So this schemaDefinitionError call will never be hit.
That will get rid of your codeCov errors.
| # layering, but one should expect to have to adapt their code substantially. | ||
| # | ||
| org.apache.daffodil.charsets.CustomNonByteSizeCharsetCompiler | ||
| org.apache.daffodil.charsets.CustomJavaCharsetCompiler No newline at end of file |
There was a problem hiding this comment.
Add missing final line ending
| maybeCompiler.get | ||
| } | ||
| } | ||
| } No newline at end of file |
| val maybeCompiler: Option[CharsetCompiler] = charsetCompilerMap.get(name) | ||
| if (maybeCompiler.isEmpty) { | ||
| val choices = charsetCompilerMap.keySet.mkString(", ") | ||
| context.SDE("The charset '%s' was not found. Available choices are: %s", name, choices + ", " + CharsetUtils.supportedEncodingsString) |
There was a problem hiding this comment.
Messages should not use the term "charset", but either "encoding" or "charset encoding". The XML term for 'charset' is 'encoding' and DFDL schema authors know the concept by that name.
There was a problem hiding this comment.
Mike also said he would like Daffodil's current charsets re-implemented as pluggable charsets, so you'll probably need to remove CharsetUtils.supportedEncodingsString from this line.
| * The state is passed in order to provide diagnostic context if not found. | ||
| */ | ||
| def find(name: String, context: ThrowsSDE): CharsetCompiler = { | ||
| val maybeCompiler: Option[CharsetCompiler] = charsetCompilerMap.get(name) |
There was a problem hiding this comment.
We have the standard scala Option type, and our own more efficient but less rich Maybe type.
In the fast paths of the runtime we try to use Maybe, but elsewhere we use Option. You are using Option here which is right. I just suggest name this variable optCompiler for consistency.
| @@ -95,10 +96,23 @@ abstract class CharsetEvBase(encodingEv: EncodingEvBase, tci: DPathCompileInfo) | |||
| val encString = encodingEv.evaluate(state) | |||
| val cs = CharsetUtils.getCharset(encString) | |||
There was a problem hiding this comment.
I'd like to get rid of all the built-in charsets, and implement them all as CharsetCompiler (with meta-inf, etc.) so there is only one way a charset is found/resolved, etc.
They would still all be part of daffodil-runtime1, just they would be detected and loaded using the dynamic load SPI mechanism, just like any user-supplied charset.
It is quite confusing to have two such mechanisms, when there is no good reason not to just have one.
So CharsetUtils.getCharset will go away, or perhaps become the code that does what you are doing here.
DaffodilCharsetProvider would also go away, or morph similarly.
| import org.apache.daffodil.io.FormatInfo | ||
|
|
||
| final class CustomJavaCharset( | ||
| name2: String) |
There was a problem hiding this comment.
Strange indentation.
When there are only a small number of args, we tend not to do the thing where args each get a line, as it reduces code density for no adantage. At 3+ args, sure, because it will line wrap anyway. But just 1 or 2 args keep args on one line.
| extends CharsetTransformer( | ||
| name2){ | ||
|
|
||
| def bitsCharset = BitsCharsetISO885913 |
There was a problem hiding this comment.
So this is named CustomJavaCharset fot the test, but the bitsCharset is wired to BitsCharsetISO885913 ?
I think CharsetTransformers should go away entirely, but until then I would named this BitsCharsetISO885913Transformer, and why is name2 being ignored?
If you ignore an arg like this, probably a comment in the code is required.
| val byte = getByte(dis, 0) | ||
| byte.toChar | ||
| } | ||
| } No newline at end of file |
tuxji
left a comment
There was a problem hiding this comment.
Mostly amplifying Mike's suggestions for improving the pull request, including re-implementing Daffodil's own charsets as pluggable charsets.
| val maybeCompiler: Option[CharsetCompiler] = charsetCompilerMap.get(name) | ||
| if (maybeCompiler.isEmpty) { | ||
| val choices = charsetCompilerMap.keySet.mkString(", ") | ||
| context.SDE("The charset '%s' was not found. Available choices are: %s", name, choices + ", " + CharsetUtils.supportedEncodingsString) |
There was a problem hiding this comment.
Mike also said he would like Daffodil's current charsets re-implemented as pluggable charsets, so you'll probably need to remove CharsetUtils.supportedEncodingsString from this line.
| abstract class CharsetTransformerFactory(val name: String) | ||
| extends Serializable { | ||
|
|
||
| def newInstance(): CharsetTransformer |
There was a problem hiding this comment.
Mike said CharsetTransformer should simply be BitsCharset, so you probably want to rename CharsetTransformerFactory to BitsCharsetFactory.
| abstract class CharsetTransformerFactory(val name: String) | ||
| extends Serializable { | ||
|
|
||
| def newInstance(): CharsetTransformer |
There was a problem hiding this comment.
Mike said CharsetTransformer should simply be BitsCharset, so you probably want to rename CharsetTransformerFactory to BitsCharsetFactory.
| # | ||
| # These are partial/subsets of "real" functionality that give us test coverage | ||
| # of the layering features of Daffodil. They can be reused to create "real" | ||
| # layering, but one should expect to have to adapt their code substantially. |
There was a problem hiding this comment.
Those comments were pasted here from another place and don't belong here, or should be updated to talk only about your test charsets.
| # | ||
| # These are partial/subsets of "real" functionality that give us test coverage | ||
| # of the layering features of Daffodil. They can be reused to create "real" | ||
| # layering, but one should expect to have to adapt their code substantially. |
There was a problem hiding this comment.
Those comments were pasted here from another place and don't belong here, or should be updated to talk only about your test charsets.
|
|
||
| <annotation> | ||
| <appinfo source="http://www.ogf.org/dfdl/"> | ||
| <dfdl:format ref="tns:GeneralFormat" initiator="" terminator="" leadingSkip="0" trailingSkip="0" textBidi="no" floating="no" encoding="ASCII" byteOrder="bigEndian" alignment="implicit" alignmentUnits="bits" fillByte="f" occursCountKind="implicit" truncateSpecifiedLengthString="no" ignoreCase="no" representation="text" lengthKind="delimited" nilValueDelimiterPolicy="both" emptyValueDelimiterPolicy="none" documentFinalTerminatorCanBeMissing="yes" initiatedContent="no" separatorSuppressionPolicy="anyEmpty" separatorPosition="infix" textTrimKind="none" /> |
There was a problem hiding this comment.
Please override only properties that really need to be different than GeneralFormat's properties, and define them on multiple lines for readability.
|
|
||
| <annotation> | ||
| <appinfo source="http://www.ogf.org/dfdl/"> | ||
| <dfdl:format ref="tns:GeneralFormat" initiator="" terminator="" leadingSkip="0" trailingSkip="0" textBidi="no" floating="no" encoding="ASCII" byteOrder="bigEndian" alignment="implicit" alignmentUnits="bits" fillByte="f" occursCountKind="implicit" truncateSpecifiedLengthString="no" ignoreCase="no" representation="text" lengthKind="delimited" nilValueDelimiterPolicy="both" emptyValueDelimiterPolicy="none" documentFinalTerminatorCanBeMissing="yes" initiatedContent="no" separatorSuppressionPolicy="anyEmpty" separatorPosition="infix" textTrimKind="none" /> |
There was a problem hiding this comment.
Please override only properties that really need to be different than GeneralFormat's properties.
| xmlns:ex="http://example.com" | ||
| xmlns:tns="urn:org.apache.daffodil.charsets.CustomCharset" defaultRoundTrip="none"> | ||
|
|
||
| <tdml:parserTestCase name="parse_charsets" root="s1" model="org/apache/daffodil/charsets/CustomCharset.dfdl.xsd" roundTrip="none"> |
There was a problem hiding this comment.
We already have defaultRoundTrip="none" above so we don't need roundTrip="none" in each test.
| xmlns:ex="http://example.com" | ||
| xmlns:tns="urn:org.apache.daffodil.charsets.CustomCharset" defaultRoundTrip="none"> | ||
|
|
||
| <tdml:parserTestCase name="parse_charsets" root="s1" model="org/apache/daffodil/charsets/CustomCharset.dfdl.xsd" roundTrip="none"> |
There was a problem hiding this comment.
We already have defaultRoundTrip="none" above so we don't need roundTrip="none" in each test.
|
This now includes commits that are not really part of this change set, but they are showing up as commits in this review. That and... github UI is having trouble. I can't seem to narrow the review to just the last 3 commits. It doesn't seem to allow me to select that. (I can select 2 commits, but not 3) The exact best way to fix this isn't clear. I think the first 2 and last 5 commits should be in this change set, the other commits shouldn't be part of this. |
03fe6dd to
3269145
Compare
mbeckerle
left a comment
There was a problem hiding this comment.
Getting very close. I have only a few more changes I'd like to see.
|
|
||
| package org.apache.daffodil.processors.charset | ||
|
|
||
| //import org.apache.daffodil.processors.charset.BitsCharset |
There was a problem hiding this comment.
Hmmm. How does this work without this import. You are definitely referencing this BitsCharset below.
| # limitations under the License. | ||
| # | ||
|
|
||
| org.apache.daffodil.charsets.BitsCharsetISO88591ReverseCompiler |
There was a problem hiding this comment.
This is a good case for exception to the caml-case naming conventions is warranted.
BitsCharset_ISO_8859_1_ReverseCompiler
BitsCharset_ISO_8859_13_Compiler
Those would be far easier to understand. Being able to pick out the "13" is important.
| <complexType> | ||
| <sequence> | ||
| <element name="e1" dfdl:encoding="ISO-8859-13" dfdl:lengthKind="explicit" dfdl:length="8" type="xsd:string" /> | ||
| <element name="e2" dfdl:encoding="X-DFDL-ISO-88591-8-BIT-PACKED-LSB-FIRST-REVERSE" dfdl:lengthKind="delimited" type="xsd:string" /> |
There was a problem hiding this comment.
Use ISO-8859-1 (missing a hyphen) consistent with ISO-8859-13 above.
| import org.apache.daffodil.processors.charset.BitsCharsetFactory | ||
|
|
||
| object BitsCharsetISO88591Reverse extends{ | ||
| override val name = "X-DFDL-ISO-88591-8-BIT-PACKED-LSB-FIRST-REVERSE" |
| import org.apache.daffodil.processors.charset.CharsetCompiler | ||
| import org.apache.daffodil.processors.charset.BitsCharsetFactory | ||
|
|
||
| object BitsCharsetISO885913 extends { |
There was a problem hiding this comment.
Call this BitsCharsetTest8859_13 or something.
So this is half a charset.
But this makes your tests wierd. Since they're asymmetric. Parse it behaves as 8859-1, unparse as 8859-13.
You can fix this easily because you just need to define decodeOneChar so that it does the lookup of the char in a length 256 string having the chars corresponding to bytes 0 to 255 of iso-8859-13, which you can do via
val decodeString = {
val bytes = ByteBuffer.wrap((0 to 255).map{ _.toByte }.toArray)
Charset.forName("iso-8859-13").newDecoder().decode(bytes).toString
}
protected override def decodeOneChar(dis: InputSourceDataInputStream, finfo: FormatInfo): Char = {
val byte = getByte(dis, 0)
decodeString(byte)
}
|
|
||
|
|
||
| final class BitsCharsetISO88591ReverseCompiler | ||
| extends CharsetCompiler("X-DFDL-ISO-88591-8-BIT-PACKED-LSB-FIRST-REVERSE") { |
| <tdml:infoset> | ||
| <tdml:dfdlInfoset> | ||
| <s1> | ||
| <e1>À1234567</e1> |
There was a problem hiding this comment.
Add comment that first character is Unicode Code point 0xC0, LATIN CAPITAL LETTER A WITH GRAVE. Just in case someone is lacking a font that displays it.
daffodil-test/src/test/resources/org/apache/daffodil/charsets/CustomCharset.tdml
Show resolved
Hide resolved
| <tdml:dfdlInfoset> | ||
| <s2> | ||
| <e1>À1234567</e1> | ||
| <e2>À1234567</e2> |
There was a problem hiding this comment.
First char inside should be Ą once you have a real 8859-13 decoder, which I suggest you fix below in comments on the scala code for this charset.
Note to other reviewers - this got fixed - it's now just the right set of commits. |
3269145 to
f1a4187
Compare
tuxji
left a comment
There was a problem hiding this comment.
I think your pull request looks pretty close to completion. I have proposed a new naming strategy that would make your pull request even better. If the new naming strategy makes sense, the changes won't take too much time if you use your IDE's refactoring support to make the changes easier to do.
|
|
||
| org.apache.daffodil.processors.charset.ASCIICompiler | ||
| org.apache.daffodil.processors.charset.ASCIICompilerAlias | ||
| org.apache.daffodil.processors.charset.BitsCharsetAISPayloadArmoringCompiler |
There was a problem hiding this comment.
I have looked at this pull request as carefully as I had time to do and I think there are better ways to name some things. I know that Mike suggested doing the same thing as daffodil-runtime1/src/main/scala/org/apache/daffodil/layers/Layer* but I think copying Compiler and Transformer from the Layer* classes is going too far. These names make some sense for layer classes which transform data, but make less sense for classes which only create new BitsCharset objects.
I suggest that we use this naming strategy instead:
BitsCharset - Daffodil's primary abstraction for dealing with character sets
BitsCharsetFactory - Daffodil's abstraction for creating a new BitsCharset (no Transformer)
BitsCharsetDefinition - Daffodil's abstraction for creating a new BitsCharsetFactory (not CharsetCompiler)
BitsCharsetUSASCII - charset for "US-ASCII" and "ASCII" character sets
BitsCharsetUSASCIIFactory - factory for creating a new BitsCharsetUSACII (no Transformer)
BitsCharsetUSASCIIDefinition - definition for creating a new BitsCharsetUSASCIIFactory from "US-ASCII"
BitsCharsetUSASCIIAliasDefinition - alias definition for creating a new BitsCharsetUSASCIIFactory from "ASCII"
Given this new naming strategy, I would make the first three lines of this file look like this and replace Compiler with Definition in the rest of the lines:
org.apache.daffodil.processors.charset.BitsCharsetUSASCIIDefinition
org.apache.daffodil.processors.charset.BitsCharsetUSASCIIAliasDefinition
org.apache.daffodil.processors.charset.BitsCharsetAISPayloadArmoringDefinition
Then I would sort all the definitions' lines alphabetically to establish a standard ordering of definitions in this file. Finally, I would rename this file from CharsetCompiler to BitsCharsetDefinition.
| final class BitsCharsetAISPayloadArmoringCompiler | ||
| extends CharsetCompiler("X-DAFFODIL-AIS-PAYLOAD-ARMORING") { | ||
|
|
||
| override def compileCharset() = { | ||
| new BitsCharsetAISPayloadArmoringTransformerFactory(name) |
There was a problem hiding this comment.
Given our new naming strategy, I would make these lines look like this,
final class BitsCharsetAISPayloadArmoringDefinition
extends BitsCharsetDefinition("X-DAFFODIL-AIS-PAYLOAD-ARMORING") {
override def newFactory() = {
new BitsCharsetAISPayloadArmoringFactory
}
}Note that I would not pass name to the factory because I see no justification for passing it. The factory only creates one charset statically chosen at compile time; it's not supposed to look at name and create one of several possible charsets dynamically chosen at runtime. Do you agree, Mike?
There was a problem hiding this comment.
Keep in mind if BitsCharsetAISPayloadArmoringFactory can ever fail and issue a diagnostic, the schema author/user knows this as "X-DAFFODIL-AIS-PAYLOAD-ARMORING", and that's the meaningful name to them for use in a diagnostic. That's the only reason I can think of for passing the name.
There was a problem hiding this comment.
name() is required for the SimpleNamedServiceLoader, however it does not need to be passed, the implementing class can just set it. I will change it
| } | ||
| } | ||
|
|
||
| class BitsCharsetAISPayloadArmoringTransformerFactory(name: String) |
There was a problem hiding this comment.
final class BitsCharsetAISPayloadArmoringFactory()I would add 'final' (unless there is a reason to subclass factories?), remove Transformer, and remove name: String since I see no reason to pass a string that's never used inside the class and I can't imagine it would be useful for debugging either.
| final class BitsCharsetBase4LSBFCompiler | ||
| extends CharsetCompiler("X-DFDL-BASE4-LSBF") { | ||
|
|
||
| override def compileCharset() = { | ||
| new BitsCharsetBase4LSBFTransformerFactory(name) | ||
| } | ||
| } | ||
|
|
||
| class BitsCharsetBase4LSBFTransformerFactory(name: String) |
There was a problem hiding this comment.
Likewise, I would make similar changes to the rest of the new definitions in this pull request (Compiler -> Definition, compileCharset -> newFactory, TransfomerFactory -> Factory, add final, remove name):
final class BitsCharsetBase4LSBFDefinition
extends BitsCharsetDefinition("X-DFDL-BASE4-LSBF") {
override def newFactory() = {
new BitsCharsetBase4LSBFFactory
}
}
final class BitsCharsetBase4LSBFFactory()| @@ -0,0 +1,33 @@ | |||
|
|
|||
There was a problem hiding this comment.
Please rename this file from BitsCharsetTransformerFactory.scala to BitsCharsetFactory.scala and remove the first line (it's just a blank line).
|
|
||
| object CharsetCompilerRegistry { | ||
|
|
||
| private lazy val charsetCompilerMap: Map[String, CharsetCompiler] = |
There was a problem hiding this comment.
Change to
private lazy val definitionMap: Map[String, BitsCharsetDefinition] =
SimpleNamedServiceLoader.loadClass[BitsCharsetDefinition](classOf[BitsCharsetDefinition])Update rest of file likewise.
| def SDE(id: String, args: Any*): Nothing = { | ||
| throw new Exception(id.format(args: _*)) | ||
| } |
| def getCharset(name: String): BitsCharset = { | ||
| val cs = DaffodilCharsetProvider.charsetForName(name) | ||
| cs | ||
| CharsetCompilerRegistry.find(name).compileCharset.newInstance() |
There was a problem hiding this comment.
Will become
BitsCharsetDefinitionRegistry.find(name).newFactory().newInstance()Not sure if Scala conventions would keep empty parentheses or not.
...l-runtime1/src/main/resources/META-INF/services/org.apache.daffodil.charsets.CharsetCompiler
Outdated
Show resolved
Hide resolved
| # | ||
|
|
||
| org.apache.daffodil.charsets.BitsCharset_ISO_8859_1_Reverse_Compiler | ||
| org.apache.daffodil.charsets.BitsCharsetTest_ISO_8859_13_Compiler |
There was a problem hiding this comment.
Rename file from CharsetCompiler to BitsCharsetDefinition. Replace Compiler with Definition.
| <tdml:dfdlInfoset> | ||
| <s1> | ||
| <!-- First char in e1 is unicode char 0xC0 captial A with grave --> | ||
| <e1>Ą1234567</e1> |
There was a problem hiding this comment.
You have the comment for the wrong char. This one is the 0x104 captial A with Greek Ogonek. (Ą)
| <tdml:infoset> | ||
| <tdml:dfdlInfoset> | ||
| <s2> | ||
| <!-- First char in e1 is unicode char 0x104 captial A with Greek Ogonek --> |
There was a problem hiding this comment.
I think you have these two comments backward. This first one is the A with Grave.
| <!-- First char in e1 is unicode char 0x104 captial A with Greek Ogonek --> | ||
| <e1>À1234567</e1> | ||
| <!-- First char in e2 is unicode char 0xC0 captial A with grave --> | ||
| <e2>Ą1234567</e2> |
There was a problem hiding this comment.
This second one is the A with greek ogonek.
daffodil-io/src/main/scala/org/apache/daffodil/processors/charset/IBM037.scala
Show resolved
Hide resolved
|
corrected typo closed the branch, reopening with typo to preserve comments |
mbeckerle
left a comment
There was a problem hiding this comment.
+1 This is an excellent change set.
daffodil-io/src/main/scala/org/apache/daffodil/processors/charset/IBM037.scala
Show resolved
Hide resolved
tuxji
left a comment
There was a problem hiding this comment.
Excellent progress! Still needs some changes to ensure error paths are handled appropriately, though.
| } | ||
| } | ||
|
|
||
| class BitsCharsetBase4LSBFFactory() |
| * Must be implemented by all charsets. | ||
| * |
There was a problem hiding this comment.
Please remove these lines. I'd meant to ask you to move the first sentence after the second sentence, and you've done that so these lines are redundant.
| import org.apache.daffodil.exceptions.ThrowsSDE | ||
| import org.apache.daffodil.util.SimpleNamedServiceLoader | ||
|
|
||
| object BitsCharsetDefinitionRegistry { |
There was a problem hiding this comment.
It would be wonderful to add a comment here briefly explaining what this object is used for, something like "Finds all pluggable BitCharsets and makes them available to Daffodil."
P.S. In later comments, I'm now suggesting that you chop Registry off BitsCharsetDefinitionRegistry's name and move it to BitsCharsetDefinition.scala.
| } | ||
| } | ||
|
|
||
| def SDE(id: String, args: Any*): Nothing = { |
There was a problem hiding this comment.
You'll want to make this helper function private.
P.S. In later comments, I'm now suggesting that you change find(name) to return an Option[BitsCharsetDefinition] instead of throwing a SDE, move this private SDE function to CharsetUtils.scala, and call it in CharsetUtils.getCharset(name) instead.
| throw new Exception(id.format(args: _*)) | ||
| } | ||
|
|
||
| def supportedEncodingsString = {bitsCharsetDefinitionMap.keySet.mkString(", ")} |
There was a problem hiding this comment.
When you have a one-liner, you don't need braces, and it would be nice to add a test to TestBitsCharsetDefinition.scala which calls this function and verifies the existence of an encoding's name with a comma within the returned string to check the string has the right content. While you're at it, let's also add two negative tests which call each find method with an impossible to find encoding in order to put the error paths under test coverage.
| extends Serializable { | ||
|
|
||
| def newInstance(): BitsCharset | ||
| } |
There was a problem hiding this comment.
Just a thought. In Scala, we don't have to put each trait/object/class in its own file like we do in Java. You could define BitsCharsetDefinition, BitsCharsetDefinitionRegistry, and BitsCharsetFactory in the same file (BitsCharsetDefinition.scala) and that'll mean the extra tests logically belong in TestBitsCharsetDefinition.scala (no need to add another test class). Also, you can remove newInstance()'s parentheses.
|
|
||
| def name(): String | ||
|
|
||
| def newFactory(): BitsCharsetFactory |
There was a problem hiding this comment.
Since these methods have no side effects and no arguments, we can remove the parentheses.
There was a problem hiding this comment.
parentheses for name are required for simplenamedserviceloader consistency
| def getCharset(name: String): BitsCharset = { | ||
| val cs = DaffodilCharsetProvider.charsetForName(name) | ||
| cs | ||
| BitsCharsetDefinitionRegistry.find(name).newFactory.newInstance |
There was a problem hiding this comment.
I just realized something concerning about this change to Charset.getCharset. The scaladoc contract for DaffodilCharsetProvider.charsetForName(name) promised it would return null if the charset couldn't be found. However, the scaladoc contract for BitsCharsetDefinitionRegistry.find(name) (well, there isn't one because you haven't written it yet, but if you did base it on find(name)'s existing code, it would say) promises it will throw a SDE if the charset can't be found. We've just changed Charset.getCharset(name)'s behavior on its error path, but there are two places in Daffodil code which still check if getCharset returns null on its error path:
daffodil-runtime1/src/main/scala/org/apache/daffodil/processors/EvEncoding.scala
98: tci.schemaDefinitionError("Unsupported encoding: %s. Supported encodings: %s", encString, CharsetUtils.supportedEncodingsString)
daffodil-tdml-lib/src/main/scala/org/apache/daffodil/tdml/TDMLRunner.scala
2115: case null => Assert.usageError("Unsupported encoding: " + encodingName + ". Supported encodings: " + CharsetUtils.supportedEncodingsString)
Both of these error paths throw exceptions as well, so you can simply delete their code since they're now unnecessary. (P.S. I see you've already deleted the error path code from EvEncoding.scala but TDMLRunner.scala still has error path code.) I think it's better to guarantee by contract that getCharset will return a Charset or throw a SDE since too many existing places just call CharsetUtils.getCharset(name) without checking for a null - these places are sure to throw a NPE on the error path anyway.
However, I think we want to change BitsCharsetDefinitionRegistry.find(name) so it returns Option[BitsCharsetDefinition] instead of throwing a SDE and throw the SDE here from getCharset instead. Someone may have an use case that requires them to programmatically look up an user-specified encoding and use that charset if it can be found or else fall back to a default charset without throwing a SDE. Do you agree, @mbeckerle and @stevedlawrence?
P.S. If we're going to expect someone to call BitsCharsetDefinitionRegistry.find(name) directly, I think we ought to chop Registry off the object's name; don't you think BitsCharsetDefinition.find(name) is a more readable call (especially since it will return Option[BitsCharsetDefinition])?
There was a problem hiding this comment.
There does seem to be precedent for find() methods to return Option types in scala, and since getCharset provides a single place for the error throw, that does seem to be a useful change.
I don't like stripping Registry suffix off the names of the registry. Something needs to indicate that this is a collection of BitsCharsetDefinitions that you are finding a charset within. I don't like just using plurality (BitsCharsetDefinions plural).
The "Registry" suffix connotes that these are gathered from some diverse places and "registered" which they are. So I think this convention, despite the name getting long-ish, is useful.
There was a problem hiding this comment.
OK, let's keep the Registry suffix then.
| } | ||
| } | ||
|
|
||
| class BitsCharsetHexLSBFFactory() |
stevedlawrence
left a comment
There was a problem hiding this comment.
Nice change! Couple questions on deprecation ability and maybe we can remove the factory indirection?
| org.apache.daffodil.processors.charset.BitsCharset6BitDFI264DUI001Definition | ||
| org.apache.daffodil.processors.charset.BitsCharset6BitDFI311DUI002Definition | ||
| org.apache.daffodil.processors.charset.BitsCharset6BitICAOAircraftIDAliasDefinition | ||
| org.apache.daffodil.processors.charset.BitsCharset6BitICAOAircraftIDDefinition |
There was a problem hiding this comment.
Now that we have this capability for schemas to provide their own charset implementations, does it make sense to deprecate some of these that are very specific to certain schema? Perhaps we output a warning when certain charsets are used in a schema and nudge users to move these to some other package?
Doesn't have to be done as part of this, but maybe a separate issue/PR?
There was a problem hiding this comment.
I'd prefer deprecating charsets to be done as a separate issue/PR. Whoever is in the best place to know which charsets should be deprecated (Mike or Steve?) should list these charsets in the issue, and then any committer can work on the PR.
|
|
||
| def name(): String | ||
|
|
||
| def newFactory(): BitsCharsetFactory |
There was a problem hiding this comment.
Is there a use case for the Factory indirection? Based on the comment it seems to be related to serialization, but I don't see anywhere where we store the created factories where they could be serialized.
The only use I see is find(...).newFactory().newInstance(). So we create a factory, create an instance from that factory, and throw away the factory. I wonder if we can simplify this and just treat BitsCharsetDefinition as if it were the factory--it can have the newInstance function that is directly called when a new instance of a charset is needed?
There was a problem hiding this comment.
Well, the comments about serialization were copied from the LayerCompiler classes. It probably is true that Daffodil processors store / serialize layer classes in such a way that this indirection makes sense, but it may not be true that Daffodil processors need this indirection for BitCharset classes at all. @alexanderrevello, try removing the indirection and see if all the tests (including integration/CLI tests) still pass. If so, then Steve is right and the indirection is unneeded.
There was a problem hiding this comment.
Indirection is not needed and has been removed
| * Factory for a Daffodil charset. | ||
| * | ||
| * This is the serialized object which is saved as part of a processor. | ||
| * It constructs a charset at runtime when newInstance() is called. |
There was a problem hiding this comment.
Is this true anymore?Mentioned in another comment, but I don't see this factory being stored anywhere to be serialized? I wonder if things changed so we only serialized the name of a charset, and on deserialization we look up the charset by that string and create a new instance?
f66908d to
1cc18db
Compare
| cs | ||
| val cs = BitsCharsetDefinitionRegistry.find(name).getOrElse(null) | ||
| if (cs == null) | ||
| SDE("The encoding '%s' was not found. Available choices are: %s", name, BitsCharsetDefinitionRegistry.supportedEncodingsString) |
There was a problem hiding this comment.
This changes the behavior of this function. In the past it returned null if the encoding wasn't found and the caller was expected to handle it correctly. I'd guess we want to keep that behavior, since some callers might just want to know if a charset exists or not, and do something other than error if not.
Maybe something like this instead?
val cs = BitsCharsetDefinitionRegistry.find(name).map(_.newInstance).getOrElse(null)
csThere was a problem hiding this comment.
I think this SDE should be removed. This function should either return a charset or null. The callers will determine how to handle null.
|
|
||
| private def SDE(id: String, args: Any*): Nothing = { | ||
| throw new Exception(id.format(args: _*)) | ||
| } |
There was a problem hiding this comment.
We usually avoid throwing generic Exception's. It often makes it hard to know how to best handle the error. There's also no context associated with this SDE, like schema file or line number, which most SDE's have. I think that's because it just doesn't exist here. Which makes me think more that CharsetUtils should not be throwing any exceptions/SDE's but should instead rely on callers to handle that.
There was a problem hiding this comment.
I suggest get rid of SDE here. (remove the method) You don't have enough context to know whether something is a schema-definition-error or some other kind of error at this point. These are just charset utils, general purpose code.
You either need to define a EncodingUnavailableException, or you should just be returning null/None or such to indicate the lookup failed. I guess I think the latter is preferable.
There is a method to return a list of the available charsets so returning null is preferable, and the caller can create a sensible diagnostic message.
| # layering, but one should expect to have to adapt their code substantially. | ||
| # | ||
| org.apache.daffodil.processors.charset.CustomNonByteSizeCharsetCompiler | ||
| org.apache.daffodil.processors.charset.CustomJavaCharsetCompiler No newline at end of file |
There was a problem hiding this comment.
The comments mention that these charsets are defined in src/test/resources since they are only used for tests. But this services file is in src/main/resources. I think this service file also needs to be in src/test/resources so that these charsets are only loaded during tests.
There was a problem hiding this comment.
I dont' think this comment was resolved. This file should be in src/test/resources, right?
There was a problem hiding this comment.
file was removed as it was no longer needed. Correct version is in the test folder
|
|
||
| override def compute(state: ParseOrUnparseState) = { | ||
| val encString = encodingEv.evaluate(state) | ||
| val cs = CharsetUtils.getCharset(encString) |
There was a problem hiding this comment.
Should this still use CharsetUtils, and CharsetUtils should be the only thing that uses the registry? This way we can easily change how the registry works if we ever need to since it's hidden behind CharsetUtils
There was a problem hiding this comment.
This comment wasn't addressed.
| tci.schemaDefinitionError("Unsupported encoding: %s. Supported encodings: %s", encString, CharsetUtils.supportedEncodingsString) | ||
| val bcd = BitsCharsetDefinitionRegistry.find(encString).getOrElse(null) | ||
| if (bcd == null) { | ||
| tci.schemaDefinitionError("Unsupported encoding: %s. Supported encodings: %s", encString, BitsCharsetDefinitionRegistry.supportedEncodingsString) |
There was a problem hiding this comment.
This isn't your change, but supportedEncodingString is potentially very large (we have a lot of encodings) so this could be very verbose. I wonder if we should instead just output the unsuported encoding? We can modify the registry so that it does something like log.debug to output the encodings so that if a user really can't figure it out they can enable debug logs to get that information.
There was a problem hiding this comment.
I think it's better to be verbose here.
I think there are two errors. One the user didn't get the charset name right, so it isn't found.
The other is the user does have the name right, but the list of charsets doesn't contain the charset because they didn't get the classpath right.
The latter error you need to see that the charset you seek isn't in the list at all. So the verbose list is needed.
So I'm ok with this as is.
| object BitsCharsetDefinitionRegistry { | ||
|
|
||
| private lazy val bitsCharsetDefinitionMap: Map[String, BitsCharsetDefinition] = | ||
| SimpleNamedServiceLoader.loadClass[BitsCharsetDefinition](classOf[BitsCharsetDefinition]) |
There was a problem hiding this comment.
All charsets have a name, but some also have aliases. For example, US-ASCII is the official name, but we also support ASCII. This SimpleNamedServiceLoader only knows about the name, which I think will break aliases?
Does our SimpleNamedServiceLoader need to change so that it can somehow support aliases?
| /** | ||
| * Given name, finds the BitsCharsetDefinition or null if not found | ||
| */ | ||
| def find(name: String): Option[BitsCharsetDefinition] = bitsCharsetDefinitionMap.get(name) |
There was a problem hiding this comment.
This doesn't do any case insensitive stuff. I thought we supported lower or upper case charset names? But maybe I'm wrong? Should we, or should all encodings match exactly what is defined by the name function returned by the definition?
There was a problem hiding this comment.
I have confirmed that we do currently support case insensitive encodings. I actually don't see in the code where we handle that, but creating schemas with different cases of the same encoding does work as expected. I'm not sure if we need to do anything special with these new changes or if will just work.
There was a problem hiding this comment.
Ah, I found how we handle case insensitivity:
The encoding property is always capitalized by our encoding cooker. So in order for a charset to be findable, the name() function must return an all capitalized string. Ours all currently do that I think, but we should probably document that in the BitsCharsetDefinition class.
There was a problem hiding this comment.
I will add that as a comment
|
Hi @alexanderrevello, somehow you've merged two new commits from the main branch into your pull request which are now causing two conflicting files: from these commits: • 97992f6 Update copyright year to 2022 I did review your other actual (new) commits, • f1f1ac6 requested changes including new naming structure, making name not be passed where it isn't needed and comment fixes Aside from requesting that the charsets in daffodil-io/src/main/resources/META-INF/services/org.apache.daffodil.processors.charset.BitsCharsetDefinition be sorted alphabetically, I think your actual changes look good. If @stevedlawrence agrees, I propose that you make that additional change, rebase and squash your pull request so that these conflicting files go away, we let CI run its tests, and we take one last look before we approve and merge your pull request. Thanks for seeing this PR through! |
|
yeah I will get rid of those extra commits |
1cc18db to
e11c4aa
Compare
| */ | ||
| abstract class BitsCharsetDefinition { | ||
|
|
||
| def name(): String |
There was a problem hiding this comment.
Actually, I believe what @stevedlawrence meant was that we should add a comment here saying that name() must return an all-upper-case string for the encoding to be findable by Daffodil.
There was a problem hiding this comment.
Yep, that's correct, sorry for the confusion. Eventually users are going to start implementing their own custom BitsCharsetDefinition and looking at this class for documentation on what should be returned. Making it clear that name() must return all uppercase is important.
Another option might be modify SimpleNamedServiceLoader to have some kind of option to specifiy this behavior, so that it just always capitalizes whatever name returns. Then users don't have to really care. Might need changes to support aliases anyways (not sure how best to handle that).
There was a problem hiding this comment.
DFDL encoding names are case insensitive. Users will be defining a DFDL schema and its associated new encodings together. It should not matter what case they use here or in the DFDL schema. We should standardize this to upper-case or whatever we want internally to make the lookups work efficiently, which I think means we upcase the name here, and the user name supplied to the dfdl:encoding property gets upcased before being used for lookup.
| } | ||
|
|
||
| final class BitsCharsetIBM037AliasDefinition | ||
| extends BitsCharsetDefinition { |
There was a problem hiding this comment.
I was thinking about how to reduce much of this boilerplate and also how to handle aliases and the upper case issue. I like the idea of have separate Definition classes for aliases, which I didn't notice when I previously mentioned aliases, but what if we make BitsCharsetDefinition something like this:
abstract class BitsCharsetDefinition(charset: BitsCharset, alias: Option[String] = None) {
final def charset(): BitsCharset = charset
final def name(): String = alias.getOrElse(charset.name).toUpperCase()
}So the name is taken from the actual BitsCharset if there's no alias, but has a way to override that with an optional alias parameter.
Then the definitions become less verbose one-liners like this:
final class BitsCharsetIBM037Definition extends BitsCharsetDefinition(BitsCharsetIBM037)
final class BitsCharsetEBCDIC_CP_USDefinition extends BitsCharsetDefinition(BitsCharsetIBM037, Some("EBCDIC-CP-US"))Note that this replaces the newInstance function with charset because a BitsCharsetDefintion now can't ever actually return a new instance--it always returns whatever instance was passed to the constructor. But this is probably good. We shouldn't make it possible to create more than one instance of BitsCharset per charset. There should only ever be one stateless thread-safe BitsCharset instance per charset. It is the decoders/encoders created from those that can have state, and Daffodil will create new instances of those as needed.
Also note that we can remove the alias member from BitsCharset, since those are all now handled with unique Definition classes.
mbeckerle
left a comment
There was a problem hiding this comment.
One more minor change to get rid of Exception throw method.
|
|
||
| private def SDE(id: String, args: Any*): Nothing = { | ||
| throw new Exception(id.format(args: _*)) | ||
| } |
There was a problem hiding this comment.
I suggest get rid of SDE here. (remove the method) You don't have enough context to know whether something is a schema-definition-error or some other kind of error at this point. These are just charset utils, general purpose code.
You either need to define a EncodingUnavailableException, or you should just be returning null/None or such to indicate the lookup failed. I guess I think the latter is preferable.
There is a method to return a list of the available charsets so returning null is preferable, and the caller can create a sensible diagnostic message.
| tci.schemaDefinitionError("Unsupported encoding: %s. Supported encodings: %s", encString, CharsetUtils.supportedEncodingsString) | ||
| val bcd = BitsCharsetDefinitionRegistry.find(encString).getOrElse(null) | ||
| if (bcd == null) { | ||
| tci.schemaDefinitionError("Unsupported encoding: %s. Supported encodings: %s", encString, BitsCharsetDefinitionRegistry.supportedEncodingsString) |
There was a problem hiding this comment.
I think it's better to be verbose here.
I think there are two errors. One the user didn't get the charset name right, so it isn't found.
The other is the user does have the name right, but the list of charsets doesn't contain the charset because they didn't get the classpath right.
The latter error you need to see that the charset you seek isn't in the list at all. So the verbose list is needed.
So I'm ok with this as is.
| BitsCharsetIBM037 | ||
| } | ||
| } | ||
| final class BitsCharsetIBM037EBCDIC_CP_USDefinition |
There was a problem hiding this comment.
Thhoughts on naming this just BitsCharsetEBCDIC_CP_USDefintion? The fact that this is aliased to IBM037 doesn't really need to be in the name and makes things a bit less verbose.
| cs | ||
| val cs = BitsCharsetDefinitionRegistry.find(name).getOrElse(null) | ||
| if (cs == null) | ||
| SDE("The encoding '%s' was not found. Available choices are: %s", name, BitsCharsetDefinitionRegistry.supportedEncodingsString) |
There was a problem hiding this comment.
I think this SDE should be removed. This function should either return a charset or null. The callers will determine how to handle null.
| # layering, but one should expect to have to adapt their code substantially. | ||
| # | ||
| org.apache.daffodil.processors.charset.CustomNonByteSizeCharsetCompiler | ||
| org.apache.daffodil.processors.charset.CustomJavaCharsetCompiler No newline at end of file |
There was a problem hiding this comment.
I dont' think this comment was resolved. This file should be in src/test/resources, right?
|
|
||
| override def compute(state: ParseOrUnparseState) = { | ||
| val encString = encodingEv.evaluate(state) | ||
| val cs = CharsetUtils.getCharset(encString) |
There was a problem hiding this comment.
This comment wasn't addressed.
tuxji
left a comment
There was a problem hiding this comment.
I agree with Steve and Mike's suggestions. Once you make the changes they suggest, we can merge your pull request (which looks like a really nice improvement).
stevedlawrence
left a comment
There was a problem hiding this comment.
+1 👍 looks great!
Please squash into a single commit, update the commit message if needed (include the bug number on the last line) and this should be good to merged.
dbb4eec to
e6e2d58
Compare
Refactored existing charsets to use the BitsCharsetDefinition and created support for it. To implement a charset follow the format and add the classpath in the file in resources/META-INF/services. This process has been shown in daffodil-test/.../charsets DAFFODIL-2663
No description provided.