New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol encoding strategy and capitalization #461

Merged
merged 17 commits into from Mar 4, 2018

Conversation

Projects
None yet
3 participants
@hosamaly
Contributor

hosamaly commented Jan 24, 2018

In this pull request, I am proposing a few modifications to the generation of classes from an XSD:

  • Accept all characters that are valid start or valid parts of a Java identifier. This allows many Unicode characters to be part of the class name.
  • When faced with expected ASCII symbols (:, ., -, and _ according to the XML spec), replace them with their name. I think that is more descriptive than U002e.
  • When faced with a character that is not a valid part of a Java identifier, replace it with its Unicode value. The value is now encoded in the same way that it would be written in a string, i.e., with 4 hexadecimal digits. This change should make it easier for people to find out the original character.

This PR breaks compatibility with v1.5. In particular:

  • Spaces used to be encoded as u32. They are now U0020.
  • Trailing underscores used to be encoded as u93 (should've been 95). They are now Underscore.
  • Other characters that used to be encoded as u00, where 00 is the decimal value of the characters, are now either left as-is if they are valid in a Java identifier or replaced with U0000 where 0000 is the hexadecimal value of the character.

Merging this would change the behaviours introduced in #181, #191, and #415.

@hosamaly

This comment has been minimized.

Show comment
Hide comment
@hosamaly

hosamaly Feb 16, 2018

Contributor

@eed3si9n Could you please have a look at this PR? Thank you.

Contributor

hosamaly commented Feb 16, 2018

@eed3si9n Could you please have a look at this PR? Thank you.

@eed3si9n

This comment has been minimized.

Show comment
Hide comment
@eed3si9n

eed3si9n Feb 16, 2018

Owner

Sorry about the late response.
Does this address any problem you're facing using scalaxb?
I am kind of conflicted about this because it would mean that anyone who generated source using older scalaxb would now need to adjust the client-side code for the new name.

Owner

eed3si9n commented Feb 16, 2018

Sorry about the late response.
Does this address any problem you're facing using scalaxb?
I am kind of conflicted about this because it would mean that anyone who generated source using older scalaxb would now need to adjust the client-side code for the new name.

@hosamaly

This comment has been minimized.

Show comment
Hide comment
@hosamaly

hosamaly Feb 19, 2018

Contributor

Hi, @eed3si9n. Apologies for the late response; time zone difference meant I had started my weekend! :)

I'm trying to work with the News Industry Text Format (NITF). Its XSD has many tags and attributes that include punctuation marks in their names, e.g. data-format and tobject.subject.ipr.

ScalaXB 1.5.2 generates class names that look like datau45format and tobjectu46subjectu46ipr. I find this format really hard to read.

My initial suggestion is to replace these punctuation characters with their names, e.g. dataHyphenformat and tobjectDotsubjectDotipr, hoping to later capitalise the words after the punctuation marks, and finally add an option to let the user choose to remove the punctuation marks altogether (assuming no naming conflicts).

I understand that this change is backwards-incompatible but I think it's a step in the right direction.

What do you think?

Contributor

hosamaly commented Feb 19, 2018

Hi, @eed3si9n. Apologies for the late response; time zone difference meant I had started my weekend! :)

I'm trying to work with the News Industry Text Format (NITF). Its XSD has many tags and attributes that include punctuation marks in their names, e.g. data-format and tobject.subject.ipr.

ScalaXB 1.5.2 generates class names that look like datau45format and tobjectu46subjectu46ipr. I find this format really hard to read.

My initial suggestion is to replace these punctuation characters with their names, e.g. dataHyphenformat and tobjectDotsubjectDotipr, hoping to later capitalise the words after the punctuation marks, and finally add an option to let the user choose to remove the punctuation marks altogether (assuming no naming conflicts).

I understand that this change is backwards-incompatible but I think it's a step in the right direction.

What do you think?

@eed3si9n

This comment has been minimized.

Show comment
Hide comment
@eed3si9n

eed3si9n Feb 20, 2018

Owner

Could you send a note to scalaxb mailing list summarizing the change, and get some responses from the existing stakeholders please? https://groups.google.com/forum/#!forum/scalaxb

Owner

eed3si9n commented Feb 20, 2018

Could you send a note to scalaxb mailing list summarizing the change, and get some responses from the existing stakeholders please? https://groups.google.com/forum/#!forum/scalaxb

@hosamaly

This comment has been minimized.

Show comment
Hide comment
@hosamaly

hosamaly Feb 21, 2018

Contributor

I sent this message to the mailing list.

Contributor

hosamaly commented Feb 21, 2018

I sent this message to the mailing list.

hosamaly added some commits Feb 22, 2018

Capitalize identifier parts when they contain symbols
Names like 'data-format' or 'pre.content' would be transformed into
'data-Format' and 'pre.Content', so that the generated class names
are in camel-case.

This is related to issue #226 and PR #320.
Unify the logic to generate "u1234" when identifiers include symbols
This changes the behaviour of trailing underscores which used to be
encoded as `u93`. They are now encoded as `u95`.
Add an option to discard non-identifier characters from generated cla…
…ss names

Pass "discard-non-identifiers" to enable this option.

Known issue: if the option is enabled and an identifier ends in multiple
underscores (e.g. `el__`) then the generated name will be invalid
(`el_`, as only the last underscore will be dropped).

hosamaly added some commits Feb 23, 2018

Merge branch 'discard-non-identifiers' into capitalize-identifier-parts
* discard-non-identifiers:
  Add an option to discard non-identifier characters from generated class names
  Unify the logic to generate "u1234" when identifiers include symbols
Add option to replace symbols in element and attribute names with the…
… symbol's name

According to the XML spec, permissible symbols are colon (`:`), dot (`.`),
hyphen (`-`), and underscore (`_`).

There is no test case for colons because Scala XML has a hard time dealing with them.
(scala/scala-xml issues 94 and 182)
Accept all valid Java identifiers in element and attribute names
Previously, any non-latin-word character would have been encoded into uXX.
This change enables ScalaXB to accept all characters that are valid in a
Java identifier.
Use 4-digit hexadecimal Unicode points to encode non-identifier chara…
…cters

Previously, such characters would be encoded into `uN` where N was the
decimal numeric value of the character.

The new implementation encodes them as `U0000`, where 0000 is replaced
by the zero-padded 4-digit hexadecimal numeric value of the character.
@hosamaly

This comment has been minimized.

Show comment
Hide comment
@hosamaly

hosamaly Feb 28, 2018

Contributor

@eed3si9n: I have updated this PR to include my other PRs with a feature flag for each behavioural change. I'll close the other PRs.

The difference between the output of v1.5.2 and this version with arguments --symbol-encoding-strategy=discard --capitalize-words can be seen in this gist.

Hopefully, these two parameters can become the default in v2.0.

Contributor

hosamaly commented Feb 28, 2018

@eed3si9n: I have updated this PR to include my other PRs with a feature flag for each behavioural change. I'll close the other PRs.

The difference between the output of v1.5.2 and this version with arguments --symbol-encoding-strategy=discard --capitalize-words can be seen in this gist.

Hopefully, these two parameters can become the default in v2.0.

@hosamaly hosamaly changed the title from Replace symbols in names with the symbol's name to Symbol encoding strategy and capitalization Feb 28, 2018

case object Discard extends Strategy("discard", "Discards any characters that are invalid in Scala identifiers, such as dots and hyphens")
case object SymbolName extends Strategy("symbol-name", "Replaces `.`, `-`, `:`, and trailing `_` in class names with `Dot`, `Hyphen`, `Colon`, and `Underscore`")
case object UnicodePoint extends Strategy("unicode-point", "Replaces symbols with a 'u' followed by the 4-digit hexadecimal code of the character (e.g. `_` => `u005f`)")
case object DecimalAscii extends Strategy("decimal-ascii", "Replaces symbols with a 'u' followed by the decimal code of the character (e.g. `_` => `u95`)")

This comment has been minimized.

@eed3si9n

eed3si9n Feb 28, 2018

Owner

Nice. Thanks so much for these compat flags ❤️

@eed3si9n

eed3si9n Feb 28, 2018

Owner

Nice. Thanks so much for these compat flags ❤️

This comment has been minimized.

@hosamaly

hosamaly Feb 28, 2018

Contributor

You're welcome. Hopefully, these flags will make it easier to change the default later.

@hosamaly

hosamaly Feb 28, 2018

Contributor

You're welcome. Hopefully, these flags will make it easier to change the default later.

@eed3si9n

LGTM

@hosamaly

This comment has been minimized.

Show comment
Hide comment
@hosamaly

hosamaly Feb 28, 2018

Contributor

I've added a final commit to add keys to the sbt plugin so that it has the same configuration options as the CLI app. Unfortunately, the sbt-test fails to compile on Travis CI because it isn't using the keys that are defined in the same commit. Is this something you can help with, @eed3si9n?

Contributor

hosamaly commented Feb 28, 2018

I've added a final commit to add keys to the sbt plugin so that it has the same configuration options as the CLI app. Unfortunately, the sbt-test fails to compile on Travis CI because it isn't using the keys that are defined in the same commit. Is this something you can help with, @eed3si9n?

@@ -126,7 +126,9 @@ object ScalaxbPlugin extends sbt.AutoPlugin {
(if (scalaxbVararg.value && !scalaxbGenerateMutable.value) Vector(VarArg) else Vector()) ++
(if (scalaxbGenerateMutable.value) Vector(GenerateMutable) else Vector()) ++
(if (scalaxbGenerateVisitor.value) Vector(GenerateVisitor) else Vector()) ++
(if (scalaxbAutoPackages.value) Vector(AutoPackages) else Vector())
(if (scalaxbAutoPackages.value) Vector(AutoPackages) else Vector()) ++
(if (scalaxbCapitalizeWords.value) Vector(CapitalizeWords) else Vector()) ++

This comment has been minimized.

@eed3si9n

eed3si9n Mar 2, 2018

Owner

I think this setting needs to be initialized somewhere.

scalaxbCapitalizeWords := false
@eed3si9n

eed3si9n Mar 2, 2018

Owner

I think this setting needs to be initialized somewhere.

scalaxbCapitalizeWords := false

This comment has been minimized.

@hosamaly

hosamaly Mar 2, 2018

Contributor

I tried to do this around line 86 but then I found that it overrides the value I set in the sbt-test project. Did I miss something?

@hosamaly

hosamaly Mar 2, 2018

Contributor

I tried to do this around line 86 but then I found that it overrides the value I set in the sbt-test project. Did I miss something?

This comment has been minimized.

@eed3si9n

eed3si9n Mar 3, 2018

Owner

Then the setting would set by the user as:

scalaxbCapitalizeWords in (Compile, scalaxb) := true,
@eed3si9n

eed3si9n Mar 3, 2018

Owner

Then the setting would set by the user as:

scalaxbCapitalizeWords in (Compile, scalaxb) := true,

This comment has been minimized.

@hosamaly

hosamaly Mar 4, 2018

Contributor

Done.

@hosamaly

hosamaly Mar 4, 2018

Contributor

Done.

(if (scalaxbAutoPackages.value) Vector(AutoPackages) else Vector())
(if (scalaxbAutoPackages.value) Vector(AutoPackages) else Vector()) ++
(if (scalaxbCapitalizeWords.value) Vector(CapitalizeWords) else Vector()) ++
Vector(SymbolEncoding.withName(scalaxbSymbolEncodingStrategy.value.toString))

This comment has been minimized.

@eed3si9n

eed3si9n Mar 2, 2018

Owner

same for this one.

@eed3si9n

eed3si9n Mar 2, 2018

Owner

same for this one.

hosamaly added some commits Mar 2, 2018

@eed3si9n eed3si9n merged commit 5d0eea5 into eed3si9n:master Mar 4, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@eed3si9n

This comment has been minimized.

Show comment
Hide comment
@eed3si9n

eed3si9n Mar 4, 2018

Owner

Merged!

Owner

eed3si9n commented Mar 4, 2018

Merged!

@hosamaly

This comment has been minimized.

Show comment
Hide comment
@hosamaly

hosamaly Mar 4, 2018

Contributor

Great! Thanks!
I'm looking forward to the new release. 👍

Contributor

hosamaly commented Mar 4, 2018

Great! Thanks!
I'm looking forward to the new release. 👍

@margussipria

This comment has been minimized.

Show comment
Hide comment
@margussipria

margussipria Mar 17, 2018

Contributor

This pull request did also brake numbers:

case object Number05 extends EciType { override def toString = "05" }

is now:

case object U485 extends EciType { override def toString = "05" }

This is unnecessary complicated

Contributor

margussipria commented Mar 17, 2018

This pull request did also brake numbers:

case object Number05 extends EciType { override def toString = "05" }

is now:

case object U485 extends EciType { override def toString = "05" }

This is unnecessary complicated

@hosamaly

This comment has been minimized.

Show comment
Hide comment
@hosamaly

hosamaly Mar 19, 2018

Contributor

Thanks for fixing this issue in #469, @margussipria.

Contributor

hosamaly commented Mar 19, 2018

Thanks for fixing this issue in #469, @margussipria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment