Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbol encoding strategy and capitalization #461

Merged
merged 17 commits into from
Mar 4, 2018

Conversation

hosamaly
Copy link
Contributor

@hosamaly hosamaly commented Jan 24, 2018

In this pull request, I am proposing a few modifications to the generation of classes from an XSD:

  • Accept all characters that are valid start or valid parts of a Java identifier. This allows many Unicode characters to be part of the class name.
  • When faced with expected ASCII symbols (:, ., -, and _ according to the XML spec), replace them with their name. I think that is more descriptive than U002e.
  • When faced with a character that is not a valid part of a Java identifier, replace it with its Unicode value. The value is now encoded in the same way that it would be written in a string, i.e., with 4 hexadecimal digits. This change should make it easier for people to find out the original character.

This PR breaks compatibility with v1.5. In particular:

  • Spaces used to be encoded as u32. They are now U0020.
  • Trailing underscores used to be encoded as u93 (should've been 95). They are now Underscore.
  • Other characters that used to be encoded as u00, where 00 is the decimal value of the characters, are now either left as-is if they are valid in a Java identifier or replaced with U0000 where 0000 is the hexadecimal value of the character.

Merging this would change the behaviours introduced in #181, #191, and #415.

testUnderscoreSuffix
testNamesWithDots
testNamesWithHyphens
testNamesWithUnderscores
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I could have added a test case for colons but they suffered from issues #459, scala/scala-xml#94, and scala/scala-xml#182.

@hosamaly
Copy link
Contributor Author

@eed3si9n Could you please have a look at this PR? Thank you.

@eed3si9n
Copy link
Owner

Sorry about the late response.
Does this address any problem you're facing using scalaxb?
I am kind of conflicted about this because it would mean that anyone who generated source using older scalaxb would now need to adjust the client-side code for the new name.

@hosamaly
Copy link
Contributor Author

Hi, @eed3si9n. Apologies for the late response; time zone difference meant I had started my weekend! :)

I'm trying to work with the News Industry Text Format (NITF). Its XSD has many tags and attributes that include punctuation marks in their names, e.g. data-format and tobject.subject.ipr.

ScalaXB 1.5.2 generates class names that look like datau45format and tobjectu46subjectu46ipr. I find this format really hard to read.

My initial suggestion is to replace these punctuation characters with their names, e.g. dataHyphenformat and tobjectDotsubjectDotipr, hoping to later capitalise the words after the punctuation marks, and finally add an option to let the user choose to remove the punctuation marks altogether (assuming no naming conflicts).

I understand that this change is backwards-incompatible but I think it's a step in the right direction.

What do you think?

@eed3si9n
Copy link
Owner

Could you send a note to scalaxb mailing list summarizing the change, and get some responses from the existing stakeholders please? https://groups.google.com/forum/#!forum/scalaxb

@hosamaly
Copy link
Contributor Author

I sent this message to the mailing list.

Names like 'data-format' or 'pre.content' would be transformed into
'data-Format' and 'pre.Content', so that the generated class names
are in camel-case.

This is related to issue eed3si9n#226 and PR eed3si9n#320.
This changes the behaviour of trailing underscores which used to be
encoded as `u93`. They are now encoded as `u95`.
…ss names

Pass "discard-non-identifiers" to enable this option.

Known issue: if the option is enabled and an identifier ends in multiple
underscores (e.g. `el__`) then the generated name will be invalid
(`el_`, as only the last underscore will be dropped).
* discard-non-identifiers:
  Add an option to discard non-identifier characters from generated class names
  Unify the logic to generate "u1234" when identifiers include symbols
… symbol's name

According to the XML spec, permissible symbols are colon (`:`), dot (`.`),
hyphen (`-`), and underscore (`_`).

There is no test case for colons because Scala XML has a hard time dealing with them.
(scala/scala-xml issues 94 and 182)
@hosamaly hosamaly force-pushed the master branch 2 times, most recently from 0aa792c to cf187a0 Compare February 28, 2018 10:23
Previously, any non-latin-word character would have been encoded into uXX.
This change enables ScalaXB to accept all characters that are valid in a
Java identifier.
…cters

Previously, such characters would be encoded into `uN` where N was the
decimal numeric value of the character.

The new implementation encodes them as `U0000`, where 0000 is replaced
by the zero-padded 4-digit hexadecimal numeric value of the character.
@hosamaly
Copy link
Contributor Author

hosamaly commented Feb 28, 2018

@eed3si9n: I have updated this PR to include my other PRs with a feature flag for each behavioural change. I'll close the other PRs.

The difference between the output of v1.5.2 and this version with arguments --symbol-encoding-strategy=discard --capitalize-words can be seen in this gist.

Hopefully, these two parameters can become the default in v2.0.

@hosamaly hosamaly changed the title Replace symbols in names with the symbol's name Symbol encoding strategy and capitalization Feb 28, 2018
case object Discard extends Strategy("discard", "Discards any characters that are invalid in Scala identifiers, such as dots and hyphens")
case object SymbolName extends Strategy("symbol-name", "Replaces `.`, `-`, `:`, and trailing `_` in class names with `Dot`, `Hyphen`, `Colon`, and `Underscore`")
case object UnicodePoint extends Strategy("unicode-point", "Replaces symbols with a 'u' followed by the 4-digit hexadecimal code of the character (e.g. `_` => `u005f`)")
case object DecimalAscii extends Strategy("decimal-ascii", "Replaces symbols with a 'u' followed by the decimal code of the character (e.g. `_` => `u95`)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks so much for these compat flags ❤️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome. Hopefully, these flags will make it easier to change the default later.

Copy link
Owner

@eed3si9n eed3si9n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hosamaly
Copy link
Contributor Author

hosamaly commented Feb 28, 2018

I've added a final commit to add keys to the sbt plugin so that it has the same configuration options as the CLI app. Unfortunately, the sbt-test fails to compile on Travis CI because it isn't using the keys that are defined in the same commit. Is this something you can help with, @eed3si9n?

@@ -126,7 +126,9 @@ object ScalaxbPlugin extends sbt.AutoPlugin {
(if (scalaxbVararg.value && !scalaxbGenerateMutable.value) Vector(VarArg) else Vector()) ++
(if (scalaxbGenerateMutable.value) Vector(GenerateMutable) else Vector()) ++
(if (scalaxbGenerateVisitor.value) Vector(GenerateVisitor) else Vector()) ++
(if (scalaxbAutoPackages.value) Vector(AutoPackages) else Vector())
(if (scalaxbAutoPackages.value) Vector(AutoPackages) else Vector()) ++
(if (scalaxbCapitalizeWords.value) Vector(CapitalizeWords) else Vector()) ++
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this setting needs to be initialized somewhere.

scalaxbCapitalizeWords := false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this around line 86 but then I found that it overrides the value I set in the sbt-test project. Did I miss something?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the setting would set by the user as:

scalaxbCapitalizeWords in (Compile, scalaxb) := true,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

(if (scalaxbAutoPackages.value) Vector(AutoPackages) else Vector())
(if (scalaxbAutoPackages.value) Vector(AutoPackages) else Vector()) ++
(if (scalaxbCapitalizeWords.value) Vector(CapitalizeWords) else Vector()) ++
Vector(SymbolEncoding.withName(scalaxbSymbolEncodingStrategy.value.toString))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this one.

@eed3si9n eed3si9n merged commit 5d0eea5 into eed3si9n:master Mar 4, 2018
@eed3si9n
Copy link
Owner

eed3si9n commented Mar 4, 2018

Merged!

@hosamaly
Copy link
Contributor Author

hosamaly commented Mar 4, 2018

Great! Thanks!
I'm looking forward to the new release. 👍

@margussipria
Copy link
Contributor

This pull request did also brake numbers:

case object Number05 extends EciType { override def toString = "05" }

is now:

case object U485 extends EciType { override def toString = "05" }

This is unnecessary complicated

@hosamaly
Copy link
Contributor Author

Thanks for fixing this issue in #469, @margussipria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants