Multi docs same tns. Hash based naming. #72

aaronmmanzano · 2018-10-25T21:12:54Z

Hello David,

I messaged you on Reddit about 2 weeks ago about making a pull request, and here it is. I'm a few months into learning GO and it was a real pleasure tooling around your code. It really gave me a good feel for well structured GO code. With that in mind I don't expect this first pull request to meet those standards and view it more as a code review than pull request. I know this project is no longer a major priority but if you have time to take a look at the code I'd appreciate it.

I'll make line comments to the PR after I create it, but I'll briefly explain my intentions here as well. I initially ran into an issue in with xsd/parse.go:Parse(). When Schema objects are being stored in the parsed map, they are stored with target namespace as the key. This causes to schemas with the same tns to collide in the map and only the last schema using the tns survives.

After correcting this I came into issue with XMLName collisions caused by joining the types of multiple schemas (that target the same namespace) into a map, which happens immediately after schema objects are created in Parse() . In short this lead me down the path of rewriting the copyEltNamesToAnonTypes() and nameAnonymousTypes() functions called within Normalize() to both be a 2-pass process. Gathering information needed to assess possible collisions in the 2nd, renaming pass. If collisions are detected name suffixes are used to prevent it while preserving context provided by using parent element names. The fewer types that pass through to nameAnonymousTypes() the better.

And for those types that do fall through to being renamed by nameAnonymousTypes() the name suffix
process is changed from an incremented number to the leading characters of the type's hash. This allows for more maintainable code as it prevents the situation I encountered where the addition of an anon type early in the xsd causes the renaming of all subsequent anon types with a suffix one higher than its previous. This way a GO type declaration only changes if the xsd type it models changes. It also allows for condensing of exact anon xsd type declarations into a single GO type.

Anyhow, please let me know what you think, and thanks so much for making this project public.
~Aaron

PS: I noticed there is an auto-build process that failed, I'll look into the tests.

Naming of anonymous types based on hash instead of counter. Decoupling anon type names from ordering of xsd document arguments.

Instead of leaving anon types to be named _anon (if they would collide with other types after inheriting a name from a parent element) append a deterministically derrived suffix to the name to maintain context while avoiding a name collision.

aaronmmanzano · 2018-10-25T21:17:22Z

xsd/parse.go

 		if err := s.parse(root); err != nil {
 			return nil, err
 		}
-		parsed[tns] = s


Storing generated schemas by tns causes collisions, retaining only the types from the last schema to be generated targeting the specific namespace. Using a hash of the schema ensures all unique schemas are retained.

This isn't the approach I would have used. Being able to assume that your schema document is exhaustive for its target namespace makes later steps simpler. For example, if there was a 1:1 mapping from target namespace to xsd then you could remove one layer of maps for your prepCopyEltNamesToAnonTypes, and you would be able to push it into the CopyEltNamesToAnonTypes function.

To workaround the pesky XSD specification allowing schema to be split across multiple files you could add a pass that merges all *xmltree.Element trees for schema with the same target namespace. Another benefit of doing the transform on the tree structure is that it's easier to verify, by printing out the XML in intermediate steps. What do you think?

will think on this

xsd/parse.go

aaronmmanzano · 2018-10-25T21:24:10Z

xsd/parse.go

+		nTypes = namedTypesByNS[tns]
+		nTypesToCopy = namesToCopiedByNS[tns]
+	}
+
 	namedTypes := and(isType, hasAttr("", "name"))
 	for _, el := range root.SearchFunc(namedTypes) {


Here we collect all the explicitly named types in a tns so we can use them to detect type name collisions in pass 2. We fail if there is a XMLName collision here because XMLNames should be unique per namespace.

aaronmmanzano · 2018-10-25T21:26:42Z

xsd/parse.go

@@ -210,22 +295,113 @@ func copyEltNamesToAnonTypes(root *xmltree.Element) {
 		hasAnonymousType)



Next we identify candidates for inheriting parent element name but don't rename immediately so we can determine if there are other candidates for the same name. We store hashes of each type that is a candidate for a particular XMLName so we can assess if a suffix is required in the renaming pass.

aaronmmanzano · 2018-10-25T21:28:07Z

xsd/parse.go

+		or(isElem(schemaNS, "element"), isElem(schemaNS, "attribute")),
+		hasAttr("", "name"),
+		hasAnonymousType)
+


Here we begin the actual renaming process

aaronmmanzano · 2018-10-25T21:29:55Z

xsd/parse.go

+			isNamedType, suffix := getSuffix(xmlname, tHash)
+
+			// If the name/hash of the anon type matches that of an explicity
+			// declared type, refernce that type and throw away the anon type


It the type, after inheriting the name, exactly matches the hash of an explicitly named type, then we throw away the anon type and point the parent element at the explicitly named type. Otherwise we determine if it requires a suffix to avoid collision and rename it.

aaronmmanzano · 2018-10-25T21:31:50Z

xsd/parse.go

@@ -256,13 +432,32 @@ to
    </xs:sequence>
  </xs:complexType>
 */
-func nameAnonymousTypes(root *xmltree.Element) error {
+


The same process is used to give anon types names

droyo · 2018-10-27T04:23:27Z

Wow... first of all, thank you for this PR and your detailed explanation. I will make comments in-line.

droyo

I made a first pass. Again, thank you for your contribution. I have not fully grokked the code yet, so my current feedback is more around formatting. I will need a little more time to give specific feedback on the more significant changes.

Overall this looks pretty good. I have two qualms that will solidify over time:

You need tests for what you've fixed here. Specifically, you need to test the following:
** schema that are split over multiple files don't have missing types after Parse is done with them.
** changing the order of arguments to Normalize doesn't change the naming of anonymous or colliding types.

Other feedback: you're really fixing two issues here:

handling schema split over multiple files
improving naming of anonymous types with named parent elements.

It would be nice, (but not required!) to address each issue in a separate PR.

xsd/parse.go

droyo · 2018-10-27T04:55:16Z

xsd/parse.go

+			nTypes[xmlName] = hash(el)
+		} else {
+			return fmt.Errorf(
+				"Type collision - Name: [ %s ] Namespaces: [ %s ]\n",


I'm sure there are exceptions in the code base, but try to keep all of the error messages in plain english, with minimal notation.

Suggested change

"Type collision - Name: [ %s ] Namespaces: [ %s ]\n",

"collision for type %v in target ns %q",

Could you clarify a bit more on this one for me?
I wanted to place the XMLName and namespace in the error response to aid in trouble shooting the troublesome elements. Perhaps, "Two elements with same XMLName in target namespace "

Or would you prefer "Naming collision in namespace on XMLName " ?

If neither of those, any suggestions?

I added a suggestion with my initial comment -- can you see it? Something like that.

Oh yes, I can see that, sorry I overlooked it.

xsd/parse.go

droyo · 2018-10-27T05:42:52Z

xsd/parse.go

+
+		// If we've encountered a type name for the first time
+		// store it with its hash
+		if _, prevUsed := nTypes[xmlName]; !prevUsed {


Use ok for the map presence boolean. Go programmers should recognize the idiom well enough. exists is fine too.

Ok, I changed this and I agree. I do wonder if you feel the same way about the special case where the boolean is to be used multiple times outside the immediate proximity of the initial map access. I'm interested in your opinion on this one with regard to readability vs accepted idiom usage.

droyo · 2018-10-27T06:01:01Z

xsd/parse.go

 */
-func copyEltNamesToAnonTypes(root *xmltree.Element) {
-	used := make(map[xml.Name]struct{})
+func prepCopyEltNamesToAnonTypes(


I'm not a fan of this function name. As time has passed i've tried to structure parsing as a series of discrete passes over the data, where each "pass" is a function that makes a well-defined transformation over the data. And these transformations should be somewhat self-contained.

From the name prepCopyEltNamesToAnonTypes I guess this is necessary to run before some CopyEltNamesToAnonTypes stage. That doesn't tell me anything about what this step is for.

I'm still digesting this function so I don't have a good name in mind.

I understand, and that makes perfect sense. With that in mind, perhaps I could move the logic of prepCopyEltNamesToAnonTypes() into the body of CopyEltNamesToAnonTypes() as an anonymously defined function?
PROS:
1 - Don't have to define maps outside both functions to pass results of prepCopyEltNamesToAnonTypes() to CopyEltNamesToAnonTypes() cleaning up Normalize() a bit.
2 - CopyEltNamesToAnonTypes() continues to meet its original purpose as well as the naming convention.
CONS:
1 - CopyEltNamesToAnonTypes() get approximately 74% larger. 87 -> 151 lines.

I have no issue with big fat functions :) . I vote for moving it into CopyEltNamesToAnonTypes

Roger that :-)

droyo · 2018-10-27T06:12:41Z

xsd/parse.go

 		if err := s.parse(root); err != nil {
 			return nil, err
 		}
-		parsed[tns] = s


This isn't the approach I would have used. Being able to assume that your schema document is exhaustive for its target namespace makes later steps simpler. For example, if there was a 1:1 mapping from target namespace to xsd then you could remove one layer of maps for your prepCopyEltNamesToAnonTypes, and you would be able to push it into the CopyEltNamesToAnonTypes function.

To workaround the pesky XSD specification allowing schema to be split across multiple files you could add a pass that merges all *xmltree.Element trees for schema with the same target namespace. Another benefit of doing the transform on the tree structure is that it's easier to verify, by printing out the XML in intermediate steps. What do you think?

droyo · 2018-10-27T06:18:20Z

xsd/parse.go

@@ -140,20 +176,33 @@ func Parse(docs ...[]byte) ([]Schema, error) {
 	for _, root := range schema {
 		tns := root.Attr("", "targetNamespace")
 		s := Schema{TargetNS: tns, Types: make(map[xml.Name]Type)}
+		sHash := hash(root)


I'm a little concerned about this part. From my understanding of the hash() function, making changes to the tree structure in root could produce a different hash value. It looks like you don't make any transformations between here and when you lookup the schema on line 205, so there's no bug... yet :)

Will think on this

aaronmmanzano · 2018-10-29T19:57:48Z

Thank you for taking the time to read and comment on the code. I've got a bit of a busy week at work, but I'll take a deeper look into your comments ASAP. Also I'll make sure to break the code into PR's with a better separation of concerns.

Thanks again,
Aaron

droyo · 2019-01-03T03:10:13Z

I finished reviewing this and I'm pretty happy with it. I would accept it as-is if you add some tests for the new functionality (see my first comment).

aaronmmanzano · 2019-01-04T00:04:15Z

I finished reviewing this and I'm pretty happy with it. I would accept it as-is if you add some tests for the new functionality (see my first comment).

Thanks David that's great news. I got on a bit of a side-track as far as getting these changes committed upstream but it is still very important to me. I've done some minor refactoring and fixed a few bugs in the mean time and I'll realistically be able to be back to this in about 3 weeks. At that time I'll break this large commit down (in terms of separation of concerns) and submit PR's that include tests. I'd really prefer to do it that way both for the experience and because I'd like the code to be as good as possible upon acceptance. In the mean time could you please leave this PR here for my reference??
Thanks David

~Aaron

droyo · 2019-01-05T16:02:01Z

Sounds good to me! Thanks again for your contributions. As a heads up I fixed a bug with commit 48d1a5a that may cause some merge conflicts with you; you're refactoring some of the same bits anyway, so you might want to just copy the tests over and make sure they work.

mpwalkerdine · 2020-01-15T23:20:54Z

I was about to submit a much smaller PR to resolve the multiple files problem, but not the types problem because I can workaround this for the schemas I'm working on with judicious argument ordering...

FWIW this fixed the problem for me:

diff --git a/xsd/parse.go b/xsd/parse.go
index ab1b059..779b0ec 100644
--- a/xsd/parse.go
+++ b/xsd/parse.go
@@ -128,7 +128,7 @@ func Normalize(docs ...[]byte) ([]*xmltree.Element, error) {
 func Parse(docs ...[]byte) ([]Schema, error) {
 	var (
 		result = make([]Schema, 0, len(docs))
-		parsed = make(map[string]Schema, len(docs))
+		parsed = make(map[*xmltree.Element]Schema, len(docs))
 		types  = make(map[xml.Name]Type)
 	)
 
@@ -143,7 +143,7 @@ func Parse(docs ...[]byte) ([]Schema, error) {
 		if err := s.parse(root); err != nil {
 			return nil, err
 		}
-		parsed[tns] = s
+		parsed[root] = s
 	}
 
 	for _, s := range parsed {
@@ -153,7 +153,7 @@ func Parse(docs ...[]byte) ([]Schema, error) {
 	}
 
 	for _, root := range schema {
-		s := parsed[root.Attr("", "targetNamespace")]
+		s := parsed[root]
 		if err := s.resolvePartialTypes(types); err != nil {
 			return nil, err
 		}

Edit: Spoke too soon - looks like the _self references are overlapping still, but also the flattening seems to be discarding needed types 🤷‍♂️

davidalpert · 2021-03-11T16:14:51Z

this is a lot of good work.

I started poking at this issue yesterday without finding this PR and the approach I settled on (working locally for me so far based on the code that's currently in master) added a MergeTypes method to the xsd.Schema type that threw an error if you call it for two schemas with different target namespaces. using this and the existing Imports method to discover and recursively follow/load imported schema files seemed to work for me, but I didn't test it extensively. If interested I could submit that as another PR, but it seems that this one is solving the same problem through a different approach and has already been approved in theory.

Perhaps I can help to resolve the pending conflicts with master and then we merge it and go from there?

davidalpert · 2021-03-11T18:31:38Z

not a trivial merge; would likely go faster with @droyo or @aaronmmanzano at the helm.

Aaron M. Manzano added 2 commits October 25, 2018 03:50

Support multiple documents with same target namespace

998caa6

Naming of anonymous types based on hash instead of counter. Decoupling anon type names from ordering of xsd document arguments.

aaronmmanzano commented Oct 25, 2018

View reviewed changes

xsd/parse.go Show resolved Hide resolved

aaronmmanzano commented Oct 25, 2018

View reviewed changes

droyo reviewed Oct 27, 2018

View reviewed changes

davidalpert mentioned this pull request Mar 11, 2021

Use Import / Included schemas for code Generation #68

Open

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi docs same tns. Hash based naming. #72

Multi docs same tns. Hash based naming. #72

aaronmmanzano commented Oct 25, 2018 •

edited

Loading

aaronmmanzano Oct 25, 2018

droyo Oct 27, 2018

aaronmmanzano Nov 29, 2018

aaronmmanzano Oct 25, 2018

aaronmmanzano Oct 25, 2018

aaronmmanzano Oct 25, 2018

aaronmmanzano Oct 25, 2018 •

edited

Loading

aaronmmanzano Oct 25, 2018

droyo commented Oct 27, 2018

droyo left a comment

droyo Oct 27, 2018

aaronmmanzano Nov 29, 2018

droyo Jan 3, 2019

aaronmmanzano Jan 3, 2019

droyo Oct 27, 2018

aaronmmanzano Nov 29, 2018 •

edited

Loading

droyo Oct 27, 2018

aaronmmanzano Nov 29, 2018

droyo Jan 3, 2019

aaronmmanzano Jan 3, 2019

droyo Oct 27, 2018

droyo Oct 27, 2018

aaronmmanzano Nov 29, 2018

aaronmmanzano commented Oct 29, 2018 •

edited

Loading

droyo commented Jan 3, 2019

aaronmmanzano commented Jan 4, 2019

droyo commented Jan 5, 2019

mpwalkerdine commented Jan 15, 2020 •

edited

Loading

davidalpert commented Mar 11, 2021 •

edited

Loading

davidalpert commented Mar 11, 2021

		@@ -210,22 +295,113 @@ func copyEltNamesToAnonTypes(root *xmltree.Element) {
		hasAnonymousType)

	"Type collision - Name: [ %s ] Namespaces: [ %s ]\n",
	"collision for type %v in target ns %q",

Multi docs same tns. Hash based naming. #72

Multi docs same tns. Hash based naming. #72

Conversation

aaronmmanzano commented Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronmmanzano Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droyo commented Oct 27, 2018

droyo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronmmanzano Nov 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronmmanzano commented Oct 29, 2018 • edited Loading

droyo commented Jan 3, 2019

aaronmmanzano commented Jan 4, 2019

droyo commented Jan 5, 2019

mpwalkerdine commented Jan 15, 2020 • edited Loading

davidalpert commented Mar 11, 2021 • edited Loading

davidalpert commented Mar 11, 2021

aaronmmanzano commented Oct 25, 2018 •

edited

Loading

aaronmmanzano Oct 25, 2018 •

edited

Loading

aaronmmanzano Nov 29, 2018 •

edited

Loading

aaronmmanzano commented Oct 29, 2018 •

edited

Loading

mpwalkerdine commented Jan 15, 2020 •

edited

Loading

davidalpert commented Mar 11, 2021 •

edited

Loading