-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi docs same tns. Hash based naming. #72
Conversation
Naming of anonymous types based on hash instead of counter. Decoupling anon type names from ordering of xsd document arguments.
Instead of leaving anon types to be named _anon (if they would collide with other types after inheriting a name from a parent element) append a deterministically derrived suffix to the name to maintain context while avoiding a name collision.
if err := s.parse(root); err != nil { | ||
return nil, err | ||
} | ||
parsed[tns] = s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing generated schemas by tns causes collisions, retaining only the types from the last schema to be generated targeting the specific namespace. Using a hash of the schema ensures all unique schemas are retained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the approach I would have used. Being able to assume that your schema document is exhaustive for its target namespace makes later steps simpler. For example, if there was a 1:1 mapping from target namespace to xsd then you could remove one layer of maps for your prepCopyEltNamesToAnonTypes
, and you would be able to push it into the CopyEltNamesToAnonTypes
function.
To workaround the pesky XSD specification allowing schema to be split across multiple files you could add a pass that merges all *xmltree.Element trees for schema with the same target namespace. Another benefit of doing the transform on the tree structure is that it's easier to verify, by printing out the XML in intermediate steps. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will think on this
nTypes = namedTypesByNS[tns] | ||
nTypesToCopy = namesToCopiedByNS[tns] | ||
} | ||
|
||
namedTypes := and(isType, hasAttr("", "name")) | ||
for _, el := range root.SearchFunc(namedTypes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we collect all the explicitly named types in a tns so we can use them to detect type name collisions in pass 2. We fail if there is a XMLName collision here because XMLNames should be unique per namespace.
@@ -210,22 +295,113 @@ func copyEltNamesToAnonTypes(root *xmltree.Element) { | |||
hasAnonymousType) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next we identify candidates for inheriting parent element name but don't rename immediately so we can determine if there are other candidates for the same name. We store hashes of each type that is a candidate for a particular XMLName so we can assess if a suffix is required in the renaming pass.
or(isElem(schemaNS, "element"), isElem(schemaNS, "attribute")), | ||
hasAttr("", "name"), | ||
hasAnonymousType) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we begin the actual renaming process
isNamedType, suffix := getSuffix(xmlname, tHash) | ||
|
||
// If the name/hash of the anon type matches that of an explicity | ||
// declared type, refernce that type and throw away the anon type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It the type, after inheriting the name, exactly matches the hash of an explicitly named type, then we throw away the anon type and point the parent element at the explicitly named type. Otherwise we determine if it requires a suffix to avoid collision and rename it.
@@ -256,13 +432,32 @@ to | |||
</xs:sequence> | |||
</xs:complexType> | |||
*/ | |||
func nameAnonymousTypes(root *xmltree.Element) error { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same process is used to give anon types names
Wow... first of all, thank you for this PR and your detailed explanation. I will make comments in-line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a first pass. Again, thank you for your contribution. I have not fully grokked the code yet, so my current feedback is more around formatting. I will need a little more time to give specific feedback on the more significant changes.
Overall this looks pretty good. I have two qualms that will solidify over time:
- You need tests for what you've fixed here. Specifically, you need to test the following:
** schema that are split over multiple files don't have missing types after Parse is done with them.
** changing the order of arguments to Normalize doesn't change the naming of anonymous or colliding types.
Other feedback: you're really fixing two issues here:
- handling schema split over multiple files
- improving naming of anonymous types with named parent elements.
It would be nice, (but not required!) to address each issue in a separate PR.
nTypes[xmlName] = hash(el) | ||
} else { | ||
return fmt.Errorf( | ||
"Type collision - Name: [ %s ] Namespaces: [ %s ]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure there are exceptions in the code base, but try to keep all of the error messages in plain english, with minimal notation.
"Type collision - Name: [ %s ] Namespaces: [ %s ]\n", | |
"collision for type %v in target ns %q", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify a bit more on this one for me?
I wanted to place the XMLName and namespace in the error response to aid in trouble shooting the troublesome elements. Perhaps, "Two elements with same XMLName in target namespace "
Or would you prefer "Naming collision in namespace on XMLName " ?
If neither of those, any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a suggestion with my initial comment -- can you see it? Something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I can see that, sorry I overlooked it.
|
||
// If we've encountered a type name for the first time | ||
// store it with its hash | ||
if _, prevUsed := nTypes[xmlName]; !prevUsed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ok
for the map presence boolean. Go programmers should recognize the idiom well enough. exists
is fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed this and I agree. I do wonder if you feel the same way about the special case where the boolean is to be used multiple times outside the immediate proximity of the initial map access. I'm interested in your opinion on this one with regard to readability vs accepted idiom usage.
*/ | ||
func copyEltNamesToAnonTypes(root *xmltree.Element) { | ||
used := make(map[xml.Name]struct{}) | ||
func prepCopyEltNamesToAnonTypes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this function name. As time has passed i've tried to structure parsing as a series of discrete passes over the data, where each "pass" is a function that makes a well-defined transformation over the data. And these transformations should be somewhat self-contained.
From the name prepCopyEltNamesToAnonTypes
I guess this is necessary to run before some CopyEltNamesToAnonTypes stage. That doesn't tell me anything about what this step is for.
I'm still digesting this function so I don't have a good name in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, and that makes perfect sense. With that in mind, perhaps I could move the logic of prepCopyEltNamesToAnonTypes() into the body of CopyEltNamesToAnonTypes() as an anonymously defined function?
PROS:
1 - Don't have to define maps outside both functions to pass results of prepCopyEltNamesToAnonTypes() to CopyEltNamesToAnonTypes() cleaning up Normalize() a bit.
2 - CopyEltNamesToAnonTypes() continues to meet its original purpose as well as the naming convention.
CONS:
1 - CopyEltNamesToAnonTypes() get approximately 74% larger. 87 -> 151 lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no issue with big fat functions :) . I vote for moving it into CopyEltNamesToAnonTypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger that :-)
if err := s.parse(root); err != nil { | ||
return nil, err | ||
} | ||
parsed[tns] = s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the approach I would have used. Being able to assume that your schema document is exhaustive for its target namespace makes later steps simpler. For example, if there was a 1:1 mapping from target namespace to xsd then you could remove one layer of maps for your prepCopyEltNamesToAnonTypes
, and you would be able to push it into the CopyEltNamesToAnonTypes
function.
To workaround the pesky XSD specification allowing schema to be split across multiple files you could add a pass that merges all *xmltree.Element trees for schema with the same target namespace. Another benefit of doing the transform on the tree structure is that it's easier to verify, by printing out the XML in intermediate steps. What do you think?
@@ -140,20 +176,33 @@ func Parse(docs ...[]byte) ([]Schema, error) { | |||
for _, root := range schema { | |||
tns := root.Attr("", "targetNamespace") | |||
s := Schema{TargetNS: tns, Types: make(map[xml.Name]Type)} | |||
sHash := hash(root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about this part. From my understanding of the hash()
function, making changes to the tree structure in root
could produce a different hash value. It looks like you don't make any transformations between here and when you lookup the schema on line 205, so there's no bug... yet :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will think on this
Thank you for taking the time to read and comment on the code. I've got a bit of a busy week at work, but I'll take a deeper look into your comments ASAP. Also I'll make sure to break the code into PR's with a better separation of concerns. Thanks again, |
I finished reviewing this and I'm pretty happy with it. I would accept it as-is if you add some tests for the new functionality (see my first comment). |
Thanks David that's great news. I got on a bit of a side-track as far as getting these changes committed upstream but it is still very important to me. I've done some minor refactoring and fixed a few bugs in the mean time and I'll realistically be able to be back to this in about 3 weeks. At that time I'll break this large commit down (in terms of separation of concerns) and submit PR's that include tests. I'd really prefer to do it that way both for the experience and because I'd like the code to be as good as possible upon acceptance. In the mean time could you please leave this PR here for my reference?? ~Aaron |
Sounds good to me! Thanks again for your contributions. As a heads up I fixed a bug with commit 48d1a5a that may cause some merge conflicts with you; you're refactoring some of the same bits anyway, so you might want to just copy the tests over and make sure they work. |
I was about to submit a much smaller PR to resolve the multiple files problem, but not the types problem because I can workaround this for the schemas I'm working on with judicious argument ordering... FWIW this fixed the problem for me: diff --git a/xsd/parse.go b/xsd/parse.go
index ab1b059..779b0ec 100644
--- a/xsd/parse.go
+++ b/xsd/parse.go
@@ -128,7 +128,7 @@ func Normalize(docs ...[]byte) ([]*xmltree.Element, error) {
func Parse(docs ...[]byte) ([]Schema, error) {
var (
result = make([]Schema, 0, len(docs))
- parsed = make(map[string]Schema, len(docs))
+ parsed = make(map[*xmltree.Element]Schema, len(docs))
types = make(map[xml.Name]Type)
)
@@ -143,7 +143,7 @@ func Parse(docs ...[]byte) ([]Schema, error) {
if err := s.parse(root); err != nil {
return nil, err
}
- parsed[tns] = s
+ parsed[root] = s
}
for _, s := range parsed {
@@ -153,7 +153,7 @@ func Parse(docs ...[]byte) ([]Schema, error) {
}
for _, root := range schema {
- s := parsed[root.Attr("", "targetNamespace")]
+ s := parsed[root]
if err := s.resolvePartialTypes(types); err != nil {
return nil, err
}
Edit: Spoke too soon - looks like the |
this is a lot of good work. I started poking at this issue yesterday without finding this PR and the approach I settled on (working locally for me so far based on the code that's currently in Perhaps I can help to resolve the pending conflicts with |
not a trivial merge; would likely go faster with @droyo or @aaronmmanzano at the helm. |
Hello David,
I messaged you on Reddit about 2 weeks ago about making a pull request, and here it is. I'm a few months into learning GO and it was a real pleasure tooling around your code. It really gave me a good feel for well structured GO code. With that in mind I don't expect this first pull request to meet those standards and view it more as a code review than pull request. I know this project is no longer a major priority but if you have time to take a look at the code I'd appreciate it.
I'll make line comments to the PR after I create it, but I'll briefly explain my intentions here as well. I initially ran into an issue in with xsd/parse.go:Parse(). When Schema objects are being stored in the parsed map, they are stored with target namespace as the key. This causes to schemas with the same tns to collide in the map and only the last schema using the tns survives.
After correcting this I came into issue with XMLName collisions caused by joining the types of multiple schemas (that target the same namespace) into a map, which happens immediately after schema objects are created in Parse() . In short this lead me down the path of rewriting the copyEltNamesToAnonTypes() and nameAnonymousTypes() functions called within Normalize() to both be a 2-pass process. Gathering information needed to assess possible collisions in the 2nd, renaming pass. If collisions are detected name suffixes are used to prevent it while preserving context provided by using parent element names. The fewer types that pass through to nameAnonymousTypes() the better.
And for those types that do fall through to being renamed by nameAnonymousTypes() the name suffix
process is changed from an incremented number to the leading characters of the type's hash. This allows for more maintainable code as it prevents the situation I encountered where the addition of an anon type early in the xsd causes the renaming of all subsequent anon types with a suffix one higher than its previous. This way a GO type declaration only changes if the xsd type it models changes. It also allows for condensing of exact anon xsd type declarations into a single GO type.
Anyhow, please let me know what you think, and thanks so much for making this project public.
~Aaron
PS: I noticed there is an auto-build process that failed, I'll look into the tests.