Skip to content

Make include/import logic more precise and consistent#1067

Merged
stevedlawrence merged 2 commits intoapache:mainfrom
stevedlawrence:daffodil-2840-resolve-schemaLocation
Aug 17, 2023
Merged

Make include/import logic more precise and consistent#1067
stevedlawrence merged 2 commits intoapache:mainfrom
stevedlawrence:daffodil-2840-resolve-schemaLocation

Conversation

@stevedlawrence
Copy link
Member

@stevedlawrence stevedlawrence commented Aug 11, 2023

We currently resolve include/import schema locations in two different
places, one in the DaffodilCatalogResolver and the other in dsom IIBase.
The logic is not obviously the same, so this adds a new XMLUtils
function specifically for resolving a schemaLocation attribute. The more
complex cases like catalog managers are not implemented here and are
still done in the different classes, but this at least consolidates the
search when catalog manager fails (which is the common case).

This also slightly tweaks the schemaLocation logic to the following

  • If a schemaLocation is an absolute URI (i.e. has a scheme) we just
    look it up directly
  • If a schemaLocation is not absolute and the path starts with a /, then
    we first look for it on the classpath. If not found we look for the
    absolute path on the file system
  • If a schemaLocation is not absolute and the path does not start with a
    /, then we resolve the path relative to the importing file. If not
    found, we look for the relative path on the classpath absolutely, but
    with a deprecation warning.

Note that this deprecates looking up relative include imports on the
classpath. A schemaLocation that is expected to be in different jar on
the classpath should be an absolute path. This removes any ambiguity
about if a file should be resolved relative to the importing schema or
on the classpath. Eventually we should remove the deprecated capability
completely, but most schemas currently depend on this behavior. For
example, many schemas have something like the following:

<import schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

But this really should be an absolute path, like so:

<import schemaLocation="/org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

This also modifies the CLI to have similar logic, where we warn if a
schema path is provided absolutely.

Deprecation/Compatibility:

Resolving relative import/include schemaLocation paths absolutely is now
deprecated and will output a warning, and support may eventually be
removed. For example, a common import is:

<import schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

Although this path is relative, previous versions of Daffodil allowed
this to be found on the classpath absolutely, as if it started with a
slash. This behavior is deprecated. If a schemaLocation is expected to
be found on the classpath and not relative to the importing file, then
it should start with a slash, for example:

<import schemaLocation="/org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

Daffodil still supports the old behavior and will look up relative
schemaLocatin paths absolutely, but will output a schema definition
warning. Future versions of Daffodil may change this to an error.

DAFFODIL-2840

Note that I've left this as two separate commits, one deprecating relative schemaLocations and one changing all the imports in Daffodil tests. This should hopefully make review easiser by mainly focusing on the first commit

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Minor suggestions for comment enhancement.

* optContextURI must be provided or a usage exception is thrown.
*
* Returns None if this fails to resolve the schemaLocation to an URI that exists. If this
* successfully resolves, it returns a Some containing a tuple of the URI and a boolean that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth using the word "compatibility" in here somewhere just for the keyword search value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

SDW(
WarnID.DeprecatedBuiltInFormats,
"schemaLocation property uses deprecated include/import of edu/illinois/ncsa/daffodil/xsd/built-in-formats.xsd. Use org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd instead and change the dfdl:format ref to \"GeneralFormat\".",
"schemaLocation property uses deprecated include/import of edu/illinois/ncsa/daffodil/xsd/built-in-formats.xsd. Use /org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd instead and change the dfdl:format ref to \"GeneralFormat\".",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been years. I think we should get rid of this feature. We need to add that to a release note if we do, because it's a feature being removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems resonable, this PR is probably a reasonable time to remove it. I'll add it to the Deprecation/Compatibility section.

val optURI = Misc.getResourceRelativeOption(sysId, baseURI)
// We did not get a catalog resolution of the nsURI. We now look for the systemID (which
// comes from the schemaLocation attribute) on classpath or as a file. Note that we
// ignore if boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore if boolean..... what

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "note" can be removed. It was saying that the boolean part of the tuple returned by resolveSchemaLocation can be ignored since we look at it elsewhere, but I move it as a comment a couple lines below and I guess I messed up removing this sentence. I'll remove this.

* is true if a relative schemaLocation path was resolved absolutely on the classpath and
* false otherwise. This capability may be disabled in the future.
*/
def resolveSchemaLocation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider factoring out a daffodil-resolver library that contains the resolver stuff, and this utility.

Actually I guess breaking up daffodil-lib into compile-time and runtime parts is a separate topic, but I think it needs to happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think he biggest obstacle is the XML catalog support, which I think is handled slightly different in the DaffodilXMLLoader and in the IIBase stuff. It also requires state for finding/loading the XML Catalog, so it felt a bit bigger task than I wanted to take on in this PR, which is relatively small and localized. I'll open a clean up ticket for it.

Copy link
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now Main is duplicating some of the same code as XMLUtils.resolveSchemaLocation but not doing it entirely correctly. The best thing would be for Main to call that function too, but if it can't, then the Main code needs to be tweaked and ideally some tests added (these new tests won't be necessary if Main calls the XMLUtils function).

Some(file.toURI)
} else {
Misc.getResourceRelativeOption(s, None)
val optURI = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This None.orElse{}... expression evaluates each alternative in turn, stopping when one returns a non-empty Option.

If you think a reader might appreciate the above line as a comment, please add it. I actually had to go look at the definition of .orElse in IDEA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't too sure about the `None.orElse{}.orElse{}.orElse{} pattern. It felt kindof nice since it removed a bunch of if-else statements, but I'm not sure we use it anywhere else, so it does feel off. I'm definitely open to other suggestions, but will add a comment explaining this if there's nothing better.

.orElse {
if (s.startsWith("/")) {
val resource = this.getClass.getResource(s)
Option(resource.toURI)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the resource does not exist or is not visible due to security considerations, the getResource method returns null. Do you need a null check before calling resource.toURI and another None if the null check fails?

Also, lines 237 - 238 don't seem to be covered by tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch. At one point this was just Option(resource), and null was fine. But the toURI is needed since the resource is a URL. I fixed it in the XMLUtils function but not here. Will fix.

I'll see if I can add some CLI tests that hit these code paths. These are important use cases we want to make sure we don't accidentally break. I don't think we had tests for them before and that's probably a mistake.

val resource = this.getClass.getResource("/" + s)
if (resource != null) {
Logger.log.warn(s"Found relative path on classpath absolutely, did you mean /$s")
Some(resource.toURI)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the previous .orElse code returns None because a resource cannot be found when s starts with "/", this code will add another "/" to s, still fail to find the resource, and return None again.

Should we combine these two .orElse cases by assigning an if expression to val resource?

   val resource = if (s.startsWith("/")) s else "/" + s

Actually, you want to warn the user if a relative path is found absolutely on the classpath, but you can guard the warning with an if (!s.startsWith("/")).

Line 247 doesn't seem to be covered by tests too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that when we want to remove this fallback support and result in a normal "could not find schema" error, then we can just remove this orElse block and nothing else needs to be modified. I don't feel strongly about this though, I can merge the logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reasoning makes sense to me - let's just add a comment to this orElse block saying the entire block is fallback support that we may remove a few releases later.

val optURI = Misc.getResourceRelativeOption(sysId, baseURI)
// We did not get a catalog resolution of the nsURI. We now look for the systemID (which
// comes from the schemaLocation attribute) on classpath or as a file. Note that we
// ignore if boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence Note that we ignore if boolean is incomplete and I don't know what you mean by it. Do you mean Note that we ignore sysId's boolean value.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned above, but this note should just e removed. The original comment was moved a couple lines below and I guess I didn't delete this one correctly.

val pr = tdp.parse(input, b.length * 8)
assert(
!pr.isProcessingError && pf.getDiagnostics.isEmpty,
!pr.isProcessingError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss whether to remove && pf.getDiagnostics.isEmpty or not. When we compile the generated C code into an executable and run that executable, the program will invariably print validation diagnostics if it detects validation problems like an element's value not matching its fixed attribute. However, the program returns an error exit status only if you pass -V on or -V limited, otherwise it returns a successful exit status. I wanted to fail these test cases and print the validation diagnostics if any are printed.

Note that right now, the TDML Runner doesn't pass any validation option to the daffodilC executable, but the TDML Runner does return any printed messages as validation diagnostics in the TDML result, so that the TDML Runner will do the right thing with any validation error clauses in the test cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally removed this because of the new warnings about a missing leading slash that causes this to fail (which are fixed in the subsequent commit in this PR). So there were new schema compile-time diagnostics that this check didn't like.

the program will invariably print validation diagnostics if it detects validation problems like an element's value not matching its fixed attribute

Are these parse-time diagnostics that you expect to be empty? Not schema compile/code generation-time diagnostics? Maybe this should have be pr.getDiagnostics.isEmpty instead of pf.getDiagnostics.isEmpty? Would that be sufficient for the check you're looking for, since the new warnings should only be on the ProcessorFactory but not on the ParseResult (I think?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, pf.getDiagnostics.isEmpty should be pr.getDiagnostics.isEmpty (parse-time diagnostics, not schema compile/code generation-time diagnostics).

@stevedlawrence
Copy link
Member Author

The best thing would be for Main to call that function too, but if it can't, then the Main code needs to be tweaked and ideally some tests added (these new tests won't be necessary if Main calls the XMLUtils function).

The new function works slightly differently than how main works. I think the main different is that if the schema location is relative it requires a context URI to resolve relative to, or else it's an assertion failure. The CLI allows relative path locations without a context.

Though, I wonder if maybe when called from the CLI, we could say the optContextURI is the current working directory? Then relative paths are resolved relative to the CWD URI and resolve as expected? I can give that a shot.

cr match {
case Left(diagnostics) => fail(s"getProcessor failed: ${diagnostics.mkString}")
case Right((diagnostics, tdmlDFDLProcessor)) =>
assert(diagnostics.isEmpty)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tuxji, this place where I removed diagnostics is just about the getProcessor call and not about parse/unparse diagnostics. Is it safe to keep this one removed? Or maybe I an change it so that it only asserts that there are no error error diagnostics, so that new warnings that get created (like this one) do not break this test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix the relative schemaLocation causing the schema compile-time warning if possible. If not, let's remove diagnostics.isEmpty to make the test more robust.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to leave all the updates to schemaLocations in a separate PR so that 1) it's easier to look at the real changes if we ever need to come back to it and 2) if we ever add a .git-blame-ignore-revs file then maybe the second commit that updates all the schemaLocations could go in that. I would also like to make sure all tests pass in both commits separately in case we ever need to do a git bisect. So I'd rather we remove or change the assert so this test passes without needing a schemaLocation updatew. I think something like this should work and is maybe reasonable?

assert(diagnostics.forall(!_.isError))

It's maybe the same idea as what the original assert was checking for, but a bit more robust against new warnings that might be added in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good to me

@stevedlawrence
Copy link
Member Author

I've pushed a new commit that I think resolves all the issues. Also, I was able to to modify Main to use the same function by making the contextURI option a URI to the current working directory.

Copy link
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

A lot of the CLI tests failed on Windows for some reason, so you'll have to fix that problem first before you can merge.

@stevedlawrence
Copy link
Member Author

stevedlawrence commented Aug 16, 2023

A lot of the CLI tests failed on Windows for some reason, so you'll have to fix that problem first before you can merge.

The issue is that Windows paths are not valid URI's. I've pushed a commit that converts the schema path to a valid URI string, keeping it relative if possible so that resolveSchemaLocation can use its relative lookup logic. Makes things a bit more complicated in Main, but I've learned that windows paths are actually pretty complicated.

@tuxji
Copy link
Contributor

tuxji commented Aug 16, 2023

Yes, the new commit has made the CLI tests pass. I wonder why only the Main class needed to be changed? What about the Java/Scala API, are there ways to pass paths into Daffodil's japi/sapi which also need similar code to transform Windows paths into URIs (or do these API accept only URIs, not String paths, anyway)?

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

One suggestion for a comment.

// This test makes sure a schema can be compiled from an absolute path on the filesystem
val schema = path(
"daffodil-cli/src/test/resources/org/apache/daffodil/cli/global_element.dfdl.xsd",
).toAbsolutePath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be a MS-Windows style path when this test is run on Windows?
I honestly did not know that. I assumed Java paths were somehow portable things at that point.
If so, please add comment like // on MS Windows this will be a windows-style path with backslash separators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the path method just does Paths.get(...) which supports unix-style separators even on windows systems. And when call toString on that Path to get a string to pass in to the CLI, the string contains windows style paths.

So Paths.get("/path/containing/unix/separators").toString will work on Linux or Windows, and return the right thing depending on the OS.

@stevedlawrence
Copy link
Member Author

I wonder why only the Main class needed to be changed?

The Main class expects Strings that are file paths (or also Java resource paths), and file paths aren't necessarily URIs on windows, hence this stuff that Main has to do to convert these strings to URIs. Note that we could just do Paths.get(s).toUri.toString and pass that into resolveSchemaLocation, letting Paths.get() handle all the weirdness around resolving relative paths on different OSes. But that doesn't work with classpath resources. It also means we lose relative path information which I'm hoping to avoid (toUri() always returns an absolute URI, I think because not all relative Windows paths can be converted a relative URI). I think it might be useful to keep URIs relative as part of depersonalizing schema paths in saved parsers (see my comments in PR #1065). This commit doesn't attempt to address that PR, but it at least doesn't make things harder.

The schemaLocation attribute in a XSD schema must be a URI, regardless of OS. So we just treat that attribute as if it were a URI. Nothing fancy here.

The Scala/Java API compile functions only accept a File or a URI, so we don't have to do anything fancy there either. We assume the user knew how to create one of those.

We currently resolve include/import schema locations in two different
places, one in the DaffodilCatalogResolver and the other in dsom IIBase.
The logic is not obviously the same, so this adds a new XMLUtils
function specifically for resolving a schemaLocation attribute. The more
complex cases like catalog managers are not implemented here and are
still done in the different classes, but this at least consolidates the
search when catalog manager fails (which is the common case).

This also slightly tweaks the schemaLocation logic to the following

- If a schemaLocation is an absolute URI (i.e. has a scheme) we just
  look it up directly
- If a schemaLocation is not absolute and the path starts with a /, then
  we first look for it on the classpath. If not found we look for the
  absolute path on the file system
- If a schemaLocation is not absolute and the path does not start with a
  /, then we resolve the path relative to the importing file. If not
  found, we look for the relative path on the classpath absolutely, but
  with a deprecation warning.

Note that this deprecates looking up relative include imports on the
classpath. A schemaLocation that is expected to be in different jar on
the classpath should be an absolute path. This removes any ambiguity
about if a file should be resolved relative to the importing schema or
on the classpath. Eventually we should remove the deprecated capability
completely, but most schemas currently depend on this behavior. For
example, many schemas have something like the following:

  <import schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

But this really should be an absolute path, like so:

  <import schemaLocation="/org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

This also modifies the CLI to have similar logic, where we warn if a
schema path is provided absolutely.

Deprecation/Compatibility:

Resolving relative import/include schemaLocation paths absolutely is now
deprecated and will output a warning, and support may eventually be
removed. For example, a common import is:

  <import schemaLocation="org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

Although this path is relative, previous versions of Daffodil allowed
this to be found on the classpath absolutely, as if it started with a
slash. This behavior is deprecated. If a schemaLocation is expected to
be found on the classpath and not relative to the importing file, then
it should start with a slash, for example:

  <import schemaLocation="/org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" />

Daffodil still supports the old behavior and will look up relative
schemaLocatin paths absolutely, but will output a schema definition
warning. Future versions of Daffodil may change this to an error.

Deprecation/Compatability:

Support for importing "edu/illinois/ncsa/daffodil/xsd/built-in-formats.xsd" has been removed.
Schemas should instead import "/org/apache/daffodil/xsd/DFDLGeneralFormat.dfdl.xsd" and change
dfdl:format to reference "GeneralFormat"

DAFFODIL-2828
This removes deprecated import/include schemaLocations and associated
warnings.

DAFFODIL-2828
@stevedlawrence stevedlawrence force-pushed the daffodil-2840-resolve-schemaLocation branch from 277029d to d8a2683 Compare August 16, 2023 18:44
@stevedlawrence stevedlawrence merged commit 531e9a6 into apache:main Aug 17, 2023
@stevedlawrence stevedlawrence deleted the daffodil-2840-resolve-schemaLocation branch August 17, 2023 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants